Data Engineering with AWS Chapter 10: Orchestrating the Data Pipeline
This is post 16 in my Data Engineering with AWS retelling series.
Up to this point in the book, we have been doing everything by hand. Click this button in the console. Run that Glue job manually. Trigger a crawler. Upload a file. It works fine for learning. But imagine doing that in production, every day, at 3 AM, across dozens of data sources. No thanks.
Chapter 10 is about fixing that. It is about pipeline orchestration – the thing that turns a collection of manual tasks into an automated, reliable data pipeline that runs itself.
What Is Pipeline Orchestration Anyway?
A data pipeline is a collection of processing tasks that need to run in a specific order. Some run one after the other. Some can run in parallel. An orchestration engine manages all of that: which tasks run when, what depends on what, what happens when something fails, and when to retry.
Think of it like a conductor leading an orchestra. The violins do not just start playing whenever they feel like it. The conductor keeps everyone in sync and stops everything if something goes wrong.
DAGs: The Fancy Word for “Tasks in Order”
When people talk about pipeline orchestration, the term DAG comes up constantly. It stands for Directed Acyclic Graph. Sounds intimidating, but the concept is simple.
A DAG is just a set of tasks (nodes) connected by arrows (edges) that flow in one direction and never loop back. “Directed” means the arrows point one way. “Acyclic” means no circles – you cannot go from Task D back to Task A.
For example, Task A finishes and triggers Tasks B and C in parallel. Task B triggers Task F. Both B and C must finish before Task D starts. Task D triggers Task E. None of these can loop backward.
Not every orchestration tool requires DAGs. Apache Airflow does – if your pipeline has a loop, Airflow will reject it. AWS Step Functions, on the other hand, actually allows loops in its state machine definitions. Different tools, different rules.
How Do Pipelines Get Triggered?
There are two main approaches: schedule-based and event-based.
Schedule-based is the traditional way. Run the pipeline every day at 6 AM. Or every hour. Or every 15 minutes. This works well for batch processing where you know roughly when data will be available.
Event-based is the modern approach. Instead of waiting for a clock, the pipeline fires in response to something happening. A file lands in an S3 bucket. A database table gets updated. A partner uploads their daily feed. The pipeline kicks off immediately.
Event-based triggers reduce latency. If your partner data usually arrives between 4 AM and 6 AM, a scheduled pipeline runs at 6 AM regardless. An event-based pipeline starts the moment data arrives.
There is also a clever pattern called manifest files. Instead of triggering on every single file upload, your partner sends a manifest file at the end of a batch listing every file with checksums. Your pipeline triggers only on the manifest, validates all listed files, and processes them together.
Handling Failures Like a Professional
Pipelines break. It is not a question of if but when. Chapter 10 covers the common failure types:
- Data quality issues – your pipeline expects CSV but gets JSON. Hard failure. Nothing works until someone fixes the source data.
- Code errors – you deployed a bug. Hard failure. Requires a code fix.
- Endpoint errors – S3 is temporarily unreachable, or a network blip causes a timeout. This might be a soft failure that goes away on retry.
- Dependency errors – an upstream job did not finish yet. Could be soft (just slow) or hard (upstream crashed).
For soft failures, you want a retry strategy. Most orchestration tools let you configure the number of retries, the interval between them, and a backoff rate.
Exponential backoff is the smart approach. With a backoff rate of 1.5 and a starting interval of 10 seconds, retries happen at 10s, 15s, 22.5s, 33.75s, and so on. This prevents hammering a struggling endpoint and gives it time to recover.
Four AWS Orchestration Options
AWS gives you four different tools for pipeline orchestration. Each has its place.
AWS Data Pipeline (the Old One)
This service launched in 2012 and it shows. It supports moving data between DynamoDB, RDS, Redshift, and S3, and can use EC2 or EMR for transformations. But the documentation has not been updated since 2018, it only works in five AWS regions, and it defaults to m1 instance types. The recommendation is clear: use something newer.
AWS Glue Workflows
If your entire pipeline lives within the AWS Glue ecosystem – Glue crawlers, Glue Spark jobs, Glue Python Shell jobs – then Glue Workflows can be a solid choice. You can chain crawlers and jobs together, run things in parallel, and use a graphical UI to monitor progress.
Glue Workflows supports three trigger types: on-demand, scheduled (including cron expressions like */30 8-16 * * 2-6 for every 30 minutes during business hours on weekdays), and event-driven via EventBridge.
There is a nice batching feature too. Configure the workflow to wait for 100 events with a 1-hour timeout. If 100 files arrive in 40 minutes, it starts early. If only 75 arrive in an hour, it starts anyway.
The limitation: Glue Workflows only orchestrates Glue services natively. You can work around this using Python Shell jobs that call other AWS services via Boto3, but it is not clean.
Amazon MWAA (Managed Apache Airflow)
Apache Airflow is the open source orchestration king. Built at Airbnb, 1,500+ contributors, integrates with practically everything – AWS, Google Cloud, Azure, Databricks, Jenkins, Slack, and hundreds more.
Airflow uses Python to define pipelines as DAGs. Key concepts: Hooks/Connections handle authentication to external systems. Tasks are the basic work units moving through states (Scheduled, Queued, Running, Success/Failed). Operators are predefined templates – BashOperator for shell commands, PythonOperator for Python functions, S3toRedshiftTransfer for data movement. Sensors are special operators that wait for events, like S3KeySensor watching for a file to appear before triggering the next step.
Amazon MWAA is the managed version. AWS handles infrastructure and auto-scales workers. You write DAGs and upload them. The downsides: you need Python skills, and there is a fixed infrastructure cost whether pipelines are running or idle.
AWS Step Functions (the Serverless Option)
Step Functions takes a low-code, visual approach. You can drag and drop states in a visual designer or write pipeline definitions in Amazon States Language (ASL), a JSON format.
It integrates natively with many AWS services – pick a Lambda from a dropdown, add a Glue job task, or call any AWS API through the SDK integration.
Step Functions uses a state machine with different state types:
- Task state – does actual work (runs a Lambda, starts a Glue job).
- Choice state – branches the pipeline based on conditions (like checking a file extension).
- Parallel state – runs multiple branches at the same time.
- Wait state – pauses for a specified duration.
- Pass state – modifies data being passed between states.
- Success/Fail states – mark the pipeline outcome.
The big advantage is that Step Functions is truly serverless. No infrastructure to manage. You only pay when your state machine is running. AWS guarantees 99.9% availability.
The drawback: you cannot resume a failed pipeline from the point of failure. If something breaks at step 7 of 10, you have to rerun the whole thing. Airflow handles this better.
Step Functions vs. MWAA: The Real Decision
For most teams, the choice comes down to these two. Step Functions wins on ease of use (visual editor, no code) and cost (pay-per-use, no idle charges). MWAA wins on third-party integrations (multi-cloud, hundreds of connectors), resume-from-failure support, and portability (Airflow is open source, Step Functions is AWS-only).
The Hands-On: Building an Event-Driven Pipeline
The chapter exercise builds a complete event-driven pipeline with Step Functions. Here is what you wire together:
Lambda
dataeng-check-file-ext– extracts the file extension from an S3 upload event and returns the bucket, key, and extension.Lambda
dataeng-random-failure-generator– divides 10 by a random number (0, 1, or 2). When the random number is 0, divide-by-zero. About one-third of executions fail.SNS topic
dataeng-failure-notification– sends email alerts on failure.State machine
ProcessFilesStateMachine– check extension, choice state routes.csvto the processor and anything else to an “Invalid File Extension” pass state, failures publish to SNS, success ends cleanly.EventBridge rule – watches for PutObject/CopyObject/CompleteMultipartUpload on the clean-zone bucket and triggers the state machine.
CloudTrail trail – required because S3 object-level events are not logged by default. Captures Write events so EventBridge can detect them.
Upload a CSV to the clean-zone bucket and the whole chain fires: EventBridge detects it, triggers the state machine, Lambda checks the extension, the choice state routes to the CSV processor, and the random generator either succeeds or fails. On failure, you get an email. Upload a PDF instead and the choice state routes to the invalid extension path, sends a notification, and fails.
Key Takeaway
Manual processes do not scale. The moment you have more than one or two data tasks running in production, you need orchestration. Chapter 10 gives you four options on AWS, each with clear trade-offs.
For Glue-only pipelines, use Glue Workflows. For complex multi-service and multi-cloud environments, go with MWAA. For most serverless AWS-native pipelines, Step Functions is the sweet spot – visual, cheap, and reliable.
The real skill is designing pipelines with proper error handling, retry strategies, and clean task separation so that when something breaks at 3 AM, the system handles it instead of paging you.
Book: Data Engineering with AWS by Gareth Eagar | ISBN: 978-1-80056-041-3
Previous: Chapter 9 Part 2 - Loading Data into a Data Mart Next: Chapter 11 - Amazon Athena