Pipeline Orchestration with Airflow, DAGs, and Data Transformations
This is Part 2 of Chapter 7, continuing from batch and streaming basics.
In Part 1, we covered how batch and streaming pipelines move data around. But here is the thing: having a pipeline is one thing. Making sure all its parts run in the right order, at the right time, without you babysitting it? That is orchestration. And this is where Chapter 7 gets really practical.
Data Orchestration: You Are the Conductor
Nwokwu uses an orchestra analogy and it is honestly pretty good. In an orchestra, a conductor makes sure every instrument plays at the right time with the right tempo. In data engineering, you are the conductor and your instruments are scripts, services, and jobs.
Picture this: you work at a startup. You built your first ETL pipeline with three parts. A script that pulls raw data, another that transforms it, and a service that loads it into storage. You run each part manually. Test ingestion. Check transformations. Verify the data landed correctly. Works great for testing.
But here is the problem. Every time someone needs fresh data, you have to run everything by hand. And if you are on vacation? Nobody else knows the order. This does not scale.
Orchestration solves this. It is the process of managing and coordinating workflows across different tools and systems. The building blocks are:
- DAGs to define task dependencies
- Automation to reduce manual triggers
- Scheduling to run things at specific times
- Monitoring to track progress
- Alerting to notify you when things break
DAGs: Directed Acyclic Graphs
This sounds more intimidating than it is. A DAG is just a way to organize tasks so they run in the correct order.
Break it down word by word. Directed means each connection between tasks has a direction. Task A leads to Task B, not the other way around. Acyclic means no loops. You cannot end up back at the starting task. Graph means it is a collection of nodes (tasks) and edges (dependencies).
So a DAG says: “Run task A first. When A is done, run B and C in parallel. When both B and C finish, run D.” No cycles, no confusion, clear execution order.
Here is what I found most useful in the book: the best practices for designing DAGs.
- Set clear dependencies. If Task B needs output from Task A, make that explicit in the DAG
- Keep it simple. Fewer nodes, easier debugging. Do not over-engineer
- Use modular workflows. Break large DAGs into smaller reusable sub-DAGs. A data extraction sub-DAG can be shared across multiple pipelines
- Implement error handling. Set up alerts for failures. Design mechanisms to skip or rerun failed tasks without restarting the whole pipeline
- Version control your DAGs. Use Git. Track changes. Be able to roll back
- Monitor and log everything. Log errors, successes, and performance metrics for each task
Scheduling, Monitoring, and Alerts
Scheduling defines when automated tasks should run. Three types of triggers:
- Time-based - run at fixed intervals (every day at 8 AM)
- Dependency-based - run only after certain other tasks complete
- Event-driven - respond to signals like a new file landing in S3
The book has a nice analogy here: scheduling is like setting your coffee machine to brew at 7 AM. Automation is what makes it brew without you pressing a button.
Monitoring means watching your pipelines to catch issues early. The key metrics to track fall into three buckets:
Performance metrics - latency (how fast data moves through), throughput (how much data per second), and error rates (how many records fail).
Resource metrics - CPU usage, memory, disk I/O, network bandwidth. If your Spark cluster is at 90% CPU during peak hours, time to think about scaling.
Data quality metrics - completeness (any missing data?), accuracy (do records match expected values?), and consistency (does data agree across pipeline stages?).
Alerts are notifications when something breaks. The book gives solid advice here. Do not alert on every minor delay or your team will start ignoring them. Use severity levels (high, medium, low). Include context-rich messages. Instead of “Pipeline failed,” say “Pipeline daily_sales_aggregator failed at step 3. Error: Database connection timeout. Check the database server health and retry.”
Hands-On: Building an ETL Pipeline
The second half of this chapter is a full lab. You build an ETL pipeline that processes bakery customer data, transforms it, loads it into PostgreSQL, and then automates the whole thing with Apache Airflow.
Extract: Loading the Data
Start by reading the CSV with pandas:
import pandas as pd
# Pull in our raw customer data
customer_data = pd.read_csv("customer_data.csv")
# Always peek at your data first
customer_data.head()
Nothing fancy. But always explore your data before transforming it. Check the shape, look for duplicates, understand what you are working with.
Transform: Cleaning and Reshaping
Here is where the real work happens. The book walks through several common transformations.
Drop duplicates based on key fields:
# Remove rows where both customerID and email are duplicated
customer_data = customer_data.drop_duplicates(
subset=['customerID', 'email']
)
Handle missing values by filling with a placeholder instead of dropping rows:
# Better to mark unknowns than lose the whole row
customer_data['phone'].fillna('Unknown', inplace=True)
Split compound columns into separate parts. This is a very common real-world task. The delivery address is one big string with street, country code, and zip code separated by commas:
# Break the address into useful pieces
customer_data['address'] = customer_data['delivery_address'].str.split(pat=',').str[0]
customer_data['country_code'] = customer_data['delivery_address'].str.split(pat=',').str[1]
customer_data['zipcode'] = customer_data['delivery_address'].str.split(pat=',').str[2]
# Same idea for the delivery timestamp
customer_data['last_delivery_date'] = customer_data['last_delivery'].str.split(pat=',').str[0]
customer_data['last_delivery_time'] = customer_data['last_delivery'].str.split(pat=',').str[1]
Drop old columns you no longer need and rename columns for clarity:
# Clean up: drop originals, rename for clarity
customer_data = customer_data.drop(['delivery_address', 'last_delivery'], axis=1)
customer_data = customer_data.rename(columns={'birthday': 'birth_date'})
Load: Push to PostgreSQL
Once the data is clean, load it into a database using SQLAlchemy:
import sqlalchemy as db
connection_uri = "postgresql://user:password@host:port/dbname"
engine = db.create_engine(connection_uri)
# Write the cleaned data to a 'customers' table
customer_data.to_sql("customers", engine, if_exists='replace', index=False)
Automate: Wire It Up with Airflow
This is the payoff. You take all the transformation logic, put it in a Python function, and schedule it as an Airflow DAG:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from customer_etl import main # your ETL function
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'customer_full_load',
default_args=default_args,
description='Customer Table ETL DAG',
schedule_interval='@daily', # runs once a day
)
run_etl = PythonOperator(
task_id='run_customer_etl',
python_callable=main,
dag=dag,
)
Now your pipeline runs daily without any manual work. You fire up the Airflow web server on port 8080, and you can see your DAG, trigger it manually, watch it execute, and check logs if something goes wrong.
This is a clean, complete example that covers the full cycle. Extract from CSV, transform with pandas, load into PostgreSQL, orchestrate with Airflow. For a beginner, this is exactly the right scope. Not too simple to be useless, not so complex that you get lost.
The chapter does a good job of building from concepts to practice. You learn what orchestration is, why DAGs matter, how scheduling and monitoring work, and then you actually build something. I wish more technical books did this.
This is part 11 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Batch and Streaming Pipelines | Next: Data Quality > |