Orchestrating Pipelines with Apache Airflow - Part 2
In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.
Neylson Crepalde’s book introduces a modern way to build these: the Taskflow API. If you’ve looked at Airflow tutorials from five years ago, forget them. The new way is much cleaner and uses standard Python decorators.
Writing your first DAG
A DAG is basically a Python script that describes a workflow. Here is a simplified version of the “Titanic” pipeline from the book:
from airflow.decorators import task, dag
from datetime import datetime
@dag(
start_date=datetime(2024, 1, 1),
schedule_interval="@once",
catchup=False
)
def titanic_pipeline():
@task
def download_data():
# logic to fetch CSV
return "/tmp/titanic.csv"
@task
def process_data(file_path):
# logic to count survivors
print(f"Processing {file_path}")
# Set the dependencies
path = download_data()
process_data(path)
# Instantiate the DAG
titanic_pipeline()
The Power of Operators
While the @task decorator is great for custom Python code, Airflow also has built-in Operators for specific tasks:
- BashOperator: Run shell commands.
- PostgresOperator: Execute SQL queries.
- SparkSubmitOperator: (One of our favorites) Trigger a Spark job.
Three Golden Rules for Airflow
The book wraps up Chapter 6 with some essential advice for data engineers:
- Airflow is NOT a processing tool: Don’t load 10GB of data into an Airflow worker. Airflow should just tell other systems (like Spark or Snowflake) to do the heavy lifting.
- Keep tasks idempotent: A task should be able to run multiple times without causing mess. If a task fails and you restart it, it shouldn’t duplicate data in your database.
- Always set
catchup=False: By default, Airflow tries to run every missed schedule interval from thestart_dateto today. If you set a start date of two years ago, your cluster will explode the moment you turn on the DAG. Setcatchup=Falseunless you specifically need backfilling.
Now that we can process data with Spark and orchestrate it with Airflow, we’re missing one thing: how do we handle data that arrives in real-time? In the next post, we’re diving into Apache Kafka.
Next: Real-Time Streaming with Apache Kafka - Part 1 Previous: Orchestrating Pipelines with Apache Airflow - Part 1
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0