Orchestrating Pipelines with Apache Airflow - Part 2

In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.

Neylson Crepalde’s book introduces a modern way to build these: the Taskflow API. If you’ve looked at Airflow tutorials from five years ago, forget them. The new way is much cleaner and uses standard Python decorators.

Writing your first DAG

A DAG is basically a Python script that describes a workflow. Here is a simplified version of the “Titanic” pipeline from the book:

from airflow.decorators import task, dag
from datetime import datetime

@dag(
    start_date=datetime(2024, 1, 1),
    schedule_interval="@once",
    catchup=False
)
def titanic_pipeline():
    
    @task
    def download_data():
        # logic to fetch CSV
        return "/tmp/titanic.csv"

    @task
    def process_data(file_path):
        # logic to count survivors
        print(f"Processing {file_path}")

    # Set the dependencies
    path = download_data()
    process_data(path)

# Instantiate the DAG
titanic_pipeline()

The Power of Operators

While the @task decorator is great for custom Python code, Airflow also has built-in Operators for specific tasks:

  • BashOperator: Run shell commands.
  • PostgresOperator: Execute SQL queries.
  • SparkSubmitOperator: (One of our favorites) Trigger a Spark job.

Three Golden Rules for Airflow

The book wraps up Chapter 6 with some essential advice for data engineers:

  1. Airflow is NOT a processing tool: Don’t load 10GB of data into an Airflow worker. Airflow should just tell other systems (like Spark or Snowflake) to do the heavy lifting.
  2. Keep tasks idempotent: A task should be able to run multiple times without causing mess. If a task fails and you restart it, it shouldn’t duplicate data in your database.
  3. Always set catchup=False: By default, Airflow tries to run every missed schedule interval from the start_date to today. If you set a start date of two years ago, your cluster will explode the moment you turn on the DAG. Set catchup=False unless you specifically need backfilling.

Now that we can process data with Spark and orchestrate it with Airflow, we’re missing one thing: how do we handle data that arrives in real-time? In the next post, we’re diving into Apache Kafka.

Next: Real-Time Streaming with Apache Kafka - Part 1 Previous: Orchestrating Pipelines with Apache Airflow - Part 1

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More