Orchestrating Pipelines with Apache Airflow - Part 1

If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.

In Chapter 6, Neylson Crepalde dives into why Airflow has become the industry standard for managing these complex workflows.

Meet the Architecture

Airflow isn’t just a script runner; it’s a distributed system with several moving parts:

  • Web Server: This is the UI you see in your browser. It’s where you monitor your jobs and trigger new runs.
  • Scheduler: The real brains of the operation. It constantly checks your DAGs to see which tasks are ready to run based on their dependencies and schedule.
  • Executor: This component decides how to run the tasks. You could run them locally (LocalExecutor) or dynamically spin up pods on a cluster (KubernetesExecutor).
  • Workers: These are the actual processes that execute your code.
  • Metadata Database: Usually a Postgres or MySQL instance that stores the state of every single task and DAG run.

The Fast Path: Astro CLI

Setting up all these components manually can be a headache. The book recommends using the Astro CLI from Astronomer. It packages everything into Docker containers so you can get a professional environment running in seconds.

Here is the “get started” flow:

  1. Install Astro CLI.
  2. Initialize a project: astro dev init
  3. Start the engine: astro dev start

Once those containers are up, you just go to http://localhost:8080, log in with admin/admin, and you’re in the pilot’s seat.

Why this matters for Kubernetes

Airflow and Kubernetes are a match made in heaven. With the KubernetesExecutor, Airflow can spin up a dedicated pod for every single task in your pipeline. This means each task gets exactly the resources it needs (and no more), and they are perfectly isolated from each other.

Now that we have the infrastructure running, it’s time to actually build something. In the next post, we’ll look at how to write your first DAG using the modern Taskflow API.

Next: Orchestrating Pipelines with Apache Airflow - Part 2 Previous: Distributed Processing with Apache Spark - Part 2

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More