Building an End-to-End Big Data Pipeline - Part 2
In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.
In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.
In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.
In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.
If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.
We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
You built your data pipelines. They work on your laptop. Now what? Chapter 10 of Data Engineering with Python by Paul Crickard covers the part everyone eventually has to face: getting your pipelines out of development and into production.
This is Part 2 of Chapter 7, continuing from batch and streaming basics.
In Part 1, we covered how batch and streaming pipelines move data around. But here is the thing: having a pipeline is one thing. Making sure all its parts run in the right order, at the right time, without you babysitting it? That is orchestration. And this is where Chapter 7 gets really practical.
The previous chapters taught you the individual tools. Python, NiFi, Airflow, databases, data cleaning. Chapter 6 of Data Engineering with Python by Paul Crickard puts them all together into one real project.
Chapter 1 was all theory. Now it’s time to actually install stuff. Chapter 2 of Data Engineering with Python by Paul Crickard is a setup chapter. You install the tools, configure them, and make sure everything talks to each other.
In the first part we got Cloud Composer running, wrote our first DAGs, and learned operators. This second part covers the stuff that separates beginner Airflow code from production-ready pipelines: variables, idempotent tasks, backfilling, sensors, and dataset-driven scheduling.
Up until now in the book, we built BigQuery tables by hand, wrote queries in the console, and loaded data manually. That works for learning, but nobody does that in production. In production, you need things to run on their own, on schedule, without you babysitting them at 5 AM.