Deploying the Big Data Stack on Kubernetes - Part 2
In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.
Airflow: The Kubernetes-Native Way
Deploying Airflow on Kubernetes is a bit different from running it on a single machine. In Chapter 8, Neylson Crepalde highlights three key configurations that make it production-ready:
- The KubernetesExecutor: This is the big one. It tells Airflow to spin up a new pod for every single task in your DAG. No more resource contention between tasks.
- GitSync: Instead of manually uploading your Python DAG files, you connect Airflow to a Git repository. Every time you push a change to GitHub, Airflow automatically pulls the latest version. It’s “DevOps for Data.”
- Remote Logging: You don’t want your logs filling up your Kubernetes disks. We configure Airflow to stream all its task logs directly to an S3 bucket.
We use the official Helm chart to deploy all this:
helm install airflow apache-airflow/airflow
--namespace airflow
-f custom_values.yaml
Kafka: Ingesting at Scale
The final piece of the stack is Apache Kafka. While you can deploy it using standard Helm charts, the book mentions that running Kafka on Kubernetes is a massive win for operations.
It simplifies how you scale brokers and manage partitions. More importantly, we deploy Kafka Connect. This tool allows us to build “connectors” that can automatically move data from a SQL database (like Postgres) directly into our Kafka topics without writing a single line of code.
The Result: A Fully Functional Stack
At this point, you have:
- Spark for processing.
- Airflow for orchestrating.
- Kafka for real-time ingestion.
- S3 for long-term storage.
They are all running in the same cluster, sharing resources efficiently, and managed via standardized YAML files. You have successfully built a Modern Data Stack on Kubernetes.
But wait—how do the analysts actually get the data? You can’t just tell them to write PySpark scripts. In the next post, we’re going to look at the Data Consumption Layer and a tool called Trino.
Next: The Data Consumption Layer - Querying with Trino Previous: Deploying the Big Data Stack on Kubernetes - Part 1
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0