Deploying the Big Data Stack on Kubernetes - Part 2

Mar 21, 2026
Big Data

In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.

Airflow: The Kubernetes-Native Way

Deploying Airflow on Kubernetes is a bit different from running it on a single machine. In Chapter 8, Neylson Crepalde highlights three key configurations that make it production-ready:

The KubernetesExecutor: This is the big one. It tells Airflow to spin up a new pod for every single task in your DAG. No more resource contention between tasks.
GitSync: Instead of manually uploading your Python DAG files, you connect Airflow to a Git repository. Every time you push a change to GitHub, Airflow automatically pulls the latest version. It’s “DevOps for Data.”
Remote Logging: You don’t want your logs filling up your Kubernetes disks. We configure Airflow to stream all its task logs directly to an S3 bucket.

We use the official Helm chart to deploy all this:

helm install airflow apache-airflow/airflow 
    --namespace airflow 
    -f custom_values.yaml

Kafka: Ingesting at Scale

The final piece of the stack is Apache Kafka. While you can deploy it using standard Helm charts, the book mentions that running Kafka on Kubernetes is a massive win for operations.

It simplifies how you scale brokers and manage partitions. More importantly, we deploy Kafka Connect. This tool allows us to build “connectors” that can automatically move data from a SQL database (like Postgres) directly into our Kafka topics without writing a single line of code.

The Result: A Fully Functional Stack

At this point, you have:

Spark for processing.
Airflow for orchestrating.
Kafka for real-time ingestion.
S3 for long-term storage.

They are all running in the same cluster, sharing resources efficiently, and managed via standardized YAML files. You have successfully built a Modern Data Stack on Kubernetes.

But wait—how do the analysts actually get the data? You can’t just tell them to write PySpark scripts. In the next post, we’re going to look at the Data Consumption Layer and a tool called Trino.

Next: The Data Consumption Layer - Querying with Trino Previous: Deploying the Big Data Stack on Kubernetes - Part 1

Book Details:

Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Author: Neylson Crepalde
ISBN: 978-1-83546-214-0

#big-data-on-kubernetes #neylson-crepalde #book-retelling #kubernetes #apache-airflow #apache-kafka #helm

Deploying the Big Data Stack on Kubernetes - Part 2

Airflow: The Kubernetes-Native Way

Kafka: Ingesting at Scale

The Result: A Fully Functional Stack

About

About BookGrill.net

Category

Tags View all tags

Theme Settings

Accent Color