Deploying the Big Data Stack on Kubernetes - Part 1

We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.

To do this effectively, we need two power tools: Helm and Operators.

Helm: The Package Manager

Think of Helm as the “App Store” for Kubernetes. Instead of writing dozens of YAML files for a single tool, you can use a Helm Chart. It packages up all the configuration, default settings, and dependencies into one bundle.

If you want to install something, you just run helm install. It makes managing complex deployments significantly easier.

Operators: Smart Controllers

While Helm helps with installation, Operators help with management. An Operator is a custom controller that “knows” how to manage a specific application.

For example, the SparkOperator adds a new type of object to Kubernetes: the SparkApplication. Instead of manually managing pods, you just give Kubernetes a YAML file describing your Spark job, and the Operator handles the rest—spinning up the driver, creating executors, and cleaning up when it’s done.

Deploying Spark on EKS

The book walks through a professional setup on AWS EKS. Here is the high-level flow:

  1. Install the SparkOperator: We use Helm to get the operator running in its own namespace.
  2. Setup Storage: We enable the AWS EBS CSI driver so Kubernetes can manage disks for us.
  3. Prepare your code: Upload your PySpark script to an S3 bucket.
  4. Run the job: Create a SparkApplication manifest that points to your script on S3.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: my-spark-job
spec:
  type: Python
  image: "my-custom-spark-image:v1"
  mainApplicationFile: "s3a://my-bucket/scripts/spark_job.py"
  driver:
    cores: 1
    memory: "1g"
  executor:
    cores: 1
    instances: 2
    memory: "1g"

The Big Win

Once this is running, you have a Kubernetes-native Spark job. You can monitor it with kubectl get sparkapplication and see the logs with kubectl logs.

This is the dream: your data processing engine is now part of your cluster’s fabric. In the next post, we’ll see how to add the “brain” to this stack by deploying Apache Airflow.

Next: Deploying the Big Data Stack on Kubernetes - Part 2 Previous: Real-Time Streaming with Apache Kafka - Part 2

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More