The Tools of the Modern Data Stack
We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.
In the second half of Chapter 4, Neylson Crepalde introduces the “Holy Trinity” of big data tools, plus a few other essentials.
The Heavy Lifter: Apache Spark
If you have terabytes of data to process, Apache Spark is your best friend. It’s a distributed computing engine that performs most of its work in memory, making it incredibly fast.
In a Kubernetes world, Spark is unique because it can spin up “executors” as separate pods, process the data, and then disappear. This “on-demand” scaling is one of the main reasons people run Spark on Kubernetes.
The Nervous System: Apache Kafka
For real-time data, you need Apache Kafka. It’s a distributed streaming platform that acts like a highly resilient, append-only log.
Whether you’re ingesting logs from a website or transactions from a database (using Kafka Connect), Kafka ensures that your data is captured and made available to your streaming and batch processors without missing a beat.
The Conductor: Apache Airflow
A data pipeline isn’t just one script; it’s a complex series of tasks. You need to make sure Task B only starts after Task A succeeds. That’s where Apache Airflow comes in.
It uses “DAGs” (Directed Acyclic Graphs) written in Python to orchestrate your entire workflow. It can trigger Spark jobs, check for new files in S3, and send alerts if something goes wrong.
The Unified Query Engine: Trino
Once your data is in the lakehouse (Gold layer), how do people actually use it? You could move it back into a traditional database, but a better way is to use Trino (formerly PrestoSQL).
Trino allows you to run standard SQL queries directly against your data lake files (like Parquet or Delta Lake). It’s designed for high-performance interactive analytics and can query data across multiple sources simultaneously.
Why Kubernetes makes these tools better
In the past, each of these tools required its own dedicated cluster. Managing them was a full-time job. By moving them to Kubernetes, you get:
- Resource efficiency: They can all share the same pool of hardware.
- Standardization: Every tool is just a set of containers and YAML files.
- Portability: You can run the exact same stack on your laptop (Kind) and in the cloud (EKS).
Over the next few posts, we’re going to dive deep into each of these tools individually. First up: mastering distributed processing with Apache Spark.
Next: Distributed Processing with Apache Spark - Part 1 Previous: The Evolution of Data Architecture
Book Details:
- Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
- Author: Neylson Crepalde
- ISBN: 978-1-83546-214-0