Big Data on Kubernetes

Master the art of building scalable, professional big data platforms using Kubernetes and open-source tools.

Building a modern data platform is a daunting task, often plagued by massive operational overhead and complex infrastructure. “Big Data on Kubernetes” by Neylson Crepalde provides a practical, hands-on roadmap to solving these challenges using the world’s most powerful container orchestration platform.

This book guides you through the entire lifecycle of data engineering, from the fundamentals of Docker and Kubernetes architecture to deploying a full “Holy Trinity” stack: Apache Spark for massive processing, Apache Airflow for complex orchestration, and Apache Kafka for real-time event streaming. You’ll learn how to move beyond traditional data warehouses to the flexible Data Lakehouse model, using Trino for high-performance SQL analytics and the ELK stack for real-time visualization.

Whether you’re a data engineer, DevOps professional, or cloud architect, this guide empowers you to build resilient, automated, and cost-effective solutions. It even looks toward the future, showing you how to integrate Generative AI workloads using Amazon Bedrock and RAG patterns. Turn your “data swamp” into a professional data factory with Kubernetes as your foundation.

Mar 05, 2026
Big Data

Rethinking Data Infrastructure: Big Data on Kubernetes

We are living in a world where data is basically everywhere. From your phone to social media and every single online purchase, the amount of info we generate is staggering. But here’s the thing: just having data isn’t enough. You have to be able to process it, and that’s where things get complicated.

Mar 06, 2026
Big Data

Why Containers Are a Must for Data Engineers

If you are working with data today, you can’t really ignore containers. They have become the standardized unit for how we develop, ship, and deploy software. But why do we care so much about them in the big data world?

Mar 07, 2026
Big Data

Building Your Own Data Images

In my last post, we talked about why containers are the bedrock of modern data engineering. But honestly, just running other people’s images only gets you so far. The real magic happens when you start packaging your own custom code.

Mar 08, 2026
Big Data

Decoding Kubernetes Architecture - Part 1

If you want to run big data workloads on Kubernetes, you have to understand how the system is actually put together. It’s not just “magic magic cloud stuff”—it’s a carefully coordinated cluster of machines.

Mar 09, 2026
Big Data

Decoding Kubernetes Architecture - Part 2

In the last post, we talked about the “brain and muscles” of a Kubernetes cluster. But how do we actually tell that brain what to do? We use Objects.

Mar 10, 2026
Big Data

Local Kubernetes With Kind

Reading about architecture is one thing, but actually seeing a cluster run is where it sticks. In the third chapter of Big Data on Kubernetes, Neylson Crepalde moves from theory to practice.

Mar 11, 2026
Big Data

Scaling to the Cloud With Amazon EKS

Testing things locally with Kind is great, but big data usually needs big iron. In this part of the hands-on journey, Neylson Crepalde shows us how to scale up to a managed cloud environment.

Mar 12, 2026
Big Data

The Evolution of Data Architecture

We’ve all heard the terms “Data Warehouse” and “Data Lake,” but do you actually know why we keep switching between them? In Chapter 4 of Big Data on Kubernetes, Neylson Crepalde gives a masterclass on how data architecture has evolved to keep up with the modern world.

Mar 13, 2026
Big Data

The Tools of the Modern Data Stack

We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.

Mar 14, 2026
Big Data

Distributed Processing With Apache Spark - Part 1

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

Mar 15, 2026
Big Data

Distributed Processing With Apache Spark - Part 2

In the last post, we looked at Spark’s architecture. Now, let’s talk about how you actually write code for it. Neylson Crepalde highlights two main ways to interact with Spark: the DataFrame API and Spark SQL.

Mar 16, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 1

If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.

Mar 17, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 2

In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.

Mar 18, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 1

In the world of big data, “batch” is no longer enough. We need data the second it happens. Whether it’s tracking stock prices, monitoring website traffic, or detecting fraud, you need a system that can handle massive streams of events with zero downtime.

Mar 19, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 2

Architecture is great, but let’s actually run some code. In the second half of Chapter 7, Neylson Crepalde walks us through setting up a multi-node Kafka cluster right on our local machine using Docker Compose.

Mar 20, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 1

We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.

Mar 21, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 2

In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.

Mar 22, 2026
Big Data

The Data Consumption Layer - Querying With Trino

You’ve built your ingestion, you’ve processed your data with Spark, and it’s all sitting neatly in your S3 “Gold” bucket. Now what? You can’t ask every business analyst to learn PySpark just to see last month’s sales.

Mar 23, 2026
Big Data

Real-Time Visualization With Elasticsearch and Kibana

Trino is great for querying your historical data on S3, but for real-time streams and text-heavy search, you need something different. In the second half of Chapter 9, Neylson Crepalde introduces the industry standard for real-time analytics: Elasticsearch and Kibana.

Mar 24, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 1

We have spent the last few weeks looking at individual tools like Spark, Airflow, and Kafka. But in the real world, these tools don’t live in isolation. They need to talk to each other to form a complete data pipeline.

Mar 25, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 2

In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.

Mar 26, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 3

Batch processing is great for historical reports, but what if you need to know what’s happening right now? In the final part of Chapter 10, Neylson Crepalde shows us how to build a world-class Real-Time Pipeline on Kubernetes.

Mar 27, 2026
Generative AI

GenAI on K8s: Building With Amazon Bedrock

We have spent this whole series talking about “Big Data”—Spark, Kafka, and SQL engines. But the hottest topic in tech right now isn’t just data processing; it’s Generative AI.

Mar 28, 2026
Generative AI

Action Models With Bedrock Agents

In the last post, we saw how to give an AI model a “memory” using RAG. But the real game-changer in the Generative AI world is when you let the model actually do things.

Mar 29, 2026
Kubernetes

Beyond the Basics: The Kubernetes Ecosystem

We have built some incredible pipelines over the last few posts. But if you were to take what we’ve built and put it into production today, you’d quickly realize that there is a lot more to managing a platform than just getting the YAML files right.

Mar 30, 2026
Big Data

Wrapping Up: Big Data on Kubernetes

We have reached the end of our deep dive into Big Data on Kubernetes by Neylson Crepalde. It has been a massive journey, moving from basic Docker containers to complex, real-time AI pipelines.