Wrapping Up: Big Data on Kubernetes
We have reached the end of our deep dive into Big Data on Kubernetes by Neylson Crepalde. It has been a massive journey, moving from basic Docker containers to complex, real-time AI pipelines.
We have reached the end of our deep dive into Big Data on Kubernetes by Neylson Crepalde. It has been a massive journey, moving from basic Docker containers to complex, real-time AI pipelines.
Batch processing is great for historical reports, but what if you need to know what’s happening right now? In the final part of Chapter 10, Neylson Crepalde shows us how to build a world-class Real-Time Pipeline on Kubernetes.
In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.
We have spent the last few weeks looking at individual tools like Spark, Airflow, and Kafka. But in the real world, these tools don’t live in isolation. They need to talk to each other to form a complete data pipeline.
Trino is great for querying your historical data on S3, but for real-time streams and text-heavy search, you need something different. In the second half of Chapter 9, Neylson Crepalde introduces the industry standard for real-time analytics: Elasticsearch and Kibana.
You’ve built your ingestion, you’ve processed your data with Spark, and it’s all sitting neatly in your S3 “Gold” bucket. Now what? You can’t ask every business analyst to learn PySpark just to see last month’s sales.
In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.
We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.
Architecture is great, but let’s actually run some code. In the second half of Chapter 7, Neylson Crepalde walks us through setting up a multi-node Kafka cluster right on our local machine using Docker Compose.
In the world of big data, “batch” is no longer enough. We need data the second it happens. Whether it’s tracking stock prices, monitoring website traffic, or detecting fraud, you need a system that can handle massive streams of events with zero downtime.
In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.
If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.
In the last post, we looked at Spark’s architecture. Now, let’s talk about how you actually write code for it. Neylson Crepalde highlights two main ways to interact with Spark: the DataFrame API and Spark SQL.
If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.
We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.
We’ve all heard the terms “Data Warehouse” and “Data Lake,” but do you actually know why we keep switching between them? In Chapter 4 of Big Data on Kubernetes, Neylson Crepalde gives a masterclass on how data architecture has evolved to keep up with the modern world.
Testing things locally with Kind is great, but big data usually needs big iron. In this part of the hands-on journey, Neylson Crepalde shows us how to scale up to a managed cloud environment.
Reading about architecture is one thing, but actually seeing a cluster run is where it sticks. In the third chapter of Big Data on Kubernetes, Neylson Crepalde moves from theory to practice.
In the last post, we talked about the “brain and muscles” of a Kubernetes cluster. But how do we actually tell that brain what to do? We use Objects.
If you want to run big data workloads on Kubernetes, you have to understand how the system is actually put together. It’s not just “magic magic cloud stuff”—it’s a carefully coordinated cluster of machines.
In my last post, we talked about why containers are the bedrock of modern data engineering. But honestly, just running other people’s images only gets you so far. The real magic happens when you start packaging your own custom code.
If you are working with data today, you can’t really ignore containers. They have become the standardized unit for how we develop, ship, and deploy software. But why do we care so much about them in the big data world?
We are living in a world where data is basically everywhere. From your phone to social media and every single online purchase, the amount of info we generate is staggering. But here’s the thing: just having data isn’t enough. You have to be able to process it, and that’s where things get complicated.
Previous: Elastic MapReduce: Running Hadoop in the AWS Cloud
We’ve covered a lot of ground in this series. From the basic blocks of HDFS to the real-time speeds of Flink and the limitless scale of the AWS cloud. After spending a lot of time with Sridhar Alla’s Big Data Analytics with Hadoop 3, I have a few final thoughts to share.
Previous: Mastering AWS for Big Data: EC2, S3, and EMR
In the last post, we looked at the basic building blocks of AWS: EC2 and S3. But if you’re trying to run a massive Hadoop or Spark cluster, you don’t really want to be manually installing software on hundreds of individual EC2 instances. That’s where Amazon EMR (Elastic MapReduce) comes in.
Previous: Comparing the Giants: AWS, Azure, and Google Cloud
We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.
Previous: Cloud Computing for Big Data: An Introduction
In the last post, we looked at the basic models of the cloud (IaaS, PaaS, and SaaS). Today, we’re talking about the “where” and the “who.” When you decide to move your big data to the cloud, you have to choose a deployment model and a provider.
Previous: Visualizing Big Data: Turning Numbers into Insight
We’ve spent this entire series talking about how to set up and run your own Hadoop cluster. But let’s be real: managing hardware is a pain. You have to buy servers, set up networking, worry about power outages, and pray that your hard drives don’t fail.
Previous: Flink Connectors and Event Time: Mastering the Stream
You’ve done the hard work. You’ve set up a Hadoop cluster, written MapReduce jobs, and built real-time pipelines in Spark and Flink. You have “insights.” But here’s the problem: nobody wants to look at a raw HDFS file or a console log.
Previous: Stream Processing with Apache Flink: True Real-Time Analytics
In the last post, we looked at Flink’s DataStream API. Today, we’re tackling the big questions: How does Flink handle the messy reality of the real world? How does it talk to other systems? And how does it deal with data that shows up late?
Previous: Flink DataSet API: Transformations, Joins, and Aggregations
We’ve talked about how Spark handles streaming using micro-batches. It’s a great approach, but some people argue it’s not “true” streaming. If you need the absolute lowest latency possible, you want Apache Flink.
Previous: Batch Analytics with Apache Flink: The New Challenger
In the last post, we got Flink up and running. Now, let’s actually do something useful with it. Chapter 8 of Sridhar Alla’s book focuses on the DataSet API, which is what you’ll use for all your batch processing needs.
Previous: Structured Streaming: The Modern Way to Handle Data Streams
We’ve spent a lot of time on Spark, and for good reason - it’s amazing. But if you’re serious about big data, you need to know about Apache Flink. In Chapter 8, Sridhar Alla introduces us to the technology that many experts consider the “true” successor to MapReduce for real-time processing.
Previous: Real-Time Analytics with Spark Streaming
In the last post, we looked at DStreams, the original way to do streaming in Spark. But things move fast in the tech world. Spark 2.0 introduced Structured Streaming, a new way to handle real-time data that makes things even simpler and more reliable.
Previous: Spark SQL and Aggregations: Joining Your Data at Scale
Up until now, we’ve mostly talked about batch processing - looking at data that’s already sitting in HDFS. But what if you need to know what’s happening right now? What if you’re tracking a stock price, monitoring a server for hacks, or following a trending hashtag on Twitter? That’s where Spark Streaming comes in.
Previous: Batch Analytics with Apache Spark: Faster Than MapReduce
In the last post, we looked at why Spark is so fast. Today, we’re getting into the nitty-gritty of how to actually use it. If you’re a SQL fan, you’re going to love this. Chapter 6 of Sridhar Alla’s book spends a lot of time on Spark SQL, and for good reason - it’s where most of the work happens.
Previous: Statistical Computing with R and Hadoop
If you’ve been following this series, you know we’ve spent a lot of time on MapReduce. It’s the foundation of Hadoop, but let’s be honest: it can be slow and painful to write. That’s why Chapter 6 of Sridhar Alla’s book is such a breath of fresh air. It introduces Apache Spark, the technology that has effectively dethroned MapReduce for most big data tasks.
Previous: Scientific Computing with Python and Hadoop
If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.
Previous: Advanced MapReduce: Joins and Filtering Patterns
Java and MapReduce are great for the heavy lifting, but when it comes to actually exploring data and building models, Python is where it’s at. Chapter 4 of Sridhar Alla’s book shifts the focus to how we can use Python’s massive ecosystem to analyze big data.
Previous: Deep Look at MapReduce: How Hadoop Processes Data
In the last post, we looked at the basics of MapReduce. But in the real world, your data is rarely in one single file. You usually have a few different datasets that you need to combine. This is where things get a little more complex - and a lot more interesting.
Previous: SQL on Hadoop: Getting Started with Apache Hive
We’ve talked about Hive, but today we’re going under the hood. MapReduce is the engine that actually does the heavy lifting in Hadoop. Sridhar Alla’s third chapter is a deep look at how this framework takes a massive pile of data and turns it into something useful.
Previous: The World of Big Data Analytics: Processes and Tools
If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.
Previous: Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide
Now that we’ve got a cluster running, let’s talk about why we bother with all this complexity in the first place. Chapter 2 of Sridhar Alla’s book takes a step back to look at the big picture of data analytics.
Previous: Getting Started with Hadoop 3: What’s New and Why It Matters
In the last post, we talked about all the cool new features in Hadoop 3. Now, let’s actually build something. Sridhar Alla’s book gives a solid walkthrough on setting up a single-node cluster. If you’re on Linux, this is pretty straightforward.
Previous: Big Data for the Rest of Us
Hadoop has been around for a while, but version 3 is where things get really interesting. If you’ve worked with Hadoop 1 or 2, you know it was solid but had some pain points. Sridhar Alla’s book kicks off by looking straight at what’s changed.
So, you’ve heard about big data. It’s everywhere. But how do you actually handle it? If you’re looking for the OG of big data platforms, you’re looking at Hadoop. And honestly, it’s still the foundation for almost everything we do in data today.