Big data

Mar 30, 2026
Big Data

Wrapping Up: Big Data on Kubernetes

We have reached the end of our deep dive into Big Data on Kubernetes by Neylson Crepalde. It has been a massive journey, moving from basic Docker containers to complex, real-time AI pipelines.

Mar 26, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 3

Batch processing is great for historical reports, but what if you need to know what’s happening right now? In the final part of Chapter 10, Neylson Crepalde shows us how to build a world-class Real-Time Pipeline on Kubernetes.

Mar 25, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 2

In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.

Mar 24, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 1

We have spent the last few weeks looking at individual tools like Spark, Airflow, and Kafka. But in the real world, these tools don’t live in isolation. They need to talk to each other to form a complete data pipeline.

Mar 23, 2026
Big Data

Real-Time Visualization With Elasticsearch and Kibana

Trino is great for querying your historical data on S3, but for real-time streams and text-heavy search, you need something different. In the second half of Chapter 9, Neylson Crepalde introduces the industry standard for real-time analytics: Elasticsearch and Kibana.

Mar 22, 2026
Big Data

The Data Consumption Layer - Querying With Trino

You’ve built your ingestion, you’ve processed your data with Spark, and it’s all sitting neatly in your S3 “Gold” bucket. Now what? You can’t ask every business analyst to learn PySpark just to see last month’s sales.

Mar 21, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 2

In my last post, we got Spark running natively on Kubernetes. Now, it’s time to bring in the conductor (Airflow) and the nervous system (Kafka). This is where your cluster starts to feel like a real data platform.

Mar 20, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 1

We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.

Mar 19, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 2

Architecture is great, but let’s actually run some code. In the second half of Chapter 7, Neylson Crepalde walks us through setting up a multi-node Kafka cluster right on our local machine using Docker Compose.

Mar 18, 2026
Big Data

Real-Time Streaming With Apache Kafka - Part 1

In the world of big data, “batch” is no longer enough. We need data the second it happens. Whether it’s tracking stock prices, monitoring website traffic, or detecting fraud, you need a system that can handle massive streams of events with zero downtime.

Mar 17, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 2

In the last post, we got Airflow running. Now, let’s talk about how to actually use it. The heart of Airflow is the DAG—the Directed Acyclic Graph.

Mar 16, 2026
Big Data

Orchestrating Pipelines With Apache Airflow - Part 1

If Spark is the engine, then Apache Airflow is the conductor. In a modern data stack, you rarely have just one job running in isolation. You have ingestion, cleaning, processing, and delivery—and they all have to happen in a specific order.

Mar 15, 2026
Big Data

Distributed Processing With Apache Spark - Part 2

In the last post, we looked at Spark’s architecture. Now, let’s talk about how you actually write code for it. Neylson Crepalde highlights two main ways to interact with Spark: the DataFrame API and Spark SQL.

Mar 14, 2026
Big Data

Distributed Processing With Apache Spark - Part 1

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

Mar 13, 2026
Big Data

The Tools of the Modern Data Stack

We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.

Mar 12, 2026
Big Data

The Evolution of Data Architecture

We’ve all heard the terms “Data Warehouse” and “Data Lake,” but do you actually know why we keep switching between them? In Chapter 4 of Big Data on Kubernetes, Neylson Crepalde gives a masterclass on how data architecture has evolved to keep up with the modern world.

Mar 11, 2026
Big Data

Scaling to the Cloud With Amazon EKS

Testing things locally with Kind is great, but big data usually needs big iron. In this part of the hands-on journey, Neylson Crepalde shows us how to scale up to a managed cloud environment.

Mar 10, 2026
Big Data

Local Kubernetes With Kind

Reading about architecture is one thing, but actually seeing a cluster run is where it sticks. In the third chapter of Big Data on Kubernetes, Neylson Crepalde moves from theory to practice.

Mar 09, 2026
Big Data

Decoding Kubernetes Architecture - Part 2

In the last post, we talked about the “brain and muscles” of a Kubernetes cluster. But how do we actually tell that brain what to do? We use Objects.

Mar 08, 2026
Big Data

Decoding Kubernetes Architecture - Part 1

If you want to run big data workloads on Kubernetes, you have to understand how the system is actually put together. It’s not just “magic magic cloud stuff”—it’s a carefully coordinated cluster of machines.

Mar 07, 2026
Big Data

Building Your Own Data Images

In my last post, we talked about why containers are the bedrock of modern data engineering. But honestly, just running other people’s images only gets you so far. The real magic happens when you start packaging your own custom code.

Mar 06, 2026
Big Data

Why Containers Are a Must for Data Engineers

If you are working with data today, you can’t really ignore containers. They have become the standardized unit for how we develop, ship, and deploy software. But why do we care so much about them in the big data world?

Mar 05, 2026
Big Data

Rethinking Data Infrastructure: Big Data on Kubernetes

We are living in a world where data is basically everywhere. From your phone to social media and every single online purchase, the amount of info we generate is staggering. But here’s the thing: just having data isn’t enough. You have to be able to process it, and that’s where things get complicated.

Jan 23, 2019
Big Data

Wrapping Up: The Future of Big Data Analytics

Previous: Elastic MapReduce: Running Hadoop in the AWS Cloud

We’ve covered a lot of ground in this series. From the basic blocks of HDFS to the real-time speeds of Flink and the limitless scale of the AWS cloud. After spending a lot of time with Sridhar Alla’s Big Data Analytics with Hadoop 3, I have a few final thoughts to share.

Jan 22, 2019
Big Data

Elastic MapReduce: Running Hadoop in the AWS Cloud

Previous: Mastering AWS for Big Data: EC2, S3, and EMR

In the last post, we looked at the basic building blocks of AWS: EC2 and S3. But if you’re trying to run a massive Hadoop or Spark cluster, you don’t really want to be manually installing software on hundreds of individual EC2 instances. That’s where Amazon EMR (Elastic MapReduce) comes in.

Jan 21, 2019
Big Data

Mastering AWS for Big Data: EC2, S3, and EMR

Previous: Comparing the Giants: AWS, Azure, and Google Cloud

We’ve talked about the “what” and the “why” of the cloud. Now it’s time for the “how.” Chapter 12 of Sridhar Alla’s book is a deep look at Amazon Web Services (AWS), which is essentially the playground where most big data pros spend their time.

Jan 20, 2019
Big Data

Comparing the Giants: AWS, Azure, and Google Cloud

Previous: Cloud Computing for Big Data: An Introduction

In the last post, we looked at the basic models of the cloud (IaaS, PaaS, and SaaS). Today, we’re talking about the “where” and the “who.” When you decide to move your big data to the cloud, you have to choose a deployment model and a provider.

Jan 19, 2019
Big Data

Cloud Computing for Big Data: An Introduction

Previous: Visualizing Big Data: Turning Numbers into Insight

We’ve spent this entire series talking about how to set up and run your own Hadoop cluster. But let’s be real: managing hardware is a pain. You have to buy servers, set up networking, worry about power outages, and pray that your hard drives don’t fail.

Jan 18, 2019
Big Data

Visualizing Big Data: Turning Numbers Into Insight

Previous: Flink Connectors and Event Time: Mastering the Stream

You’ve done the hard work. You’ve set up a Hadoop cluster, written MapReduce jobs, and built real-time pipelines in Spark and Flink. You have “insights.” But here’s the problem: nobody wants to look at a raw HDFS file or a console log.

Jan 17, 2019
Big Data

Flink Connectors and Event Time: Mastering the Stream

Previous: Stream Processing with Apache Flink: True Real-Time Analytics

In the last post, we looked at Flink’s DataStream API. Today, we’re tackling the big questions: How does Flink handle the messy reality of the real world? How does it talk to other systems? And how does it deal with data that shows up late?

Jan 16, 2019
Big Data

Stream Processing With Apache Flink: True Real-Time Analytics

Previous: Flink DataSet API: Transformations, Joins, and Aggregations

We’ve talked about how Spark handles streaming using micro-batches. It’s a great approach, but some people argue it’s not “true” streaming. If you need the absolute lowest latency possible, you want Apache Flink.

Jan 15, 2019
Big Data

Flink DataSet API: Transformations, Joins, and Aggregations

Previous: Batch Analytics with Apache Flink: The New Challenger

In the last post, we got Flink up and running. Now, let’s actually do something useful with it. Chapter 8 of Sridhar Alla’s book focuses on the DataSet API, which is what you’ll use for all your batch processing needs.

Jan 14, 2019
Big Data

Batch Analytics With Apache Flink: The New Challenger

Previous: Structured Streaming: The Modern Way to Handle Data Streams

We’ve spent a lot of time on Spark, and for good reason - it’s amazing. But if you’re serious about big data, you need to know about Apache Flink. In Chapter 8, Sridhar Alla introduces us to the technology that many experts consider the “true” successor to MapReduce for real-time processing.

Jan 13, 2019
Big Data

Structured Streaming: The Modern Way to Handle Data Streams

Previous: Real-Time Analytics with Spark Streaming

In the last post, we looked at DStreams, the original way to do streaming in Spark. But things move fast in the tech world. Spark 2.0 introduced Structured Streaming, a new way to handle real-time data that makes things even simpler and more reliable.

Jan 12, 2019
Big Data

Real-Time Analytics With Spark Streaming

Previous: Spark SQL and Aggregations: Joining Your Data at Scale

Up until now, we’ve mostly talked about batch processing - looking at data that’s already sitting in HDFS. But what if you need to know what’s happening right now? What if you’re tracking a stock price, monitoring a server for hacks, or following a trending hashtag on Twitter? That’s where Spark Streaming comes in.

Jan 11, 2019
Big Data

Spark SQL and Aggregations: Joining Your Data at Scale

Previous: Batch Analytics with Apache Spark: Faster Than MapReduce

In the last post, we looked at why Spark is so fast. Today, we’re getting into the nitty-gritty of how to actually use it. If you’re a SQL fan, you’re going to love this. Chapter 6 of Sridhar Alla’s book spends a lot of time on Spark SQL, and for good reason - it’s where most of the work happens.

Jan 10, 2019
Big Data

Batch Analytics With Apache Spark: Faster Than MapReduce

Previous: Statistical Computing with R and Hadoop

If you’ve been following this series, you know we’ve spent a lot of time on MapReduce. It’s the foundation of Hadoop, but let’s be honest: it can be slow and painful to write. That’s why Chapter 6 of Sridhar Alla’s book is such a breath of fresh air. It introduces Apache Spark, the technology that has effectively dethroned MapReduce for most big data tasks.

Jan 09, 2019
Big Data

Statistical Computing With R and Hadoop

Previous: Scientific Computing with Python and Hadoop

If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.

Jan 08, 2019
Big Data

Scientific Computing With Python and Hadoop

Previous: Advanced MapReduce: Joins and Filtering Patterns

Java and MapReduce are great for the heavy lifting, but when it comes to actually exploring data and building models, Python is where it’s at. Chapter 4 of Sridhar Alla’s book shifts the focus to how we can use Python’s massive ecosystem to analyze big data.

Jan 07, 2019
Big Data

Advanced MapReduce: Joins and Filtering Patterns

Previous: Deep Look at MapReduce: How Hadoop Processes Data

In the last post, we looked at the basics of MapReduce. But in the real world, your data is rarely in one single file. You usually have a few different datasets that you need to combine. This is where things get a little more complex - and a lot more interesting.

Jan 06, 2019
Big Data

Deep Look at MapReduce: How Hadoop Processes Data

Previous: SQL on Hadoop: Getting Started with Apache Hive

We’ve talked about Hive, but today we’re going under the hood. MapReduce is the engine that actually does the heavy lifting in Hadoop. Sridhar Alla’s third chapter is a deep look at how this framework takes a massive pile of data and turns it into something useful.

Jan 05, 2019
Big Data

SQL on Hadoop: Getting Started With Apache Hive

Previous: The World of Big Data Analytics: Processes and Tools

If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.

Jan 04, 2019
Big Data

The World of Big Data Analytics: Processes and Tools

Previous: Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Now that we’ve got a cluster running, let’s talk about why we bother with all this complexity in the first place. Chapter 2 of Sridhar Alla’s book takes a step back to look at the big picture of data analytics.

Jan 03, 2019
Big Data

Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Previous: Getting Started with Hadoop 3: What’s New and Why It Matters

In the last post, we talked about all the cool new features in Hadoop 3. Now, let’s actually build something. Sridhar Alla’s book gives a solid walkthrough on setting up a single-node cluster. If you’re on Linux, this is pretty straightforward.

Jan 02, 2019
Big Data

Getting Started With Hadoop 3: What's New and Why It Matters

Previous: Big Data for the Rest of Us

Hadoop has been around for a while, but version 3 is where things get really interesting. If you’ve worked with Hadoop 1 or 2, you know it was solid but had some pain points. Sridhar Alla’s book kicks off by looking straight at what’s changed.

Jan 01, 2019
Big Data

Big Data for the Rest of Us: A Deep Look at Hadoop 3

So, you’ve heard about big data. It’s everywhere. But how do you actually handle it? If you’re looking for the OG of big data platforms, you’re looking at Hadoop. And honestly, it’s still the foundation for almost everything we do in data today.

Big data

About

About BookGrill.net

Category

Tags View all tags