Apache spark

Mar 25, 2026
Big Data

Building an End-to-End Big Data Pipeline - Part 2

In our last post, we checked the infrastructure. Now, let’s build the actual pipeline. Neylson Crepalde uses the IMDB dataset to demonstrate a professional batch workflow.

Mar 20, 2026
Big Data

Deploying the Big Data Stack on Kubernetes - Part 1

We’ve explored Spark, Airflow, and Kafka as individual tools. But the real goal of Neylson Crepalde’s book is to show you how to run them all as a cohesive “stack” on Kubernetes. In Chapter 8, we finally start the heavy lifting of deployment.

Mar 14, 2026
Big Data

Distributed Processing With Apache Spark - Part 1

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

Mar 13, 2026
Big Data

The Tools of the Modern Data Stack

We’ve talked about the architecture, but what about the actual tools? To build a modern data lakehouse on Kubernetes, you need a specific set of tools that can handle scale, automation, and speed.

Feb 28, 2026
software-engineering

Data Engineering With Python: Final Thoughts and Takeaways

That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).

Feb 27, 2026
software-engineering

Real-Time Edge Data With MiNiFi and Spark - Study Notes From Data Engineering With Python Ch 15

You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?

Feb 26, 2026
software-engineering

Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14

You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.

Feb 25, 2026
software-engineering

Big Data and Distributed Systems - Chapter 11 Retelling

At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?

Feb 14, 2026
software-engineering

Data Engineering With GCP Chapter 5 Part 2: Working With Spark on Dataproc

In Part 1 we set up a Dataproc cluster, got familiar with HDFS, and touched on what a data lake actually is. Now it is time to get into the real work: writing PySpark code, understanding RDDs, moving data between HDFS, GCS, and BigQuery, and learning how to actually submit Spark jobs to Dataproc.

Jan 13, 2019
Big Data

Structured Streaming: The Modern Way to Handle Data Streams

Previous: Real-Time Analytics with Spark Streaming

In the last post, we looked at DStreams, the original way to do streaming in Spark. But things move fast in the tech world. Spark 2.0 introduced Structured Streaming, a new way to handle real-time data that makes things even simpler and more reliable.

Jan 12, 2019
Big Data

Real-Time Analytics With Spark Streaming

Previous: Spark SQL and Aggregations: Joining Your Data at Scale

Up until now, we’ve mostly talked about batch processing - looking at data that’s already sitting in HDFS. But what if you need to know what’s happening right now? What if you’re tracking a stock price, monitoring a server for hacks, or following a trending hashtag on Twitter? That’s where Spark Streaming comes in.

Jan 10, 2019
Big Data

Batch Analytics With Apache Spark: Faster Than MapReduce

Previous: Statistical Computing with R and Hadoop

If you’ve been following this series, you know we’ve spent a lot of time on MapReduce. It’s the foundation of Hadoop, but let’s be honest: it can be slow and painful to write. That’s why Chapter 6 of Sridhar Alla’s book is such a breath of fresh air. It introduces Apache Spark, the technology that has effectively dethroned MapReduce for most big data tasks.

Apache spark

Building an End-to-End Big Data Pipeline - Part 2

Deploying the Big Data Stack on Kubernetes - Part 1

Distributed Processing With Apache Spark - Part 1

The Tools of the Modern Data Stack

Data Engineering With Python: Final Thoughts and Takeaways

Real-Time Edge Data With MiNiFi and Spark - Study Notes From Data Engineering With Python Ch 15

Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14

Big Data and Distributed Systems - Chapter 11 Retelling

Data Engineering With GCP Chapter 5 Part 2: Working With Spark on Dataproc

Structured Streaming: The Modern Way to Handle Data Streams

Real-Time Analytics With Spark Streaming

Batch Analytics With Apache Spark: Faster Than MapReduce

About

About BookGrill.net

Category

Tags View all tags

Theme Settings

Accent Color