Distributed Processing with Apache Spark - Part 1

Mar 14, 2026
Big Data

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

How Spark actually thinks

Spark isn’t just one program; it’s a distributed cluster architecture. Here is the hierarchy you need to know:

Driver Program: This is the “conductor.” It runs your main Python script and coordinates everything else.
SparkSession: The unified entry point. This is how your code talks to Spark.
Cluster Manager: The resource negotiator. It could be YARN, Mesos, or—our favorite—Kubernetes.
Executors: The “workers.” These are separate processes that live on worker nodes and do the actual data crunching.

The Execution Flow

When you run a Spark script, it doesn’t just execute top-to-bottom like a normal script. It gets broken down into:

Jobs: Spark creates a job whenever you ask for a result (like counting rows).
Stages: Jobs are broken into stages based on whether data needs to be “shuffled” across the network.
Tasks: The smallest unit of work. One task runs on one core (one slot) in an executor.

Getting Started (The Easy Way)

You don’t need a huge cluster to start learning. You can install PySpark right now with one command:

pip install pyspark

The book walks through a great first exercise: loading the classic Titanic dataset.

from pyspark.sql import SparkSession

# Start a session
spark = SparkSession.builder.appName("TitanicData").getOrCreate()

# Read some data
titanic = (
    spark.read
    .options(header=True, inferSchema=True, delimiter=";")
    .csv('data/titanic.csv')
)

titanic.show()

The Secret Weapon: Spark UI

One of my favorite things about Spark is the Spark UI. When you run Spark locally, you can open http://localhost:4040 in your browser. It gives you a visual breakdown of every job, stage, and task. If your data processing is slow, the Spark UI is where you go to find the bottleneck.

Understanding this architecture is the first step. In the next post, we’re going to dive into the DataFrame API and see how Spark uses “Lazy Evaluation” to make your code run way faster than you’d expect.

Next: Distributed Processing with Apache Spark - Part 2 Previous: The Tools of the Modern Data Stack

Book Details:

Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Author: Neylson Crepalde
ISBN: 978-1-83546-214-0

#big-data-on-kubernetes #neylson-crepalde #book-retelling #apache-spark #distributed-computing #pyspark

Distributed Processing with Apache Spark - Part 1

How Spark actually thinks

The Execution Flow

Getting Started (The Easy Way)

The Secret Weapon: Spark UI

About

About BookGrill.net

Category

Tags View all tags

Theme Settings

Accent Color