Distributed Processing with Apache Spark - Part 1

If there is one tool that defined the “Big Data” era, it’s Apache Spark. It’s the engine that handles everything from terabyte-scale ETL to complex machine learning. In Chapter 5, Neylson Crepalde breaks down exactly how Spark works and why it’s so powerful on Kubernetes.

How Spark actually thinks

Spark isn’t just one program; it’s a distributed cluster architecture. Here is the hierarchy you need to know:

  • Driver Program: This is the “conductor.” It runs your main Python script and coordinates everything else.
  • SparkSession: The unified entry point. This is how your code talks to Spark.
  • Cluster Manager: The resource negotiator. It could be YARN, Mesos, or—our favorite—Kubernetes.
  • Executors: The “workers.” These are separate processes that live on worker nodes and do the actual data crunching.

The Execution Flow

When you run a Spark script, it doesn’t just execute top-to-bottom like a normal script. It gets broken down into:

  1. Jobs: Spark creates a job whenever you ask for a result (like counting rows).
  2. Stages: Jobs are broken into stages based on whether data needs to be “shuffled” across the network.
  3. Tasks: The smallest unit of work. One task runs on one core (one slot) in an executor.

Getting Started (The Easy Way)

You don’t need a huge cluster to start learning. You can install PySpark right now with one command:

pip install pyspark

The book walks through a great first exercise: loading the classic Titanic dataset.

from pyspark.sql import SparkSession

# Start a session
spark = SparkSession.builder.appName("TitanicData").getOrCreate()

# Read some data
titanic = (
    spark.read
    .options(header=True, inferSchema=True, delimiter=";")
    .csv('data/titanic.csv')
)

titanic.show()

The Secret Weapon: Spark UI

One of my favorite things about Spark is the Spark UI. When you run Spark locally, you can open http://localhost:4040 in your browser. It gives you a visual breakdown of every job, stage, and task. If your data processing is slow, the Spark UI is where you go to find the bottleneck.

Understanding this architecture is the first step. In the next post, we’re going to dive into the DataFrame API and see how Spark uses “Lazy Evaluation” to make your code run way faster than you’d expect.

Next: Distributed Processing with Apache Spark - Part 2 Previous: The Tools of the Modern Data Stack

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More