Distributed Processing with Apache Spark - Part 2

In the last post, we looked at Spark’s architecture. Now, let’s talk about how you actually write code for it. Neylson Crepalde highlights two main ways to interact with Spark: the DataFrame API and Spark SQL.

But before you start coding, you need to understand Spark’s biggest secret: Lazy Evaluation.

Transformations vs. Actions

In normal Python, every line of code executes as soon as you run it. In Spark, that’s not true. Spark divides operations into two categories:

  1. Transformations: These are operations like .select(), .filter(), and .groupBy(). When you call these, Spark does nothing. It just writes down a “plan” of what you want to do.
  2. Actions: These are operations like .count(), .show(), or .write(). As soon as you call an action, Spark looks at the plan you’ve built and says, “Okay, now I’ll actually do the work.”

Why bother with Lazy Evaluation?

This seems like extra work, but it’s actually a genius optimization. Because Spark knows the whole “plan” before it starts, it uses something called the Catalyst Optimizer to rewrite your query for maximum efficiency.

It can skip unnecessary columns (Projection Pruning), filter rows as early as possible (Predicate Pushdown), and choose the best way to join large tables. If Spark were “eager” (like standard Pandas), it couldn’t do any of this.

DataFrames and SQL: The Best of Both Worlds

If you’re a Python fan, you’ll love the DataFrame API. It feels a lot like Pandas:

# Filter and sort in PySpark
filtered_df = df.filter(df["age"] > 20).orderBy("salary")

But if you come from a database background, you can use Spark SQL. You just register your DataFrame as a temporary view and write standard SQL:

df.createOrReplaceTempView("employees")
results = spark.sql("SELECT * FROM employees WHERE age > 20 ORDER BY salary")

The kicker? Both of these approaches result in the exact same performance. They both get optimized by Catalyst and turned into the same low-level Spark code.

Best Practices for Performance

The book wraps up Chapter 5 with some key tips:

  • Use DataFrames/SQL: Avoid the low-level RDD API unless you really know what you’re doing.
  • Filter early: The less data you move around the network, the faster your job will be.
  • Understand Shuffling: Operations like groupBy and join cause data to move between executors (a “shuffle”). Shuffles are expensive, so use them wisely.

Spark is the engine that powers the lakehouse, but keeping all those Spark jobs organized is a job for an orchestrator. In the next post, we’re going to look at Apache Airflow.

Next: Orchestrating Pipelines with Apache Airflow Previous: Distributed Processing with Apache Spark - Part 1

Book Details:

  • Title: Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
  • Author: Neylson Crepalde
  • ISBN: 978-1-83546-214-0

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More