Data Processing with Apache Spark - Study Notes from Data Engineering with Python Ch 14

You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.

Chapter 14 of Data Engineering with Python by Paul Crickard introduces Apache Spark. It is the tool data engineers reach for when Python scripts and single-machine processing can no longer keep up. Spark lets you distribute data processing across a cluster of machines. It handles both batch and streaming data. And you can write your Spark code in Python using PySpark.

This chapter covers three things: installing Spark, configuring PySpark, and processing data with PySpark DataFrames.

What Is Apache Spark?

Spark is a distributed data processing engine. It is faster than older approaches like MapReduce for most non-trivial transformations. It can handle batch data, streaming data, and even graph computations.

The Spark ecosystem has a few key pieces:

  • Spark Core is the foundation. It handles scheduling, memory management, and fault recovery.
  • Spark SQL lets you run SQL queries on structured data.
  • Spark Streaming handles real-time data streams.
  • MLlib is the machine learning library.
  • GraphX is for graph processing.

All of these sit on top of Spark Core and share the same cluster resources.

Installing and Running Spark

To run Spark as a cluster, you have several options: standalone mode (Spark’s own simple cluster manager), Amazon EC2, YARN, Mesos, or Kubernetes. For learning purposes, standalone mode is the fastest way to get started.

Here is how the setup works. You download Spark from the Apache website, extract it, and put it in a directory. Then you make a copy of that directory to act as a worker node. In a real production environment, worker nodes would be on separate machines. But for learning, running everything on one machine works fine.

Spark comes with scripts to start the cluster. You run one script to start the head node and another to start the worker node. When starting the worker, you point it at the head node’s URL and port.

The worker script accepts several useful parameters:

  • Host and port to control where the node listens
  • Web UI port (defaults to 8080) so you can check the cluster status in your browser
  • Cores to limit how many CPU cores the worker uses
  • Memory to cap memory usage (defaults to total memory minus 1 GB)
  • Work directory for scratch space
  • Properties file to set all of these in a config file instead of command-line flags

Once the cluster is up, you can open localhost:8080 in your browser and see the Spark web UI. It shows your cluster details, running applications, and completed jobs.

Setting Up PySpark

PySpark comes bundled with Spark. You just need to tell your system where to find it. That means setting a few environment variables:

  • SPARK_HOME points to your Spark installation directory
  • Add $SPARK_HOME/bin to your PATH
  • PYSPARK_PYTHON tells Spark which Python version to use

You can run these in a terminal session, but they disappear when you close the terminal. To make them permanent, add them to your .bashrc file and reload it with source ~/.bashrc.

After that, typing pyspark in a terminal opens the PySpark interactive shell. You are connected to your cluster and ready to run Spark code.

PySpark in Jupyter Notebooks

Crickard walks through two ways to use PySpark inside Jupyter:

  1. Set driver variables. You set PYSPARK_DRIVER_PYTHON to jupyter and PYSPARK_DRIVER_PYTHON_OPTS to notebook. This makes PySpark automatically open in Jupyter when you run it.

  2. Use the findspark library. Install it with pip, then add two lines at the top of your notebook: import findspark and findspark.init(). This method is what the book uses for all its examples.

The findspark approach is simpler. It finds your Spark installation automatically and sets up the environment inside the notebook. No extra environment variables needed.

Processing Data with PySpark

Every PySpark program follows the same pattern:

  1. Import findspark and initialize it
  2. Import PySpark and SparkSession
  3. Create a session connected to your cluster
  4. Do your work
  5. Stop the session

The session creation is the boilerplate. You call SparkSession.builder.master() with your cluster URL and give the application a name with .appName(). That name shows up in the Spark web UI so you can track what is running.

The Pi Estimation Example

Crickard starts with a classic example from the Spark website: estimating the value of pi using random sampling. The important thing here is not the math. It is seeing how Spark parallelizes work. The code uses sparkContext.parallelize() to distribute the computation across the cluster. Each worker processes a chunk of the data, and Spark combines the results.

This is the core idea of Spark. You write code that looks like it runs on one machine, but Spark splits the work across many machines behind the scenes.

Spark DataFrames

If you have used pandas, Spark DataFrames will feel familiar. But there are some differences.

Reading Data

In pandas, you read a CSV with read_csv(). In Spark, it is read.csv(). Small difference, easy to miss.

Here is the thing about Spark CSVs. If you just read the file without any options, Spark treats the header row as data and gives columns default names like _c0, _c1. It also makes every column a string type.

To fix this, you pass two parameters: header=True tells Spark the first row is column names, and inferSchema=True tells it to figure out the correct data types automatically.

Viewing Data

In pandas, you just type the DataFrame variable name and it shows you the data. In Spark, you need to call .show() explicitly. Want to see just the first five rows? Use .show(5).

To check column types, pandas uses .dtypes. Spark uses .printSchema(). Same idea, different method name.

Selecting and Filtering

Selecting a column uses .select('column_name') instead of the bracket notation pandas uses. Don’t forget .show() at the end or you just get a DataFrame object reference, not the actual data.

Filtering works differently too. In pandas, you write df[df['age'] < 40]. In Spark, you have two options:

  • .select(df['age'] < 40) returns True/False for each row
  • .filter(df['age'] < 40) returns the actual matching rows

You can chain .filter() with .select() to filter rows and then pick specific columns. For example, filter for age under 40, then select only the name, age, and state columns.

Iterating Through Rows

In pandas, you use iterrows(). In Spark, you call .collect() to pull the data into an array of Row objects. Then you loop through them with a regular for loop.

Each Row can be converted to a dictionary with .asDict(). From there, you access values by key like any Python dict.

One thing to be careful about: .collect() pulls all the data to the driver machine. If your DataFrame has millions of rows, that can cause memory problems. Use it on filtered subsets, not full datasets.

Spark SQL

If you prefer SQL over method chaining, Spark has you covered. You create a temporary view from your DataFrame, then query it with standard SQL.

The process is two steps:

  1. Call .createOrReplaceTempView('table_name') on your DataFrame
  2. Run spark.sql('SELECT * FROM table_name WHERE age > 40')

The result is another Spark DataFrame. Same data, different way to get there. Use whichever feels more natural.

Aggregations and Statistics

Spark gives you .describe() for quick summary statistics on a column: count, mean, standard deviation, min, and max. Five numbers that tell you a lot about your data.

For grouping, use .groupBy('column').count() to get counts per category. For custom aggregations, .agg() takes a dictionary of column names and functions. Pass {'age': 'mean'} to get the average age.

Both groupBy and agg support mean, max, min, sum, and other standard aggregation functions.

Built-in Functions

Spark has a large library of built-in functions in pyspark.sql.functions. Import it and you get access to things like:

  • collect_set to get unique values from a column
  • countDistinct to count unique values
  • md5 to hash values (useful for anonymization or deduplication)
  • reverse to reverse string values
  • soundex for phonetic matching on name fields

These are just a few examples. The full list is in the PySpark SQL documentation. The functions module is where a lot of the real data engineering power lives.

Stopping the Session

When you are done, always call spark.stop(). This releases your cluster resources. A stopped session shows up as a completed application in the Spark web UI.

Leaving sessions running when you are not using them wastes cluster resources. In a shared environment, other people’s jobs might not get the cores and memory they need because your idle session is holding them.

Key Takeaways

  • Spark is for distributed processing. When your data is too big or your transformations too slow for a single machine, Spark lets you spread the work across a cluster.
  • PySpark makes Spark accessible to Python developers. The API is similar to pandas, with a few syntax differences to learn.
  • DataFrames are the main abstraction. Read data in, transform it, write it out. The same pattern as pandas, but running on a cluster.
  • Spark SQL is there if you want it. Create a view and query with SQL. Good for people who think in SQL rather than method chains.
  • Always stop your sessions. Free up cluster resources when you are done.

My Take

This is a practical chapter. Crickard keeps things focused on getting Spark running and doing basic data processing. He does not try to cover everything Spark can do, which is the right call for a book that covers the full data engineering stack.

The pandas-to-Spark comparison approach works well. If you already know pandas, you can pick up PySpark fast because Crickard consistently shows you the Spark equivalent of what you already know.

What this chapter does not cover is the harder stuff: performance tuning, partitioning strategies, handling skewed data, or running Spark on a real multi-node cluster. Those are topics that matter a lot in production but would each need their own chapter. For an introductory look at Spark as part of a data engineering toolkit, this gets the job done.

The most useful thing here is the pattern. Every PySpark program has the same structure: set up, connect, process, stop. Once you know the pattern, you just need to learn more transformations and functions. And the Spark documentation is good for that.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More