Data Engineering with GCP Chapter 5 Part 2: Working with Spark on Dataproc

In Part 1 we set up a Dataproc cluster, got familiar with HDFS, and touched on what a data lake actually is. Now it is time to get into the real work: writing PySpark code, understanding RDDs, moving data between HDFS, GCS, and BigQuery, and learning how to actually submit Spark jobs to Dataproc.

This is where Chapter 5 gets practical.

PySpark Basics

Spark was originally built in Scala, but there is a Python API called PySpark. If you know Python, you already know 80% of what you need.

You access PySpark through the Spark shell. It works like a regular Python shell, except you also get SparkContext and SparkSession. These are your connection to the Hadoop cluster. SparkSession is the newer one with more features, so use that.

The shell is fine for testing, but real work means writing Python files and submitting them as jobs.

RDDs Explained Simply

RDD stands for Resilient Distributed Dataset. Sounds fancy, but the idea is simple.

In regular Python, a list with a million items sits in memory on one machine. An RDD splits that data across multiple machines. That is the “distributed” part. The “resilient” part means if one machine crashes, Spark recovers and keeps going.

Think of an RDD as a Python list that Spark manages across a cluster. You cannot print it like a normal variable. You call specific actions on it, like collect() or sum().

The other big concept is lazy computation. When you chain transformations (filter, map, split), Spark does not execute anything. It remembers the recipe. Only when you call an action does it run the whole pipeline. With a billion records, this saves you from wasting memory on intermediate results nobody asked for.

Accessing HDFS from PySpark

Loading a file from HDFS in PySpark is one line. You point SparkContext to the HDFS path and get back an RDD. From there you filter rows, split strings, map values, all the usual stuff.

The interesting bit: even though HDFS stores files across multiple data nodes, you always access them through the master node. It has an index of where everything lives. You never need to know which data node holds your file.

Accessing GCS from PySpark

This is where Dataproc shines. Switching from HDFS to GCS is literally just changing the path prefix. Instead of hdfs://master-node/path/to/file, you write gs://bucket-name/path/to/file. Everything else stays the same. Same filters, same maps, same code.

The trade-off is speed versus convenience. HDFS on SSDs is faster for I/O-heavy workloads. But GCS works with every other GCP service, needs zero maintenance, and wins for most real-world scenarios.

Spark DataFrames

RDDs are the foundation, but Spark DataFrames are where things get practical. A DataFrame is like pandas, but distributed.

The workflow: load raw data into an RDD, do some cleanup (splitting strings, selecting columns), then convert to a DataFrame with named columns. Now you can run SQL queries on it.

The book shows this with Apache web log data. Raw log lines are messy. You split each line by spaces, pick out IP, timestamp, HTTP method, and URL, then structure it as a DataFrame. A SQL GROUP BY tells you which articles got the most traffic. That is the core data lake pattern: take unstructured data, transform it into something queryable.

Submitting Jobs to Dataproc

For production, you write PySpark code in a file, upload it to GCS, and submit it using gcloud commands. The book covers three patterns:

HDFS to HDFS. Read from HDFS, process with Spark, write results back to HDFS. Classic Hadoop pattern, nothing fancy.

GCS to GCS. Same logic, but reads from a GCS bucket and writes output to another GCS path. The only code change is the file paths. The output gets split into multiple part files automatically because Spark parallelizes the work.

GCS to BigQuery. The most practical pattern. Read unstructured data from GCS, process with Spark, write directly to a BigQuery table. You need the BigQuery connector JAR when submitting, but that is one extra flag. Now your analysts can query the results with plain SQL.

Ephemeral Clusters and Dataproc Serverless

If you read from GCS and write to GCS or BigQuery, the Hadoop cluster stores nothing. It just processes. So why keep it running 24/7?

An ephemeral cluster gets created when a job starts and destroyed when it finishes. You pay only for actual processing time.

Two ways to set this up. Workflow Templates let you define a cluster config, attach jobs, and run everything as one unit. Create cluster, run jobs, tear down. Good for simple pipelines.

For complex stuff, Cloud Composer (Airflow) gives full orchestration. Airflow operators create a Dataproc cluster, submit PySpark jobs, and delete the cluster when done. This is the common production pattern because you can combine Dataproc with BigQuery, GCS, and other services.

Then there is Dataproc Serverless. You submit Spark code and Google handles everything. It starts with two workers and can auto-scale up to 2,000. Trade-off: fewer config options. If you need specific Spark versions or custom libraries, use regular Dataproc.

When to Use Dataproc vs BigQuery

Use BigQuery when data is structured and fits into tables. Use Dataproc when you have unstructured data like logs or anything without a clean schema.

Many pipelines use both: Spark wrangles unstructured data into shape, BigQuery stores and queries the structured results. Not competitors, teammates.

Quick decision guide: GCS over HDFS unless you need raw I/O speed. Dataproc Serverless unless you need custom cluster config. Permanent clusters only if you need HDFS storage, Hadoop web UIs, or the math proves it cheaper for your workload.

Chapter Summary

Main takeaways from the hands-on part of Chapter 5:

  • PySpark gives you distributed processing with Python syntax you already know
  • RDDs are distributed lists with lazy computation, DataFrames are distributed tables with SQL support
  • Switching between HDFS and GCS in PySpark is just a path change
  • You can write Spark job output to HDFS, GCS, or directly to BigQuery
  • Ephemeral clusters save money by existing only during job execution
  • Dataproc Serverless removes cluster management entirely
  • The real power is combining Spark for unstructured data processing with BigQuery for structured querying

Next chapter moves to streaming data with Pub/Sub and Dataflow, which is a completely different paradigm from the batch processing we have been doing so far.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 5 Part 1: Building a Data Lake or continue to Chapter 6 Part 1: Streaming with Pub/Sub.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More