Data Engineering with GCP Chapter 6 Part 2: Stream Processing with Dataflow
In Part 1 we covered Pub/Sub and how messages flow between publishers and subscribers. Now comes the fun part: what do you actually do with those messages once you have them? That’s where Apache Beam and Dataflow come in.
Apache Beam: Write Once, Run Anywhere
Apache Beam is an open-source programming model for building data pipelines. Think of it as a universal language for data processing. You write your pipeline logic in Beam, then choose where to run it. Locally on your machine? Sure. On Google Cloud Dataflow? Absolutely. On Apache Spark or Flink? Also yes.
This “write once, run anywhere” idea is Beam’s biggest selling point. The book walks through this with Python examples, and the core concepts are pretty straightforward.
A Beam pipeline has a few key building blocks. First is PCollection, which is Beam’s way of representing a dataset. If you’ve worked with Spark, PCollection is similar to an RDD. It’s a distributed collection of data that your pipeline steps operate on.
Then you have transformations. Map takes one element in, gives one element out. ParDo (short for Parallel Do) is more flexible because it can take one element and produce zero, one, or many elements. Both let you transform your data step by step, and you chain them together using a pipe syntax that reads almost like plain English.
There are also aggregation functions like CombinePerKey that let you group and summarize data, similar to a GROUP BY in SQL.
Batch Processing: Your First Dataflow Job
The chapter starts with a batch example before jumping to streaming. This makes sense because the pipeline structure is almost identical for both.
The exercise uses web server log files stored in Google Cloud Storage. The pipeline reads these log files, splits each line to extract the IP address, date, HTTP method, and URL, then counts how many times each URL appears. Simple enough, but it teaches the fundamental pattern.
Here’s the important part about runners. When you’re developing, you use DirectRunner. It runs everything on your local machine, and it’s fast to deploy. When you’re ready for production, you switch to DataflowRunner. That one line change is literally all it takes to go from local execution to a fully managed, scalable cloud service. Dataflow handles all the infrastructure, scaling, and monitoring for you.
Once you run a job with DataflowRunner, you can see it in the Dataflow console. The console shows your pipeline as a visual graph with each step displayed as a node. You can track progress, see how many elements each step processed, and check for errors. It is quite nice.
From Batch to Streaming: Surprisingly Easy
Switching from batch to streaming in Beam is remarkably simple. The book demonstrates this by building a pipeline that reads from a Pub/Sub subscription and writes to BigQuery in real time.
The main differences from the batch version are small but important. You set streaming=True in your pipeline options. You swap the file reader for a Pub/Sub reader. You change the output from a text file to a BigQuery table with a defined schema. That’s basically it.
To test a streaming pipeline, you need three environments running at the same time: one running the Beam streaming application, one with BigQuery open to check incoming records, and one running the Pub/Sub publisher to send test messages. It’s a bit of juggling, but it shows how all the pieces connect.
When everything works, you publish a message to Pub/Sub and it shows up in BigQuery almost immediately. That’s the whole point of streaming.
Windowing: Because Streaming Data Never Stops
Here’s where streaming gets interesting and a little tricky. In batch processing, you have a finite dataset. You know when it starts and ends. In streaming, data just keeps coming. So how do you do aggregations like “sum up all the trip durations per station”?
You use windows. A window is a time boundary you put around streaming data so you can group and aggregate it. The book uses a fixed window of 60 seconds. Every minute, the pipeline calculates the sum of trip durations for each station and writes the result to BigQuery.
The key insight is that when you aggregate streaming data, you lose the concept of “process everything at once” that batch gives you. Instead, you get results for each time window. Each row in your output table represents the aggregation for one station during one 60-second window. If you need a running total across all windows, you can create a BigQuery view on top that sums everything up.
One practical detail: when you aggregate within a window, you need to attach the window’s timestamp to your output. Otherwise, you just get numbers with no context about which time period they belong to. The book shows how to use a ParDo function to grab the window start time and add it to each record before writing to BigQuery.
Dataflow in Production: Cost and Operations
A streaming Dataflow job runs continuously. Unlike batch jobs that finish and go away, a streaming job stays on until you stop it. This has cost implications.
Dataflow bills by worker hours. Each worker is basically a virtual machine. If you run one worker for 24 hours, you pay for 24 hours of that VM. Dataflow can auto-scale the number of workers based on data volume, which is great for handling spikes. But you can also set a maximum number of workers to avoid surprise bills.
When you want to stop a streaming job, you have options: Cancel (stop immediately), Drain (finish processing current data then stop), or Force Cancel. Each has different implications for data that’s currently being processed.
Change Data Capture with Datastream
The chapter introduces one more important concept: what happens when you can’t modify the source system to publish to Pub/Sub?
This is common in real organizations. Think banking systems, legacy monoliths, or third-party vendor software. You can’t just add a Pub/Sub publisher to someone else’s application.
Change Data Capture (CDC) solves this. CDC tools tap into a database’s internal transaction logs and capture every insert, update, and delete as it happens. On GCP, the CDC service is called Datastream.
Datastream connects to databases like MySQL, PostgreSQL, AlloyDB, and Oracle, then streams the changes to BigQuery, Cloud Storage, Cloud SQL, or Cloud Spanner. No code changes needed in the source system.
The book walks through a full end-to-end exercise: setting up a CloudSQL MySQL database, configuring Datastream to capture changes, using GCS bucket notifications with Pub/Sub, and running a Dataflow template job to load everything into BigQuery. It involves six different GCP components working together, which honestly reflects how real production systems look.
Chapter Takeaways
This chapter covers a lot of ground, and here’s what stands out. Apache Beam gives you a portable way to write pipelines that work for both batch and streaming with minimal changes. Dataflow handles all the infrastructure so you can focus on the logic. Windowing is the key concept that makes streaming aggregations possible. And when you can’t modify source systems, CDC through Datastream gives you real-time data capture without touching a single line of application code.
The author makes a good point at the end: learning all of Beam’s features at once is not practical. Start with your use case, learn what you need, and expand from there. With the foundation from chapters 1 through 6, you already have about 70% of what data engineering practice looks like day to day.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 6 Part 1: Real-Time Data with Pub/Sub or continue to Chapter 7: Visualizing Data with Looker Studio.