Data Engineering with GCP Chapter 6 Part 1: Real-Time Data with Pub/Sub

Chapter 6 is where Adi Wijaya switches gears from batch to real-time. After spending Chapters 3 through 5 on batch pipelines with BigQuery, Cloud Composer, and Dataproc, now it is time to talk about streaming data. Two GCP services carry this chapter: Pub/Sub and Dataflow. This post covers the streaming concepts and Pub/Sub. Dataflow gets its own post in Part 2.

Batch vs Streaming: What Is Actually New Here

Adi makes a point that I really like. Some people say that if data is not real-time, it is not big data. He calls that partially true. The reality is that about 90% of data pipelines in the real world are batch. Daily jobs, hourly refreshes, weekly reports. Batch is not going anywhere.

But all data is technically created in real time. When a user registers in your app, that record exists the instant they click the button. Processing data in batches is just a simplification. We collect events over some period and process them together because it is easier.

Streaming means you stop waiting. You process each record as soon as it arrives. The data flows from source to target continuously, like water through a pipe rather than buckets being carried back and forth.

Two reasons, and both are practical.

First, most business questions are naturally about time periods. “How much revenue this month?” or “How many signups last week?” Nobody asks “How many signups this exact second compared to the previous second?” The questions that decision makers actually ask fit the batch model perfectly.

Second, batch is simpler to build and operate. You control the schedule. You can retry failed jobs. With streaming, you have one long-running process that never stops. Once it starts, it processes everything that comes in. There is no “run it again tomorrow” safety net.

When Streaming Actually Makes Sense

Fraud detection is the classic example. If someone is using a stolen credit card, you need to catch it now, not in tomorrow’s batch report. Real-time marketing campaigns are another. If a user just browsed winter jackets, showing them an ad 24 hours later is way less effective than right now. Live dashboards for operations teams also fit here.

The common thread: situations where waiting even a few minutes costs real money or creates real risk.

The GCP Streaming Stack

On Google Cloud, the standard streaming pattern looks like this: data sources send messages to Pub/Sub, Dataflow subscribes and processes them, then writes results to BigQuery. Cloud Storage sits in the middle as temporary staging when needed.

No scheduler here. No Cloud Composer, no cron jobs. Everything is a long-running process that handles data as it appears.

What Is Pub/Sub

Pub/Sub is a messaging system. Its job is to receive messages from multiple sources and distribute them to multiple consumers. Think of it as a post office that accepts letters from many senders and delivers copies to many recipients.

There are four key terms you need to know.

Publisher is whatever sends messages into Pub/Sub. This is usually your application code. You write a small program in Python, Java, Go, or whatever language you prefer, and it pushes messages into Pub/Sub.

Topic is where messages live inside Pub/Sub. Think of it like a database table. Just as a table stores rows, a topic stores messages. Adi uses “bike-sharing-trips” in his example. Unlike databases, there is no higher-level grouping like datasets or schemas, so your topic names need to be clear on their own.

Subscription is the other end. A subscription is an entity that wants to receive messages from a topic. One topic can have many subscriptions, and each subscription gets identical copies of every message. If topic X has two subscriptions, both get the same messages.

Subscriber is different from subscription. A subscriber is the actual consumer attached to a subscription. One subscription can have multiple subscribers, and the messages get split between them. This is how you handle load balancing.

Acknowledgement: The Pizza Delivery Analogy

There is a fifth concept that ties it all together: acknowledgement, or “ack” for short.

After a subscriber receives a message, it needs to tell Pub/Sub “got it, thanks.” That is the ack. Once acked, Pub/Sub stops trying to deliver it. If the subscriber fails to ack, maybe a code bug or server crash, Pub/Sub keeps retrying until the message is either acked or expired.

Adi has a nice analogy. Imagine pizza delivery. The chef (publisher) puts pizzas on the shelf (topic). The delivery person (subscription) brings a pizza to you (subscriber). When you sign the receipt, that is the ack. If you are not home, they come back and try again.

Push vs Pull Delivery

When you create a subscription, you choose between two delivery types: pull and push.

Pull means the subscriber calls Pub/Sub and asks for messages. “Hey, anything new for me?” This is the standard approach for data pipelines. Dataflow uses pull when it reads from a subscription.

Push means Pub/Sub sends messages to an HTTP endpoint that your subscriber exposes. Pub/Sub calls your URL with each message. This works when your subscriber cannot use the Pub/Sub client library directly.

Both methods can handle high throughput, but pull has a much higher quota. Adi mentions the difference is roughly 30 to 1 in favor of pull. For data engineering work, pull is almost always what you want.

Creating Topics and Subscriptions

Setting up Pub/Sub is quick. Console, gcloud command line, or code.

For topics, you click Create Topic, give it a name, done. Adi recommends unchecking the default subscription option so you can create subscriptions manually and learn the settings.

For subscriptions, there are a few settings worth knowing. Message retention duration controls how long unacked messages stick around (cost vs risk tradeoff). Expiration period sets when an idle subscription gets automatically deleted. Acknowledgement deadline tells Pub/Sub how long to wait before resending an unacked message. If your subscriber does heavy processing, set this higher.

Publishing Messages

To send messages, you write a small publisher application. Adi uses Python with the google-cloud-pubsub library. The publisher creates JSON messages, converts them to strings, and calls the publish method. Each published message gets a unique ID back as confirmation.

Messages can be anything: free text, JSON, binary data. For data pipelines, JSON is most common because downstream processors can parse it easily.

The whole flow from publishing to reading takes milliseconds. Pub/Sub handles scaling automatically. Multiple publishers can send millions of messages without you worrying about server capacity.

What Comes Next

We now have a messaging system that can ingest and distribute data streams. But messages sitting in Pub/Sub are not useful on their own. You need something to consume them, transform them, and write the results to BigQuery.

That something is Dataflow, built on top of Apache Beam. Part 2 covers how Dataflow works, how to write Beam pipelines, and how to connect the full chain from Pub/Sub through Dataflow into BigQuery.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 5 Part 2: Spark on Dataproc or continue to Chapter 6 Part 2: Stream Processing with Dataflow.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More