Data Engineering with GCP Chapter 11: Keeping Google Cloud Costs Under Control

Nobody ever got promoted for building the cheapest data pipeline. But plenty of people have gotten uncomfortable phone calls from their CFO after a runaway BigQuery bill. Chapter 11 is about the money side of GCP, and I think this is one of the most practical chapters in the book.

Stakeholders always ask: “How much will this cost?” The answer is never simple.

Three Pricing Models in GCP

Not every GCP service charges you the same way. The book breaks it down into three categories, and once you understand these, estimating costs gets easier.

VM-based pricing. Services like Dataproc, Dataflow, and Cloud Composer bill based on virtual machines. The formula: number of workers times cost per worker times total hours. You pay for uptime, not usage. If your Dataproc cluster sits idle all weekend, you are still paying. Cloud Composer 2 added auto-scaling to help, but machines running equals money spent.

Usage-based pricing. Here you pay for what you actually use. Google Cloud Storage charges by how much data you store. Pub/Sub charges by message volume. BigQuery on-demand charges by how many bytes your queries process. No usage, no bill. This feels fairer, but it can also surprise you when usage spikes.

Commitment pricing. You agree to buy a fixed amount of resources for a fixed period, and Google gives you a discount. This is the enterprise model. BigQuery editions are the key example for data engineers. You commit to compute slots and get predictable monthly costs.

BigQuery Pricing: On-Demand vs Editions

BigQuery is unique because you actually get to choose your pricing model. And the choice matters a lot.

On-demand is the default. You pay per terabyte of data processed by your queries. In the US, that is about $6.25 per TB. The math is straightforward. If you have a 10 GB table and 100 users each run a SELECT * once a day for a month, your bill is roughly $187.50. Simple, predictable for small workloads, and no upfront commitment.

The catch? On-demand gives you a fixed 2,000 slots per project. A slot is basically a unit of compute in BigQuery, think of it like a virtual CPU with some memory. If your queries need more than 2,000 slots, they queue up and get slower. You cannot change this limit on the on-demand plan.

BigQuery editions replaced the old flat-rate model in 2023. Instead of paying per byte scanned, you pay for slots. You configure a slot reservation with two key numbers: baseline slots (the minimum you always have) and max reservation size (the ceiling you are willing to pay for). The difference between these two is your autoscale range.

Set your baseline to 200 slots and your max to 4,000. During quiet hours, you pay for 200. When queries spike, BigQuery scales up in increments of 100, up to 4,000. When demand drops, it scales back down.

On top of this, slot commitments let you lock in a 1-year or 3-year term for your baseline slots at a lower price. But you pay whether you use those slots or not. Classic cloud trade-off: flexibility versus savings.

The practical takeaway: small and unpredictable workloads, stick with on-demand. Queries getting slow or bills getting big? Start looking at editions.

Using the Google Cloud Pricing Calculator

Google provides a free pricing calculator at cloud.google.com/products/calculator. You pick a service, fill in your expected usage, and it gives you a monthly estimate.

The book walks through a realistic scenario: GCS for 100 GB daily CSV storage, a 13-node Dataproc cluster, BigQuery for 20 users, Cloud Composer for orchestration, and Pub/Sub plus Dataflow for streaming at 2 GB per hour. The interesting finding? Dataproc is by far the most expensive piece. That always-on Hadoop cluster costs more than everything else combined.

Estimates and reality will always differ. But having numbers on paper gives your team something concrete to discuss. Maybe that permanent Dataproc cluster should be ephemeral. Maybe batch processing could move to BigQuery SQL. These conversations save real money.

Partitioned Tables: Your First Line of Defense

Here is where cost strategy gets practical. On BigQuery on-demand, every query bills you for the bytes it processes. A SELECT * on a 1 TB table costs roughly $6.25 every single time.

Partitioned tables split your data into segments based on a key, usually a date column. When a query filters on that key, BigQuery only reads the relevant partition instead of the whole table.

Say your table has five days of data totaling 1 TB. Without partitioning, every query scans 1 TB. With daily partitioning, filtering to one date scans about 200 GB. Same result, one-fifth the cost.

The most common approach is partitioning by a date column. You can also partition by ingestion time or integer ranges. One thing to watch: BigQuery limits you to 4,000 partitions per table. With daily partitions, that covers about 10 years. Partition by hour and you hit the limit in five months.

Clustered Tables: The Second Line of Defense

Clustering goes one step further. While partitioning divides data by date ranges, clustering sorts data within each partition based on up to four columns.

BigQuery stores data in distributed blocks. Without clustering, blocks contain random rows. With clustering, blocks are sorted. When your query filters on a clustered column, BigQuery knows which blocks to read and skips the rest.

The combination of partitioning and clustering is powerful. Partition by date, cluster by the columns people filter on most. The result is dramatically fewer bytes scanned per query.

When choosing cluster columns, two rules from the book: pick the columns your users filter on most often, and order them from least granular to most granular (country, region, city, postal code).

The book proves this with a StackOverflow data experiment. Running the same query against a standard table, a partitioned table, and a partitioned-plus-clustered table, the last one consistently processes the fewest bytes. At scale, those savings compound fast.

Practical Tips for Data Engineers

After reading this chapter, here is what I would tell anyone starting with GCP:

Do the math before building. Use the pricing calculator. Knowing that Dataproc will be your biggest expense might change your architecture entirely.

Know what you are paying for. VM-based services charge for uptime. Usage-based services charge for consumption. Shut down what you are not using. Use ephemeral clusters when possible.

Partition and cluster everything in BigQuery. There is almost no reason not to. On on-demand pricing, the savings are direct. On editions, fewer bytes scanned means fewer slots needed.

Watch your SELECT star queries. Every column adds to your bill. Train your team to select only what they need.

Consider editions when you outgrow on-demand. Slow queries hitting the 2,000 slot ceiling? High monthly bill? Run the numbers on editions with autoscaling.

Revisit costs regularly. Google changes pricing. Your usage changes. Make cost review a habit.

Chapter Summary

Chapter 11 is less about building things and more about paying for them wisely. The three pricing models give you a framework for any GCP service. BigQuery pricing deserves special attention because you choose between on-demand and editions, and that choice affects both cost and performance. Partitioned and clustered tables are the easiest wins for reducing BigQuery costs. And the pricing calculator is your best friend when stakeholders ask “how much?”

The money conversation is not glamorous, but nobody wants to be the person who explains why the cloud bill tripled last month.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 10 Part 2: Data Quality and Security or continue to Chapter 12 Part 1: CI/CD for Data Engineers.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More