Data Engineering with GCP Chapter 3 Part 2: Data Modeling and BigQuery Features
In Part 1 we loaded CSV files into BigQuery and built a simple warehouse from a MySQL export. Now the book throws a second scenario at us: bike-sharing data. More tables, daily batch loading, and a real question that every data engineer has to face sooner or later. How do you actually model your data so that business people can use it without calling you every five minutes?
The Second Scenario: Bike Trips
The setup is straightforward. We have three tables: trips, stations, and regions. The business user wants to know things like how many bike trips happen daily, average trip duration, and which stations or regions have the longest or shortest rides.
Trips data lives in a GCS bucket, organized by date folders. Regions come from BigQuery public datasets. New data arrives daily. So the challenge is not just loading data once, but handling the ongoing flow of new records without breaking things.
Adi makes an important point here: stop clicking buttons in the GCP console. In real work you write Python scripts. You need reproducibility, version control, and the ability to loop through pipelines programmatically.
Event Data vs Snapshot Data
Here’s something the book explains really well. Not all data behaves the same way.
Trips are event data. Each trip happened, it’s done, it won’t change. You just keep appending new records using WRITE_APPEND. Yesterday’s trips don’t get edited.
Stations are snapshot data. They’re objects in the real world. A station can be demolished, renamed, or get extra capacity. New stations appear. So when you load tomorrow’s station data, you can’t just append it. You’d end up with duplicates: the old version and the new version of the same station sitting side by side.
The simple fix? Use WRITE_TRUNCATE. Replace the whole table every day. Problem solved, no duplicates. But you lose history. If someone asks “how many stations did we have last month?” you can’t answer that.
The book shows a smarter approach: keep a history table with an insert_date column, append everything, then create a view that only shows the latest snapshot using CURRENT_DATE(). Users query the view and see clean current data. When someone needs historical records, the full table is right there.
There’s also MERGE in BigQuery (insert if new, update if changed) and Slowly Changing Dimensions, but the book just mentions these as concepts to research on your own.
Data Modeling: Why It Matters
This is where Chapter 3 gets really interesting. Adi makes a statement that I strongly agree with: modern data engineers tend to skip data modeling because storage is cheap and BigQuery is fast. And that’s a mistake.
Yes, BigQuery can join 587 GB of data in 23 seconds. Yes, storing 1 TB costs about $20/month. But cheap storage and fast queries don’t fix these problems:
- Data duplicated across many locations
- Inconsistent values between reports
- End users who can’t understand the table structures
- Business people who stop trusting the data
A good data model is self-explanatory. The book gives a simple example: a People table with name, age, hair_color, and gender makes sense immediately. A People table with postal_code, wealthy (yes/no), and inconsistent gender values (Female vs Woman) is confusing. The goal is that business users answer their own questions just by looking at the table structure, without calling the engineering team asking “what does NULL mean here?”
Inmon vs Kimball
The two big names in data warehouse modeling. Both approaches are from the 1990s, and people still argue about them.
Inmon is top-down. You build one highly normalized central warehouse (the “single source of truth”) and create data marts from it. Everything is consistent, but it takes a long time to build and adapting to new data sources is painful.
Kimball is bottom-up and user-driven. Start with business questions, build fact tables (measurements) and dimension tables (entity attributes) to answer them. Faster to deliver value, but consistency across the whole organization is not guaranteed.
Which one for BigQuery? The book says: both, depending on your situation. Rock-solid financial report? Lean Inmon. Marketing team that gets new data sources weekly? Kimball.
For the bike-sharing scenario, the book uses Kimball’s star schema. The business questions are clear, consistency requirements are moderate, and the fact/dimension table split makes the data easy to understand.
The fact table (fact_trips_daily) holds measurements: total trips, average duration, sum of duration per station per day. The dimension table (dim_stations) holds station attributes: name, region, capacity. A business user can join these two tables and answer all four original questions without touching the raw data.
Nested Data Types in BigQuery
Here’s a BigQuery-specific trick. The eternal fight between normalized tables (good for storage, need JOINs) and denormalized tables (easy to query, waste storage) has a third option in BigQuery: nested data types.
You can store stations as repeated records inside a regions table. One row per region, with an array of stations nested inside it. Region info is stored once, no duplication, and you don’t need a JOIN to get stations with their regions.
The tradeoff? It confuses people who are used to flat SQL tables. But it’s a useful tool to have in your back pocket.
BigQuery Features Tour
The last section of the chapter is a quick tour of BigQuery features beyond basic querying.
Data Transfer Service (DTS) pulls data from sources like Amazon S3, Google Ads, and YouTube directly into BigQuery. Useful if your source matches the supported list, skip it otherwise.
Scheduled Queries let non-engineers schedule SQL to run on a timer. Fine for analysts, but as a data engineer you should use proper orchestration tools like Cloud Composer instead.
Analytics Hub is for data exchange between organizations or teams. Think publish/subscribe for datasets.
Dataform handles data transformation (covered later in the governance chapter).
BigQuery Routines include stored procedures, UDFs (user-defined functions), table functions, and remote functions. UDFs are especially useful: instead of everyone writing their own formula, you create a shared function that keeps business logic consistent.
Partitioned Tables are the most important feature covered here. Partition a table by a date column and BigQuery divides the data into segments. Query with a date filter and it only reads that partition, not the whole table. This saves money (you pay per data scanned) and makes queries faster.
What I Took Away
Chapter 3 Part 2 is really two lessons in one. The practical lesson is about loading batch data, handling event vs snapshot patterns, and building fact/dimension tables. The conceptual lesson is that data modeling still matters, even when your warehouse is fast and storage is cheap.
The best data engineers I’ve worked with always think about how humans will use the data, not just how to get it from point A to point B. That’s exactly what this chapter tries to teach.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 3 Part 1: Your First BigQuery Warehouse or continue to Chapter 4 Part 1: Cloud Composer Workflows.