Data Engineering with GCP Chapter 3 Part 1: Your First BigQuery Data Warehouse

Chapter 3 is where things get real. Up to now the book was setting the stage, explaining what data engineering is, showing you around GCP. Now Adi Wijaya says: okay, let’s actually build something. And the something is a data warehouse in BigQuery.

The chapter is big, so I’m splitting it into two parts. This first part covers the basics: what GCS and BigQuery are, how the BigQuery console works, loading data, and building your first end-to-end pipeline from CloudSQL through GCS into BigQuery.

The Dream vs Reality of Data Warehouses

Adi starts with a really honest point. The whole dream of a data warehouse is simple: put all your data in one place so business people can trust it and make decisions from it. That’s the dream.

Reality? Not so smooth. There are three big challenges that come up every time: technology bottlenecks, data consistency problems, and trying to serve multiple business needs at once. These aren’t specific to any one tool or company. They just show up whenever you try to centralize data. The rest of the chapter is about working through these challenges hands-on using BigQuery.

Google Cloud Storage and BigQuery Basics

Two services you need to know before building anything.

Google Cloud Storage (GCS) is object storage. Think of it as a giant bucket where you can throw any file of any size and any format. It’s serverless, meaning you don’t worry about disk space, servers, or network. For data engineers, GCS is where you dump database exports, store historical data, keep ML model files, and stage data before loading it somewhere else.

BigQuery is Google’s serverless data warehouse. You store data, you query it with SQL, you don’t think about infrastructure. It can handle petabytes. Under the hood, it stores data in a distributed file system called Google Colossus (the successor to the system that inspired Hadoop’s HDFS) and processes queries using a SQL engine inspired by Google’s internal tool called Dremel.

As a regular user, you don’t need to care about Colossus or Dremel. But Adi makes a good point: knowing they exist helps when you’re troubleshooting performance issues later.

One thing BigQuery is not: a transactional database. It can scan 1 TB of data in about 5 seconds, but that doesn’t mean it’ll run a 1 MB query in microseconds. It’s built for analytics, not for serving your app’s login page.

Data Location Matters

When you create a BigQuery dataset, you pick a region. That’s where your data physically lives and gets processed. This matters for two reasons: regulatory compliance (some organizations need data in specific countries) and practical stuff like making sure your GCS bucket and BigQuery dataset are in the same region. You can’t load data from a GCS bucket in Asia into a BigQuery dataset in the US. They have to match.

The BigQuery Console

The console has three main sections: a left menu bar with things like SQL Workspace and Data Transfers, an Explorer panel where you see your projects and datasets, and the query editor where you write and run SQL.

You can literally start with SELECT 'hello world' and hit Run. The results show up at the bottom along with job information, execution details, and query history. Nothing fancy, but it’s important to get comfortable here because you’ll spend a lot of time in this console.

Datasets, Tables, and Loading CSV Files

In BigQuery, a “dataset” is basically a container for tables. But Adi makes an interesting distinction. Other databases call these “databases” or “schemas,” which makes engineers think in technical terms. BigQuery calls them “datasets” on purpose, because it pushes you to name them by what the data represents. Instead of naming things “Layer_1” or “Layer_2,” you’d name them “Customer_Savings” or “Website_Logs.” Small thing, but it changes how you think about organizing data.

Loading data is straightforward. The simplest way is uploading a local CSV file right from the console. You create a dataset, click the action button on it, choose “Create Table,” point it at your file, define the schema, and you’re done. Not the most common approach in production (that would be loading from GCS or using Dataflow), but great for learning.

BigQuery Public Datasets

If you’re just learning and don’t have your own data, BigQuery comes with a bunch of public datasets. Some are small and simple, but others like StackOverflow, crypto_ethereum, and github_repos are real, messy, multi-table datasets that actually look like production databases.

One practical tip from the book: always check the table size before querying public data. Stick to tables under 1 GB while you’re experimenting. The free tier gives you 1 TB of query processing per month, but you can burn through that fast if you’re not careful with big tables.

Data Types: Simpler Than You Think

If you come from traditional databases like PostgreSQL, Oracle, or Teradata, you’re used to choosing between char, varchar, varchar(n), text for strings, or between smallint, int, int8, int32, int64 for numbers. BigQuery just has STRING and INTEGER. That’s it. You don’t need to think about memory allocation or storage efficiency because BigQuery handles all of that internally. A one-character string and a 100-character string have the same efficiency in BigQuery.

One specific thing worth noting: BigQuery stores all timestamps in UTC. Always. This is actually a good thing because it forces data consistency. If your company has offices in different time zones, everyone’s working with the same base format. You can convert to local time in your queries whenever you need to.

Building Your First Pipeline: CloudSQL to BigQuery

This is where the chapter gets hands-on with a realistic scenario. Here’s the setup:

A regional manager needs to find the top two regions by total station capacity from a bike-sharing dataset. The source data lives in a CloudSQL MySQL database. They want to download results as CSV. And more data sources will be added in the future.

Adi walks through the thinking process. Since data comes from multiple systems and there will be more sources later, this is clearly a data warehouse situation. The plan becomes a four-step pipeline:

  1. Create a MySQL database in CloudSQL with a stations table and load CSV data into it. This simulates having a real application database as your data source.

  2. Extract from MySQL to GCS. This is the “E” in ETL. You use gcloud commands to export table data as CSV files into a GCS bucket. There’s a detour into IAM here because the CloudSQL service account needs permission to write to GCS. It’s a small but important real-world detail.

  3. Load from GCS to BigQuery. Create a dataset called raw_bikesharing, then create a table from the CSV file in GCS. You define the schema manually since the MySQL export doesn’t include headers.

  4. Create a data mart. Business users typically don’t access raw tables directly. You create a separate dataset (dm_regional_manager) and build a view that answers the specific business question. The view aggregates station capacity by region and returns the top two.

The result? A view that the regional manager can query and export to CSV. End-to-end ELT on GCP.

Adi is honest about the limitations: this isn’t really a full data warehouse yet. It’s one table, no real data modeling, and the pipeline is manual. But you’ve touched every piece of the puzzle: CloudSQL, GCS, BigQuery, IAM permissions, datasets, tables, views, and data marts. That’s a solid foundation.

The real data warehouse work, including data modeling, handling multiple tables, and thinking about how to structure your data, comes in Part 2.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 2: Big Data on GCP or continue to Chapter 3 Part 2: Data Modeling in BigQuery.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More