Data Engineering with GCP Chapter 10 Part 1: Data Governance Basics on Google Cloud

Data governance is one of those topics that sounds boring until you realize nobody can find anything in your data platform. Then it becomes very interesting very fast.

Chapter 10 is the biggest chapter in the book, so I’m splitting it into two parts. This first part covers the foundations: what data governance is, why data engineers should care, metadata management with Dataplex, and data lineage. Part 2 will cover security, data protection, and data quality.

What Data Governance Actually Means

Data governance is the set of processes, policies, and practices that organizations use to manage their data ecosystem. That definition sounds broad because it is. Every organization will define it differently.

Adi makes an important point: roles vary by organization size. In smaller companies, data engineers handle governance themselves. In bigger ones, there’s a dedicated data governance team.

The book breaks data governance into three pillars:

  • Usability - can people actually find and understand the data?
  • Security - is the data protected from unauthorized access?
  • Accountability - do we know what happened, who did it, and when?

Each of these branches into more specific concerns. But before jumping into tools, Adi emphasizes something many teams miss: understanding why you’re doing data governance matters more than which tools you pick. Wrong motivation leads to wasted effort.

Tools Are Not Enough, You Need Practice Too

Here is an analogy from the book that stuck with me. Think about the word “government.” You think of rules, facilities, and enforcers. But what actually makes government work is that people agree to follow the rules. Same with data governance.

For example, Dataform can help implement unit testing on BigQuery. But Dataform by itself won’t improve data quality. It requires people to consistently write and run proper tests. Without the practice, the tool does nothing. Without the tool, some practices are very hard to follow. You need both.

This is why getting everyone on the same page about motivation matters. Every person in the data ecosystem needs to understand and share the same reasons for doing governance.

Data Usability: Can People Find Your Data?

The first pillar is usability. Two questions: can users find the data they need, and can they understand it once they find it?

You might think “of course people can find the data.” But Adi shares a real example: a big retail company with data spread across 800 GCP projects. One project could have 200,000 tables. One table could have 200 columns. Now imagine finding what you need just by table and column names. Good luck.

This is where metadata and cataloging come in. Like Google Search for the internet, you need a search experience for your internal data. On GCP, that tool is Dataplex Data Catalog.

Dataplex: The Data Governance Platform

Dataplex is GCP’s data governance platform. “Platform” is the key word because Dataplex is not one specific tool. It’s an umbrella for several independent services:

  • Dataplex Search and Catalog - manages metadata tagging, indexing, and search. This is the successor to the older Data Catalog service.
  • Dataplex Lakes - manages Data Lake metastore, access control, and processes.
  • Dataplex Profile - automatically profiles BigQuery tables.
  • Dataplex Data Quality - automatically checks data quality on BigQuery tables.
  • Dataplex Data Lineage - auto-generated BigQuery table lineage based on DDL and DML queries.

Important for first-time users: all these functionalities are independent. You don’t need Dataplex Lakes before using Dataplex Profile. Pick the service you need and ignore the rest. The console layout might suggest otherwise, but trust me, there are no prerequisites between Dataplex services.

Tags and Tag Templates: Making Data Searchable

The practical heart of Data Catalog is tags. You create tag templates that define what kind of metadata you want to track, then attach tag values to your BigQuery tables and columns.

The workflow is straightforward. Create a tag template in Dataplex Catalog, for example “data-ownership” with fields like “Data Owner” and “Business Owner.” Then go to any BigQuery table in Dataplex Search, attach the template, and fill in values. Tags work at both table and column levels.

Once tags are assigned, users can search for tables using those tags as filters in the Dataplex Search console.

This sounds simple with one template and one table. But scale it up to millions of tables with hundreds of business contexts, and you start to see why this matters. Every organization will have different tag templates. The book shows a design where tags split into Business Tags (data domain, owner, use case) and Technical Tags (freshness, pipeline info, SLA).

Adi recommends defining tag templates through workshops with stakeholders. If only one person designs the tags in isolation, adoption will be poor.

Data Modeling for Understanding

Finding data is only half of usability. The other half is understanding what you found. Two things help: good table and column descriptions, and good data modeling.

Maintaining descriptions is surprisingly hard. Everyone agrees it’s important, nobody wants to do it. One approach is using GitOps to maintain table definitions, which the book covers later with Dataform.

Good data modeling means structuring tables to represent the business model clearly. Adi compares two versions of a “People” table. One has cryptic column names. The other has clear names, proper types, and descriptions. The difference in usability is huge.

Data Lineage: Knowing Where Data Comes From

Under the accountability pillar, data lineage is one of the most talked-about topics. The usual question: for a given BigQuery table, what upstream tables and processes created it?

On GCP, you have a few options. You can rely on ETL tools like Cloud Composer or Data Fusion. But there are also two GCP-native approaches.

Dataplex Data Lineage is the out-of-the-box option. Enable the Data Lineage API, and BigQuery automatically generates lineage based on DDL and DML queries. You can see it in the LINEAGE tab of any BigQuery table. The benefit: zero extra work. The downside: it’s automated and sometimes doesn’t capture exactly what you need.

Dataform gives you full control. Since you define all transformations as SQL files with explicit table references, the lineage is precise and complete. The tradeoff: you need to manage your transformations through Dataform.

Traceability and Data Ownership

Two more aspects of accountability worth mentioning briefly.

Clear traceability means tracking who did what and when. GCP makes this easy through Cloud Logging, which tracks logs for BigQuery, GCS, Dataproc, Cloud Composer, and Dataflow. For BigQuery specifically, you can query INFORMATION_SCHEMA tables to find things like which users ran the most queries in the past 30 days. The Log Router feature lets you export logs to BigQuery for deeper analysis.

Data ownership is trickier. Who is responsible for a given table? You can’t always tell from logs because tables are often created by service accounts, not humans. Adi recommends tracking human owners through BigQuery table labels or metadata tags. This requires consistent practice from the whole team.

Wrapping Up Part 1

Data governance is not one tool or one practice. It’s how your organization finds, understands, secures, and takes responsibility for its data. Key takeaway: tools without practices are useless, and practices without tools are painful. You need both.

In Part 2, we’ll cover security: data encryption, column-level access control, data masking, PII detection with Sensitive Data Protection, and data quality with Dataform.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 9: User and Project Management or continue to Chapter 10 Part 2: Data Quality and Security.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More