Data Engineering with GCP Chapter 10 Part 2: Data Quality, Security, and Compliance

In Part 1 we covered how data governance breaks into three pillars (usability, security, accountability) and went through metadata, Dataplex search, access control in BigQuery, and the Sensitive Data Protection service for finding PII. Now let’s pick up where we left off: understanding what SDP actually finds, and then moving into the accountability pillar.

What SDP Discovery Actually Tells You

After you set up SDP Discovery and let it scan your BigQuery tables, you get something called Data Profiles. These show up under the SDP console, and they list every table that may contain PII columns.

The interesting part is how the predictions work. For each column, SDP gives you a “Predicted infoType” (its best guess, like PERSON_NAME or EMAIL_ADDRESS) and “Other infoTypes with Estimated prevalence.” The second one means SDP detected PII, but not in every row. Maybe 12% of records in a free-text column contain person names, 9% contain domain names. That percentage matters because it tells you how widespread the problem is.

Each infoType comes with pre-configured data risk and sensitivity scores. Gender is moderate risk. Email addresses are higher. You can find the full definitions in Google’s docs, and these scores are configurable through your SDP template.

Here is what impressed me. SDP can find PII even in free-text columns. Think about Stack Overflow comments. People write whatever they want. Without SDP, you would need someone to manually read thousands of records to check if anyone pasted their name or email into a comment. SDP gives you a free-text score (how likely a column is to be unstructured text) and scans it for hidden PII. In the book’s example, it detected person names in about 12% of comment records. That would take a human analyst weeks to find.

Accountability: Who Did What and When

The third pillar of data governance is accountability. And no, this is not about finding someone to blame when things go wrong. The point is to have systems in place that track all important events so that when something does go wrong, you can actually fix it quickly.

Accountability breaks down into four parts: traceability, data ownership, data lineage, and data quality.

Clear Traceability with Cloud Logging

Traceability means knowing who did what and when. Which user created a table with sensitive data? Which queries consumed the most BigQuery capacity this month? Who’s running the most expensive jobs?

In GCP, this is mostly handled for you. All GCP product logs are tracked and stored in Cloud Logging. BigQuery, GCS, Dataproc, Cloud Composer, Dataflow, everything gets logged. You can also export logs to BigQuery using the Log Router feature and run SQL against them for real analysis.

BigQuery has its own built-in audit system called Information Schema. These are Google-managed tables you can query directly. Want to know which user ran the most queries in the past 30 days? One SQL query against INFORMATION_SCHEMA.JOBS. Spend time exploring these tables. They are the foundation for monitoring your BigQuery environment.

Data Ownership: Not as Simple as You Think

Here is a problem I have seen many times. Someone asks “who owns this table?” and nobody knows. The table was created by a Cloud Composer service account. The logs show the service account name, not the human who wrote the pipeline.

You cannot rely on automatic tracking to identify human data owners. Service accounts create most tables, not people. So you need to actively manage ownership using BigQuery table labels or metadata tags. The team has to agree on a standard (like adding an “owner” label to every table) and follow it consistently.

In my experience, this only works if the organization enforces the rules. Otherwise six months in, half the tables have no owner label and you are back to guessing.

Data Lineage: Where Did This Data Come From

Data lineage answers the question: what upstream tables and processes were needed to produce this table? In a complex data ecosystem with hundreds of tables feeding into each other, this is not trivial.

GCP gives you two main options. First, Dataplex data lineage. You enable the Data Lineage API, and Dataplex automatically generates lineage based on the queries that run against your tables. The upside is that it’s automatic. The downside is that automated lineage does not always match what you actually need. Sometimes the generated graph is incomplete or confusing.

The second option is Dataform, which gives you full control over lineage because you explicitly define table dependencies in your code. More on this in a moment.

Data Quality: Stop Thinking in Black and White

This section of the book has one of the best analogies I have read about data quality. Adi compares it to a water pipeline. If you drink dirty water once, you distrust the entire water system. Same thing happens with data. One bad report, and the business loses trust in the whole data platform.

But here is the thing: expecting perfect data quality in a complex system is like expecting zero bugs in complex software. It will not happen. The right approach is to treat data quality like software quality. You need clear processes and measurements, not a binary “good” or “bad” label.

Dataplex Data Quality gives you built-in rules to measure table quality (null checks, range checks, uniqueness, and so on). But measurement alone is not enough. You need automated testing in your data pipeline. And that is where Dataform comes in.

Dataform: The Missing Piece

Dataform is probably the most practical tool covered in this chapter. A lot of people confuse it with Dataflow or Cloud Composer, so let me clarify. Dataform does not process data. It manages SQL scripts.

Here is the positioning: Dataflow and Dataproc process data. Cloud Composer orchestrates pipelines. Dataform manages the T in ELT. It is a Git-based platform where you write .sqlx files, commit them, create releases, and execute them. Every workspace maps to a Git branch. Releases compile your SQL and produce BigQuery scripts ready to run.

The environment has four layers: repository (connected to a Git repo), development workspaces (one per developer or team, mapped to branches), release configurations (compile and package the code), and workflow executions (run the compiled releases).

Why Dataform Matters for Governance

The book walks through a hands-on exercise where you build a multi-layer transformation: source declarations, staging tables, a transformed events table, and an access layer view. Then Adi improves the code to show how Dataform enforces governance.

With a few lines in the config block of a .sqlx file, you can specify the target dataset, add table and column descriptions, attach ownership labels, define partition columns, and add assertion-based tests. The assertions are the key feature. You define rules like “event_id must be unique” and Dataform checks them every time you run. If you break uniqueness (the book demonstrates this by setting event_id to a static value), execution fails: “Assertion failed, expected zero rows.”

This is how data quality should work. Not as a separate manual audit, but as automated tests baked into the pipeline. If you come from software engineering, think of it as unit tests for your data transformations.

Chapter Summary

Chapter 10 covered a lot of ground across both parts. The three-pillar framework (usability, security, accountability) gives you a mental model for thinking about data governance without getting lost in dozens of tools.

Usability is about making data easy to find and understand. Security is about access control down to the column level and discovering sensitive data you did not know you had. Accountability is about traceability, explicit ownership, lineage, and automated quality testing.

If I had to pick one takeaway: data governance is not a one-time project. It is a set of practices built into how your team works every day. Instead of asking people to follow rules manually, encode the rules into the pipeline itself. That is the only way governance works at scale.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 10 Part 1: Data Governance Basics or continue to Chapter 11: Cost Strategy.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More