Data Engineering with GCP Chapter 9: Managing Users and Projects in Google Cloud
Chapter 9 is the one where Adi Wijaya zooms out from data pipelines and asks: okay, but who can access what, and how do we keep this whole thing organized? If the previous chapters taught you how to build things in GCP, this one teaches you how to not let those things turn into a security and management mess.
IAM: Who Gets to Touch What
IAM stands for Identity and Access Management. The concept is straightforward. You have accounts, and you give those accounts roles. Roles contain permissions. Permissions let you do specific things with specific GCP services.
Two types of accounts matter here. User accounts are for humans. Your email, your login. Service accounts are for machines and applications. When Cloud Composer runs your data pipeline at 3 AM, it is not using your personal Gmail. It uses a service account. This is important because people quit jobs, go on vacation, change teams. Service accounts stay put.
Every GCP project comes with a default service account. Adi says never use it. Always create dedicated service accounts for specific purposes. Better security, easier to track who does what.
Now, about roles. There are three kinds: basic, predefined, and custom. Basic roles like Owner or Editor are the big hammers. They give access to everything in a project. Fine for learning and experimentation, terrible for production. If you give a data engineer the Editor role, they can suddenly touch Kubernetes, App Engine, and a dozen services they have no business touching.
Predefined roles are what you should use in real life. They are scoped to specific services. For example, BigQuery Data Viewer only lets someone see data in BigQuery, nothing else. Custom roles exist for edge cases where predefined ones do not fit, but they add operational overhead.
The golden rule here is the principle of least privilege. Give people only the permissions they actually need. Nothing more.
Planning Your GCP Project Structure
Throughout the book, all exercises happened in a single GCP project. That works for learning. In the real world, organizations usually need more than one project.
Adi shows three alternatives. First: separate projects by workload. One project for core application and database, another for data warehouse stuff (BigQuery, Cloud Composer, GCS), and a third for ML work with Vertex AI. Second: put everything in one single project. Third: give every service its own project.
There is no universally correct answer. Startups with small teams? One project is fine. You want to ship your MVP, not spend weeks managing project hierarchies. Larger corporations with multiple teams using GCP for different purposes? Separate projects make sense because different teams need different permissions and different services.
On top of projects, GCP has folders and organizations. One organization sits at the root. Folders group projects underneath. The real benefit of folders is IAM inheritance. If you grant someone BigQuery Viewer access at a folder level, they automatically get that access in every project inside that folder.
Adi makes a good point about folder design. Most people think top-down: “Let me create nice tidy folders first and sort projects into them.” That is the wrong approach. Think bottom-up instead. Look at which projects share the same access patterns and group those under a folder. If no shared access pattern exists, a folder gives you zero benefit and just adds overhead.
Three Reasons to Split Projects
Adi boils the decision down to three factors.
IAM and service requirements. If different teams need different permissions and use different GCP services, separate projects help. If everyone needs the same access to the same services, one project is simpler.
Project limits and quotas. Every GCP service has quotas, and quotas are per project. BigQuery, for example, limits concurrent interactive queries to 100 per project. If you have 1,000 active users all hitting one project, you will run into problems. Splitting into multiple projects gives each one its own quota ceiling.
Cost tracking. The GCP billing dashboard shows cost per project. If your organization wants to know how much the data team is spending versus the ML team, separate projects make that easy. The bill still goes to one billing account, but the breakdown is clear.
Controlling Access in BigQuery
Adi goes deeper into BigQuery specifically because that is where data engineers spend most of their time. BigQuery access control works at multiple levels: project, dataset, table, and even column and row.
There are two main types of permissions. Job permissions let you run queries. Access permissions control what data you can actually see. Having the ability to run a query does not mean you can see every table.
The practical example is an eCommerce company with four user groups: Data Engineers, Marketing, Sales, and Head of Analysts. Each group needs different tables. Instead of granting table-level access one by one (which does not scale), Adi recommends grouping tables into datasets based on access patterns. Sales needs checkout and cart tables? Put them in a Transaction dataset. Grant Sales the viewer role on that dataset. Done. New tables added to that dataset automatically inherit the permissions.
This is inheritance doing the heavy lifting for you.
Infrastructure as Code with Terraform
The last section of the chapter is about Terraform and the concept of Infrastructure as Code (IaC). Until this point in the book, everything was created through the GCP console UI or gcloud commands. That works for a handful of resources. It does not work when an organization has hundreds or thousands of resources to manage.
Without IaC, you get inconsistent naming conventions, forgotten configurations, and no clear record of what was created when and by whom. IaC fixes this by defining all your resources in code files. You get version control, templates, testing before deploying, and documentation baked into the process.
Terraform is an open source tool by HashiCorp. It works with GCP and many other cloud providers. The basic workflow has three steps: terraform init downloads the required libraries, terraform plan shows you what will change before anything happens, and terraform apply actually creates or modifies the resources.
The setup involves a few files. A backend configuration tells Terraform where to store its state (typically a GCS bucket). A provider configuration tells it you are working with Google Cloud. Variables keep your configuration flexible so you do not hardcode project IDs everywhere. You declare variables in one file and set their values in another.
Adi walks through creating a BigQuery dataset with Terraform as a simple first example. The main point is not the specific resource but the pattern: define what you want in code, review the plan, apply it.
In most organizations, DevOps or infrastructure teams own the Terraform code. But data engineers often contribute to it, especially for data-related resources like BigQuery datasets, GCS buckets, and service accounts.
The Bigger Picture
This chapter bridges the gap between being a data engineer who can build pipelines and being someone who understands how those pipelines fit into a larger organization. IAM keeps things secure. Project structure keeps things organized. Quotas keep things running. And IaC keeps things reproducible.
Adi puts it well: understanding these four topics lifts your knowledge from data engineer to cloud data architect. You start thinking not just about your pipeline, but about the whole system it lives in.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 8 Part 2: Vertex AI and AutoML or continue to Chapter 10 Part 1: Data Governance Basics.