Data Engineering with GCP Chapter 2: Getting Started with Google Cloud for Big Data

Chapter 2 is where Adi Wijaya starts showing what Google Cloud Platform actually has for data engineers. After the theory in Chapter 1, this one is about opening GCP for the first time and figuring out which services matter and which ones you can safely ignore for now.

Cloud vs Non-Cloud: The Four Month Story

Adi shares a great example from his own career. He once helped a company build an on-premises data warehouse using Oracle databases. How long before he could store his first table? Four months. They had to wait for physical servers to be shipped across continents, then wait for network engineers to plug cables, install software, configure everything. Months of waiting before a data engineer could even touch the system.

In BigQuery on GCP? Less than one minute. That is the difference between cloud and non-cloud.

The core principle is simple. In the cloud, computation and storage are services, not hardware. You control them with code and configuration. You request resources when you need them and release them when you don’t.

The On-Demand Nature

This is a concept that feels obvious once you get it, but it changes how you think about infrastructure.

In a traditional setup, you buy servers for development, testing, and production. All three environments run 24/7 whether anyone uses them or not.

In the cloud, you spin up a testing environment only when tests run, then delete everything when done. In the old world, that would be like selling your servers after each test run, then buying them back the next day. Nobody does that with physical hardware. But in the cloud, this is normal.

Adi gives a good Hadoop example too. On-premises Hadoop means one giant cluster with hundreds or thousands of nodes running all the time. On GCP, the common practice is to create a dedicated Spark cluster for a single job, run it, then delete the cluster. You only pay while the job runs. When it sits idle, you pay nothing. This is called an ephemeral cluster, and it is a very different way of thinking about infrastructure.

The GCP Console

When you register for GCP, you land in the console. This is where everything happens. Enabling services, managing users, monitoring costs, checking logs. You can build an entire data pipeline without leaving your browser.

The console has a navigation menu with a long list of services organized by categories like Compute, Storage, Databases, and Analytics. It is overwhelming at first. Adi recommends pinning the services you will actually use to the top of the menu: BigQuery, Cloud Composer, Dataproc, Pub/Sub, Dataflow, Cloud Storage, and a few others.

There is also Cloud Shell, a Linux terminal right in the browser with gcloud and Python preinstalled. Plus a basic code editor. Nothing fancy, but enough for exercises.

About costs: GCP requires a payment method to register, but new users get $300 in free credits for 90 days. Some services like BigQuery have free tiers (10 GB storage, 1 TB queries per month). Others like Cloud Composer charge immediately. Don’t worry about costs for now; the book covers cost strategy later.

Serverless vs Managed vs VM-Based

This is one of the most useful sections in the chapter. Adi breaks GCP services into three groups, and understanding these groups helps you make better choices.

VM-based means you rent virtual machines from Google (Compute Engine) and install whatever software you want on them. Google handles the physical hardware and operating system, but you install and maintain the software. Use this when GCP doesn’t have a managed version of what you need. Adi’s example: Elasticsearch. Google didn’t have a managed Elasticsearch service, so you spin up VMs and install it yourself.

Managed service means Google installs and maintains the software, but you still configure machine sizes and networking. Example: Dataproc for Hadoop. You don’t install Hadoop yourself. But you decide how many nodes and what machine type.

Serverless means you just use it. No setup, no infrastructure config. BigQuery is the poster child. Create a table, run SQL, done. Zero visibility into what runs underneath, and zero need for it.

The tradeoff is flexibility versus simplicity. VM-based gives full control but full responsibility. Serverless gives zero control but zero maintenance. Adi recommends starting with serverless whenever one exists for your use case.

Service Mapping: What Goes Where

GCP has a lot of services. Adi maps the important ones into categories and assigns priorities (1 being most important, 3 being least for a data engineer starting out).

Analytics: BigQuery (data warehouse), Cloud Composer (Airflow for orchestration), Dataproc (managed Hadoop/Spark), Pub/Sub (messaging, comparable to Kafka), Dataflow (distributed processing for streaming), Dataplex (data governance).

Storage and Database: Cloud Storage (object storage for data lakes), Cloud SQL (managed MySQL/PostgreSQL), Bigtable (NoSQL), Spanner (highly scalable SQL).

Identity and Management: IAM for access control, Cloud Logging/Monitoring, Cloud Build for CI/CD.

ML and BI: Vertex AI for machine learning, Looker Studio for visualization.

Security: Sensitive Data Protection for detecting PII, Secret Manager for passwords and keys.

Each GCP service tends to do one specific thing. Unlike traditional IT products that try to be a full-stack solution in one package, GCP expects you to combine multiple services. As a data engineer, your job is knowing which pieces fit together.

Adi’s advice on choosing between overlapping services: pick the most popular one. Not because of hype, but because popular services get better long-term support from Google, bigger communities, and more available experts to hire.

Quotas: The Hidden Boundaries

Here is something that surprises people. Even with unlimited budget, you cannot do everything in GCP. Every service has quotas, which are hard limits set by Google.

BigQuery has a maximum of 10,000 columns per table. Cloud Storage limits individual objects to 5 TB. These exist for two reasons.

First, GCP services run on shared infrastructure. Quotas prevent one customer from hogging resources and slowing things down for everyone else. Second, quotas act as a smell test for your design. If your table needs more than 10,000 columns, something is probably wrong with your data model.

You don’t need to memorize quotas. Just know they exist and where to check: cloud.google.com/[service-name]/quotas.

User Accounts vs Service Accounts

Last concept, and it is important. GCP has two types of accounts.

A user account is your personal login. Your email, your identity. You use it to click around the console and do development work.

A service account is for machines. When you automate an ETL pipeline with Cloud Composer, that pipeline needs permissions to read from Cloud Storage and write to BigQuery. You don’t want it running under someone’s personal email. What happens when that person leaves the company? The whole pipeline breaks.

Instead, you create a service account like [email protected]. It has its own permissions, independent of any employee. People come and go, the service account stays. You also get better security because you can limit what the service account accesses without touching any human user’s permissions.

What Stuck With Me

Chapter 2 is not deep, and it is not supposed to be. It is the “here is the map, here is the compass” chapter. The key ideas are: cloud means on-demand resources, serverless is your default choice, each GCP service does one thing, and always use service accounts for automation.

Starting next chapter, Adi goes hands-on with BigQuery. That is where things get interesting.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 1: Fundamentals or continue to Chapter 3 Part 1: BigQuery Data Warehouse.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More