Data Engineering with GCP Chapter 5 Part 1: Building a Data Lake on Google Cloud

Chapter 5 is where things get interesting if you come from a traditional database background. We are leaving the nice structured world of BigQuery and entering the territory of raw files, distributed storage, and the Hadoop ecosystem. Welcome to the data lake.

Why Data Lakes Exist

In previous chapters we built a data warehouse in BigQuery. That is great when you know what your data looks like and what questions you want to answer. But what happens when you have terabytes of log files, images, JSON blobs, and CSV exports from 15 different systems, and you have no idea yet what is useful?

That is exactly the problem a data lake solves. Adi Wijaya breaks it down into three things that make a data lake different from a data warehouse:

It can store any kind of data. Structured tables, semi-structured JSON and CSV, unstructured images and logs. But honestly, as Adi points out, from a data engineer’s perspective they are all just files. If your storage can hold files, you are good. No schemas upfront, no table structures before loading.

It can process files in a distributed way. You cannot just read a 500 GB file with a Python script on your laptop. You need a system that breaks work into smaller pieces and runs them across multiple machines in parallel. This is scaling out, not scaling up. Scaling up means buying a bigger server. Scaling out means adding more servers. One has a ceiling, the other does not.

The mindset is: load the data first, figure out value later. This is maybe the biggest shift. In a data warehouse world, you design your schema, clean your data, and then load it. In a data lake world, you dump everything in first and let data scientists, analysts, and other teams figure out what is worth exploring. If your organization keeps debating “is this data clean enough to store?” then they are thinking about it wrong. Store it, worry later.

A Quick History Lesson

The idea of storing and processing massive amounts of data is not new. Google was doing it in the early 2000s because they needed to store every web page on the internet. You cannot put that in a regular database table.

In 2003, Google published papers about their Google File System and MapReduce. In 2006, the open source community took those ideas and created Hadoop. After that, companies around the world started building their own data lakes. The rest, as they say, is history.

The Hadoop Ecosystem

Hadoop is not just one tool. It is an ecosystem of tools that work together. At the core, there are three main components:

HDFS (Hadoop Distributed File System) takes your files and automatically splits them across multiple machines. You can store terabytes or petabytes of data spread across many hard disks, and HDFS handles all the distribution for you.

MapReduce defines how to process data in parallel across all those machines. In recent years, Spark has become more popular for this job, but conceptually Spark still uses the MapReduce idea of splitting work into chunks and processing them simultaneously.

YARN is the resource manager. It keeps track of how much CPU and memory each machine has available and decides how to allocate resources to each job.

Beyond these three, the Hadoop ecosystem includes tools for streaming (Kafka, Flume), data warehousing on top of files (Hive, Impala), machine learning, and more. But Adi gives solid advice here: do not try to learn every Hadoop tool at once. Focus on understanding HDFS and a processing framework like Spark first. The rest will make sense naturally once you get those two.

Dataproc: Hadoop on Google Cloud

Setting up a Hadoop cluster from scratch is painful. You need to install operating systems, configure networking, install Hadoop software, tune configurations. It can take days.

Dataproc is Google’s managed Hadoop service. It takes about 5 minutes to spin up a full Hadoop cluster. Google handles the VMs, the OS, and the Hadoop installation. You just focus on writing and running your data processing jobs.

One of the most common reasons organizations use Dataproc is migration. They already have years of Hadoop scripts and developers familiar with the ecosystem. Moving to Dataproc is the path of least resistance.

An interesting point from the book: on GCP, the best practice is to use Google Cloud Storage (GCS) instead of HDFS for storage. Both can store files, both work with Dataproc. But GCS is serverless and compatible with other GCP services like BigQuery and Vertex AI. HDFS lives inside your cluster and dies when the cluster dies.

There is also Dataproc Serverless, which lets you submit Spark jobs without even creating a cluster. Autoscaling is built in. It does not replace regular Dataproc, but for many use cases it removes the infrastructure headache entirely.

How Much Hadoop Do You Actually Need on GCP?

This is a practical question Adi addresses. GCP has native alternatives for most Hadoop components. BigQuery replaces Hive for data warehousing. GCS replaces HDFS for storage. What remains is the processing layer, and that means Spark.

If you are new to both Hadoop and GCP, focus on learning Spark. That is where most of the real-world value is, and it is what shows up in data engineering job interviews.

Working with HDFS and Hive

The book walks through a hands-on exercise of creating a Dataproc cluster and working with data. Here is the high-level flow without getting into the specific commands.

First, you create a single-node Dataproc cluster. In production you would have multiple master and worker nodes, but for learning, one node is enough.

Then you access the cluster’s master node through SSH. From there, you interact with HDFS using command-line tools. HDFS has its own filesystem separate from the regular Linux filesystem on the machine. When you list files in HDFS, you see completely different directories than what a normal ls shows. This feels abstract at first, but it clicks after some practice.

To get data into HDFS, you first copy files from Google Cloud Storage to the master node, and then load them from the master node into HDFS. Think of it as a two-step process: GCS to local disk, then local disk to HDFS.

Once your data is in HDFS, you can use Hive to put a table structure on top of your files. Hive lets you create an external table that points to a directory in HDFS. You define the columns, specify the delimiter, and suddenly you can query CSV files with SQL. The underlying data never moves. The files stay where they are. Hive just provides a schema layer on top.

This is one of the most important concepts in data lake technology. You are not loading data into a database. You are querying files directly, and the “table” is just metadata that tells the system how to interpret those files.

What is Next

In Part 2, we will get into Spark on Dataproc: how to read files from both HDFS and GCS using PySpark, how to submit Spark jobs to a cluster, and the concept of ephemeral clusters, which is honestly the biggest reason to use Dataproc in the cloud.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 4 Part 2: Airflow Best Practices or continue to Chapter 5 Part 2: Spark on Dataproc.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More