Data Engineering with GCP Chapter 1: What Is Data Engineering Anyway?
Chapter 1 starts with a confession most of us in the data world can relate to. Adi Wijaya says he used to think data was clean. Neatly organized, ready to go. Then he actually worked with data in real organizations and realized most of the effort goes into collecting, cleaning, and transforming it. Not the fun machine learning part. The plumbing part.
That’s what data engineering is. The plumbing. This chapter lays the foundation for everything that comes later in the book.
Data Does Not Stay in One Place
The first big idea is the data life cycle. Data is like water. It flows from upstream to downstream. Sometimes it’s a simple waterfall. Sometimes it’s a complex pipeline with filters, branches, and valves along the way.
In most organizations, data starts in application databases. Your website, your mobile app, your point of sale system. Each one generates data. But that data sits in its own little world. The book calls these data silos, and they’ve been a problem since at least the 1980s.
Think about a bank. They have one system for credit cards, another for mortgages, another for savings accounts. Simple question: how many credit card customers also have mortgages? You can’t answer that without pulling data from multiple systems into one place.
That’s where data warehouses and data lakes come in.
Data Warehouses vs Data Lakes
A data warehouse has been around since the 1980s. Take data from many sources, put it in one place with a structured format. You query it with SQL, it has a schema, and storage and compute are bundled together.
A data lake came later, around 2008, when Hadoop showed up. The key difference is not that a data lake stores unstructured data. The real difference is that a data lake separates its building blocks. Storage is separate from compute. Schema is optional. You pick your own processing engine.
Here’s how the book breaks it down:
- Data warehouse: one product, structured schema, uses SQL, storage and compute together
- Data lake: modular platform, schema optional, flexible processing, storage and compute separated
In modern systems, these two work together. Data lakes don’t replace data warehouses. They complement each other. Organizations store raw data cheaply in a data lake, then move the valuable stuff into a structured data warehouse.
The Full Data Life Cycle
The book shows a pretty standard flow that most companies follow:
- Apps and databases generate raw data
- Data lake collects everything from multiple sources in file formats like CSV, JSON, Parquet
- Data warehouse takes the valuable data and puts it into structured tables with proper schemas
- Data marts serve specific teams (finance gets finance tables, data scientists get ML feature tables)
- End consumers use dashboards, run queries, or train machine learning models
Not every company follows this exact pattern. But across finance, government, telecom, and e-commerce, most companies follow this or are moving toward it. From my own experience in IT, I can confirm this is pretty accurate.
What Does a Data Engineer Actually Do
The book has a good comparison. If you go to a doctor, you know what they do: examine you, diagnose, prescribe medicine. Clear responsibilities.
Data engineering is not that clear yet. The role is still new. Some companies expect data engineers to handle infrastructure. Others expect them to build dashboards. The expectations are all over the place.
The book defines it simply: a data engineer is someone who designs and builds data pipelines.
Then it maps the role to three zones:
- Core (must know): building data pipelines, moving data from lake to warehouse, designing ETL processes, job orchestration
- Good to have: building data marts, collecting data into the lake
- Good to know: application databases, machine learning, infrastructure, dashboards
If you’re starting out, focus on the core first. The rest comes with experience. If your current job only touches the edges, find a way to get closer to the core.
ETL vs ELT
ETL stands for Extract, Transform, Load. It’s the bread and butter of data engineering.
- Extract: pull data from the source system
- Transform: clean it, join it, reshape it
- Load: put it into the target system
ELT is the same letters, different order. Extract, Load, Transform. You load the raw data into the target system first, then transform it there.
Why does the order matter? Because it affects your choice of technology, performance, cost, and scalability.
If your target system is powerful enough to handle transformations (like BigQuery or a Hadoop cluster), use ELT. Load everything there and let the system do the heavy lifting.
If your target system is not powerful, do the transformations in the middle before loading. That’s ETL.
Simple concept, but the decision comes up at every step of the data life cycle.
Big Data Is Relative
The book takes a practical approach to big data. Instead of talking about the famous “5 Vs” (volume, variety, velocity, veracity, value), it focuses on what matters to engineers: the “how” questions.
How do you store 1 petabyte when hard drives are measured in terabytes? How do you calculate an average when data lives on multiple machines?
Big data is relative to your system. 5 GB of data on a laptop with 1 TB storage? Not big data. 5 petabytes of the same dataset? Now you need special tools. The key is size relative to what your system can handle.
Big data systems distribute data across multiple machines, called a cluster. A large file gets split into small chunks (like 128 MB each) and spread across servers. Metadata keeps track of which chunk belongs where.
MapReduce: Processing Distributed Data
Once data is spread across multiple machines, you need a way to process it. The most famous concept for this is MapReduce, originally published as a paper by Google.
The classic example: counting words across files on different machines. Three file parts on three machines containing fruit names. You want to count how many times each fruit appears.
Here’s how MapReduce works:
- Map: each machine tags every word with a count of 1 (Apple: 1, Banana: 1, etc.)
- Shuffle: group the same words together on the same machine (all Apples go to one machine)
- Reduce: add up the counts (Apple: 3, Banana: 2, Melon: 1)
- Result: store the final answer
The key thing is that each step happens in parallel across all machines. Instead of downloading everything to one computer and processing it there, you distribute the work.
While the original MapReduce technology (from Hadoop) is being replaced by newer tools like Spark and Dataflow, the concept is still relevant. When you run a SQL query in BigQuery on a petabyte of data, MapReduce-style distributed processing is happening in the background. You just don’t see it.
What I Think
This chapter does a good job of giving you the vocabulary. If you walk into a data engineering interview and someone asks about ETL vs ELT, data warehouses vs data lakes, or how distributed systems process data, you’ll have solid answers after reading this.
What I appreciate about the author’s approach is that he starts from the problem, not the solution. Why did data warehouses exist? Because data was in silos. Why did data lakes appear? Because storage needed to be cheaper and more flexible. Why does MapReduce matter? Because data got too big for one machine.
That’s how good engineering explanations should work. Start with the “why” and the “how” follows naturally.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Start from the beginning or continue to Chapter 2: Big Data on GCP.