Data Engineering with GCP Chapter 8 Part 1: Machine Learning Basics for Data Engineers

Chapter 8 is the one where Adi Wijaya finally brings up the topic every data engineer either loves or dreads: machine learning. And honestly, he does a good job of calming down both camps. If you are excited about ML, great. If you think it has nothing to do with your job, think again. This chapter shows why ML and data engineering are way closer than most people realize.

I am splitting this chapter into two parts. This first part covers the foundations: what ML actually is from a data engineering perspective, the key terms you need to know, how a basic ML pipeline works, and the different approaches GCP gives you for building ML models. Part 2 will get into the hands-on stuff with Vertex AI, AutoML, and pipelines.

Why Data Engineers Should Care About ML

Adi makes a simple but important point right at the start. There are two main reasons ML got so popular: better infrastructure and more data. That second reason is literally our job. Data engineers build the pipelines, clean the data, and make it available. Without good data, ML models are useless.

In any organization, data engineers will probably be pulled into ML discussions or projects at some point. Maybe you won’t be training models yourself. But you will be preparing datasets, building pipelines that feed into ML systems, and maintaining the infrastructure around them. Understanding how ML works, even at a high level, makes you much better at all of that.

What ML Actually Is (For Data Engineers)

Here is the simplest way to think about it. ML is a data process. It takes data as input and produces a generalized formula as output. That formula is what people call an ML model.

Take an eCommerce recommendation system. The input is customer purchase history. The ML process crunches that data and produces a formula that can predict what items a customer might buy next. Or a cancer predictor: the input is X-ray images labeled as “cancer” or “no cancer,” and the output is a formula that can look at new, unlabeled X-ray images and make predictions.

Even generative AI follows the same pattern. Tons of text goes in, and out comes a formula that calculates what words should follow other words. The scale is different, but the concept is the same.

For data engineers, the key thing to notice is that ML is not fundamentally different from other data processes. It needs data in and produces output. The special part is that “generalized formula” in the middle, which is where the math lives: regression, decision trees, neural networks, Random Forest, and so on.

The Terms You Need to Know

One thing Adi highlights is that data scientists and data engineers sometimes use different words for the same things. Like how some people say “mobile phone” and others say “cell phone.” Here are the ML terms that matter:

Dataset is the data used as ML input. For data scientists, a dataset usually means a cleaned, ready-to-use version of the data. For data engineers, raw data might need a lot of work before it qualifies as an ML dataset.

Features are the columns in your data that the model uses to make predictions. If you have a table with humidity, temperature, and rain/no-rain columns, then humidity and temperature are your features.

Target is what you want to predict. In that same example, rain/no-rain is the target.

Accuracy is how well the model actually performs. Like “90% of predictions correctly predicted rain.”

Hyperparameters are settings you give to the model before training. For example, the number of trees in a Random Forest model. Nobody knows the perfect number upfront. Data scientists try many combinations and pick the best one. That process is called hyperparameter tuning.

Batch prediction means predicting on a bunch of data at once. Like predicting next week’s weather for all seven days in one go.

Online prediction means predicting in real time, one request at a time. Usually through an API. A web app sends humidity and temperature values, and the API responds with rain or no rain.

How a Basic ML Pipeline Works

Adi walks through a practical example: predicting whether a credit card customer will fail to pay their bill next month. The dataset comes from BigQuery public data. Here is the pipeline, simplified:

  1. Load data from BigQuery into a format your ML library can work with
  2. Select which columns (features) matter for your prediction
  3. Split the data into training data (70%) and test data (30%)
  4. Train the model on the training data
  5. Test the model on the test data to measure accuracy
  6. Save the trained model as a file

Once you have a saved model, you can use it for batch prediction (feed it a table of new data, get predictions back) or online prediction (serve it as an API endpoint).

The important takeaway for data engineers: steps 1, 2, and 3 are basically data engineering. Loading data, selecting columns, splitting datasets. You already know how to do that. The ML-specific part is really just steps 4 and 5. And step 6 is just saving a file.

MLOps: The Bigger Picture

Here is something Adi is very honest about. Most ML content on the internet focuses on creating and improving ML models. That is maybe 10% of the actual work. The other 90% is everything else: data collection, data cleaning, feature engineering, infrastructure, monitoring, deployment, and maintenance.

That whole practice of building and maintaining all these pieces is called MLOps. It requires expertise in ML models, data engineering, orchestration, containerization, web services, monitoring, and more. This is exactly why ML projects usually need more than one team.

If you start everything from scratch, it could take months or years to set up the full stack. GCP offers managed services that cut that time significantly. Not easy to learn, but way easier than building it all yourself.

Four Approaches to ML on GCP

This is where things get practical. GCP gives you four different ways to build ML models, and each one makes trade-offs between development time, expertise needed, and cost:

Custom model in a custom environment. You write all the code, pick the algorithm, tune the hyperparameters, manage the infrastructure. This gives you the most control and can be the cheapest at scale. But it needs experienced data scientists and lots of experimentation time.

Pre-built models. Google provides trained models you can use immediately through an API. Vision AI can read text from images. Translation API can detect and translate languages. You need almost zero ML expertise. The downside: you are limited to what Google offers, and it costs more per API call for heavy usage.

AutoML. You give it your data and a time budget, and AutoML tries many algorithms and hyperparameters automatically. It returns the best model it found in that time. This handles your specific use cases (unlike pre-built models), needs very little ML expertise, and often produces models more accurate than hand-built ones. The downside: infrastructure cost is higher than running custom models.

BigQuery ML. You train ML models using SQL queries right inside BigQuery. If you are comfortable with SQL but not Python, this is your path. You can use various algorithms that BigQuery provides, and the workflow feels similar to building data pipelines.

From fastest to slowest in terms of development time: pre-built models, AutoML, BigQuery ML, custom models. From cheapest to most expensive at scale, roughly the reverse order.

The Data Engineer’s Role in All of This

What struck me most in this chapter is how Adi keeps bringing it back to data engineering. The ML model itself, the fancy algorithm part, is actually a small piece of the puzzle. The bigger work is getting data ready, building reliable pipelines, making sure the infrastructure works, and keeping everything running in production.

That is our work. Data engineers do not need to become data scientists. But understanding how ML fits into the bigger picture makes you the kind of engineer that ML teams actually want to work with. You know how to feed their models good data. You know how to build the pipelines that keep everything running. And on GCP specifically, many of these tools are designed so that data engineers can participate directly in ML projects.

In Part 2, we will get into the hands-on exercises: using Google Cloud Vision as a pre-built model, training a model with AutoML, and building ML pipelines with Vertex AI.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 7: Looker Studio Visualization or continue to Chapter 8 Part 2: Vertex AI and AutoML.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More