Data Engineering with GCP Chapter 8 Part 2: Vertex AI, AutoML, and ML Pipelines

Part 1 covered the ML basics: what supervised and unsupervised learning are, how a simple model gets trained, and why data engineers should care about ML at all. Now in Part 2, Adi Wijaya moves into the GCP tools that make ML actually work in production. This is where theory meets infrastructure.

Vertex AI: Google’s ML Platform

Vertex AI is Google’s unified ML platform on GCP. Before Vertex AI existed, Google had a service called AI Platform, which did similar things but was less organized. The client library is still called aiplatform instead of something like VertexAIClient, which can be confusing at first. Just know they are the same thing, different generations.

What Vertex AI gives you is a single place to manage datasets, train models, deploy them, and monitor how they perform. For a data engineer, the most relevant parts are the dataset management, the pipeline orchestration, and understanding how models get served so you can build the data flows around them.

AutoML: ML Without Being an ML Engineer

This is probably the most interesting part for data engineers. AutoML lets you build ML models without writing ML code, without knowing which algorithm to pick, and without worrying about infrastructure. You point it at your data, tell it what you want to predict, and AutoML figures out the rest.

Adi walks through a practical example using a credit card default dataset stored in BigQuery. The workflow is straightforward. You create a Vertex AI dataset, connect it to your BigQuery table, pick a target column (in this case, whether someone will default on their payment), select which features to include, and hit train. AutoML tries different models and configurations to find the best one, usually within an hour.

Three things you do not need when using AutoML: you do not need to code, you do not need to know ML algorithms, and you do not need to set up infrastructure. The whole process takes less than an hour. For a data engineer who needs to build a quick predictive model or prototype something for the data science team, this is very practical.

The minimum cost is around $21 for one training node hour on tabular data, so keep an eye on your budget if you are still on the free trial.

What Happens After Training

Once AutoML finishes, you get a model with evaluation metrics like F1 score, Precision, Recall, and ROC AUC. If you are not a data scientist, don’t stress about these numbers. They all basically tell you how accurate the model is from different angles. The data science team will care about them deeply. As a data engineer, you just need to know they exist and where to find them.

From here, there are two paths. You can deploy the model as an API for real-time predictions (someone sends a request, gets a prediction back instantly). Or you can run batch predictions, where you feed a whole table of new data and get predictions for all rows at once. Both options are built into the Vertex AI console.

Vertex AI Pipelines: Orchestrating ML Workflows

This is where data engineers feel right at home, because Vertex AI Pipelines is basically orchestration for ML, similar to how Cloud Composer (Airflow) orchestrates data pipelines.

Under the hood, Vertex AI Pipelines uses Kubeflow Pipelines, an open source tool for building ML workflows on Kubernetes containers. The relationship is the same pattern you see everywhere in GCP: Kubeflow is to Vertex AI Pipelines what Hadoop is to Dataproc, or what Airflow is to Cloud Composer. Google takes the open source tool, manages the infrastructure, and you just use the SDK.

Why containers matter here more than in regular data pipelines: each step in an ML pipeline might need completely different libraries and even different Python versions. The data loading step needs BigQuery and pandas libraries. The model training step needs scikit-learn or TensorFlow. The evaluation step needs its own set of tools. Containers let each step run in its own isolated environment with exactly the packages it needs. You configure this through the @component decorator in Python.

The Two-Pipeline Pattern

Adi demonstrates a pattern that is very common in production ML systems: separating training from prediction into two distinct pipelines.

The first pipeline loads data from BigQuery, stores it in Cloud Storage, trains the model, and saves the model file (a .joblib file in this case) back to Cloud Storage. The second pipeline loads new data from BigQuery, picks up the trained model from Cloud Storage, runs predictions, and saves the results as a CSV file.

Why separate them? Because training and prediction run on different schedules. You might retrain a model once a month when new data accumulates, but run predictions every day or even every hour. Also, the newest model is not always the best model. Data scientists compare versions and decide which one to use. Keeping these pipelines separate gives everyone that flexibility.

One important detail Adi emphasizes: you cannot pass in-memory Python objects (like a pandas DataFrame) between pipeline steps. Each step runs in a different container, meaning a different machine. So the way to share data between steps is through Cloud Storage. Step one writes a file to GCS, step two reads that file from GCS. The only thing you pass between steps is a string with the file location.

How Data Engineers Support ML Teams

Adi makes a point that keeps coming back throughout this chapter: ML is not a core skill for data engineers. But understanding the ML workflow gives you a much bigger picture of the overall data architecture.

As a data engineer, your job in the ML context is making sure the right data lands in the right place at the right time. You build and maintain the data pipelines that feed ML models. You manage the BigQuery tables and Cloud Storage buckets that training pipelines read from. You help set up the orchestration that runs prediction pipelines on schedule.

You do not need to pick the ML algorithm. You do not need to tune hyperparameters. You do not need to evaluate model accuracy. But you do need to understand the flow well enough to design data infrastructure that supports it. When someone from the ML team says they need fresh data every 6 hours in a specific GCS bucket, you should know exactly how to make that happen.

Chapter Wrap-up

Chapter 8 closes the “Building Data Solutions with GCP Components” section of the book. From BigQuery in Chapter 3 all the way to ML in Chapter 8, Adi has covered the core tools a data engineer uses on GCP.

The main takeaways from this second part: AutoML is a fast way to build models without ML expertise, Vertex AI Pipelines uses Kubeflow under the hood and runs each step in containers, training and prediction should be separate pipelines with separate schedules, and data between pipeline steps should flow through Cloud Storage rather than in-memory objects.

Starting from the next chapter, the book shifts from “how to use tools” to “how to organize everything”: project structures, user management, cost estimation, and CI/CD. Those are the topics that separate someone who can use GCP from someone who can run GCP well in a real organization.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 8 Part 1: ML Basics or continue to Chapter 9: User and Project Management.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More