Data Engineering with GCP Chapter 12 Part 1: CI/CD Basics for Data Engineers
Chapter 12 is a shift from everything we have done so far. Until now, we were learning how to build things: pipelines, data lakes, warehouses, streaming systems. Now the question is: how do you ship all that stuff to production without breaking things? The answer is CI/CD.
Do Data Engineers Even Need CI/CD?
Adi Wijaya addresses this head-on, and I like his honesty here. CI/CD is a common practice in software engineering, but in the data world it is still catching up. Many data engineers work their entire careers without setting up a CI/CD pipeline. That does not make them bad engineers. It just means their projects or organizations did not require it.
He uses a cooking analogy that actually makes sense. If you are a solo pasta chef running a small restaurant, you do not need a complex kitchen management system. You cook, you serve, you are done. But if you join a big restaurant with 20 chefs, multiple stations, and a queue of orders, you need systems. Standard recipes, quality checks, plating procedures. Otherwise it is chaos.
Same with data engineering. If you are working alone or in a team of two to three people on a short-term project, CI/CD is nice to have but not critical. If you are in a big organization with multiple engineers pushing code to the same data pipelines, you need it. Someone will eventually push broken code to production at 2 AM, and nobody wants to debug that manually.
The practical takeaway: any code-based data application can use CI/CD. Dataflow jobs, Spark scripts in Dataproc, Terraform configs, Airflow DAGs in Cloud Composer. If it is code and it goes somewhere, CI/CD can automate the process. On the other hand, tools that are purely UI-based, like Looker Studio, cannot use CI/CD because there is no code to integrate or deploy.
CI vs CD: Two Separate Things
People say “CI/CD” like it is one word, but it is actually two distinct practices, and you do not have to use both.
Continuous Integration (CI) is about automatically checking code when developers push changes. A developer commits code, and the system automatically runs tests, builds Docker images, checks for errors. The goal is to catch problems early. If five people push code to the same repository, CI makes sure each change does not break anything before it gets merged.
Continuous Deployment (CD) is about automatically pushing the verified code to production. Once CI says “this code is fine,” CD takes it and deploys it to the actual running application. It can also handle rollbacks if something goes wrong.
Here is the key insight: many teams implement CI but skip CD. They automate the testing and validation part, but they still deploy to production manually. Why? Because automated deployment to production feels risky, and for some systems it genuinely is. That is a perfectly valid approach. Start with CI, add CD when you are confident in your test coverage.
The GCP CI/CD Toolchain
Google Cloud has services that cover each step of a CI/CD pipeline. Here is how they map together at a high level.
Cloud Source Repositories is Google’s own Git hosting service. Think of it as GitHub but inside GCP. You create a repository, clone it, push code to it, and it integrates natively with other GCP services. One note: Cloud Source Repositories is being deprecated and will be replaced by Secure Source Manager (SSM). The concepts stay the same, just the product name changes. You can also connect external Git providers like GitHub or GitLab instead of using Cloud Source Repositories at all.
Cloud Build is the actual CI/CD engine. It is serverless, which means you do not manage any build servers. You define your pipeline steps in a YAML file called cloudbuild.yaml, and Cloud Build runs them in order. Each step runs inside a container, so you have full control over the environment. Steps can be anything: building Docker images, running unit tests, installing packages, copying files.
Cloud Build Triggers connect the dots. A trigger watches your Git repository for events like code pushes, new tags, or pull requests. When the event happens, the trigger kicks off a Cloud Build run. You configure which repository to watch, which branch pattern to match, and which cloudbuild.yaml to execute.
Container Registry (or Artifact Registry, which is the newer replacement) stores the Docker images that your CI pipeline produces. After your code passes tests and gets packaged into a container, it gets pushed here for storage and later deployment.
How a Build Pipeline Actually Works
The book walks through a full exercise, and here is the general flow without any code.
First, you create a Git repository in Cloud Source Repositories. Then you put your application code and a cloudbuild.yaml config file in that repository. The YAML file defines your pipeline steps.
A typical CI pipeline has three steps:
Step 1: Build a Docker image. Take all your code files and package them into a container image. This uses a publicly available Docker builder image from Google’s container registry.
Step 2: Run tests. Use the image you just built to run your unit tests. If the tests fail, Cloud Build stops right there. No broken code gets through. This is the whole point of CI.
Step 3: Push the image. If tests pass, push the finished container image to Container Registry. Now you have a tested, packaged version of your application ready for deployment.
Then you create a Cloud Build Trigger that watches your repository. Every time someone pushes code, the trigger fires, and these three steps run automatically.
The beauty of this setup is that it is completely serverless. You do not maintain build servers. You do not SSH into machines to run tests. You push code, and the system handles everything. If a test fails, you see it in the Cloud Build console. If everything passes, your new container image is ready.
The YAML Config Matters
Without getting into specific code, here is what you need to know about cloudbuild.yaml. Each step has a few key parameters:
name specifies which container image to use for that step. This is not just a label. It determines the environment where the step runs. Want to run Docker commands? Use the Docker builder image. Want to run Python tests? Use your own Python image.
id is a human-readable identifier for the step. It shows up in the Cloud Build console so you can see which step passed or failed.
args are the command-line arguments that get executed inside the container. They get converted into regular Linux commands.
entrypoint lets you change the default command for the container. If your container normally runs bash but you want to run python, you set the entrypoint.
Each step runs in sequence. If any step fails, the whole pipeline stops. This is by design. You do not want to deploy an image that failed its tests.
What is Next
In Part 2, we will look at a more practical scenario: deploying Cloud Composer DAGs using Cloud Build. That is where CI/CD connects back to the data pipelines we built earlier in the book. We will also see how the pipeline handles copying DAG files to GCS buckets, which is how Cloud Composer picks up new workflow definitions.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 11: Cost Strategy or continue to Chapter 12 Part 2: Building CI/CD Pipelines.