Data Engineering with GCP Chapter 12 Part 2: Building CI/CD Pipelines on Google Cloud
In Part 1 we covered the theory behind CI/CD and ran through a basic Cloud Build exercise with a Python project. Unit tests ran automatically on every push. Broken code got caught before it reached production. Good stuff, but that was a simple calculator script. Now we need to connect this to real data engineering work.
This part of Chapter 12 is where things get practical. We are deploying Airflow DAGs through a CI/CD pipeline, and then the book wraps up with honest advice about best practices that I think every data engineer should hear.
Deploying Cloud Composer DAGs with Cloud Build
The exercise uses a DAG from Chapter 4, back when the book taught Cloud Composer and Airflow. The idea is simple: instead of manually uploading your DAG Python file to the Cloud Composer GCS bucket, you let Cloud Build handle it automatically every time you push code.
The setup follows the same pattern as before. You create a Cloud Source Repository, connect it to a Cloud Build trigger, and prepare a cloudbuild.yaml file. But the pipeline steps are different this time because we are dealing with Airflow, not a regular Python app.
The pipeline has four steps:
Step 1: Build a Docker image. This image installs Airflow and all dependencies from a requirements file. You need Airflow installed because the next step runs unit tests against the DAG, and those tests need the Airflow package to work.
Step 2: Validate the DAG. This is the CI part. The pipeline runs unit tests against your DAG file inside the Docker container. It sets the AIRFLOW__CORE__DAGS_FOLDER environment variable so Airflow knows where to find the DAGs, then runs the test suite. If your DAG has a syntax error or an invalid cron expression, the build stops right here. Nothing gets deployed.
Step 3: Push the image to Container Registry. This step is actually optional for the deployment itself. You cannot deploy a DAG from a Docker image. But storing the image in GCR means the next build run can use it as a cache, which makes subsequent builds faster.
Step 4: Deploy the DAG to GCS. This is the CD part. The pipeline uses gsutil rsync to copy the DAG files from the build workspace to the Cloud Composer GCS bucket. Once the file lands in the /dags directory, Cloud Composer picks it up and the DAG starts running in Airflow. That is it. Push code, tests run, DAG deploys.
What Happens When Tests Fail
The book demonstrates this nicely. If you change the schedule_interval in your DAG from a valid cron expression to something like “This is wrong” and push it, Cloud Build catches it at the validation step. The logs clearly say “Invalid Cron expression” and point to the exact file. The deploy step never runs. The broken DAG never reaches your Composer environment.
This is the same principle as the calculator example from Part 1, but now it matters more. A broken DAG in production can silently fail, corrupt data, or waste compute resources for hours before anyone notices. Catching it at the CI stage costs you nothing.
Best Practices from the Field
This is my favorite section of the chapter. Adi Wijaya shares advice from working with dozens of companies, and it is refreshingly honest. There is no “follow these five steps and you are done” here. Instead, he talks about real tradeoffs.
Start with Code Testing, Not Data Testing
This advice is counterintuitive at first. Data engineers exist to make sure data is clean, so you would think data testing should come first. But Adi argues the opposite, and I agree with him based on my own experience.
Here is the problem. Testing code is straightforward. You define inputs, run functions, check outputs. Testing data is a different beast entirely. Data comes from sources you do not control. You can check for nulls, validate data types, count rows, look at distributions. But none of those checks will ever guarantee the data is correct. You are always guessing.
On top of that, running data quality checks on large datasets is expensive. When your production environment has a petabyte of data, do you really want to run the same tests across dev, staging, and UAT environments?
The practical advice: get your CI/CD pipeline solid with code testing first. Unit test your DAGs, lint your SQL with tools like SQLFluff, run your Python tests. Once the team is comfortable with that foundation, then add data quality checks on top.
Follow Software Engineering Practices
CI/CD has been a standard practice in software engineering for years. Data engineers do not need to reinvent the wheel. Maintain versioning. Commit often. Automate testing. Have a deployment strategy.
The one thing the book adds specifically for data engineers: treat SQL as code. Use a SQL linter (SQLFluff supports BigQuery dialect). Set up peer review for SQL changes. Write unit tests for your queries using tools like Dataform. The primary focus should still be code quality, not data quality.
Budget Drives Your Testing Strategy
Here is the honest part that many books skip. When you have multiple environments (dev, SIT, UAT, production), should you duplicate all your production data to each one? Should you run all tests on full datasets in every environment?
The answer the book gives is: it depends on your budget. Ideally, yes, you would test everything everywhere. But duplicating a petabyte of data across four environments and running queries on all of it is extremely expensive.
In practice, Adi has seen all kinds of approaches. Some companies only store data in production, so development and testing happen there too. Others use data sampling for non-production environments. Some replicate everything. There is no single right answer. Every team has to make this call together with whoever controls the budget.
Make It a Team Effort
CI/CD only works if the whole team buys in and practices it consistently. The author has seen implementations that got too complicated and ended up slowing teams down instead of speeding them up. That is the opposite of what CI/CD is supposed to do.
The key is clear communication and documentation. One person can design a great pipeline. But if the rest of the team does not understand it or agree with it, it becomes a burden instead of a tool.
Chapter Summary
Chapter 12 covered CI/CD from a data engineering perspective. We used Cloud Build, Cloud Source Repositories, and Container Registry to build automated pipelines. The first exercise tested a basic Python project. The second exercise, covered here, deployed an Airflow DAG through a full CI/CD pipeline with validation, caching, and automatic deployment to Cloud Composer.
The best practices section was arguably more valuable than the exercises themselves. The takeaway is clear: start simple, focus on code quality first, respect your budget constraints when planning data tests, and make sure the whole team is on the same page.
This was the final technical chapter in the book. Everything from Chapter 1 through Chapter 12 covered the core GCP services for data engineering: BigQuery, Cloud Composer, Dataproc, Pub/Sub, Dataflow, and now CI/CD. The next chapter shifts gears to career advice, certifications, and growing as a data engineer.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 12 Part 1: CI/CD Basics or continue to Chapter 13 Part 1: Growing as a Data Engineer.