Data Engineering with GCP Chapter 4 Part 1: Automating Data Workflows with Cloud Composer

Up until now in the book, we built BigQuery tables by hand, wrote queries in the console, and loaded data manually. That works for learning, but nobody does that in production. In production, you need things to run on their own, on schedule, without you babysitting them at 5 AM.

Chapter 4 is where we learn how to automate all of that. The tool is Cloud Composer, and once you understand how it works, a huge chunk of data engineering starts making sense.

What Is Cloud Composer

Cloud Composer is Google’s managed service for Apache Airflow. That’s it. Google runs the servers, handles the installation, manages the updates. You just write your pipeline code and deploy it.

From a data engineer’s perspective, there is almost no difference between Cloud Composer and Airflow directly. When you learn Cloud Composer, you are really learning Airflow.

There are two versions: Composer 1 and Composer 2. The big difference is scaling. Composer 1 makes you pick a fixed number of workers upfront. Those machines run 24/7, busy or idle. Composer 2 can scale down to zero when nothing is happening, and scale up when work comes in. Use Composer 2.

One thing about cost: you pay for the environment as long as it exists, not per DAG or per task. Think of it like renting an apartment. The meter runs whether you cook dinner or not.

Apache Airflow Basics

So what is Airflow actually doing? It is an open source workflow management tool. If you ever used Control-M, Informatica, or Talend, Airflow sits in the same category. The difference is that Airflow is code-first, not drag-and-drop.

The book uses a great kitchen analogy. Cooking pasta every morning involves three things: a chain of tasks (prep ingredients, cook, serve), a schedule (every day at 5 AM), and integrations with tools (knives, stove, plates). Airflow handles exactly these three things for data pipelines: task dependencies, scheduling, and system integration.

Why code instead of drag-and-drop? You can automate deployment, write proper tests, and keep everything in Git for version control.

DAGs, Operators, and Tasks

These three terms show up everywhere in Airflow, so let’s get them straight.

A DAG (Directed Acyclic Graph) is your workflow definition. It is a collection of tasks chained together with their dependencies. Every DAG needs an ID (must be unique across the entire environment), a start date, and a schedule. The schedule uses cron format or simple presets like @daily and @weekly.

One gotcha: the start date is counter-intuitive. If today is January 1st and you set start_date to January 1st with a daily midnight schedule, the DAG will not run today. Airflow calculates the first run as start_date plus one schedule interval. So to run today, set the start date to yesterday. The book recommends days_ago(1) to keep things simple.

An operator is how Airflow talks to external systems. Think of operators as connectors. BigQuery, GCS, Cloud SQL, MySQL, PostgreSQL, email, Bash, Python, and dozens more. The huge library of pre-built integrations is one of Airflow’s biggest strengths.

A task is what you get when you use an operator with specific parameters. One DAG can have many tasks. Tasks are chained together with the bitshift syntax (>>) to define execution order. You can also have a task depend on multiple upstream tasks by grouping them in a list.

A DAG Run is an instance of your DAG executing at a specific time. The Airflow UI shows DAG Runs as colored bars and squares. Green means success, red means failure. You can click into any run and check logs for individual tasks, which is essential for debugging.

Cloud SQL Operator

The book’s Level 2 exercise moves from dummy tasks to real GCP operators. The first one is CloudSQLExportInstanceOperator, which extracts data from Cloud SQL to a GCS bucket.

You give it a project ID, instance name, and a body describing the export (file format, destination URI, SQL query). It pulls data from your database and drops a CSV file into GCS. Straightforward.

GCS to BigQuery Operator

Next step: get that CSV from GCS into BigQuery. The operator is GCSToBigQueryOperator. You define the source bucket, file path, destination table, schema, and write disposition.

If you worked through the BigQuery chapter, these parameters look familiar. They match the BigQuery Python API. The operator just wraps it in a cleaner format.

BigQuery Operator for Transformations

For transformations inside BigQuery, you use BigQueryInsertJobOperator. You give it a SQL query and a destination table. It runs the query and stores results, handling table creation and write behavior through options like CREATE_IF_NEEDED and WRITE_TRUNCATE.

Chain all three operators together and you have a complete ELT pipeline: extract from Cloud SQL, load into BigQuery raw tables, transform into warehouse tables. All scheduled, all automatic.

Things to Avoid in BigQuery DAGs

The book makes three strong points about what not to do, and honestly, I’ve seen all three in real projects.

Don’t use the BigQuery Python library inside DAGs. Yes, you can import google.cloud.bigquery and call the API directly. But native operators handle connections, logging, and error handling for you. Your code stays clean and readable as configuration, which is what a DAG should be.

Don’t download data to Cloud Composer workers. Airflow is not a storage system. If your DAG downloads files to the worker VM or loads BigQuery data into a pandas DataFrame, you are eating the worker’s limited disk and memory. Do this with enough data and your entire Composer environment crashes, taking all your DAGs with it.

Don’t process data in Python inside the DAG. Airflow is an orchestrator. Its job is to tell other systems what to do, not to do the heavy lifting itself. Let BigQuery handle SQL. Let Dataproc handle Spark. Composer workers are small machines meant for coordination, not computation.

Variable Types in Cloud Composer

When your pipeline grows beyond one or two tables, you start seeing hardcoded values everywhere: project IDs, bucket names, dataset names. The book introduces three ways to handle variables.

DAG variables live inside your DAG script. Simple, but if you need to change a project ID across 50 DAGs, you are editing 50 files.

Airflow variables are stored in the Airflow metadata database and accessible from any DAG. You create them through the Airflow UI under Admin > Variables. The book recommends storing related values as JSON in a single variable rather than many individual ones. Better for Airflow performance because each variable access hits the database.

Environment variables are set at the Cloud Composer environment level. They apply to the whole environment.

There is also a fourth option for sensitive data: GCP Secret Manager. The book mentions it but doesn’t cover it in detail to keep the learning focused.

What’s Next

At this point in the chapter, you can already build a working automated pipeline from Cloud SQL to BigQuery. It schedules itself, runs on time, and you can monitor it through the Airflow UI. That’s genuinely useful.

But there are still things to improve: making tasks idempotent so reruns don’t create duplicates, handling dependencies between different DAGs, and applying production-ready patterns. That’s what Part 2 covers.


This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 3 Part 2: Data Modeling in BigQuery or continue to Chapter 4 Part 2: Airflow Best Practices.

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More