Data Engineering with GCP Chapter 4 Part 2: Airflow Scheduling, Idempotency, and Sensors
In the first part we got Cloud Composer running, wrote our first DAGs, and learned operators. This second part covers the stuff that separates beginner Airflow code from production-ready pipelines: variables, idempotent tasks, backfilling, sensors, and dataset-driven scheduling.
Macro Variables and Why They Matter
Airflow has this concept called macro variables. They return information about the current DAG Run. Things like execution date, DAG ID, task ID. The most important one for data pipelines is the logical date.
Here’s the thing that confuses everyone the first time. In Airflow, the execution date is not “today.” When you set a start_date in the past and a daily schedule, Airflow will create runs for each day going back. The logical date tells you which day that specific run is supposed to process.
So if today is January 4th and your start_date is January 1st, you’ll have four DAG runs. Each one has its own logical date. Your tasks can use that date to know which day’s data to load. This is what the {{ ds }} macro does. Airflow renders it into the actual date string at runtime.
This sounds abstract until you need to load historical data or reprocess a failed day. The author admits this concept is hard to digest without practice, and I agree. Just remember: logical date is not wall-clock time.
Loading Tables and Task Dependencies
The book walks through loading three tables (stations, regions, trips) from different sources into BigQuery. The interesting pattern is the trips table. It pulls data from a BigQuery public dataset, extracts to GCS, then loads into your own dataset. Why the roundabout path? Because in real life your source is rarely BigQuery. It’s MySQL, Oracle, Hadoop, whatever. The extract-to-GCS-then-load pattern works for any source.
The book also shows storing BigQuery schemas as JSON files in GCS instead of hardcoding them in DAG code. When a schema changes, you update the JSON file, not the Python.
With multiple tables loading in parallel, you need proper dependencies. Airflow’s bracket notation lets you say “wait for all these tasks before starting this one.” Your three loads run in parallel, transformations wait for all three, and data quality checks at the end verify that output tables have records. Simple row count: if zero, the task fails.
Backfilling, Rerun, and Catchup
These three concepts come up constantly in real data engineering work, so let’s be clear about each.
Backfilling is loading historical data that was never loaded before. Your pipeline ran for a week, someone asks for the previous week too. You run a backfill command and Airflow creates runs for those past dates.
Rerun is when a DAG or task failed and you need to run it again. Clear the task status in the Airflow UI and it retries.
Catchup is automatic backfilling. Deploy a new DAG with a start_date in the past and Airflow creates runs for every missed interval. Deploy on January 7th with start_date January 1st, you get seven runs right away. Turn it off with catchup=False if you don’t want that.
All three rely on the logical date concept. And all three will break your pipeline if tasks aren’t idempotent.
Task Idempotency: The Water Glass Problem
The author uses a great analogy here. Imagine filling five empty glasses with water one at a time. You fill the first three. Someone says “the first glass had dirt in it, refill it.” If you just pour more water into the already full glass, it overflows. That’s a non-idempotent task.
An idempotent task would empty the glass first, then refill it. Same action, same result, every time.
In Airflow terms, if your task uses WRITE_APPEND to load data into BigQuery and you rerun it, you get duplicate records. The first run appended 1000 rows, the rerun appends another 1000 rows for the same day. Now you have 2000 rows where you should have 1000.
The fix seems obvious: use WRITE_TRUNCATE instead of WRITE_APPEND. Truncate the table, reload everything. But that creates a new problem.
BigQuery Partitioning Saves the Day
If you use WRITE_TRUNCATE on the entire table, you’re rewriting everything on every run. Got a 1TB table that has been running for a year? Airflow would rewrite all 1TB of data 365 times. That’s wasteful and expensive.
This is where BigQuery partitioned tables come in. You partition your table by date. Each partition is basically a separate slice of the table. Now when you WRITE_TRUNCATE, you target only one partition, not the whole table. You add the partition date as a suffix to the destination table ID.
So a rerun for January 2nd only rewrites the January 2nd partition. The rest of the table stays untouched. You get idempotent tasks without the cost of rewriting terabytes of data.
This combination of WRITE_TRUNCATE plus date-partitioned tables is the pattern the book recommends for incremental loads. Your tasks are safe to rerun, no duplicates, no unnecessary data rewrites.
Airflow Sensors: Don’t Run Blind
Now we get to a different problem. What happens when you have two DAGs that depend on each other?
DAG 1 loads raw data. DAG 2 transforms it. You schedule DAG 1 at 5:00 AM and DAG 2 at 5:45 AM, assuming DAG 1 finishes in about 30 minutes. Works great until one day DAG 1 takes 50 minutes. DAG 2 starts on schedule at 5:45, finds no data, and either fails or produces garbage.
Airflow sensors solve this. A sensor is a task that waits for a condition before letting the DAG proceed. The classic GCS sensor checks if a file exists in a bucket. It keeps poking at a set interval until the file appears. DAG 1 writes a success signal file when it finishes. DAG 2 starts with a sensor watching for that file. No guessing about timing.
The downside? That poking mechanism uses compute on every check. Which is why Airflow has been evolving this concept.
Airflow Datasets: The Modern Way
Sensors worked but had overhead. Airflow went through Smart Sensors (deprecated), then Deferrable operators, and finally landed on Datasets in Airflow 2.4.
Don’t let the name confuse you. An Airflow Dataset has nothing to do with BigQuery datasets or actual data storage. Think of it as a signal label. An in-memory tag that one DAG sends and another DAG listens for.
The upstream DAG adds an outlets parameter to its final task, sending a signal with a name you choose. The downstream DAG uses that signal name in its schedule parameter instead of a cron expression. When the upstream task completes, the downstream DAG starts. No files to manage, no poking, no wasted compute. You can wait for multiple signals from different DAGs before starting.
In the final exercise, the main pipeline DAG sends signals after the fact and dimension tables pass verification. A separate downstream DAG that builds data mart tables only starts when both signals arrive.
What I Think
Everyone can write a DAG that runs once. The hard part is writing one that survives reruns, backfills, late data, and team boundaries. The progression from “append everything” to “idempotent partitioned loads” to “dataset-driven scheduling” mirrors what happens in real organizations.
If you take one thing from this chapter, let it be task idempotency. Every task should produce the same result whether it runs once or ten times.
This is part of my retelling of “Data Engineering with Google Cloud Platform” by Adi Wijaya. Go back to Chapter 4 Part 1: Cloud Composer Workflows or continue to Chapter 5 Part 1: Building a Data Lake.