Production Pipeline Features - Study Notes from Data Engineering with Python Ch 7

You built a pipeline. It works on your machine. It runs on a schedule. Data goes in, data comes out. Ship it, right?

Not so fast. Chapter 7 of Data Engineering with Python by Paul Crickard is about the gap between “it works” and “it works in production.” There are three features that separate a hobby pipeline from a production one: staging and validation, idempotency, and atomicity. If those words sound intimidating, they are simpler than you think.

Staging Your Data

Here is the problem. You query a transactional database. You get 1,000 sales records. You start transforming them. Halfway through, something breaks. You want to re-query the database, but the data has changed. Orders got canceled. New orders came in. The snapshot you had 10 minutes ago is gone forever.

This is why you stage your data.

Staging at the front of the pipeline (extraction):

Instead of passing query results straight into your transformation steps, you save them to a file first. A CSV dump, a JSON file, whatever works. Now if anything downstream breaks, you still have that original snapshot. You can replay the pipeline from the file without hitting the source database again.

Crickard uses a widget sales example to illustrate this. A NiFi pipeline queries a widget database, splits records, converts dates to GMT, and loads into a warehouse. Without staging, a failure in the date conversion step means you have to re-query the database. But the database has changed. You lost data.

With staging, the pipeline becomes two parts. Part one: query the database and save results to a file. Part two: read the file, transform, and load. If part two fails, you just restart it. The file is still there.

Staging at the end of the pipeline (loading):

You can also stage before loading into your final warehouse. Instead of inserting directly into production, load into a replica database first. Run validation checks on that replica. If everything looks good, move it to the real warehouse. If something is wrong (dates stored as strings, wrong record counts), you catch it before it hits production.

Stage everywhere if you need to:

There is no rule that says you can only stage at the start and end. You can stage after every transformation step. The more stages you have, the easier it is to debug failures and resume from any point. As your transformations get more complex and time-consuming, this becomes more valuable.

One more practical benefit: staging reduces load on source systems. If your queries are expensive or the source database belongs to another team, you only need to hit it once. After that, you work from your local copy.

ETL vs ELT

Crickard makes a quick but important note here. Traditional ETL (Extract, Transform, Load) does transformations before loading. But there is a growing trend toward ELT (Extract, Load, Transform), where you load raw data into a database first, then do all transformations there using SQL. Both approaches are valid. ELT works especially well with SQL-based transformation tools. There is no single right answer.

Validating Data with Great Expectations

Staging gives you a place to check your data. But checking it manually is not sustainable. You need automated validation. That is where Great Expectations comes in.

Great Expectations is a Python library that lets you write human-readable validation rules for your data. Instead of writing a bunch of pandas code to check for nulls, data types, and value ranges, you write statements like:

expect_column_values_to_not_be_null('age')

The library handles the implementation whether your data is in a DataFrame or a database.

Setting it up:

You install it with pip, initialize a project with great_expectations init, and walk through a setup wizard. You tell it where your data lives (files or database), what tool processes it (pandas or SQL), and point it at a sample data file.

Great Expectations then generates a default expectation suite based on your sample data. It creates expectations like “this column should exist,” “this column should not have nulls,” “age values should be between 18 and 80.” It even generates HTML documentation showing which expectations passed or failed.

Editing expectations:

The default suite is usually too rigid. You can edit it in a Jupyter notebook using the suite edit command. Remove expectations that are too strict. Add new ones that match your actual requirements. The full glossary of available expectations is in their docs.

Running validations in NiFi:

To use Great Expectations inside a NiFi pipeline, you create a “tap” (an executable Python script) from your expectation suite. One important modification: change all sys.exit(1) calls to sys.exit(0). If the script exits with a non-zero code, NiFi treats it as a crash. Instead, have the script always exit cleanly but print a JSON result like {"result": "pass"} or {"result": "fail"}.

Then in NiFi, you use an ExecuteStreamCommand processor to run the tap, an EvaluateJsonPath processor to extract the result, and a RouteOnAttribute processor to send the flowfile down either a pass or fail path. If it passed, continue the pipeline. If it failed, log the error or alert someone.

Running validations in Airflow:

In Airflow, you skip the tap and embed the validation code directly in a Python task. The key difference: instead of printing pass/fail JSON, you raise an AirflowException if validation fails. Airflow handles the rest, marking the task as failed and stopping downstream tasks.

Failing on purpose:

Crickard shows a neat way to test your validation. The sample data generates ages between 18 and 80. Change the generator to produce ages between 1 and 100, and suddenly your validation suite catches records outside the expected range. The Great Expectations docs show exactly which expectations failed, which values were out of range, and keeps a history of all validation runs.

Building Idempotent Pipelines

Idempotent means: run it once, run it ten times, you get the same result.

Here is the thing. Pipelines fail. That is not an “if,” it is a “when.” When they fail, you need to re-run them. If re-running your pipeline creates duplicate records, you have a problem.

Imagine a pipeline that queries an API and inserts records into Elasticsearch. It runs every 8 hours. Each run grabs recent records, but also records you already have from previous runs. Without idempotency, you get duplicates piling up.

Two approaches to idempotency:

Approach 1: Upserts. Instead of blind inserts, use an upsert operation. Extract a unique identifier from each record (like an issue ID) and use it as the document key. If the record already exists, it gets updated. If it is new, it gets inserted. No duplicates. This is what Crickard used in the SeeClickFix pipeline from Chapter 6 with Elasticsearch’s upsert method.

Approach 2: Timestamped indexes or partitions. Create a new index (or database partition) every time the pipeline runs, with a datetime suffix in the name. Each run produces a distinct, complete snapshot. No run ever modifies a previous run’s data. This makes your data immutable. Some functional data engineering advocates prefer this approach because it also gives you a full history of every pipeline run.

Building Atomic Pipelines

Atomicity means: all or nothing. If you are inserting 1,000 records and record 500 fails, all 1,000 should fail. You do not want a half-completed transaction sitting in your database.

Why this matters:

Without atomicity, a failure leaves you in a partial state. Some records made it in, some did not. Now you have to figure out which ones succeeded, which failed, and how to fix it. That debugging process is painful and error-prone.

SQL databases have this built in. If you use a library like psycopg2 (for PostgreSQL), you can wrap multiple inserts into a single transaction. If any insert fails, the database rolls back everything automatically. You can safely retry the entire batch.

Elasticsearch does not. There are no atomic transactions in Elasticsearch. Each document insert is independent. So if you are building a pipeline that loads into Elasticsearch, you need to handle atomicity yourself.

Crickard shows one approach: track both successes and failures. Write successful document IDs to one file and failed ones to another. If there are any failures, use the success file to delete what was already inserted. Then retry the whole batch.

It is not elegant. Crickard says so himself. But it is necessary. Debugging partial failures is harder than building atomicity into the pipeline from the start.

Key Takeaways

  • Stage your data at extraction. Save query results to files before transforming. If the pipeline breaks, you can restart from the file instead of re-querying a changed database.
  • Stage before loading. Load into a replica first. Validate before moving to production.
  • Automate validation with Great Expectations. Human-readable rules, automatic documentation, and integration with both NiFi and Airflow.
  • Make pipelines idempotent. Use upserts or timestamped partitions so re-running never creates duplicates.
  • Make pipelines atomic. All records succeed or all records fail. No partial states.
  • SQL databases give you atomicity for free. NoSQL systems like Elasticsearch require you to build it yourself.

My Take

This is one of those chapters that separates the tutorials from the real world. Most beginner resources teach you how to move data from A to B. Very few talk about what happens when things go wrong. And things always go wrong.

The staging concept is straightforward but easy to skip when you are prototyping. The temptation is to pipe everything through without saving intermediate results. Then when something breaks at step 5 of a 7-step pipeline, you have to start over from step 1. Staging fixes that.

Great Expectations is a nice tool. The human-readable expectations are much cleaner than writing custom pandas validation code. The auto-generated documentation is a bonus. If you are building production pipelines, it is worth adding to your toolkit.

The idempotency and atomicity sections are shorter than I expected. They cover the concepts well and show practical examples, but these are deep topics that could fill chapters on their own. Still, understanding the “why” is the important part. The “how” will vary depending on your stack.

One thing I would add: monitoring and alerting. Staging, validation, idempotency, and atomicity are all essential. But you also need to know when things break. A pipeline that fails silently is worse than one that fails loudly. That said, Crickard may cover observability in later chapters.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More