Data Engineering with Python: Final Thoughts and Takeaways

That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).

Here’s what I think after going through the whole thing.

What the Book Does Well

It covers the full stack. Most data engineering books focus on one tool or one layer of the pipeline. This one walks you through extraction, transformation, loading, scheduling, monitoring, version control, deployment, streaming, and big data processing. That’s a lot of ground for one book.

The three project chapters are the highlight. Chapters 6, 11, and 15 tie everything together with real pipelines. Chapter 6 builds a 311 data pipeline from scratch. Chapter 11 upgrades it to production quality. Chapter 15 adds edge computing with MiNiFi, Kafka, and Spark. These chapters are where the learning really clicks.

NiFi gets serious coverage. If you’ve never worked with Apache NiFi, this book is one of the better introductions out there. Crickard clearly knows the tool well. The processor configurations, the Registry setup, the deployment strategies. It’s practical, hands-on stuff.

The progression makes sense. You start with basics (what is data engineering, file I/O, databases), move to building pipelines, then production concerns, then streaming. Each chapter builds on the last. You don’t feel lost jumping between topics.

What Shows Its Age

The book came out in 2020. Some things have changed:

Docker Compose is the standard now. The book installs everything locally with manual downloads and config files. Today you’d use Docker Compose for most of these tools. It would cut the setup chapter in half.

Airflow has evolved a lot. The Airflow sections use the older pattern. Modern Airflow has the TaskFlow API, better DAG decorators, and a much improved UI. The core concepts still apply, but the syntax looks different now.

Kafka without KRaft. The book uses ZooKeeper for Kafka cluster management. Kafka has since moved to KRaft mode, which removes the ZooKeeper dependency entirely. The fundamentals are the same, but the setup is simpler now.

Great Expectations has grown. The book covers basic expectations. The library has gotten much more powerful since then with data docs, checkpoints, and better integrations.

Python packaging. No mention of virtual environments, poetry, or modern dependency management. For a Python book, that’s a gap.

Who Should Read This

Good for beginners who want breadth. If you’re new to data engineering and want to understand the full landscape, this book gives you that. You’ll touch databases, file processing, ETL, scheduling, monitoring, streaming, and big data processing all in one place.

Good for people who learn by building. The project chapters are the real value. You don’t just read about concepts. You build actual pipelines that do actual things.

Less useful if you already work in data engineering. If you’ve built production pipelines before, the content will feel basic. The tool coverage is introductory, not deep.

Less useful if you want cloud-native patterns. There’s nothing about AWS, GCP, or Azure. No managed services, no serverless, no cloud data warehouses. It’s all self-hosted, local infrastructure.

My Top Takeaways

After working through all fifteen chapters, here’s what stuck with me:

  1. Data engineering is mostly plumbing. It’s not glamorous. It’s about reliably moving data from point A to point B, making sure it’s clean, and making sure nothing breaks at 3 AM.

  2. Pick your scheduling tool and commit. NiFi for visual, drag-and-drop pipelines. Airflow for code-first, version-controlled workflows. Both work. Using both in the same project creates confusion.

  3. Idempotency is not optional. If your pipeline can’t run twice without messing up your data, it’s not production-ready. This was a recurring theme and it’s the right one.

  4. Monitoring is an afterthought until something breaks. Chapter 9 felt short, and that’s probably realistic. Most teams underinvest in monitoring until they get burned.

  5. Streaming is a different mindset. Moving from batch to streaming isn’t just a technology change. It’s a conceptual shift in how you think about data. Windowing, event time vs. processing time, unbounded datasets. It takes practice.

  6. The tools change, the patterns don’t. NiFi might get replaced by something else. Kafka might look different in five years. But ETL, staging, validation, idempotency, monitoring. Those patterns survive any tool swap.

The Full Series

If you missed any posts, here’s the complete list:

  1. Series Intro
  2. What is Data Engineering? (Ch 1)
  3. Building Your Data Engineering Setup (Ch 2)
  4. Reading and Writing Files (Ch 3)
  5. Working with Databases (Ch 4)
  6. Cleaning and Transforming Data (Ch 5)
  7. Building a 311 Data Pipeline (Ch 6)
  8. Production Pipeline Features (Ch 7)
  9. NiFi Registry Version Control (Ch 8)
  10. Monitoring Data Pipelines (Ch 9)
  11. Deploying Data Pipelines (Ch 10)
  12. Building a Production Pipeline (Ch 11)
  13. Building a Kafka Cluster (Ch 12)
  14. Streaming Data with Kafka (Ch 13)
  15. Data Processing with Apache Spark (Ch 14)
  16. Real-Time Edge Data with MiNiFi and Spark (Ch 15)

Thanks for reading along. If you found these notes useful, the book itself is worth picking up for the hands-on projects alone.


Previous: Real-Time Edge Data with MiNiFi and Spark (Ch 15)

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More