Paul Crickard's hands-on guide to building data pipelines with Python, covering ETL, NiFi, Airflow, Kafka, Spark, and production deployment.
Data Engineering with Python walks you through the full data engineering stack using Python as the glue language. The book starts with fundamentals like reading files and working with databases, then progresses to building complete data pipelines with Apache NiFi and Apache Airflow. It covers data cleaning with pandas, monitoring with Elasticsearch and Kibana, and deployment strategies for production environments.
The second half shifts to streaming and big data. You set up a Kafka cluster, build streaming pipelines, process data with Apache Spark, and finish with a real-time edge computing project using MiNiFi. Three dedicated project chapters tie everything together with practical, end-to-end pipelines.
This book is for Python developers who want to understand data engineering from the ground up. It’s especially useful for beginners who want breadth across the tooling landscape rather than deep expertise in any single tool. The hands-on approach means you’re building real pipelines, not just reading about theory.
Published by Packt in 2020, some tooling details have aged (manual installs vs Docker, older Airflow patterns, ZooKeeper-based Kafka), but the core patterns of ETL, staging, validation, idempotency, and monitoring remain relevant and well-taught.