Data Engineering with Python: My Study Notes from Paul Crickard's Book
So I picked up Data Engineering with Python by Paul Crickard (Packt, 2020, ISBN: 978-1-83921-418-9) and decided to write up my study notes as I go through it. I’ve been working in IT for over 20 years, and data engineering keeps coming up everywhere. This book seemed like a good one to work through and share what I learn.
What This Book Covers
The book is about building data pipelines using Python. But it’s not just Python scripts. Crickard walks through the full stack of tools you’d actually use in a real data engineering setup:
- Apache NiFi for moving data around
- Apache Airflow for scheduling and orchestrating pipelines
- Apache Kafka for streaming data
- Apache Spark for processing big data
- Databases like Elasticsearch, PostgreSQL, and MongoDB
It’s 15 chapters that take you from “what even is data engineering” all the way to building real-time data pipelines with edge computing.
Why I’m Writing This Up
I read a lot of books. And honestly, the best way to remember what you read is to explain it to someone else. So that’s what this series is. My notes, my takeaways, and sometimes my opinions on what works and what could be better.
I’m going to keep things simple. No jargon walls. If you’re new to data engineering or just curious about Python’s role in it, these notes should give you a solid overview without having to read all 300+ pages.
The Book Structure
The book breaks down into three sections:
Section 1 - Building Data Pipelines (Chapters 1-6): The basics. What data engineering is, setting up your tools, reading files, working with databases, cleaning data, and your first real pipeline project.
Section 2 - Running Pipelines in Production (Chapters 7-11): Making things production-ready. Version control, monitoring, deployment, and a full production pipeline project.
Section 3 - Beyond Batch (Chapters 12-15): Streaming and real-time processing. Kafka clusters, streaming data, Spark processing, and a final project combining everything.
What to Expect
One chapter per post. I’ll cover the main ideas, share what I found useful, and flag anything that felt outdated or could use more explanation. The book came out in 2020, so some things have changed in the ecosystem since then.
Here’s the full series:
- What is Data Engineering? (Ch 1)
- Building Your Data Engineering Setup (Ch 2)
- Reading and Writing Files (Ch 3)
- Working with Databases (Ch 4)
- Cleaning and Transforming Data (Ch 5)
- Building a 311 Data Pipeline (Ch 6)
- Production Pipeline Features (Ch 7)
- NiFi Registry Version Control (Ch 8)
- Monitoring Data Pipelines (Ch 9)
- Deploying Data Pipelines (Ch 10)
- Building a Production Pipeline (Ch 11)
- Building a Kafka Cluster (Ch 12)
- Streaming Data with Kafka (Ch 13)
- Data Processing with Apache Spark (Ch 14)
- Real-Time Edge Data with MiNiFi and Spark (Ch 15)
Let’s get started.
Next up: What is Data Engineering? (Ch 1)