Data Engineering With Python: Final Thoughts and Takeaways
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
Up to this point in the book, data pipelines have been about moving data that already exists. Query a database, read a file, process it, store it. The data sits still and you go get it.
Up to this point in the book, everything has been batch processing. You query a database, get a full dataset, transform it, load it somewhere. The data sits still while you work on it.
You learned the individual tools. You learned the deployment strategies. Now Chapter 11 of Data Engineering with Python by Paul Crickard puts it all together. This is the chapter where you build a complete, production-grade data pipeline from start to finish.
You built your data pipelines. They work on your laptop. Now what? Chapter 10 of Data Engineering with Python by Paul Crickard covers the part everyone eventually has to face: getting your pipelines out of development and into production.
You built a data pipeline. It is idempotent, uses atomic transactions, and has version control. It is production ready. But can you tell when it breaks?
You’ve been building data pipelines for several chapters now. They work. They move data. But here’s the problem: none of them have version control. If you break something, there’s no going back. Chapter 8 of Data Engineering with Python by Paul Crickard fixes that. It introduces the NiFi Registry, a sub-project of Apache NiFi that handles version control for your data pipelines.
You built a pipeline. It works on your machine. It runs on a schedule. Data goes in, data comes out. Ship it, right?
The previous chapters taught you the individual tools. Python, NiFi, Airflow, databases, data cleaning. Chapter 6 of Data Engineering with Python by Paul Crickard puts them all together into one real project.
You can build the best pipeline in the world. You can read files, write to databases, schedule everything with Airflow. But if the data going through that pipeline is messy, none of it matters.
Most data pipelines start with a database. Most of them end with one too. Chapter 4 of Paul Crickard’s book is about connecting Python to databases and moving data between them. If the previous chapter was about flat files, this one is where things get real.
Chapter 3 is where Crickard moves from setup to actual work. You installed all those tools in Chapter 2. Now you use them. The chapter covers one of the most fundamental tasks in data engineering: getting data out of text files and into something useful.
Chapter 1 was all theory. Now it’s time to actually install stuff. Chapter 2 of Data Engineering with Python by Paul Crickard is a setup chapter. You install the tools, configure them, and make sure everything talks to each other.
Chapter 1 of Data Engineering with Python by Paul Crickard starts with the basics. What is data engineering? What do data engineers actually do? And how is it different from data science?
So I picked up Data Engineering with Python by Paul Crickard (Packt, 2020, ISBN: 978-1-83921-418-9) and decided to write up my study notes as I go through it. I’ve been working in IT for over 20 years, and data engineering keeps coming up everywhere. This book seemed like a good one to work through and share what I learn.