Data Engineering With Python: Final Thoughts and Takeaways
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
So we made it through the whole book. And honestly? It was worth the ride.
The biggest thing Scavetta and Angelov got right is the framing. They didn’t write a “Python is better” or “R is better” book. They wrote a “both are useful, here’s when to use which” book. And that’s the mature take.
The appendix of “Python and R for the Modern Data Scientist” is basically a bilingual dictionary. It runs about 40 tables long and covers everything from package management to indexing. You could spend a whole afternoon reading through it.
You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
The whole book has been building to this. Six chapters of philosophy, syntax comparisons, and interoperability tricks. Now Chapter 7 drops a real project on the table. Build it with both languages. Together. Start to finish.
Up to this point in the book, data pipelines have been about moving data that already exists. Query a database, read a file, process it, store it. The data sits still and you go get it.
Chapter 6 is where the book finally delivers on its promise. All that talk about using both languages together? This is where it actually happens. Rick Scavetta walks through the nuts and bolts of making Python and R talk to each other in the same project.
Up to this point in the book, everything has been batch processing. You query a database, get a full dataset, transform it, load it somewhere. The data sits still while you work on it.
Chapter 5 is where Boyan Angelov gets practical about the question everyone dances around: which language should you actually use for which job?
You learned the individual tools. You learned the deployment strategies. Now Chapter 11 of Data Engineering with Python by Paul Crickard puts it all together. This is the chapter where you build a complete, production-grade data pipeline from start to finish.
Chapter 4 is where the book stops teaching you the languages and starts telling you when to use which one. This is Part III, “The Modern Context,” and Boyan Angelov takes the lead here. The question is simple: given a specific data format, which language gives you a better experience?
You built your data pipelines. They work on your laptop. Now what? Chapter 10 of Data Engineering with Python by Paul Crickard covers the part everyone eventually has to face: getting your pipelines out of development and into production.
Chapter 2 showed Pythonistas how to pick up R. Chapter 3 flips the script. Now it’s the R user’s turn to step into Python territory. Rick Scavetta writes this one, and he does a good job easing R folks into a world that feels messier at first glance.
You built a data pipeline. It is idempotent, uses atomic transactions, and has version control. It is production ready. But can you tell when it breaks?
In Part 1 we covered R basics: setting up your environment, installing packages, working with tibbles, and understanding R’s type system. Now we get to the good stuff. Lists, factors, finding things in your data, and the iteration patterns that make R feel so different from Python.
You’ve been building data pipelines for several chapters now. They work. They move data. But here’s the problem: none of them have version control. If you break something, there’s no going back. Chapter 8 of Data Engineering with Python by Paul Crickard fixes that. It introduces the NiFi Registry, a sub-project of Apache NiFi that handles version control for your data pipelines.
Chapter 2 is where the book gets hands-on. Rick Scavetta takes the wheel and walks Python developers through R. Not from scratch, but with the assumption you already know how to code. The chapter is big, so I split it into two posts. This is the first half.
You built a pipeline. It works on your machine. It runs on a schedule. Data goes in, data comes out. Ship it, right?
Chapter 1 is titled “In the Beginning” and it’s written by Rick Scavetta. He opens with a tongue-in-cheek Dickens reference, saying it’s just the best of times for data science. But to understand where we are, we need to look at where Python and R came from. Their origin stories explain why they feel so different today.
The previous chapters taught you the individual tools. Python, NiFi, Airflow, databases, data cleaning. Chapter 6 of Data Engineering with Python by Paul Crickard puts them all together into one real project.
The preface of “Python and R for the Modern Data Scientist” sets up the whole book in a few pages. And it does something rare for a tech book. It actually defines what it means by its own title.
I picked up “Python and R for the Modern Data Scientist” by Rick J. Scavetta and Boyan Angelov a while back. It’s an O’Reilly book from 2021, and it caught my eye because it doesn’t pick sides in the Python vs R debate. Instead, it argues you should use both.
You can build the best pipeline in the world. You can read files, write to databases, schedule everything with Airflow. But if the data going through that pipeline is messy, none of it matters.
Most data pipelines start with a database. Most of them end with one too. Chapter 4 of Paul Crickard’s book is about connecting Python to databases and moving data between them. If the previous chapter was about flat files, this one is where things get real.
Chapter 3 is where Crickard moves from setup to actual work. You installed all those tools in Chapter 2. Now you use them. The chapter covers one of the most fundamental tasks in data engineering: getting data out of text files and into something useful.
Chapter 1 was all theory. Now it’s time to actually install stuff. Chapter 2 of Data Engineering with Python by Paul Crickard is a setup chapter. You install the tools, configure them, and make sure everything talks to each other.
Chapter 1 of Data Engineering with Python by Paul Crickard starts with the basics. What is data engineering? What do data engineers actually do? And how is it different from data science?
So I picked up Data Engineering with Python by Paul Crickard (Packt, 2020, ISBN: 978-1-83921-418-9) and decided to write up my study notes as I go through it. I’ve been working in IT for over 20 years, and data engineering keeps coming up everywhere. This book seemed like a good one to work through and share what I learn.