Data Engineering With Python: Final Thoughts and Takeaways
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
That’s it. Fifteen chapters, seventeen posts, and one complete walkthrough of Paul Crickard’s Data Engineering with Python (Packt, 2020, ISBN: 978-1-83921-418-9).
You have NiFi running. Kafka is streaming. Spark is processing. But what about the data source? What happens when your data comes from a tiny sensor or a Raspberry Pi that can barely run a web browser?
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?
In Part 1 we set up a Dataproc cluster, got familiar with HDFS, and touched on what a data lake actually is. Now it is time to get into the real work: writing PySpark code, understanding RDDs, moving data between HDFS, GCS, and BigQuery, and learning how to actually submit Spark jobs to Dataproc.