Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
In Part 1 we set up a Dataproc cluster, got familiar with HDFS, and touched on what a data lake actually is. Now it is time to get into the real work: writing PySpark code, understanding RDDs, moving data between HDFS, GCS, and BigQuery, and learning how to actually submit Spark jobs to Dataproc.