Data Processing With Apache Spark - Study Notes From Data Engineering With Python Ch 14
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
You have streaming data. You have batch data. You have a lot of it. Now you need to actually process it. Fast. On more than one machine.
At some point, your data gets too big for one machine. That’s not a hypothetical. Netflix, Google, Amazon, they all hit that wall years ago. The question is: what do you do when a single server can’t keep up?
You have a great hypothesis. Your stakeholders are on board. But none of it matters without the right data.
Chapter 7 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about sourcing. Where do you get data? How do you collect it? How do you know if it is any good?