Advanced MapReduce: Joins and Filtering Patterns
Previous: Deep Look at MapReduce: How Hadoop Processes Data
In the last post, we looked at the basics of MapReduce. But in the real world, your data is rarely in one single file. You usually have a few different datasets that you need to combine. This is where things get a little more complex - and a lot more interesting.
The Challenge of Joins in Hadoop
In a regular SQL database, joining two tables is easy. In Hadoop, it’s a bit of a headache because your data is spread across a hundred different machines.
Sridhar Alla introduces the Multiple Mappers pattern. The idea is to have one Mapper for each input source. For example, if you have a file of Cities and a file of Temperatures, you write two different Mapper classes. Both Mappers output the same key (the CityID), but their values are different. The Reducer then gets all the data for a specific CityID from both Mappers and performs the join.
Types of Joins
The book walks through several join patterns, and if you’ve done SQL, these will sound familiar:
- Inner Join: Only keep records where the key exists in both datasets.
- Left Outer Join: Keep everything from the “left” dataset, even if there’s no match in the “right” one.
- Left Anti Join: Only keep records from the left dataset that don’t have a match in the right one. This is great for finding missing data.
- Full Outer Join: Keep everything from everywhere. It’s the most complete but also the slowest join.
Aggregation and Filtering Patterns
Beyond joins, there are other “patterns” or templates you can use to solve common problems:
- Aggregation Patterns: These are for summarizing data. Think min, max, count, average, or even more complex things like standard deviation.
- Filtering Patterns: These are for finding a subset of data. Maybe you want the “Top 10” most frequent words, or you want to remove all duplicate records. These patterns help you thin out your data so you’re only looking at what matters.
Why Design Patterns?
You might wonder why we call them “patterns.” It’s because these are tried-and-true templates. Instead of reinventing the wheel every time you have a new data problem, you can look at your problem and say, “Oh, this is a Top-K filtering problem,” and apply the pattern.
It makes your code more reliable and easier for other developers to understand.
We’ve spent a lot of time on Java and MapReduce, but in the next chapter, we’re going to switch gears. We’ll look at how Python - everyone’s favorite language for data science - fits into the Hadoop ecosystem.