Deep Look at MapReduce: How Hadoop Processes Data

Previous: SQL on Hadoop: Getting Started with Apache Hive

We’ve talked about Hive, but today we’re going under the hood. MapReduce is the engine that actually does the heavy lifting in Hadoop. Sridhar Alla’s third chapter is a deep look at how this framework takes a massive pile of data and turns it into something useful.

The MapReduce Lifecycle

If you want to understand MapReduce, you have to understand the phases. It’s not just “map” and “reduce.” There’s a lot happening in between:

  1. Record Reader: This is the first step. It takes your raw data (like a CSV file) and breaks it into key/value pairs that the Mapper can understand.
  2. Mapper: This is where you define your logic. You take a record and output zero or more new key/value pairs. For example, if you’re counting words, the Mapper outputs (word, 1) for every word it finds.
  3. Combiner (Optional): Think of this as a “mini-reducer” that runs on the same node as the Mapper. It aggregates data locally to save network bandwidth. Instead of sending (Boston, 1) a thousand times, it sends (Boston, 1000).
  4. Partitioner: This decides which Reducer gets which key. Usually, it uses a hash of the key to make sure all instances of the same key end up on the same machine.
  5. Shuffle and Sort: This is the “magic” phase. Hadoop moves the data across the network and sorts it by key so the Reducer gets an ordered list of values for each key.
  6. Reducer: Finally, the Reducer takes that list of values and performs an aggregation, like summing them up or taking an average.
  7. Output Format: The final step where the results are written back to HDFS.

Two Basic Job Types

The book walks through two fundamental types of jobs with real Java code examples.

1. The Single Mapper Job

This is for simple transformations. You’re not aggregating anything; you’re just changing the data. For example, converting city names to short codes (e.g., “New York” to “NYC”). There’s no Reducer needed here. You just read, transform, and write.

2. The Single Mapper Reducer Job

This is for basic aggregations. The book uses a temperature dataset. The Mapper outputs (CityID, Temperature), and the Reducer takes all the temperatures for a single CityID and calculates the average. This is the classic “group by” pattern in SQL, but implemented in Java.

The “Word Count” Example

You can’t talk about MapReduce without mentioning the Word Count example. It’s the “Hello World” of big data. Alla provides a clean implementation showing how the WordMapper tokenizes strings and the CountReducer sums up the ones.

It might seem like a lot of boilerplate code compared to a SQL query, but this is what allows you to scale to trillions of records. You’re giving the system a very specific plan of execution that can be parallelized across thousands of machines.

In the next post, we’ll look at some more complex patterns, like how to join two different datasets using MapReduce.

Next: Advanced MapReduce: Joins and Filtering Patterns

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More