Getting Started with Hadoop 3: What's New and Why It Matters
Previous: Big Data for the Rest of Us
Hadoop has been around for a while, but version 3 is where things get really interesting. If you’ve worked with Hadoop 1 or 2, you know it was solid but had some pain points. Sridhar Alla’s book kicks off by looking straight at what’s changed.
HDFS: More Storage, Less Waste
The biggest thing for me in HDFS is Erasure Coding (EC). Before Hadoop 3, the standard was to replicate every block three times. That’s a 200% overhead. If you had 1TB of data, you needed 3TB of storage.
EC is a huge deal. It uses some math magic to break data into fragments and add redundant bits. It brings that overhead down from 200% to about 50%. You get the same level of fault tolerance but with way less hardware. That’s a massive win for anyone paying for storage.
And there’s more:
- High Availability: You can now have more than one standby NameNode. This is huge for preventing cluster-wide crashes.
- Intra-DataNode Balancer: Finally, Hadoop can balance data across multiple disks inside a single node. No more skewed performance because one disk is full and the others are empty.
- New Port Numbers: They moved the default ports (like 50070 becoming 9870) to avoid conflicts with other Linux apps. A small change, but it saves a lot of “why won’t this start?” headaches.
MapReduce and YARN: Speeding Things Up
MapReduce isn’t dead. In fact, it got a nice performance boost. There’s a new native implementation for the map output collector that can make things 30% faster.
And then there’s YARN. Think of YARN as the traffic cop for your cluster. Hadoop 3 introduces Opportunistic Containers. These are basically “low priority” tasks that can sit in a queue and wait for resources. It keeps the cluster busy and improves overall throughput.
Another cool addition is Timeline Service v.2. It uses HBase now, which means it scales much better. It also tracks “flows,” so you can see how a series of applications work together as one logical workflow.
The Java 8 Factor
One thing to keep in mind: Hadoop 3 requires Java 8. If you’re still on Java 7, it’s time to upgrade. All the JARs are compiled for Java 8 now.
So, Hadoop 3 isn’t just a minor update. It’s about efficiency, scalability, and making life easier for people running these clusters. In the next post, we’ll actually get our hands dirty and look at how to install this beast.