Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Previous: Getting Started with Hadoop 3: What’s New and Why It Matters

In the last post, we talked about all the cool new features in Hadoop 3. Now, let’s actually build something. Sridhar Alla’s book gives a solid walkthrough on setting up a single-node cluster. If you’re on Linux, this is pretty straightforward.

The Basics: Prerequisites

First things first: you need Java 8. Hadoop 3 is picky about this. If you have Java 7 or 11+, you might run into issues. Check your version with java -version and make sure your JAVA_HOME is set.

You’ll also need to download the Hadoop 3.1.0 binaries. Once you’ve got them, extract the tarball and you’re almost ready to go.

SSH Without a Passphrase

This is a step that trips people up. Hadoop needs to be able to talk to itself over SSH without you typing in a password every five seconds.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Once that’s done, test it with ssh localhost. If it just logs you in without asking for anything, you’re golden.

Configuring the NameNode and HDFS

You’ll need to edit a couple of XML files in etc/hadoop/.

  • core-site.xml: Tell Hadoop where the filesystem is (usually hdfs://localhost:9000).
  • hdfs-site.xml: Set your replication factor to 1 (since we’re only on one node) and tell it where to store the NameNode data.

Then, format the filesystem: ./bin/hdfs namenode -format

And start HDFS: ./sbin/start-dfs.sh

Boom. You can now visit http://localhost:9870 and see your NameNode in all its glory.

YARN and the Timeline Service

Next, start YARN with ./sbin/start-yarn.sh. This gives you the resource manager at http://localhost:8088.

If you want to try the new Timeline Service v.2, things get a bit more complex. You’ll need Apache HBase 1.2.6. You have to set up HBase, create a specific schema, and tweak your YARN config to point to it. It’s a bit of a process, but it’s worth it if you want that deep visibility into your jobs.

Testing it Out

The best way to see if everything is working is to run a sample job. The book suggests the classic “word count” or “grep” examples included in the Hadoop JARs. If the job finishes and you see output in HDFS, you’ve successfully built your first Hadoop 3 cluster.

Setting up a cluster can feel like a lot of moving parts, but once you get the hang of these config files, it starts to make sense. In the next chapter, we’re going to step back and look at the “why” - the big picture of data analytics.

Next: The World of Big Data Analytics: Processes and Tools

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More