Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide
Previous: Getting Started with Hadoop 3: What’s New and Why It Matters
In the last post, we talked about all the cool new features in Hadoop 3. Now, let’s actually build something. Sridhar Alla’s book gives a solid walkthrough on setting up a single-node cluster. If you’re on Linux, this is pretty straightforward.
The Basics: Prerequisites
First things first: you need Java 8. Hadoop 3 is picky about this. If you have Java 7 or 11+, you might run into issues. Check your version with java -version and make sure your JAVA_HOME is set.
You’ll also need to download the Hadoop 3.1.0 binaries. Once you’ve got them, extract the tarball and you’re almost ready to go.
SSH Without a Passphrase
This is a step that trips people up. Hadoop needs to be able to talk to itself over SSH without you typing in a password every five seconds.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Once that’s done, test it with ssh localhost. If it just logs you in without asking for anything, you’re golden.
Configuring the NameNode and HDFS
You’ll need to edit a couple of XML files in etc/hadoop/.
- core-site.xml: Tell Hadoop where the filesystem is (usually
hdfs://localhost:9000). - hdfs-site.xml: Set your replication factor to 1 (since we’re only on one node) and tell it where to store the NameNode data.
Then, format the filesystem:
./bin/hdfs namenode -format
And start HDFS:
./sbin/start-dfs.sh
Boom. You can now visit http://localhost:9870 and see your NameNode in all its glory.
YARN and the Timeline Service
Next, start YARN with ./sbin/start-yarn.sh. This gives you the resource manager at http://localhost:8088.
If you want to try the new Timeline Service v.2, things get a bit more complex. You’ll need Apache HBase 1.2.6. You have to set up HBase, create a specific schema, and tweak your YARN config to point to it. It’s a bit of a process, but it’s worth it if you want that deep visibility into your jobs.
Testing it Out
The best way to see if everything is working is to run a sample job. The book suggests the classic “word count” or “grep” examples included in the Hadoop JARs. If the job finishes and you see output in HDFS, you’ve successfully built your first Hadoop 3 cluster.
Setting up a cluster can feel like a lot of moving parts, but once you get the hang of these config files, it starts to make sense. In the next chapter, we’re going to step back and look at the “why” - the big picture of data analytics.