SQL on Hadoop: Getting Started with Apache Hive

Previous: The World of Big Data Analytics: Processes and Tools

If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.

Enter Apache Hive. Hive is basically a SQL-like layer that sits on top of Hadoop. It takes your SQL queries and turns them into MapReduce jobs automatically. It’s a lifesaver for anyone who already knows SQL and doesn’t want to spend all day writing Java code.

Setting Up Hive

Sridhar Alla’s book walks through the installation process. You’ll need the Hive binaries and a database to store metadata (where your tables are, what columns they have, etc.). For a local setup, Apache Derby is the easiest choice.

Once you extract Hive, you have to tweak a few properties in hive-site.xml, like setting the warehouse directory in HDFS. Then, initialize the Derby schema:

schematool -dbType derby -initSchema --verbose

After that, you just type hive and you’re in the console.

Databases and Tables

Creating a database in Hive is exactly like in MySQL or Postgres: CREATE DATABASE mydb;

Creating a table is also very similar, but with a few Hadoop-specific twists. For example, you can create an External Table. This means Hive doesn’t “own” the data; it just points to a directory in HDFS. If you drop the table, the data stays where it is.

CREATE EXTERNAL TABLE OnlineRetail (
  InvoiceNo string,
  StockCode string,
  Description string,
  Quantity integer,
  ...
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/normal';

Querying Data

Once your table is set up, you can run regular SQL: SELECT Description, count(*) as cnt FROM OnlineRetail GROUP BY Description HAVING cnt > 100 ORDER BY cnt DESC LIMIT 5;

Behind the scenes, Hive is launching MapReduce jobs to handle the grouping and sorting. It might be a bit slower than a traditional RDBMS for small datasets, but for billions of rows, it’s a beast.

More Than Just Strings and Ints

One of the cool things about Hive is that it supports complex types like Arrays, Maps, and Structs. This is perfect for the kind of nested data you often find in big data logs.

We also got a quick glimpse of Apache Spark and Tableau. Spark is generally faster than Hive because it processes data in-memory, and Tableau is the go-to for turning all these numbers into charts that people can actually understand.

Next up, we’re going to dive deep into the engine that powers Hive: MapReduce.

Next: Deep Look at MapReduce: How Hadoop Processes Data

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More