SQL on Hadoop: Getting Started with Apache Hive
Previous: The World of Big Data Analytics: Processes and Tools
If you’ve ever tried to write a MapReduce job just to count the number of lines in a file, you know it’s a lot of work. You have to write a Mapper, a Reducer, a Driver… it’s a whole thing.
Enter Apache Hive. Hive is basically a SQL-like layer that sits on top of Hadoop. It takes your SQL queries and turns them into MapReduce jobs automatically. It’s a lifesaver for anyone who already knows SQL and doesn’t want to spend all day writing Java code.
Setting Up Hive
Sridhar Alla’s book walks through the installation process. You’ll need the Hive binaries and a database to store metadata (where your tables are, what columns they have, etc.). For a local setup, Apache Derby is the easiest choice.
Once you extract Hive, you have to tweak a few properties in hive-site.xml, like setting the warehouse directory in HDFS. Then, initialize the Derby schema:
schematool -dbType derby -initSchema --verbose
After that, you just type hive and you’re in the console.
Databases and Tables
Creating a database in Hive is exactly like in MySQL or Postgres:
CREATE DATABASE mydb;
Creating a table is also very similar, but with a few Hadoop-specific twists. For example, you can create an External Table. This means Hive doesn’t “own” the data; it just points to a directory in HDFS. If you drop the table, the data stays where it is.
CREATE EXTERNAL TABLE OnlineRetail (
InvoiceNo string,
StockCode string,
Description string,
Quantity integer,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/normal';
Querying Data
Once your table is set up, you can run regular SQL:
SELECT Description, count(*) as cnt FROM OnlineRetail GROUP BY Description HAVING cnt > 100 ORDER BY cnt DESC LIMIT 5;
Behind the scenes, Hive is launching MapReduce jobs to handle the grouping and sorting. It might be a bit slower than a traditional RDBMS for small datasets, but for billions of rows, it’s a beast.
More Than Just Strings and Ints
One of the cool things about Hive is that it supports complex types like Arrays, Maps, and Structs. This is perfect for the kind of nested data you often find in big data logs.
We also got a quick glimpse of Apache Spark and Tableau. Spark is generally faster than Hive because it processes data in-memory, and Tableau is the go-to for turning all these numbers into charts that people can actually understand.
Next up, we’re going to dive deep into the engine that powers Hive: MapReduce.