The World of Big Data Analytics: Processes and Tools

Previous: Setting Up Your Hadoop 3 Cluster: A Step-by-Step Guide

Now that we’ve got a cluster running, let’s talk about why we bother with all this complexity in the first place. Chapter 2 of Sridhar Alla’s book takes a step back to look at the big picture of data analytics.

What is Data Analytics, Anyway?

At its heart, data analytics is just about asking questions and finding answers in your data. The book breaks it down into two main types:

  • Exploratory Data Analysis (EDA): This is when you’re just poking around, looking for patterns or relationships you didn’t know were there.
  • Confirmatory Data Analysis (CDA): This is more scientific. You have a specific hypothesis and you’re using stats to prove or disprove it.

The 7 Vs of Big Data

Back in the day, we talked about the 3 Vs: Volume, Velocity, and Variety. Then it became 4. Now, the cool kids talk about the 7 Vs. If you want to sound like you know what you’re talking about in a meeting, remember these:

  1. Volume: The sheer amount of data. We’re talking Terabytes and Petabytes.
  2. Velocity: How fast the data is coming in (think real-time Twitter feeds vs. monthly reports).
  3. Variety: Different formats like JSON, XML, video, or just plain text.
  4. Veracity: Can you trust the data? Is it accurate?
  5. Variability: Does the meaning of the data change over time or context?
  6. Visualization: How do you actually show this data so people can understand it?
  7. Value: At the end of the day, is this data actually making the business better?

The Reality of Data Quality

Here’s a truth bomb: data is messy. People type their addresses differently. Systems crash. Records get duplicated.

This is where roles like the Data Steward come in. They’re the ones who know exactly where every byte comes from and what it means. Before you can do any fancy AI or machine learning, you have to clean your data. Profiling, cleansing, and deduplication aren’t the “sexy” parts of big data, but they’re the most important.

Why Hadoop?

Hadoop exists because traditional databases just can’t handle the 7 Vs. When you have quintillions of bytes being generated every day, you can’t just buy a bigger server. You need to spread the work across hundreds or thousands of cheap, standard machines.

This distributed approach is what makes things like Instagram or Gmail possible. It’s not just about storage; it’s about processing that storage in a way that doesn’t take forever.

In the next post, we’ll look at one of the most popular tools for making Hadoop feel like a regular database: Apache Hive.

Next: SQL on Hadoop: Getting Started with Apache Hive

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More