Statistical Computing with R and Hadoop

Previous: Scientific Computing with Python and Hadoop

If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.

The Big Challenge: R’s Memory Limit

The “elephant in the room” (pun intended) is that R usually wants to load all your data into RAM. That’s fine for a few megabytes, but for a 100TB Hadoop cluster? Not going to happen.

To fix this, we have a few options for integrating R with Hadoop:

  1. Workstation Connection: Use R on your laptop to pull a sample of data from Hadoop.
  2. Shared Server: Run R on a massive server with 512GB of RAM.
  3. RHadoop: This is the big one. It’s a collection of packages (like rhdfs and rmr2) that let you run R code inside the Hadoop cluster.

RHadoop: Running R in Parallel

rmr2 is particularly cool because it lets you write Mappers and Reducers using pure R functions. Instead of moving the data to your R environment, you move your R code to where the data lives. This skips the painful “data movement” phase and lets you parallelize your computations across the entire cluster.

There’s also RHive, which lets you launch Hive queries from your R console. It’s the best of both worlds: the ease of SQL and the analytical power of R.

Exploring Data in R

The book walks through some basic data exploration in R. If you’ve used Pandas, some of this will feel familiar, but the syntax is definitely its own thing:

  • read.csv(): Loading your data (you can even use file.choose() to get a nice popup window).
  • head() and tail(): Checking the top and bottom of your dataset.
  • summary() and fivenum(): Getting instant statistical breakdowns (min, max, median, quartiles).
  • plot(): R’s built-in plotting is legendary. You can create everything from simple scatter plots to complex histograms with just one or two lines of code.

R vs. Python: Which One to Use?

Honestly, it depends on your background. If you’re a developer first, you’ll probably prefer Python. If you’re a statistician or researcher, you’ll feel right at home with R.

The good news is that with tools like SparkR, the gap is closing. You can now use the speed of Apache Spark while still writing R code.

Speaking of Spark, that’s exactly what we’re looking at next. It’s the technology that many say is replacing MapReduce as the default way to process big data.

Next: Batch Analytics with Apache Spark: Faster Than MapReduce

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More