Statistical Computing with R and Hadoop
Previous: Scientific Computing with Python and Hadoop
If Python is the general-purpose king of data science, R is the specialized wizard of statistics. While Python is great for building pipelines and apps, R was built by statisticians, for statisticians. In Chapter 5, Sridhar Alla shows us how to bring that statistical power to the massive datasets sitting in Hadoop.
The Big Challenge: R’s Memory Limit
The “elephant in the room” (pun intended) is that R usually wants to load all your data into RAM. That’s fine for a few megabytes, but for a 100TB Hadoop cluster? Not going to happen.
To fix this, we have a few options for integrating R with Hadoop:
- Workstation Connection: Use R on your laptop to pull a sample of data from Hadoop.
- Shared Server: Run R on a massive server with 512GB of RAM.
- RHadoop: This is the big one. It’s a collection of packages (like
rhdfsandrmr2) that let you run R code inside the Hadoop cluster.
RHadoop: Running R in Parallel
rmr2 is particularly cool because it lets you write Mappers and Reducers using pure R functions. Instead of moving the data to your R environment, you move your R code to where the data lives. This skips the painful “data movement” phase and lets you parallelize your computations across the entire cluster.
There’s also RHive, which lets you launch Hive queries from your R console. It’s the best of both worlds: the ease of SQL and the analytical power of R.
Exploring Data in R
The book walks through some basic data exploration in R. If you’ve used Pandas, some of this will feel familiar, but the syntax is definitely its own thing:
read.csv(): Loading your data (you can even usefile.choose()to get a nice popup window).head()andtail(): Checking the top and bottom of your dataset.summary()andfivenum(): Getting instant statistical breakdowns (min, max, median, quartiles).plot(): R’s built-in plotting is legendary. You can create everything from simple scatter plots to complex histograms with just one or two lines of code.
R vs. Python: Which One to Use?
Honestly, it depends on your background. If you’re a developer first, you’ll probably prefer Python. If you’re a statistician or researcher, you’ll feel right at home with R.
The good news is that with tools like SparkR, the gap is closing. You can now use the speed of Apache Spark while still writing R code.
Speaking of Spark, that’s exactly what we’re looking at next. It’s the technology that many say is replacing MapReduce as the default way to process big data.
Next: Batch Analytics with Apache Spark: Faster Than MapReduce