The Powerful Marriage of Hadoop and R (David Champagne)

Revolution Analytics

November 9, 2011

1

Leveraging R in Hadoop Environments

In Today’s Presentation:

About Revolution Analytics

Why R and Hadoop?

The Packages (rhdfs, rhbase, rmr)

Examples

OPEN SOURCE ANALYTICS FOR THE ENTERPRISE

The professor who invented analytic software for the experts now wants to take it to the masses

Most advanced statistical analysis software available

Half the cost of commercial alternatives

2M+ Users

3,000+ Packages

Statistics

Predictive Analytics

Data Mining

Visualization

Finance

Life Sciences

Manufacturing

Retail

Telecom

Social Media

Government

Power

Productivity

Enterprise Readiness

4

What’s the Difference Between R and Revolution R Enterprise?

Revolution R is 100% R and More®

R Engine Language Libraries

3,000+ Community Packages

Technical Support

Web-Based GUI

Web ServicesAPI

Big DataAnalysis

IDE / DeveloperGUI

BuildAssurance

ParallelTools

Multi-ThreadedMath Libraries

For more information contact: [email protected]

5

Let’s Talk about R and Hadoop

6

Why R and Hadoop?

Hadoop - a scalable infrastructure for processing massive amounts of data

Storage – HDFS, HBASE

Distributed Computing - MapReduce

R - a statistical programming language

Need for more than counts and averages

Analyze all of the data

7

Motivation for this project

Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs

Run R on a massively distributed system without having to understand the underlying infrastructure

Statisticians stay focused on the analysis

Open source

8

R and Hadoop – The R Packages

R Client

R

Map or Reduce

Job Tracker

Task Node

HDFS

HBASE

Thriftrhdfs - R and HDFS

rhbase - R and HBASE

rmr - R and MapReduce

Capabilities delivered as individual R packages

rmr

rhdfs rhbase

Downloads available from Github

https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

9

rhdfs

Manipulate HDFS directly from R

Mimic as much of the HDFS Java API as possible

Examples:Read a HDFS text file into a data frame.

Serialize/Deserialize a model to HDFS

Write an HDFS file to local storagerhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples

10

rhdfs Functions

File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get

File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file

Directory - hdfs.dircreate, hdfs.mkdir

Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists

Initialization – hdfs.init, hdfs.defaults

11

rhbase

Manipulate HBASE tables and their content

Uses Thrift C++ API as the mechanism to communicate to HBASE

ExamplesCreate a data frame from a collection of rows and columns in an HBASE table

Update an HBASE table with values from a data framerhbase/pkg/inst/unitTests

12

rhbase Functions

Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table

Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan

Utility - hb.list.tables

Initialization - hb.defaults, hb.init

13

Writing MapReduce programs in R

14

rmr - For R Programmers

•A way to access big data sets

•A simple way to write parallel programs – everyone will have to

•Very R-like, building on the functional characteristics of R

•Just a library

15

rmr – For MapReduce Developers

•Much simpler than writing Java

•Not as simple as Hive, Pig at what they do, but more general

•Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.

16

rmr mapreduce Function

mapreduce (input, output, map, reduce, …)

input – input folder

output – output folder

map – R function used as map

reduce – R function used as reduce

… - other advanced parameters

Some Simple Things

Example showing sampling and counting

map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)

reduce = function(k, vv) keyval(k, length(vv))mapreduce(input, output, map, reduce)

18

More Simple Things

HIVE

INSERT OVERWRITE TABLE pv_gender_sum

SELECT pv_users.gender, count (DISTINCT pv_users.userid)

FROM pv_users

GROUP BY pv_users.gender;

rmr

mapreduce(input =

mapreduce(input = "pv_users",

map = function(k, v) keyval(v['userid'], v['gender']),

reduce = function(k, vv) keyval(k, vv[[1]]),

output = "pv_gender_sum",

map = function(k,v) keyval(v, 1)

reduce = function(k, vv) keyval(k, sum(unlist(vv)))

TakeawaysA language like HIVE makes a class of problems easy to solve, but it is not a general tool

The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities

19

Complex Thingsk-means Clustering

22

k-means - Implementation

Well known design (MacQueen, 1967)

Comparison of the k-means in MapReduce Pig

From Hortonworks

Requires coding in 3 languages (Python-Pig-Java)

100 lines of code

rmr 20 lines of only R code

https://github.com/RevolutionAnalytics/RHadoop/wiki/Comparison-of-high-level-languages-for-mapreduce:-k-means



23

k-means - Highlights

map = function(k,v)

keyval(which.min(distances(centers,v)),v)

reduce = function(k,vv)

keyval(NULL, col.average(vv))

centers = from.dfs(

mapreduce("data-points", map, reduce))

24

k-means - Optimizations

Slow Fast Notes

for(i in 1:100) a[i] = b[i] + c[i]

a = b + c light use of R interpreter, use fast vector primitives, C if necessary

[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, 10],[11, 12, 13, 14, 15]...

use beefier records, say 1k points per record

distance(center, point)

norm(center - P) compute all distances with fast matrix operations

combiner = FALSE combiner = TRUE reduce often and early, use combiner

keyval(k, mean(…)) keyval(k, c(total, count))

replace means with (sum, count) pairs to enable early reduction

https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means

https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means

25

Final thoughts

R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data

We need contributors to this project!Developers

Documentation

Use cases

General Feedback

26

Resources

RHadoop Open source project: https://github.com/RevolutionAnalytics/RHadoop/wiki

Revolution R Enterprise: bit.ly/Enterprise-R

Cloudera CDH: http://www.cloudera.com/hadoop/

Email: [email protected]

https://github.com/RevolutionAnalytics/RHadoop/wiki

https://github.com/RevolutionAnalytics/RHadoop/wiki

http://bit.ly/Enterprise-R

http://www.cloudera.com/hadoop/

mailto:[email protected]

27

www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.

Thank you.

http://www.revolutionanalytics.com/

http://twitter.com/RevolutionR

The Powerful Marriage of Hadoop and R (David Champagne)

Technology

hdfs r thriftrhbase

rmr r

hdfs file

initialization hdfs

r andrevolution r enterprise

r packagescapabilities

r programmers

r mimic