Top Banner
Revolution Analytics November 9, 2011 1 Leveraging R in Hadoop Environments
27

The Powerful Marriage of Hadoop and R (David Champagne)

Jun 27, 2015

Download

Technology

When two of the most powerful innovations in modern analytics come together, the result is revolutionary.

This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.

Presented at Hadoop World 2011 by:

David Champagne
CTO, Revolution Analytics

David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Powerful Marriage of Hadoop and R (David Champagne)

Revolution Analytics

November 9, 2011

1

Leveraging R in Hadoop Environments

Page 2: The Powerful Marriage of Hadoop and R (David Champagne)

In Today’s Presentation:

About Revolution Analytics

Why R and Hadoop?

The Packages (rhdfs, rhbase, rmr)

Examples

Page 3: The Powerful Marriage of Hadoop and R (David Champagne)

OPEN SOURCE ANALYTICS FOR THE ENTERPRISE

The professor who invented analytic software for the experts now wants to take it to the masses

Most advanced statistical analysis software available

Half the cost of commercial alternatives

2M+ Users

3,000+ Packages

Statistics

Predictive Analytics

Data Mining

Visualization

Finance

Life Sciences

Manufacturing

Retail

Telecom

Social Media

Government

Power

Productivity

Enterprise Readiness

Page 4: The Powerful Marriage of Hadoop and R (David Champagne)

4

What’s the Difference Between R and Revolution R Enterprise?

Revolution R is 100% R and More®

R Engine Language Libraries

3,000+ Community Packages

Technical Support

Web-Based GUI

Web ServicesAPI

Big DataAnalysis

IDE / DeveloperGUI

BuildAssurance

ParallelTools

Multi-ThreadedMath Libraries

For more information contact: [email protected]

Page 5: The Powerful Marriage of Hadoop and R (David Champagne)

5

Let’s Talk about R and Hadoop

Page 6: The Powerful Marriage of Hadoop and R (David Champagne)

6

Why R and Hadoop?

Hadoop - a scalable infrastructure for processing massive amounts of data

Storage – HDFS, HBASE

Distributed Computing - MapReduce

R - a statistical programming language

Need for more than counts and averages

Analyze all of the data

Page 7: The Powerful Marriage of Hadoop and R (David Champagne)

7

Motivation for this project

Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs

Run R on a massively distributed system without having to understand the underlying infrastructure

Statisticians stay focused on the analysis

Open source

Page 8: The Powerful Marriage of Hadoop and R (David Champagne)

8

R and Hadoop – The R Packages

R Client

R

Map or Reduce

Job Tracker

Task Node

HDFS

HBASE

Thriftrhdfs - R and HDFS

rhbase - R and HBASE

rmr - R and MapReduce

Capabilities delivered as individual R packages

rmr

rhdfs rhbase

Downloads available from Github

Page 9: The Powerful Marriage of Hadoop and R (David Champagne)

9

rhdfs

Manipulate HDFS directly from R

Mimic as much of the HDFS Java API as possible

Examples:Read a HDFS text file into a data frame.

Serialize/Deserialize a model to HDFS

Write an HDFS file to local storagerhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples

Page 10: The Powerful Marriage of Hadoop and R (David Champagne)

10

rhdfs Functions

File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get

File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file

Directory - hdfs.dircreate, hdfs.mkdir

Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists

Initialization – hdfs.init, hdfs.defaults

Page 11: The Powerful Marriage of Hadoop and R (David Champagne)

11

rhbase

Manipulate HBASE tables and their content

Uses Thrift C++ API as the mechanism to communicate to HBASE

ExamplesCreate a data frame from a collection of rows and columns in an HBASE table

Update an HBASE table with values from a data framerhbase/pkg/inst/unitTests

Page 12: The Powerful Marriage of Hadoop and R (David Champagne)

12

rhbase Functions

Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table

Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan

Utility - hb.list.tables

Initialization - hb.defaults, hb.init

Page 13: The Powerful Marriage of Hadoop and R (David Champagne)

13

Writing MapReduce programs in R

Page 14: The Powerful Marriage of Hadoop and R (David Champagne)

14

rmr - For R Programmers

•A way to access big data sets

•A simple way to write parallel programs – everyone will have to

•Very R-like, building on the functional characteristics of R

•Just a library 

Page 15: The Powerful Marriage of Hadoop and R (David Champagne)

15

rmr – For MapReduce Developers

•Much simpler than writing Java

•Not as simple as Hive, Pig at what they do, but more general

•Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.

Page 16: The Powerful Marriage of Hadoop and R (David Champagne)

16

rmr mapreduce Function

mapreduce (input, output, map, reduce, …)

input – input folder

output – output folder

map – R function used as map

reduce – R function used as reduce

 

… - other advanced parameters

Page 17: The Powerful Marriage of Hadoop and R (David Champagne)

Some Simple Things

Example showing sampling and counting

map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)

reduce = function(k, vv) keyval(k, length(vv))mapreduce(input, output, map, reduce)

Page 18: The Powerful Marriage of Hadoop and R (David Champagne)

18

More Simple Things

HIVE

INSERT OVERWRITE TABLE pv_gender_sum

SELECT pv_users.gender, count (DISTINCT pv_users.userid)

FROM pv_users 

GROUP BY pv_users.gender;

rmr

mapreduce(input = 

  mapreduce(input = "pv_users",  

      map = function(k, v) keyval(v['userid'], v['gender']), 

      reduce = function(k, vv) keyval(k, vv[[1]]),

 output  = "pv_gender_sum",

 map = function(k,v) keyval(v, 1)

 reduce = function(k, vv) keyval(k, sum(unlist(vv)))

TakeawaysA language like HIVE makes a class of problems easy to solve, but it is not a general tool

The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities

Page 19: The Powerful Marriage of Hadoop and R (David Champagne)

19

Complex Thingsk-means Clustering

Page 20: The Powerful Marriage of Hadoop and R (David Champagne)
Page 21: The Powerful Marriage of Hadoop and R (David Champagne)
Page 22: The Powerful Marriage of Hadoop and R (David Champagne)

22

k-means - Implementation

Well known design (MacQueen, 1967)

Comparison of the k-means in MapReduce Pig

From Hortonworks

Requires coding in 3 languages (Python-Pig-Java)

100 lines of code

rmr 20 lines of only R code

Page 23: The Powerful Marriage of Hadoop and R (David Champagne)

23

k-means - Highlights

map = function(k,v)

keyval(which.min(distances(centers,v)),v)

reduce = function(k,vv)

keyval(NULL, col.average(vv))

centers = from.dfs(

mapreduce("data-points", map, reduce))

Page 24: The Powerful Marriage of Hadoop and R (David Champagne)

24

k-means - Optimizations

Slow Fast Notes

for(i in 1:100) a[i] = b[i] + c[i]

a = b + c light use of R interpreter, use fast vector primitives, C if necessary

[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, 10],[11, 12, 13, 14, 15]...

use beefier records, say 1k points per record

distance(center, point)

norm(center - P) compute all distances with fast matrix operations

combiner = FALSE combiner = TRUE reduce often and early, use combiner

keyval(k, mean(…)) keyval(k, c(total, count))

replace means with (sum, count) pairs to enable early reduction

https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means

Page 25: The Powerful Marriage of Hadoop and R (David Champagne)

25

Final thoughts

R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data

We need contributors to this project!Developers

Documentation

Use cases

General Feedback

Page 26: The Powerful Marriage of Hadoop and R (David Champagne)

26

Resources

RHadoop Open source project: https://github.com/RevolutionAnalytics/RHadoop/wiki

Revolution R Enterprise: bit.ly/Enterprise-R

Cloudera CDH: http://www.cloudera.com/hadoop/

Email: [email protected]

Page 27: The Powerful Marriage of Hadoop and R (David Champagne)

27

www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.

Thank you.