Revolution Analytics November 9, 2011 1 Leveraging R in Hadoop Environments
Jun 27, 2015
Revolution Analytics
November 9, 2011
1
Leveraging R in Hadoop Environments
In Today’s Presentation:
About Revolution Analytics
Why R and Hadoop?
The Packages (rhdfs, rhbase, rmr)
Examples
OPEN SOURCE ANALYTICS FOR THE ENTERPRISE
The professor who invented analytic software for the experts now wants to take it to the masses
Most advanced statistical analysis software available
Half the cost of commercial alternatives
2M+ Users
3,000+ Packages
Statistics
Predictive Analytics
Data Mining
Visualization
Finance
Life Sciences
Manufacturing
Retail
Telecom
Social Media
Government
Power
Productivity
Enterprise Readiness
4
What’s the Difference Between R and Revolution R Enterprise?
Revolution R is 100% R and More®
R Engine Language Libraries
3,000+ Community Packages
Technical Support
Web-Based GUI
Web ServicesAPI
Big DataAnalysis
IDE / DeveloperGUI
BuildAssurance
ParallelTools
Multi-ThreadedMath Libraries
For more information contact: [email protected]
5
Let’s Talk about R and Hadoop
6
Why R and Hadoop?
Hadoop - a scalable infrastructure for processing massive amounts of data
Storage – HDFS, HBASE
Distributed Computing - MapReduce
R - a statistical programming language
Need for more than counts and averages
Analyze all of the data
7
Motivation for this project
Make it easy for the R programmer to interact with the Hadoop data stores and write MapReduce programs
Run R on a massively distributed system without having to understand the underlying infrastructure
Statisticians stay focused on the analysis
Open source
8
R and Hadoop – The R Packages
R Client
R
Map or Reduce
Job Tracker
Task Node
HDFS
HBASE
Thriftrhdfs - R and HDFS
rhbase - R and HBASE
rmr - R and MapReduce
Capabilities delivered as individual R packages
rmr
rhdfs rhbase
Downloads available from Github
9
rhdfs
Manipulate HDFS directly from R
Mimic as much of the HDFS Java API as possible
Examples:Read a HDFS text file into a data frame.
Serialize/Deserialize a model to HDFS
Write an HDFS file to local storagerhdfs/pkg/inst/unitTests rhdfs/pkg/inst/examples
10
rhdfs Functions
File Manipulations - hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
File Read/Write - hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
Directory - hdfs.dircreate, hdfs.mkdir
Utility - hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
Initialization – hdfs.init, hdfs.defaults
11
rhbase
Manipulate HBASE tables and their content
Uses Thrift C++ API as the mechanism to communicate to HBASE
ExamplesCreate a data frame from a collection of rows and columns in an HBASE table
Update an HBASE table with values from a data framerhbase/pkg/inst/unitTests
12
rhbase Functions
Table Manipulation – hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table
Row Read/Write - hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan
Utility - hb.list.tables
Initialization - hb.defaults, hb.init
13
Writing MapReduce programs in R
14
rmr - For R Programmers
•A way to access big data sets
•A simple way to write parallel programs – everyone will have to
•Very R-like, building on the functional characteristics of R
•Just a library
15
rmr – For MapReduce Developers
•Much simpler than writing Java
•Not as simple as Hive, Pig at what they do, but more general
•Great for prototyping, can transition to production -- optimize instead of rewriting! Lower risk, always executable.
16
rmr mapreduce Function
mapreduce (input, output, map, reduce, …)
input – input folder
output – output folder
map – R function used as map
reduce – R function used as reduce
… - other advanced parameters
Some Simple Things
Example showing sampling and counting
map = function(k, v) if (hash(k) %% 10 == 0) keyval(k, v)
reduce = function(k, vv) keyval(k, length(vv))mapreduce(input, output, map, reduce)
18
More Simple Things
HIVE
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
rmr
mapreduce(input =
mapreduce(input = "pv_users",
map = function(k, v) keyval(v['userid'], v['gender']),
reduce = function(k, vv) keyval(k, vv[[1]]),
output = "pv_gender_sum",
map = function(k,v) keyval(v, 1)
reduce = function(k, vv) keyval(k, sum(unlist(vv)))
TakeawaysA language like HIVE makes a class of problems easy to solve, but it is not a general tool
The cost of doing the same operation in rmr is modest and it provides a broader set of capabilities
19
Complex Thingsk-means Clustering
22
k-means - Implementation
Well known design (MacQueen, 1967)
Comparison of the k-means in MapReduce Pig
From Hortonworks
Requires coding in 3 languages (Python-Pig-Java)
100 lines of code
rmr 20 lines of only R code
23
k-means - Highlights
map = function(k,v)
keyval(which.min(distances(centers,v)),v)
reduce = function(k,vv)
keyval(NULL, col.average(vv))
centers = from.dfs(
mapreduce("data-points", map, reduce))
24
k-means - Optimizations
Slow Fast Notes
for(i in 1:100) a[i] = b[i] + c[i]
a = b + c light use of R interpreter, use fast vector primitives, C if necessary
[ 1, 2, 3, 4, 5] [[ 1, 2, 3, 4, 5],[6, 7, 8, 9, 10],[11, 12, 13, 14, 15]...
use beefier records, say 1k points per record
distance(center, point)
norm(center - P) compute all distances with fast matrix operations
combiner = FALSE combiner = TRUE reduce often and early, use combiner
keyval(k, mean(…)) keyval(k, c(total, count))
replace means with (sum, count) pairs to enable early reduction
https://github.com/RevolutionAnalytics/RHadoop/wiki/Fast-k-means
25
Final thoughts
R and Hadoop together offer innovation and flexibility needed to meet analytics challenges of big data
We need contributors to this project!Developers
Documentation
Use cases
General Feedback
26
Resources
RHadoop Open source project: https://github.com/RevolutionAnalytics/RHadoop/wiki
Revolution R Enterprise: bit.ly/Enterprise-R
Cloudera CDH: http://www.cloudera.com/hadoop/
Email: [email protected]
27
www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR
The leading commercial provider of software and support for the popular open source R statistics language.
Thank you.