Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP
1. Introduction2. Motivating Examples3. Preliminaries4. Ricardo Design5. Experimental Study6. Conclusion
CONTENTS
Enterprise datasets
Why are these datasets important?
Statistical analysis on datasets
Data analyst workflow Explore/summarize data Built a model Used to improve business practices Need a statistical package
DATA COLLECTION
R design Single server Main memory Large data FAIL!
Problem for analysts – they work with large datasets Vertical scalability Subsets Neither is ideal!
Large-scale data management systems (DMS) Example: Hadoop Aggregation processing
R AND DMS
Overview Scalable platform for deep analytics Part of eXtreme Analytics Platform (XAP) project Named after economist David Ricardo Facilitates trading between R and Hadoop
Previous work on Map-Reduce
Small data – combined approach success
Several advantages
RICARDO
Familiar working environment – work within a statistical environment
Data attraction – Hadoop’s flexible data store together with the Jaql query language
Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it
Reliability and community support – built from open-source projects
Improved user code – facilitates better codeDeep analytics –can handle many kinds of advanced
statistical analysesNo re-inventing of wheels – combine existing
statistical and DMS technology
RICARDO ADVANTAGES
Analyst workflow: exploration
Graph shows movie perceptionover time
How does an analyst get thisdata visualization?
R is good for the job, BUT…
Ricardo can help!
EXAMPLE 1: SIMPLE TRADING
Analyst workflow: evaluation – already have a model
Analysis must be on all the data
Ricardo can help once again
What did we see? Simple trading First case pass to R Second case pass to Hadoop
More complicated analyses? No problem!
EXAMPLE 2: SIMPLE TRADING
Analyst workflow: modelingHow?
Simple-trading scheme no good Losing information Ricardo permits complex trading
Data needs decomposition Small parts handled by R Large parts handled by Hadoop
Consider an example Latent-factor model Each piece of data must be taken into account Simple-trading won’t work
EXAMPLE 3: COMPLEX TRADING
Developed at the University of Auckland, New Zealand
Open-source language and statistical environment
Small maintenance team, but big popularity
Example of functionality:fi t <- lm(df$mean ~ df$year)plot(df$year, df$mean)abline(fi t)
Data frame equivalent
THE R PROJECT
Enterprise data warehouses – dominant type of DMSDesigned for clean/structured data – not goodAnalysts want their data dirtyWhat to do? Use Hadoop!Hadoop method
Hadoop Distributed File System Operates on raw data files Process according to MapReduce Map phase results fed to reducer
Used successfully on large-scale datasetsAppealing alternative
LARGE-SCALE DMS
Hadoop drawback – programming interfaceAttempts to help thisRicardo uses Jaql
Open-source dataflow language Jaql scripts automatically compiled Operates directly on data files
JSON view:[{ customer: "Michael", movie: { name: "About Schmidt",
year: 2002}, rating: 5}, ... ],
Jaql query:read("ratings")-> group by year = $.movie.year
into { year, mean: avg($[*].rating) }-> sort by [ $.year ].
JAQL: A JSON QUERY LANGUAGE
PROBLEM STATEMENT
How to bridge between them?
Advantages:-Statistical software-Data analysis
Disadvantages:- Operate in main memory- Limited data
Advantage:-Large scale processing
Disadvantage:Insufficient analytical functionality
R driver:Not memory-resident
Does R need memory to store some data?
Hadoop :Performance operationsStore data in HDFS
R- Jaql Bridge:Connect between R driver and Hadoop cluster
Execute query (what kind of query?) Send the result back to R as data frames
Allow Jaql queries to spawn R processes on Hadoop worker nodes.
RICARDO DESIGN
Components:R package(Jaql R and a Jaql module: R Jaql)
R-JAQL BRIDGE
R Hadoop Hadoop R
Hadoop R R Hadoop
Analyst’s typical workflow Data exploration
Preliminary observation
Simple trading
Model building Depth Analytics Complex trading
Model evaluation Quality of models Simple trading
RICARDO WORKFLOW
Why model building is complex trading?
Movies recommendation
REVIEW EXAMPLE
Simple Trading: Linear RegressionComplex Trading: Latent-Factor Model
Data Exploration Model Building
MODEL BUILDING
Random pick up p and qSet up optimization method
Update p and qRepeat it until convergence
Compute• Squared error (e) • The derivative of e with respect
to p• The derivative of e with respect
to q
Table r: stores ratings Table p and q: stores latent factors
MODEL BUILDING
User Item
Alice About Schmidt
Bob Lost in Translation
Michael Sideways
Schmidt 2.24
Lost in translation
1.92
Sideways 1.18
Alice 1.98
Bob 1.21
Michael 2.30
Table r Table q Table p
Principal component analysis (PCA) Compute eigenvectors and eigenvalues Perpendicular among eigenvectors
GLM Compute response variable Expressed as a nonlinear function
……
OTHER MODELS
Java Native Interface (JNI) as the bridge between C and Java How to transfer the data between JNI?
Naïve way Better solution
Japl wrapper handles data-representation incompatibilities This is in the bridge
What’s the component right now in the R- Jaql bridge now?
IMPLEMENTATIONS
Scaling Out R
Low level message passing type
Task- and data-parallel computing systems
Automatic parallelization of high-level
Deeping a DMS
RELATED WORK
Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.
Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.
Future work
Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.
CONCLUSION