Top Banner
Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP
38

Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Angel TrifonovYun LuYing Wang

RICARDO: INTEGRATING R AND

HADOOP

Page 2: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

1. Introduction2. Motivating Examples3. Preliminaries4. Ricardo Design5. Experimental Study6. Conclusion

CONTENTS

Page 3: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

INTRODUCTION

Page 4: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Enterprise datasets

Why are these datasets important?

Statistical analysis on datasets

Data analyst workflow Explore/summarize data Built a model Used to improve business practices Need a statistical package

DATA COLLECTION

Page 5: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

R design Single server Main memory Large data FAIL!

Problem for analysts – they work with large datasets Vertical scalability Subsets Neither is ideal!

Large-scale data management systems (DMS) Example: Hadoop Aggregation processing

R AND DMS

Page 6: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Overview Scalable platform for deep analytics Part of eXtreme Analytics Platform (XAP) project Named after economist David Ricardo Facilitates trading between R and Hadoop

Previous work on Map-Reduce

Small data – combined approach success

Several advantages

RICARDO

Page 7: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Familiar working environment – work within a statistical environment

Data attraction – Hadoop’s flexible data store together with the Jaql query language

Integration of data processing into the analytical workflow – handle large data by preprocessing and reducing it

Reliability and community support – built from open-source projects

Improved user code – facilitates better codeDeep analytics –can handle many kinds of advanced

statistical analysesNo re-inventing of wheels – combine existing

statistical and DMS technology

RICARDO ADVANTAGES

Page 8: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

MOTIVATING EXAMPLES

Page 9: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Analyst workflow: exploration

Graph shows movie perceptionover time

How does an analyst get thisdata visualization?

R is good for the job, BUT…

Ricardo can help!

EXAMPLE 1: SIMPLE TRADING

Page 10: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Analyst workflow: evaluation – already have a model

Analysis must be on all the data

Ricardo can help once again

What did we see? Simple trading First case pass to R Second case pass to Hadoop

More complicated analyses? No problem!

EXAMPLE 2: SIMPLE TRADING

Page 11: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Analyst workflow: modelingHow?

Simple-trading scheme no good Losing information Ricardo permits complex trading

Data needs decomposition Small parts handled by R Large parts handled by Hadoop

Consider an example Latent-factor model Each piece of data must be taken into account Simple-trading won’t work

EXAMPLE 3: COMPLEX TRADING

Page 12: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

LATENT-FACTOR MODEL

Page 13: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

PRELIMINARIES

Page 14: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Developed at the University of Auckland, New Zealand

Open-source language and statistical environment

Small maintenance team, but big popularity

Example of functionality:fi t <- lm(df$mean ~ df$year)plot(df$year, df$mean)abline(fi t)

Data frame equivalent

THE R PROJECT

Page 15: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Enterprise data warehouses – dominant type of DMSDesigned for clean/structured data – not goodAnalysts want their data dirtyWhat to do? Use Hadoop!Hadoop method

Hadoop Distributed File System Operates on raw data files Process according to MapReduce Map phase results fed to reducer

Used successfully on large-scale datasetsAppealing alternative

LARGE-SCALE DMS

Page 16: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Hadoop drawback – programming interfaceAttempts to help thisRicardo uses Jaql

Open-source dataflow language Jaql scripts automatically compiled Operates directly on data files

JSON view:[{ customer: "Michael", movie: { name: "About Schmidt",

year: 2002}, rating: 5}, ... ],

Jaql query:read("ratings")-> group by year = $.movie.year

into { year, mean: avg($[*].rating) }-> sort by [ $.year ].

JAQL: A JSON QUERY LANGUAGE

Page 17: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

RICARDO DESIGN

Page 18: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

PROBLEM STATEMENT

How to bridge between them?

Advantages:-Statistical software-Data analysis

Disadvantages:- Operate in main memory- Limited data

Advantage:-Large scale processing

Disadvantage:Insufficient analytical functionality

Page 19: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

RICARDO DESIGN

Page 20: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

R driver:Not memory-resident

Does R need memory to store some data?

Hadoop :Performance operationsStore data in HDFS

R- Jaql Bridge:Connect between R driver and Hadoop cluster

Execute query (what kind of query?) Send the result back to R as data frames

Allow Jaql queries to spawn R processes on Hadoop worker nodes.

RICARDO DESIGN

Page 21: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Components:R package(Jaql R and a Jaql module: R Jaql)

R-JAQL BRIDGE

R Hadoop Hadoop R

Hadoop R R Hadoop

Page 22: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Analyst’s typical workflow Data exploration

Preliminary observation

Simple trading

Model building Depth Analytics Complex trading

Model evaluation Quality of models Simple trading

RICARDO WORKFLOW

Why model building is complex trading?

Page 23: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Movies recommendation

REVIEW EXAMPLE

Simple Trading: Linear RegressionComplex Trading: Latent-Factor Model

Data Exploration Model Building

Page 24: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

SIMPLE TRADING – LINEAR REGRESSION

Get data from

Hadoop

Fit data

Page 25: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

SIMPLE TRADING – EVALUATE MODEL

Fit data

Select top 10 outliers

Page 26: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

COMPLEX TRADING

Model Building Objectives

Page 27: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

MODEL BUILDING

Random pick up p and qSet up optimization method

Update p and qRepeat it until convergence

Compute• Squared error (e) • The derivative of e with respect

to p• The derivative of e with respect

to q

Page 28: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Table r: stores ratings Table p and q: stores latent factors

MODEL BUILDING

User Item

Alice About Schmidt

Bob Lost in Translation

Michael Sideways

Schmidt 2.24

Lost in translation

1.92

Sideways 1.18

Alice 1.98

Bob 1.21

Michael 2.30

Table r Table q Table p

Page 29: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

DETAILS

Compute the gradient

Compute the sum of

squared errors

Page 30: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Principal component analysis (PCA) Compute eigenvectors and eigenvalues Perpendicular among eigenvectors

GLM Compute response variable Expressed as a nonlinear function

……

OTHER MODELS

Page 31: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Java Native Interface (JNI) as the bridge between C and Java How to transfer the data between JNI?

Naïve way Better solution

Japl wrapper handles data-representation incompatibilities This is in the bridge

What’s the component right now in the R- Jaql bridge now?

IMPLEMENTATIONS

Page 32: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

EXPERIMENTAL STUDY

Page 33: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

EXPERIMENTAL STUDY

Page 34: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

EXPERIMENTAL STUDY

Page 35: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

EXPERIMENTAL STUDY

Page 36: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Scaling Out R

Low level message passing type

Task- and data-parallel computing systems

Automatic parallelization of high-level

Deeping a DMS

RELATED WORK

Page 37: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.

Ricardo combines the data management capabilities of Hadoop and Jaql with the statistical functionality provided by R.

Future work

Identifying and integrating additional statistical analyses that are amenable to the Ricardo approach.

CONCLUSION

Page 38: Angel Trifonov Yun Lu Ying Wang RICARDO: INTEGRATING R AND HADOOP.

S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and

J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987-998, 2010.

REFERENCES