Top Banner
© Hortonworks Inc. 2012 Enabling R on Hadoop July 11, 2013 Page 1
29

Enabling R on Hadoop

Jan 27, 2015

Download

Technology

Hadoop_Summit

Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Enabling R on Hadoop

© Hortonworks Inc. 2012

Enabling R on Hadoop July 11, 2013

Page 1

Page 2: Enabling R on Hadoop

© Hortonworks Inc. 2012

Your Presenters

Ravi Mutyala Systems Architect

Page 2

Paul Codding Solutions Engineer

Page 3: Enabling R on Hadoop

© Hortonworks Inc. 2012

Agenda

• A Brief History of R • How R is Typically Used • How R is Used with Hadoop • Getting Started

Page 3

Page 4: Enabling R on Hadoop

© Hortonworks Inc. 2012

A Brief History of R

Page 4

Page 5: Enabling R on Hadoop

© Hortonworks Inc. 2012

History of R

Page 5

1976: S Fortran John

Chambers

S

1988: S V3 written in C & statistical

models included

1998: S V4

1991: R Created by

Ross Ihaka & Robert

Gentleman

R

1997: R Core Group

Formed

2000: R Version 1.0

released

Page 6: Enabling R on Hadoop

© Hortonworks Inc. 2012

How R is Typically Used

Page 6

Page 7: Enabling R on Hadoop

© Hortonworks Inc. 2012

Main Uses of R

• Statistical Analysis & Modeling – Classification – Scoring – Ranking – Clustering – Finding relationships – Characterization

• Common Uses – Interactive Data Analysis – General Purpose Statistics – Predictive Modeling

Page 7

Page 8: Enabling R on Hadoop

© Hortonworks Inc. 2012

How R is Used with Hadoop

Page 8

Page 9: Enabling R on Hadoop

© Hortonworks Inc. 2012

Hadoop Components

Page 9

OS   Cloud   VM   Appliance  

PLATFORM  SERVICES  

HADOOP  CORE  

DATA  SERVICES  

OPERATIONAL  SERVICES  

Manage & Operate at

Scale

Store, Process and Access Data

Enterprise Readiness: HA, DR, Snapshots, Security, …

HORTONWORKS    DATA  PLATFORM  (HDP)  

Distributed Storage & Processing HDFS   YARN  (in  2.0)  

WEBHDFS   MAP  REDUCE  

HCATALOG  

HIVE  PIG  HBASE  

SQOOP  

FLUME  

OOZIE  

AMBARI  

Page 10: Enabling R on Hadoop

© Hortonworks Inc. 2012

Hadoop Components & R

Page 10

OS   Cloud   VM   Appliance  

PLATFORM  SERVICES  

HADOOP  CORE  

DATA  SERVICES  

OPERATIONAL  SERVICES  

Manage & Operate at

Scale

Store, Process and Access Data

Enterprise Readiness: HA, DR, Snapshots, Security, …

HORTONWORKS    DATA  PLATFORM  (HDP)  

Distributed Storage & Processing HDFS   YARN  (in  2.0)  

WEBHDFS   MAP  REDUCE  

HCATALOG  

HIVE  PIG  HBASE  

SQOOP  

FLUME  

OOZIE  

AMBARI  

Data Service Components •  Hive •  HBase Hadoop Core •  Map Reduce •  HDFS

Page 11: Enabling R on Hadoop

© Hortonworks Inc. 2012

Options for R on Hadoop

• Options – RODBC/RJDBC – RHive – RHadoop

• Analysis – Focus – Integration Ease – Benefits – Limitations

Page 11

RHadoop

RODBC/RJDBC

RHive

Page 12: Enabling R on Hadoop

© Hortonworks Inc. 2012

RODBC/RJDBC

• Focus – SQL Access from R

• Integration Ease – Install Hortonworks Hive ODBC Driver – Install Hive libraries

• Benefits – Low impact on existing R scripts leveraging other DB packages – Not required to install Hadoop configuration/binaries on client

machines

• Limitations – Parallelism limited to Hive – Result set size

Page 12

Page 13: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 13

TT , DN

.

.

.

.

.

.

.

TT , DNJT

NN

HS

Page 14: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHive

• Focus – Broad access to Hive and HDFS

• Integration Ease – Requires Hadoop binaries, libraries, and configuration files on

client machines – Uses Java DFS Client and HiveServer

• Benefits – Wide range of features expressed through HQL

– rhive-apply R Distributed apply function using HQL

• Limitations – Requires heavy client deployment – Dependent on HiveServer, and can’t be used with HiveServer2

Page 14

Page 15: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 15

TT + DN

.

.

.

.

.

.

.

TT + DN

JT

R Edge Node N

NH

S

Page 16: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop

• Focus – Tight integration with core Hadoop components

• Benefit – Ability to run R on a massively distributed system

– Ability to work with full data sets instead of sample sets

• Additional Information – https://github.com/RevolutionAnalytics/RHadoop/wiki

Page 16

Page 17: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop Architecture

Page 17

R

rhdfs

rhbase

rmr2

HDFS

HBase Thrift Gateway

Map Reduce

HBase

Streaming

R

RR

R

Page 18: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhdfs

• Access HDFS from R • Read from HDFS to R dataframe • Write from R dataframe to HDFS • 1.0.6 adds support for Windows (using HDP)

Page 18

Page 19: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhdfs

• Hadoop CLI Commands & rhdfs equivalent • hadoop fs –ls /

– hdfs.ls(“/”)

• hadoop fs –mkdir /user/rhdfs/ppt – hdfs.mkdir(“/user/rhdfs/ppt”)

• hadoop fs –put 1.txt /user/rhfds/ppt/ –  localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”) – hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”)

• hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt – hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”)

• hadoop fs –rm /user/rhdfs/ppt/1.txt – hdfs.delete(“/user/rhdfs/ppt/1.txt”)

Page 19

Page 20: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhbase

• Access and change data within HBase • Uses Thrift API • Command Examples

– hb.new.table – hb.insert – hb.scan.ex – hb.scan

Page 20

Page 21: Enabling R on Hadoop

© Hortonworks Inc. 2012

rmr2

• Enables writing MapReduce jobs using R • Ability to parallelize algorithms • Ability to use big data sets without needing to sample data

• mapreduce(input, output, map, reduce, …) • Reduces takes a key and a collection of values which could be vector, list, data frame or matrix

• 2.2.1 adds support for Windows (using HDP)

Page 21

Page 22: Enabling R on Hadoop

© Hortonworks Inc. 2012

Sample code - wordcount

Page 22

wc.map = ! function(., lines) {! keyval(! unlist(! strsplit(! x = lines,! split = pattern)),! 1)}!wc.reduce =! function(word, counts ) {! keyval(word, sum(counts))}!!mapreduce(! input = input ,! output = output,! input.format = "text",! map = wc.map,! reduce = wc.reduce,! combine = T)}!

Page 23: Enabling R on Hadoop

© Hortonworks Inc. 2012

More Sample Code

Page 23

groups = rbinom(32, n = 50, prob = 0.4)! tapply(groups, groups, length)!

groups = to.dfs(groups)! from.dfs(! mapreduce(! input = groups,! map = function(., v) keyval(v, 1),! reduce =! function(k, vv)! keyval(k, length(vv))))!

Page 24: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 24

TT , DN, RS

R

.

.

.

.

.

.

.

TT , DN, RS

RJT

R Edge Node N

NH

T G

Page 25: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop

• Limitations – Requires installation of R on all TaskTracker nodes – Does not automatically parallelize algorithms – Different slot/memory configuration recommended to leave

memory and CPU resources for R

Page 25

OS

Map Reduce

OS

Map Reduce

R

Page 26: Enabling R on Hadoop

© Hortonworks Inc. 2012

Getting Started

Page 26

Page 27: Enabling R on Hadoop

© Hortonworks Inc. 2012

Your Fastest On-ramp to Enterprise Hadoop™!

Page 27

http://hortonworks.com/products/hortonworks-sandbox/

The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: •  A free download: http://hortonworks.com/products/hortonworks-sandbox/ •  A complete, self contained virtual machine with Apache Hadoop pre-configured •  A personal, portable and standalone Hadoop environment •  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop

Page 28: Enabling R on Hadoop

© Hortonworks Inc. 2012

Installation

• Install R on all nodes • Install dependent packages – RJSONIO – itertools – digest – Rcpp – rJava – functional – RCurl – httr – plyr

• Download & Install RHadoop Packages – rmr2 – rhdfs – rhbase (requires Thrift)

Page 28

Page 29: Enabling R on Hadoop

© Hortonworks Inc. 2012

Questions & Answers

TRY Download HDP at hortonworks.com

LEARN Applying Data Science using Apache Hadoop Training

FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks

Page 29

Further questions & comments: [email protected]

[email protected]