Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Workshop – Hadoop + R

Carlos Gil Bellosta

http://www.bigdataspain.org/2013/conference/workshop-hadoop-r

http://www.bigdataspain.org/2013/speakers#carlos-gil-bellosta

Big DataAnalytics

Carlos J. GilBellosta

Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R

Counting (&Graphics)

Graphics & bigdata

Let’s count...hexagons

Details ofmapreduce

Scoring,sampling &simulating

Datamodelling

LinearRegression

LogisticRegression

Trees & RandomForests

Final remarks

Big Data AnalyticsR & Hadoop

Carlos J. Gil Bellosta

[email protected]

November 2013

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents

1 Intro to Hadoop & RAll about Hadoop

Hadoop FSHadoop & mapreduce

All about R

2 Counting (& Graphics)

3 Details of mapreduce

4 Scoring, sampling & simulating

5 Data modelling

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

File system: manages all aboutfiles

• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!

• Combination of hardware and software to hide boringactivities from users:

• Find space to write the files• Read/write files• Manage fragmentation• Etc.

• How many devices per FS?

• 1-to-1: diskettes, CD-ROMs, HDDs,...• n-to-1: partitioned HDDs,...• 1-to-n: RAIDs, Hadoop

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Hadoop goodies (as a FS)

• Chuncks (large) files among machines

• Replicates chunks (default, 3)

• Balances data

• Robust to hardware failures

• It is rack aware

Obviously, it requires some system to keep track of:

• Which servers/racks are up/down

• Where each chunk is located

• ...

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

How to work with data in Hadoop?

• Provides a shell (ls, cp, etc.)

• You can put/get data from your local FS to Hadoop FS

• This is:• You can dump your data to your local machine• You can run your programs in your local machine• You can put results back into Hadoop

• But what if the file is too large?

Solution

Rather than bringing the data to the code, why not moving thecode to the data?

One of the ways to move code to data is known as mapreduce.

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Mapreduce

• Two step process:• Map: run your code on chunks all over• Reduce: reshape the output into the desired format

• Hadoop manages issues:• System failures• Threads that do not return• And all (?) that made life of OpenMP, MPI, etc. users

miserable

• Slotted approach: mapreduce provides slots where you putthe mappers/reducers code

• The code is for you to provide!

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

What is R?

• R is a• software package?• programming language?• environment?

for data analysis and graphics.

• R users are (should be?) used to the mapreduce approach:

ddply(dfx, .(group, sex), summarize,

mean = mean(age),

sd = sd(age))

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents

1 Intro to Hadoop & R

2 Counting (& Graphics)Graphics & big dataLet’s count... hexagons



5 Data modelling

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Visualizing a million

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Fluctuation plot

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table plot

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

• Non-trivial counting exercise (no, we are not countingwords today!)

• Good visualization features for big datasets

• Fits in mapreduce framework:• Map: Assigns points to hexagons• Reduce: aggregates counts on hexagons• The output is small and can be plotted locally

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents





5 Data modelling

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

What you see: input/output, map,reduce

• input:• Type: text, csv, R object,...• Options: separator,...

• output: similar to input

• map & reduce:• Functions with (k,v) argument (k, key; v, value)• They return a k,v list• Thus, mapreduces can be chained together (the output of

the first one is the input for the second)

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

What you don’t see

$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes

-D stream.map.output=typedbytes

-D stream.reduce.input=typedbytes

-D stream.reduce.output=typedbytes

-D mapred.reduce.tasks=0

-input /tmp/RtmpUUrNMj/file68c0185e60c

-output /tmp/RtmpUUrNMj/file68c04c25d5f0

-mapper \"Rscript rmr-streaming-map68c018acf680 \"

-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a

-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080

-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680

-inputformat org.apache.hadoop.streaming.AutoInputFormat

-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat 2>&1

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents





5 Data modelling

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Scoring

• Externals consultants build a model (using R and small

data)

• Models in R should have a predict method

• You can then score your huge database (in batch)

• No need to rewrite the model into your systems!

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

The case for sampling

• Sampling works!

• Sampled datasets can be used to build small data models

• You can use R (& mapreduce) to sample data, but youbetter not

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Running simulations on Hadoop

• Some (many?) people say it is not the right tool

• You need input data, but simulations often not

• You want to control the number of mappers (which runyour simulations)

• Still mapreduce is nice for simulations...

• ... so let and old dog try its dirty trick!

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents





5 Data modellingLinear RegressionLogistic RegressionTrees & Random Forests

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Linear regression can beparallelized

Simple linear regression: y ∼ α + βx

β =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2=

=

∑ni=1 xiyi −

1n

∑ni=1 xi

∑nj=1 yj∑n

i=1(x2i )− 1n (∑n

i=1 xi )2

Operations are case by case!

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Multiple linear regression

• Based on X ′X and X ′y :

β = (X ′X )−1X ′y

• If X ′ = [X1|...|Xn] (by blocks), then X ′X =∑

i XiX′i .

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Can logistic regression beparallelized? Yes and no.

• Fitting logistic regression models is iterative and iterationsare not parallelizable.

• However, each iteration can be parallelized (these are notunlike fitting linear models as before)

• We will explore two big data alternatives:• Parallelize iterations using mapreduce (seehttp://goo.gl/ftx36r)

• Split your data meaningfully and do standard logisticregression in the nodes

http://goo.gl/ftx36r

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

How many bytes make knowledge?(aka the fractal nature of big data)

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Splitted logistic regression

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Viable alternatives to logisticmodels

• Trees• High interpretability• But unstable and tend to miss out details

• Random forests• Black boxes• Superb performance• These are collections of trees that can be built in parallel

• Both can be parallelized indifferent ways:• Similar to partitioned logistic models above• Within training

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Table of Contents





5 Data modelling

6 Final remarks

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Forget most of what you learnedtoday, seriously

• People strive to extend small data models to big data (aswe did today)...

• ... but is it the way to go?

• Achtung microlocal structure• Small data people knows microlocal structure as outliers• Global models (linear, logistic,...) cannot (easily?) exploit

microlocal structure• But the promises of big data lie precisely there• (Otherwise, just sample and you will be fine)

• Areas to watch for insights on big data modelling:• SNA (networks analysis)• Text analysis

Big DataAnalytics


Intro toHadoop & R

All aboutHadoop

Hadoop FS

Hadoop &mapreduce

All about R


Graphics & bigdata


Details ofmapreduce


Datamodelling

LinearRegression

LogisticRegression


Final remarks

Thank you very much and...

... questions?

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Technology