Workshop – Hadoop + R Carlos Gil Bellosta
May 06, 2015
Workshop – Hadoop + R
Carlos Gil Bellosta
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Big Data AnalyticsR & Hadoop
Carlos J. Gil Bellosta
November 2013
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & RAll about Hadoop
Hadoop FSHadoop & mapreduce
All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
File system: manages all aboutfiles
• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boringactivities from users:
• Find space to write the files• Read/write files• Manage fragmentation• Etc.
• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...• n-to-1: partitioned HDDs,...• 1-to-n: RAIDs, Hadoop
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Hadoop goodies (as a FS)
• Chuncks (large) files among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware
Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
How to work with data in Hadoop?
• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:• You can dump your data to your local machine• You can run your programs in your local machine• You can put results back into Hadoop
• But what if the file is too large?
Solution
Rather than bringing the data to the code, why not moving thecode to the data?
One of the ways to move code to data is known as mapreduce.
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Mapreduce
• Two step process:• Map: run your code on chunks all over• Reduce: reshape the output into the desired format
• Hadoop manages issues:• System failures• Threads that do not return• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you putthe mappers/reducers code
• The code is for you to provide!
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What is R?
• R is a• software package?• programming language?• environment?
for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:
ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd = sd(age))
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)Graphics & big dataLet’s count... hexagons
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Visualizing a million
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Fluctuation plot
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table plot
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
• Non-trivial counting exercise (no, we are not countingwords today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:• Map: Assigns points to hexagons• Reduce: aggregates counts on hexagons• The output is small and can be plotted locally
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What you see: input/output, map,reduce
• input:• Type: text, csv, R object,...• Options: separator,...
• output: similar to input
• map & reduce:• Functions with (k,v) argument (k, key; v, value)• They return a k,v list• Thus, mapreduces can be chained together (the output of
the first one is the input for the second)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
What you don’t see
$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper \"Rscript rmr-streaming-map68c018acf680 \"
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat 2>&1
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Scoring
• Externals consultants build a model (using R and small
data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
The case for sampling
• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but youbetter not
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Running simulations on Hadoop
• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which runyour simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modellingLinear RegressionLogistic RegressionTrees & Random Forests
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Linear regression can beparallelized
Simple linear regression: y ∼ α + βx
β =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2=
=
∑ni=1 xiyi −
1n
∑ni=1 xi
∑nj=1 yj∑n
i=1(x2i )− 1n (∑n
i=1 xi )2
Operations are case by case!
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Multiple linear regression
• Based on X ′X and X ′y :
β = (X ′X )−1X ′y
• If X ′ = [X1|...|Xn] (by blocks), then X ′X =∑
i XiX′i .
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Can logistic regression beparallelized? Yes and no.
• Fitting logistic regression models is iterative and iterationsare not parallelizable.
• However, each iteration can be parallelized (these are notunlike fitting linear models as before)
• We will explore two big data alternatives:• Parallelize iterations using mapreduce (seehttp://goo.gl/ftx36r)
• Split your data meaningfully and do standard logisticregression in the nodes
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
How many bytes make knowledge?(aka the fractal nature of big data)
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Splitted logistic regression
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Viable alternatives to logisticmodels
• Trees• High interpretability• But unstable and tend to miss out details
• Random forests• Black boxes• Superb performance• These are collections of trees that can be built in parallel
• Both can be parallelized indifferent ways:• Similar to partitioned logistic models above• Within training
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Table of Contents
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Forget most of what you learnedtoday, seriously
• People strive to extend small data models to big data (aswe did today)...
• ... but is it the way to go?
• Achtung microlocal structure• Small data people knows microlocal structure as outliers• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure• But the promises of big data lie precisely there• (Otherwise, just sample and you will be fine)
• Areas to watch for insights on big data modelling:• SNA (networks analysis)• Text analysis
Big DataAnalytics
Carlos J. GilBellosta
Intro toHadoop & R
All aboutHadoop
Hadoop FS
Hadoop &mapreduce
All about R
Counting (&Graphics)
Graphics & bigdata
Let’s count...hexagons
Details ofmapreduce
Scoring,sampling &simulating
Datamodelling
LinearRegression
LogisticRegression
Trees & RandomForests
Final remarks
Thank you very much and...
... questions?