Top Banner
R + Hadoop = Big Data Analytics
29

R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

Nov 28, 2014

Download

Technology

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

R + Hadoop = Big Data Analytics

Page 2: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 3: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 4: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 5: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 6: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

66

Page 7: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

77

Page 8: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

88

Page 9: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

99

Page 10: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 11: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 12: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 13: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

library(rmr)

mapreduce(…)

Page 14: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

lapply(data, function)

mapreduce(big.data, map = function)

Page 15: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

Rmr

Hive, Pig

Rmr, Rhipe, Dumbo, Pydoop,

Hadoopy

Java, C++

Cascalog, Scalding, Scrunch

Cascading, Crunch

Expose MR Hide MR

Page 16: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

mapreduce(input, output, map, reduce)

Page 17: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

x = from.dfs(hdfs.object)

hdfs.object = to.dfs(x)

Page 18: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

small.ints = 1:1000lapply(small.ints, function(x) x^2)

small.ints = to.dfs(1:1000)mapreduce(input = small.ints,

map = function(k,v) keyval(v, v^2))

groups = rbinom(32, n = 50, prob = 0.4)tapply(groups, groups, length)

groups = to.dfs(groups)mapreduce(input = groups,

map = function(k, v) keyval(v,1),reduce = function(k,vv)

keyval(k, length(vv)))

Page 19: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

condition = function(x) x > 10

out = mapreduce(input = input, map = function(k,v)

if (condition(v)) keyval(k,v))

Page 20: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

kmeans =function(points, ncenters, iterations = 10, distfun = function(a,b) norm(as.matrix(a-b), type = 'F')) {

newCenters = kmeans.iter(points, distfun, ncenters = ncenters) for(i in 1:iterations) { newCenters = kmeans.iter(points, distfun, centers = newCenters)} newCenters}

kmeans.iter = function(points, distfun, ncenters = dim(centers)[1], centers = NULL) { from.dfs(mapreduce(

input = points, map = if (is.null(centers)) { function(k,v) keyval(sample(1:ncenters,1),v)} else { function(k,v) { distances = apply(centers, 1, function(c) distfun(c,v)) keyval(centers[which.min(distances),], v)}}, reduce = function(k,vv) keyval(NULL, apply(do.call(rbind, vv), 2, mean))), to.data.frame = T)}

Page 21: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

#!/usr/bin/pythonimport sysfrom math import fabsfrom org.apache.pig.scripting import Pig

filename = "student.txt"k = 4tolerance = 0.01

MAX_SCORE = 4MIN_SCORE = 0MAX_ITERATION = 100

# initial centroid, equally divide the spaceinitial_centroids = ""last_centroids = [None] * kfor i in range(k):

last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE)initial_centroids = initial_centroids + str(last_centroids[i])if i!=k-1:

initial_centroids = initial_centroids + ":"

P = Pig.compile("""register udf.jarDEFINE find_centroid FindCentroid('$centroids');raw = load 'student.txt' as (name:chararray, age:int, gpa:double);centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;grouped = group centroided by centroid;result = foreach grouped generate group, AVG(centroided.gpa);store result into 'output';

""")

converged = Falseiter_num = 0while iter_num<MAX_ITERATION:

Q = P.bind({'centroids':initial_centroids})results = Q.runSingle()

Page 22: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

if results.isSuccessful() == "FAILED":raise "Pig job failed"

iter = results.result("result").iterator()centroids = [None] * kdistance_move = 0# get new centroid of this iteration, caculate the moving distance with last iterationfor i in range(k):

tuple = iter.next()centroids[i] = float(str(tuple.get(1)))distance_move = distance_move + fabs(last_centroids[i]-centroids[i])

distance_move = distance_move / k;Pig.fs("rmr output")print("iteration " + str(iter_num))print("average distance moved: " + str(distance_move))if distance_move<tolerance:

sys.stdout.write("k-means converged at centroids: [")sys.stdout.write(",".join(str(v) for v in centroids))sys.stdout.write("]\n")converged = Truebreak

last_centroids = centroids[:]initial_centroids = ""for i in range(k):

initial_centroids = initial_centroids + str(last_centroids[i])if i!=k-1:

initial_centroids = initial_centroids + ":"iter_num += 1

if not converged:print("not converge after " + str(iter_num) + " iterations")sys.stdout.write("last centroids: [")sys.stdout.write(",".join(str(v) for v in last_centroids))sys.stdout.write("]\n")

Page 23: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

import java.io.IOException;

import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;

public class FindCentroid extends EvalFunc<Double> {double[] centroids;public FindCentroid(String initialCentroid) {

String[] centroidStrings = initialCentroid.split(":");centroids = new double[centroidStrings.length];for (int i=0;i<centroidStrings.length;i++)

centroids[i] = Double.parseDouble(centroidStrings[i]);}@Overridepublic Double exec(Tuple input) throws IOException {

double min_distance = Double.MAX_VALUE;double closest_centroid = 0;for (double centroid : centroids) {

double distance = Math.abs(centroid - (Double)input.get(0));if (distance < min_distance) {

min_distance = distance;closest_centroid = centroid;

}}return closest_centroid;

}

}

Page 24: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

mapreduce(mapreduce(…

mapreduce(input = c(input1, input2), …)

equijoin = function(left.input, right.input, input,output,outer, map.left, map.right,reduce, reduce.all)

Page 25: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

out1 = mapreduce(…)mapreduce(input = out1, <xyz>)mapreduce(input = out1, <abc>)

abstract.job = function(input, output, …) {…result = mapreduce(input = input,

output = output)…result}

Page 26: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework

input.format, output.format, formatreduce.on.data.frame, to.data.framelocal, hadoop backendsprofiling

Page 27: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 28: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework
Page 29: R + Hadoop = Big Data Analytics. How Revolution Analytics' RHadoop Project Allows All Developers to Leverage the MapReduce Framework