Top Banner
CS224: Advanced Topics in Data Management Spring 2008 1 Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

1

Computing PageRank Using Hadoop

(+Introduction to MapReduce)

Alexander Behm, Ajey Shah

University of California, IrvineInstructor: Prof. Chen Li

Page 2: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

2

Outline Introduction to MapReduce

Motivation + Goals

MapReduce Paradigm + Example

Introduction to Hadoop

Architecture

Our setup

Computing PageRank using Map Reduce

Link Analysis

Matrix Multiplication

Page 3: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

3

Motivation for MapReduce

How can we process huge amounts of data quickly (think web-scale)?

Mainframe (one big machine)

expensive, one vendor, hard to scale radically, single point of failure

COTS Cluster (many small machines)

cheap components, many vendors, easy to scale

COTS Clusters very popular because of price and scalability

Main drawback is complexity of programming parallel applications on them

COTS = Commodity off the shelf

Page 4: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

4

Motivation for MapReduce

What are the main challenges of programming a COTS cluster?

1. Fault Tolerance (many machines many failures)

2. Transparency: how to hide underlying details of cluster

3. Scheduling and load balancing

Parallel Programming Models

-High exposure to programmer-Complex programming-High Efficiency-(Long development time)

-Low exposure to programmer-Simple programming-Lower Efficiency-(Short development time)

MapReduce

Page 5: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

5

MapReduce Goals

Provide easy but general model for programmers to use cluster resources

Hide network communication (i.e. RPCs)

Hide storage details, file chunks are automatically distributed and replicated

Provide transparent fault tolerance

Failed tasks are automatically rescheduled on live nodes

High throughput and automatic load balancing

E.g. scheduling tasks on nodes that already have data

RPC = Remote Procedure Call

Page 6: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

6

MapReduce is NOT…

An operating system

A programming language

Meant for online processing

Hadoop (it is an implementation of MapReduce)

MapReduce is a programming

paradigm!

Page 7: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

7

MapReduce FlowInput

Map

Key, Value Key, Value …=

Map Map

Key, Value

Key, Value

Key, Value

Key, Value

Key, Value

Key, Value

Split Input into Key-Value pairs.For each K-V pair call Map.

Each Map produces new set of K-V pairs.

Reduce(K, V[ ])

Sort

Output Key, Value Key, Value …=

For each distinct key, call reduce. Produces one K-V pair for each distinct key.

Output as a set of Key Value Pairs.

Page 8: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

8

MapReduce WordCount ExampleOutput:Number of occurrences of each word

Input:File containing words

Hello World Bye WorldHello Hadoop Bye HadoopBye Hadoop Hello Hadoop

Hello World Bye WorldHello Hadoop Bye HadoopBye Hadoop Hello Hadoop

Bye 3Hadoop 4Hello 3World 2

Bye 3Hadoop 4Hello 3World 2

MapReduce

How can we do this within the MapReduce framework?

Basic idea: parallelize on lines in input file!

Page 9: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

9

MapReduce WordCount ExampleInput

1, “Hello World Bye World”

2, “Hello Hadoop Bye Hadoop”

3, “Bye Hadoop Hello Hadoop”

Map Output

<Hello,1><World,1><Bye,1><World,1>

<Hello,1><Hadoop,1><Bye,1><Hadoop,1>

<Bye,1><Hadoop,1><Hello,1><Hadoop,1>

Map(K, V) { For each word w in V Collect(w, 1);}

Map

Map

Map

Page 10: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

10

MapReduce WordCount ExampleReduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count);}

Map Output

<Hello,1><World,1><Bye,1><World,1>

<Hello,1><Hadoop,1><Bye,1><Hadoop,1>

<Bye,1><Hadoop,1><Hello,1><Hadoop,1>

Internal Grouping

<Bye 1, 1, 1>

<Hadoop 1, 1, 1, 1>

<Hello 1, 1, 1>

<World 1, 1>

Reduce Output

<Bye, 3><Hadoop, 4><Hello, 3><World, 2>

Reduce

Reduce

Reduce

Reduce

Page 11: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

11

Open Source implementation of MapReduce by Apache Java software framework In use and supported by Yahoo! Hadoop consists of the following components:

Processing : Map Reduce Storage: HDFS, Hbase (Google Bigtable)

@Yahoo!Some Webmap size data:Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes

(source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)

Page 12: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

12

Typical Hadoop Setup

Page 13: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

13

Our Hadoop Setup

MASTER

peach

Namenode

JobTracker

TaskTracker

DataNode

watermelon

DataNode

TaskTracker

cherry

DataNode

TaskTracker

avocado

DataNode

TaskTracker

blueberry

DataNode

TaskTracker

Switch

SLAVES

Page 14: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

14

Our Hadoop Setup

Demo: Hadoop Admin Pages!

Page 15: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

15

• Single Name Node- manages meta data and block placement

• DataNode – stores blocks

Storage: HDFS

Page 16: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

16

Run Application

Job Tracker

Task Tracker Task Tracker Task Tracker…

Task Task Task Task Task Task

Hadoop Black Box

Job Execution Diagram

Page 17: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

17

Input -> Map -> Shuffle -> Reduce -> Output

Processing: Hadoop MapReduce

Page 18: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

18

Using Hadoop To Program

Reduce(…)Mapper(…)

extendsextends

implements implements

Page 19: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

19

Sample Map Classpublic static class Map extends MapReduceBase implements Mapper<LongWritable,

Text, Text, IntWritable>

{     

private final static IntWritable one = new IntWritable(1);  

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

{

 String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens())

{         

word.set(tokenizer.nextToken());         

output.collect(word, one);       

}     

}   

}

Page 20: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

20

Sample Reduce Classpublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

{     

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException

{      

int sum = 0;       

while (values.hasNext())

{     

sum += values.next().get();       

}  

output.collect(key, new IntWritable(sum));     

}   

}

Page 21: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

21

Running a Job

Demo: Show WordCount Example

Page 22: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

22

Project5: PageRank on Hadoop

Page 23: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

23

Output

#|colNum|NumOfRows|<R,val>…..<R,val>|#.....

Link Extractor

Link Analysis

Crawled Pages

Page 24: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

24

PageRank on MapReduceVery Basic PageRank Algorithm

Input:

PageRankVector

DistributionMatrix

ComputePageRank {

Until converged {

PageRankVector = DistributionMatrix * PageRankVector;

}

}

Output:

PageRankVector

Challenges

-Storage of matrix and vector

-Parallel matrix multiplication

-Determine convergence

-Implementation on Hadoop

Page 25: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

25

PageRank on MapReduceWhy is storage a challenge?

UCI domain: 500000 pages

Assuming 4 Bytes per entry

Size of Vector: 500000 * 4 = 2000000 = 2MB

Size of Matrix: 500000 * 500000 * 4 = 1012 = 1TB

Assumes a fully connected graph.Cleary this is very unrealistic for web pages!

Solution: Sparse MatrixBut: Row-Wise or Column-Wise?

Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)

Page 26: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

26

PageRank on MapReduceParallel Matrix Multiplication

Requirement: Make it work! Simple but practical solution

X =

M V M x V

Every Row of M is “combined” with V, yielding one element of M x V each

Intuition:

- Parallelize on rows: each parallel task computes one final value

- Use row-wise sparse matrix, so above can be done easily!

(column-wise is actually better for PageRank)

Page 27: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

27

PageRank on MapReduce

0 0 0 0 1 0

0 1 0 1 0 0

1 1 0 0 0 0

0 0 0 1 1 0

1 1 0 0 0 0

0 1 0 0 0 1

1 2 3 4 5 6

1

2

3

4

5

6

Stored As

1

2

3

4

5

6

5, 1

2, 1 4, 1

1, 1 2, 1

4, 1 5, 1

1, 1 2, 1

2, 1 6, 1

Original Matrix Row-Wise Sparse Matrix

New Storage Requirements

UCI domain: 500000 pages

Assuming 4 Bytes per entry

Assuming max 100 outgoing links per page

Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 106 = 400MB

Notice:

No more random access!

Page 28: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

28

PageRank on MapReduce

Map(Key, Row) {

Vector v = getVector();

Int sum = 0;

For each Element e in Row

sum += e.value * v.at(e.columnNumber);

collect(Key, sum);

}

Reduce(Key, Value) {

collect(Key, Value);

}

Map-Reduce procedures for parallel matrix*vector multiplication

using row-wise sparse matrix

Page 29: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

29

Matrix Vector Multiplication

Demo: Show Matrix-Vector Multiplication

Page 30: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

30

Hadoop: Implementing Own File Format

HDFSFile

HDFSFile

InputFormat- Splits File into Chunks (“InputSplits”)- Byte Oriented- Provides appropriate RecordReader

InputSplit- Filename- Start Offset- End Offset- Hosts in HDFS

RecordReader- Record oriented- Extracts Records from an InputSplit

InputSplit InputSplit

RecordReader RecordReader

Map Map Map

Page 31: Behm Shah Pagerank

CS224: Advanced Topics in Data Management Spring 2008

31

[1] Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clusters,Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004

[2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html

[3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html

[4] http://lucene.apache.org/hadoop/

References