Top Banner
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 2 September 1, 2011 Matt Lease School of Information University of Texas at Austin ml at ischool dot utexas dot edu Jason Baldridge Department of Linguistics University of Texas at Austin Jasonbaldridge at gmail dot com
32

Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Sep 13, 2014

Download

Technology

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 2 September 1, 2011
Jason Baldridge and Matt Lease
https://sites.google.com/a/utcompling.com/dicta-f11/
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M

University of Texas at Austin, Fall 2011

Lecture 2 September 1, 2011

Matt Lease

School of Information

University of Texas at Austin

ml at ischool dot utexas dot edu

Jason Baldridge

Department of Linguistics

University of Texas at Austin

Jasonbaldridge at gmail dot com

Page 2: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Acknowledgments

Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park

Some figures courtesy of

• Chuck Lam’s Hadoop In Action (2011)

• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)

Page 3: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

g g g g g

f f f f f Map

Fold

Roots in Functional Programming

Page 4: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

Page 5: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce

Page 6: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

“Big Ideas”

Scale “out”, not “up”

Limits of SMP and large shared-memory machines

Move processing to the data

Cluster have limited bandwidth

Process data sequentially, avoid random access

Seeks are expensive, disk throughput is reasonable

Seamless scalability

From the mythical man-month to the tradable machine-hour

Page 7: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Typical Large-Data Problem

Iterate over a large number of records

Compute something of interest from each

Shuffle and sort intermediate results

Aggregate intermediate results

Generate final output

Key idea: provide a functional abstraction for

these two operations

(Dean and Ghemawat, OSDI 2004)

Page 8: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce Data Flow

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52

Page 9: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce “Runtime”

Handles scheduling

Assigns workers to map and reduce tasks

Handles “data distribution”

Moves processes to data

Handles synchronization

Gathers, sorts, and shuffles intermediate data

Handles errors and faults

Detects worker failures and restarts

Built on a distributed file system

Page 10: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce

Programmers specify two functions

map ( K1, V1 ) → list ( K2, V2 )

reduce ( K2, list(V2) ) → list ( K3, V3)

Note correspondence of types map output → reduce input

Data Flow

Input → “input splits”: each a sequence of logical (K1,V1) “records”

Map

• Each split processed by same map node

• map invoked iteratively: once per record in the split

• For each record processed, map may emit 0-N (K2,V2) pairs

Reduce

• reduce invoked iteratively for each ( K2, list(V2) ) intermediate value

• For each processed, reduce may emit 0-N (K3,V3) pairs

Each reducer’s output written to a persistent file in HDFS

Page 11: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

InputSplit

Source: redrawn from a slide by Cloduera, cc-licensed

InputSplit InputSplit

Input File Input File

InputSplit InputSplit

RecordReader RecordReader RecordReader RecordReader RecordReader

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Mapper

Intermediates

Inp

utF

orm

at

Page 12: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Data Flow

Input → “input splits”: each a sequence of logical (K1,V1) “records”

For each split, for each record, do map(K1,V1) (multiple calls)

Each map call may emit any number of (K2,V2) pairs (0-N)

Run-time

Groups all values with the same key into ( K2, list(V2) )

Determines which reducer will process this

Copies data across network as needed for reducer

Ensures intra-node sort of keys processed by each reducer

• No guarantee by default of inter-node total sort across reducers

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30

Page 13: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Page 14: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

map map map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 45, 52

Page 15: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Partition

Given: map ( K1, V1 ) → list ( K2, V2 )

reduce ( K2, list(V2) ) → list ( K3, V3)

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

Each distinct key (with associated values) sent to a single reducer

• Same reduce node may process multiple keys in separate reduce() calls

Balances workload across reducers: equal number of keys to each

• Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)

Customizable

• Some keys require more computation than others

• e.g. value skew, or key-specific computation performed

• For skew, sampling can dynamically estimate distribution & set partition

• Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?

Page 16: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Secondary Sorting (Lin 57, White 241)

How to output sorted bigrams (1st word, then list of 2nds)?

What if we use word1 as the key, word 2 as the value?

What if we use <first>--<second> as the key?

Pattern

Create a composite key of (first, second)

Define a Key Comparator based on both words

• This will produce the sort order we want (aa ab ac ba bb bc ca cb…)

Define a partition function based only on first word

• All bigrams with the same first word go to same reducer

• How do you know when the first word changes across invocations?

Preserve state in the reducer across invocations

• Will be called separately for each bigram, but we want to remember

the current first word across bigrams seen

Hadoop also provides Group Comparator

Page 17: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Combine

Given: map ( K1, V1 ) → list ( K2, V2 )

reduce ( K2, list(V2) ) → list ( K3, V3)

combine ( K2, list(V2) ) → list ( K2, V2 )

Optional optimization

Local aggregation to reduce network traffic

No guarantee it will be used, how many times it will be called

Semantics of program cannot depend on its use

Signature: same input as reduce, same output as map

Combine may be run repeatedly on its own output

Lin: Associative & Commutative combiner = reducer

• See next slide

Page 18: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Functional Properties

Associative: f( a, f(b,c) ) = f( f(a,b), c )

Grouping of operations doesn’t matter

YES: Addition, multiplication, concatenation

NO: division, subtraction, NAND

NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )

Commutative: f(a,b) = f(b,a)

Ordering of arguments doesn’t matter

YES: addition, multiplication, NAND

NO: division, subtraction, concatenation

Concatenate(“a,”b”) != concatenate(“b”,a”)

Distributive

White (p. 32) and Lam (p. 84) mention with regard to combiners

But really, go with associative + commutative in Lin (pp. 20, 27)

Page 19: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

combine combine combine combine

b a 1 2 c 9 a c 5 2 b c 7 8

partition partition partition partition

map map map map

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

Page 20: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

split 0

split 1

split 2

split 3

split 4

worker

worker

worker

worker

worker

Master

User

Program

output

file 0

output

file 1

(1) submit

(2) schedule map (2) schedule reduce

(3) read (4) local write

(5) remote read (6) write

Input

files

Map

phase

Intermediate files

(on local disk)

Reduce

phase

Output

files

Adapted from (Dean and Ghemawat, OSDI 2004)

Page 21: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Shuffle and 2 Sorts

As map emits values, local sorting runs in tandem (1st sort)

Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)

Partition determines which (logical) reducer Rj each key will go to

Node’s TaskTracker tells JobTracker it has keys for Rj

JobTracker determines node to run Rj based on data locality

When local map/combine/sort finishes, sends data to Rj’s node

Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)

For each (K, list(V)) tuple in merged output, call reduce(…)

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178

Page 22: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Distributed File System

Don’t move data… move computation to the data!

Store data on the local disks of nodes in the cluster

Start up the workers on the node that has the data local

Why?

Not enough RAM to hold all the data in memory

Disk access is slow, but disk throughput is reasonable

A distributed file system is the answer

GFS (Google File System) for Google’s MapReduce

HDFS (Hadoop Distributed File System) for Hadoop

Page 23: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

GFS: Assumptions

Commodity hardware over “exotic” hardware

Scale “out”, not “up”

High component failure rates

Inexpensive commodity components fail all the time

“Modest” number of huge files

Multi-gigabyte files are common, if not encouraged

Files are write-once, mostly appended to

Perhaps concurrently

Large streaming reads over random access

High sustained throughput over low latency

GFS slides adapted from material by (Ghemawat et al., SOSP 2003)

Page 24: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

GFS: Design Decisions

Files stored as chunks

Fixed size (64MB)

Reliability through replication

Each chunk replicated across 3+ chunkservers

Single master to coordinate access, keep metadata

Simple centralized management

No data caching

Little benefit due to large datasets, streaming reads

Simplify the API

Push some of the issues onto the client (e.g., data layout)

HDFS = GFS clone (same basic ideas)

Page 25: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Basic Cluster Components

1 “Manager” node (can be split onto 2 nodes)

Namenode (NN)

Jobtracker (JT)

1-N “Worker” nodes

Tasktracker (TT)

Datanode (DN)

Optional Secondary Namenode

Periodic backups of Namenode in case of failure

Page 26: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Hadoop Architecture

Courtesy of Chuck Lam’s Hadoop In Action (2011), pp. 24-25

Page 27: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Namenode Responsibilities

Managing the file system namespace:

Holds file/directory structure, metadata, file-to-block mapping,

access permissions, etc.

Coordinating file operations:

Directs clients to datanodes for reads and writes

No data is moved through the namenode

Maintaining overall health:

Periodic communication with the datanodes

Block re-replication and rebalancing

Garbage collection

Page 28: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Putting everything together…

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

Page 29: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Anatomy of a Job

MapReduce program in Hadoop = Hadoop job

Jobs are divided into map and reduce tasks (+ more!)

An instance of running a task is called a task attempt

Multiple jobs can be composed into a workflow

Job submission process

Client (i.e., driver program) creates a job, configures it, and

submits it to job tracker

JobClient computes input splits (on client end)

Job data (jar, configuration XML) are sent to JobTracker

JobTracker puts job data in shared location, enqueues tasks

TaskTrackers poll for tasks

Off to the races…

Page 30: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Why have 1 API when you can have 2?

White pp. 25-27, Lam pp. 77-80

Hadoop 0.19 and earlier had “old API”

Hadoop 0.21 and forward has “new API”

Hadoop 0.20 has both!

Old API most stable, but deprecated

Current books use old API predominantly, but discuss changes

• Example code using new API available online from publisher

Some old API classes/methods not yet ported to new API

Cloud9 uses both, and you can too

Page 31: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

Old API

Mapper (interface)

void map(K1 key, V1 value, OutputCollector<K2, V2> output,

Reporter reporter)

void configure(JobConf job)

void close() throws IOException

Reducer/Combiner

void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3,V3> output, Reporter reporter)

void configure(JobConf job)

void close() throws IOException

Partitioner

void getPartition(K2 key, V2 value, int numPartitions)

Page 32: Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)

New API

org.apache.hadoop.mapred now deprecated; instead use

org.apache.hadoop.mapreduce &

org.apache.hadoop.mapreduce.lib

Mapper, Reducer now abstract classes, not interfaces

Use Context instead of OutputCollector and Reporter

Context.write(), not OutputCollector.collect()

Reduce takes value list as Iterable, not Iterator

Can use java’s foreach syntax for iterating

Can throw InterruptedException as well as IOException

JobConf & JobClient replaced by Configuration & Job