Top Banner
Some Details About Stream- and-Sort Operations William W. Cohen
24

Some Details About Stream- and-Sort Operations William W. Cohen.

Jan 04, 2016

Download

Documents

Robyn Richards
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Some Details About Stream- and-Sort Operations William W. Cohen.

Some Details About Stream-and-Sort Operations

William W. Cohen

Page 2: Some Details About Stream- and-Sort Operations William W. Cohen.

MERGE SORTS

Page 3: Some Details About Stream- and-Sort Operations William W. Cohen.

Bottom-Up Merge Sortuse: input array A[n]; buffer array B[n]

• assert: A[ ] contains sorted runs of length r=1• for run-length r=1,2,4,8,…

• merge adjacent length-r runs in A[ ], copying the result into the buffer B[ ]• assert: B[ ] contains sorted runs of length 2*r• swap roles of A and B

Page 4: Some Details About Stream- and-Sort Operations William W. Cohen.
Page 5: Some Details About Stream- and-Sort Operations William W. Cohen.

Wikipedia on Old-School Merge Sort

Use four tape drives A,B,C,D

1. merge runs from A,B and write them alternately into C,D

2. merge runs from C,D and write them alternately into A,B

3. And so on….

Requires only constant memory.

Page 6: Some Details About Stream- and-Sort Operations William W. Cohen.

21st Century Sorting

Page 7: Some Details About Stream- and-Sort Operations William W. Cohen.

Unix Sort• Load as much as you can

[actually --buffer-size=SIZE] into memory and do an in-memory sort [usually quicksort].

• If you have more to do, then spill this sorted buffer out on to disk, and get a another buffer’s worth of data.

• Finally, merge your spill buffers.

Page 8: Some Details About Stream- and-Sort Operations William W. Cohen.

PIPES

Page 9: Some Details About Stream- and-Sort Operations William W. Cohen.

How Unix Pipes Work

• Processes are all started at the same time

• Data streaming thru the paper is held in a queue: writer […queue…] reader

• If the queue is full:– the writing process is blocked

• If the queue is empty:– the reading process is blocked

• (I think) queues are usually smallish: 64k

Page 10: Some Details About Stream- and-Sort Operations William W. Cohen.

How stream-and-sort works

• Pipeline is stream […queue…] sort• Algorithm you get:–sort reads --buffer-size lines in, sorts

them, spills them to disk–sort merges spill files after stream

closes

–stream is blocked when sort falls behind–and sort is blocked if it gets ahead

Page 11: Some Details About Stream- and-Sort Operations William W. Cohen.

LOOKING AHEAD TO PARALLELIZATION…

Page 12: Some Details About Stream- and-Sort Operations William W. Cohen.

Stream and Sort Counting Distributed Counting

• example 1• example 2• example 3• ….

Counting logic

“C[x] +=D”

Machines A1,…

Sort

• C[x1] += D1• C[x1] += D2• ….

Logic to combine counter updates

Machines C1,..,Machines B1,…,

Trivial to parallelize! Easy to parallelize!

Standardized message routing

logic

Page 13: Some Details About Stream- and-Sort Operations William W. Cohen.

Stream and Sort Counting Distributed Counting

• example 1• example 2• example 3• ….

Counting logic

“C[x] +=D”

Machines A1,…

Sort

• C[x1] += D1• C[x1] += D2• ….

Logic to combine counter updates

Machines C1,..,

Spill 1

Spill 2

Spill 3

Merg

e S

pill

File

s

Page 14: Some Details About Stream- and-Sort Operations William W. Cohen.

Stream and Sort Counting Distributed Counting

• example 1• example 2• example 3• ….

Counting logic

“C[x] +=D”

Counter Machine

Sort

• C[x1] += D1• C[x1] += D2• ….

Logic to combine counter updates

Combiner Machine

Spill 1

Spill 2

Spill 3

Merg

e S

pill

File

s

Page 15: Some Details About Stream- and-Sort Operations William W. Cohen.

Stream and Sort Counting Distributed Counting

• example 1• example 2• example 3• ….

Counting logic

Counter Machine 1

Part

itio

n/S

ort

Spill 1Spill

2Spill

3

• C[x1] += D1• C[x1] += D2• ….

Logic to combine

counter updates

Combiner Machine 1

Merg

e S

pill

File

s

• example 1• example 2• example 3• ….

Counting logic

Counter Machine 2

Part

itio

n/S

ort

Spill 1Spill

2Spill

3

• C[x1] += D1• C[x1] += D2• ….

combine counter updates

Combiner Machine 2

Merg

e S

pill

Fi

les

Spill n

Page 16: Some Details About Stream- and-Sort Operations William W. Cohen.

COMMENTS ON BUFFERING

Page 17: Some Details About Stream- and-Sort Operations William W. Cohen.

Review: Large-vocab Naïve Bayes

• Create a hashtable C• For each example id, y, x1,….,xd in train:

– C(“Y=ANY”) ++; C(“Y=y”) ++– For j in 1..d:

• C(“Y=y ^ X=xj”) ++

Page 18: Some Details About Stream- and-Sort Operations William W. Cohen.

Large-vocabulary Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:

– C(“Y=ANY”) ++; C(“Y=y”) ++– Print “Y=ANY += 1”– Print “Y=y += 1”– For j in 1..d:• C(“Y=y ^ X=xj”) ++

• Print “Y=y ^ X=xj += 1”

• Sort the event-counter update “messages”• Scan the sorted messages and compute and output

the final counter values

Think of these as “messages” to another component to increment the counters

java MyTrainertrain | sort | java MyCountAdder > model

Page 19: Some Details About Stream- and-Sort Operations William W. Cohen.

Large-vocabulary Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:

– C(“Y=ANY”) ++; C(“Y=y”) ++– Print “Y=ANY += 1”– Print “Y=y += 1”– For j in 1..d:• C(“Y=y ^ X=xj”) ++

• Print “Y=y ^ X=xj += 1”

• Sort the event-counter update “messages”– We’re collecting together messages about the same counter

• Scan and add the sorted messages and output the final counter values

Y=business+=

1Y=business

+=1

…Y=business ^ X =aaa

+= 1…Y=business ^ X=zynga

+= 1Y=sports ^ X=hat +=

1Y=sports ^ X=hockey

+= 1Y=sports ^ X=hockey

+= 1Y=sports ^ X=hockey

+= 1…Y=sports ^ X=hoe +=

1…Y=sports

+= 1…

Page 20: Some Details About Stream- and-Sort Operations William W. Cohen.

Large-vocabulary Naïve Bayes

Y=business+=

1Y=business

+=1

…Y=business ^ X =aaa

+= 1…Y=business ^ X=zynga

+= 1Y=sports ^ X=hat +=

1Y=sports ^ X=hockey

+= 1Y=sports ^ X=hockey

+= 1Y=sports ^ X=hockey

+= 1…Y=sports ^ X=hoe +=

1…Y=sports

+= 1…

•previousKey = Null• sumForPreviousKey = 0• For each (event,delta) in input:• If event==previousKey

• sumForPreviousKey += delta• Else

• OutputPreviousKey()• previousKey = event• sumForPreviousKey = delta

• OutputPreviousKey()

define OutputPreviousKey():• If PreviousKey!=Null• print PreviousKey,sumForPreviousKey

Accumulating the event counts requires constant storage … as long as the input is sorted.

streamingScan-and-add:

Page 21: Some Details About Stream- and-Sort Operations William W. Cohen.

Distributed Counting Stream and Sort Counting

• example 1• example 2• example 3• ….

Counting logic

Hash table1

“C[x] +=D”

Hash table2

Hash table2

Machine 1

Machine 2

Machine K

. . .

Machine 0

Mess

ag

e-r

outi

ng logic

Page 22: Some Details About Stream- and-Sort Operations William W. Cohen.

Distributed Counting Stream and Sort Counting

• example 1• example 2• example 3• ….

Counting logic

“C[x] +=D”

Machine A

Sort

• C[x1] += D1• C[x1] += D2• ….

Logic to combine counter updates

Machine C

Machine B

BUFFER

Page 23: Some Details About Stream- and-Sort Operations William W. Cohen.

Review: Large-vocab Naïve Bayes

• Create a hashtable C• For each example id, y, x1,….,xd in train:

– C.inc(“Y=ANY”); C.inc(“Y=y”)– For j in 1..d:

• C.inc(“Y=y ^ X=xj”)

class EventCounter {void inc(String event) {

// increment the right hashtable slot if (hashtable.size()>BUFFER_SIZE) { for (e,n) in hashtable.entries : print e + “\t” + n hashtable.clear();

} }}

Page 24: Some Details About Stream- and-Sort Operations William W. Cohen.

How much does buffering help?

small-events.txt: nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB < RCV1.small_train.txt \ | sort -k1,1 \ | java -cp nb.jar com.wcohen.StreamSumReducer> small-events.txt

test-small: small-events.txt nb.jar time java -cp nb.jar com.wcohen.SmallStreamNB \ RCV1.small_test.txt MCAT,CCAT,GCAT,ECAT 2000 < small-events.txt \ | cut -f3 | sort | uniq -cBUFFER_SIZE Time Message Size

none 1.7M words

100 47s 1.2M

1,000 42s 1.0M

10,000 30s 0.7M

100,000 16s 0.24M

1,000,000 13s 0.16M

limit 0.05M