Top Banner
Large Scale Math with Hadoop MapReduce Tsz-Wo (Nicholas) Sze, PhD Hadoop Summit June 29, 2011 1
77

Large Scale Math with Hadoop MapReduce

Sep 13, 2014

Download

Technology

Hadoop Summit 2011 presentation on Large Scale Math with Apache Hadoop MapReduce
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large Scale Math with Hadoop MapReduce

Large Scale Math withHadoop MapReduce

Tsz-Wo (Nicholas) Sze, PhD

Hadoop SummitJune 29, 2011

1

Page 2: Large Scale Math with Hadoop MapReduce

Who am I?

• Hortonworks Software Engineer

• Apache Hadoop PMC Member

• Mathematician

I Interests:

F Distributed Computing

F Algorithms

F Number Theory

2

Page 3: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 3

Page 4: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 4

Page 5: Large Scale Math with Hadoop MapReduce

Typical Hadoop Applications

I Major applications of Hadoop include

• Search and crawling

• Text processing

• Machine learning

• ...

Tsz-Wo Sze, Hadoop Summit 2011 5

Page 6: Large Scale Math with Hadoop MapReduce

Typical Hadoop Applications

I Major applications of Hadoop include

• Search and crawling

• Text processing

• Machine learning

• ...

I But not yet commonly used in scientific

or mathematical applications.

Why?

Tsz-Wo Sze, Hadoop Summit 2011 6

Page 7: Large Scale Math with Hadoop MapReduce

Why Not Math?

I No MapReduce math libraries available, and

I More fundamentally,

MapReduce math algorithms are not well studied.

Tsz-Wo Sze, Hadoop Summit 2011 7

Page 8: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

Tsz-Wo Sze, Hadoop Summit 2011 8

Page 9: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

I Apache Mahout

• A machine learning library.

• Includes packages for matrix operations.

Tsz-Wo Sze, Hadoop Summit 2011 9

Page 10: Large Scale Math with Hadoop MapReduce

Existing Library

I Really no MapReduce Math Library?

Not exactly.

I Apache Mahout

• A machine learning library.

• Includes packages for matrix operations.

I Apache Hama (Incubation)

• A matrix computational package.

Tsz-Wo Sze, Hadoop Summit 2011 10

Page 11: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(1)

I Integer Factoring

• a.k.a. breaking RSA cryptosystemGiven N , e and c, compute m such that

c ≡ me (mod N),

where N is a product of two primes.

• a 768-bit RSA modulus was factored1 in 2009

1 Kleinjung et al., Factorization of a 768-bit RSA modulus, CRYPTO 2010.

Tsz-Wo Sze, Hadoop Summit 2011 11

Page 12: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(2)

I Solving PDEs (Partial Differential Equations)

• Fluid dynamics

• Electromagnetism

• Financial analysis

• ...

(Two-dimensional Turbulence, courtesy of Y.K. Tsang)

Tsz-Wo Sze, Hadoop Summit 2011 12

Page 13: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

Tsz-Wo Sze, Hadoop Summit 2011 13

Page 14: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

• Disprove Riemann Hypothesis (RH)

Then, you will get $1,000,000 dollars2. ,

However, RH is unlikely to be false.

2 See http://www.claymath.org/millennium/Riemann_Hypothesis/.

Tsz-Wo Sze, Hadoop Summit 2011 14

Page 15: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(3)

I Finding complex zeros of Riemann Zeta function

ζ(s) =∞∑n=1

1

nsfor s ∈ C, <(s) > 1

and then analytically continued to all s 6= 1.

• Disprove Riemann Hypothesis (RH)

Then, you will get $1,000,000 dollars.

However, RH is unlikely to be false.

• More likely:

Obtain more evidents which support RH. ,

Tsz-Wo Sze, Hadoop Summit 2011 15

Page 16: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(4)

I Computing π

Latest world records:

• Five trillion decimal digits (August 2010)

F by Alexander Yee & Shigeru Kondo3

3 See http://www.numberworld.org/misc_runs/pi-5t/announce_en.html

Tsz-Wo Sze, Hadoop Summit 2011 16

Page 17: Large Scale Math with Hadoop MapReduce

Computational Intensive Problems(4)

I Computing π

Latest world records:

• Five trillion decimal digits (August 2010)

F by Alexander Yee & Shigeru Kondo

• The two quadrillionth bits (July 2010)

F by Tsz-Wo Sze &

the Yahoo! Cloud Computing Team4

4 See http://developer.yahoo.net/blogs/hadoop/2010/09/two_quadrillionth_bit_pi.html

Tsz-Wo Sze, Hadoop Summit 2011 17

Page 18: Large Scale Math with Hadoop MapReduce

Missing Functionalities

I Fast Fourier Transform (FFT)– the basic rountine behind many algorithms.

I Arbitrary Precision Arithmetic

F Integer functions

F Floating-point functions

F Complex functions

I ...

Tsz-Wo Sze, Hadoop Summit 2011 18

Page 19: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 19

Page 20: Large Scale Math with Hadoop MapReduce

Why Integer Multiplication?

I There exist fast algorithms.

I Many applications

• Division

• Logarithm

• Trigonometric functions

• ...

Tsz-Wo Sze, Hadoop Summit 2011 20

Page 21: Large Scale Math with Hadoop MapReduce

Prerequisite of Algorithms

(D.J. Bernstein, Fastmultiplication and itsapplications, ANTS 2008.

)Tsz-Wo Sze, Hadoop Summit 2011 21

Page 22: Large Scale Math with Hadoop MapReduce

Integer Multiplication Algorithms

I Naıve, O(N 2)

I Karatsuba, O(N log2 3) = O(N 1.585)

I Toom-Cook, O(N log(2D−1)/ logD)

If D = 3, then O(N log 5/ log 3) = O(N 1.465)

I FFT-based algorithms O(N logN · · · )

Tsz-Wo Sze, Hadoop Summit 2011 22

Page 23: Large Scale Math with Hadoop MapReduce

FFT-based Algorithms

I Basic FFT, O(N logN log logN log log logN · · · )

I Schonhage-Strassen, O(N logN log logN)

I Nussbaumer, O(N logN log logN)

I Furer, O(N(logN)2log∗N)

I De-Kurur-Saha-Saptharishi, O(N(logN)2log∗N)

Tsz-Wo Sze, Hadoop Summit 2011 23

Page 24: Large Scale Math with Hadoop MapReduce

Convolution

I By the convolution theorem,

a× b = dft−1(dft(a) ∗ dft(b)),

where

× denotes the convolution operator ,

∗ denotes componentwise multiplication,

dft( · ) denotes discrete Fourier transform.

Tsz-Wo Sze, Hadoop Summit 2011 24

Page 25: Large Scale Math with Hadoop MapReduce

Schonhage-Strassen Algorithm(SSA)

I Represent integers as polynomials. Then, com-

pute convolution with DFTs modulo an integer5.

5 It has the form 2n + 1 and is called the Schonhage-Strassen modulas.

Tsz-Wo Sze, Hadoop Summit 2011 25

Page 26: Large Scale Math with Hadoop MapReduce

SSA StepsI Step 1: two DFTs,

adef= dft(a) and b

def= dft(b);

I Step 2: componentwise multiplication,

pdef= a ∗ b;

I Step 3: a DFT inverse,

p = dft−1(p);

I Step 4: normalization.

Tsz-Wo Sze, Hadoop Summit 2011 26

Page 27: Large Scale Math with Hadoop MapReduce

Calculating DFTs

I DFT can be calculated by a family of algorithms

called Fast Fourier Transform (FFT).

Tsz-Wo Sze, Hadoop Summit 2011 27

Page 28: Large Scale Math with Hadoop MapReduce

FFT Family

I Recursive-FFT

I Parallel-FFT

I Cooley-Tukey (decimation-in-time)

I Gentleman-Sande (decimation-in-frequency)

I Danielson-Lanczos

I Ping-pong FFT

I ...

Tsz-Wo Sze, Hadoop Summit 2011 28

Page 29: Large Scale Math with Hadoop MapReduce

Data Model(1)

I Need a data model which allows accessing

terabit integers efficiently.

I An integer x is represented as a D-dimensional

tuple

x = (xD−1, xD−2, . . . , x0).

Tsz-Wo Sze, Hadoop Summit 2011 29

Page 30: Large Scale Math with Hadoop MapReduce

Data Model(2)

I Write

D = IJ.

where I and J are powers of two.

I Define J-dimensional tuples

x(i) def= (x(J−1)I+i, x(J−2)I+i, . . . , xi)

for 0 ≤ i < I.

Tsz-Wo Sze, Hadoop Summit 2011 30

Page 31: Large Scale Math with Hadoop MapReduce

Data Model(3)

I Then,x(0)

x(1)

...

x(I−1)

=

x(J−1)I x(J−2)I . . . x0

x(J−1)I+1 x(J−2)I+1 . . . x1... ... . . . ...

x(J−1)I+(I−1) x(J−2)I+(I−1) . . . xI−1

I We call it the (I, J)-format of x.

Tsz-Wo Sze, Hadoop Summit 2011 31

Page 32: Large Scale Math with Hadoop MapReduce

Data Model(4)

I Each x(i) is a sequence of J records.

I Each record is a key-value pair.

Record # <Key, Value>

0 < i, xi >

1 < J + i, xJ+i >... ...

J − 1 < (J − 1)I + i, x(J−1)I+i >

Tsz-Wo Sze, Hadoop Summit 2011 32

Page 33: Large Scale Math with Hadoop MapReduce

Data Model(5)

I Thus, an integer is stored as I SequenceFiles in

HDFS, each SequenceFile contains J records.

Tsz-Wo Sze, Hadoop Summit 2011 33

Page 34: Large Scale Math with Hadoop MapReduce

Parallel-FFT Steps

I Step 1: I inner DFTs with J-point,

a(i) = dft(a(i));

I Step 2: componentwise shifting,

zjI+idef= ζ ij a(i)

j;

I Step 3: transposition,

z[j] def= (zjI+(I−1), zjI+(I−2), . . . , zjI);

I Step 4: J outer DFTs with I-point,

z[j] def= dft(z[j]).

Tsz-Wo Sze, Hadoop Summit 2011 34

Page 35: Large Scale Math with Hadoop MapReduce

MapReduce Model

Map1 Map2 Map3 Map4

Reduce1 Reduce2 Reduce3 Reduce4

Shuffle

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 35

Page 36: Large Scale Math with Hadoop MapReduce

MapReduce-FFT

Inner FFT1 Inner FFT2 Inner FFT3 Inner FFT4

Outer FFT1 Outer FFT2 Outer FFT3 Outer FFT4

Transposition(by shuffle)

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 36

Page 37: Large Scale Math with Hadoop MapReduce

Data Locality

I The FFT transposition, which is traditionally dif-

ficult in preserving locality, becomes trivial in

MapReduce.

Tsz-Wo Sze, Hadoop Summit 2011 37

Page 38: Large Scale Math with Hadoop MapReduce

MapReduce-FFT(1)

I Map function:

(k1, v1) −→ list〈k2, v2〉

Algorithm 1 (Forward FFT, Mapper).

(f.m.1) read key i, value a(i);

(f.m.2) calculate a J-point DFT;

(f.m.3) componentwise multiply;

(f.m.4) for 0 ≤ j < J , emit key j, value (i, zjI+i).

Tsz-Wo Sze, Hadoop Summit 2011 38

Page 39: Large Scale Math with Hadoop MapReduce

MapReduce-FFT(2)

I Reduce function:

(k2, list〈v2〉) −→ list〈k3, v3〉.

Algorithm 2 (Forward FFT, Reducer).

(f.r.1) receive key j, list [(i, zjI+i)]0≤i<I;

(f.r.2) calculate an I-point DFT;

(f.r.3) write key j, value z[j].

Tsz-Wo Sze, Hadoop Summit 2011 39

Page 40: Large Scale Math with Hadoop MapReduce

Normalization

I Normalization can be viewed as a summation ofthree integers.

Tsz-Wo Sze, Hadoop Summit 2011 40

Page 41: Large Scale Math with Hadoop MapReduce

Summation

I Integer summation can be done by (1) componen-twise summation, (2) carry evaluation and then

(3) parallel carrying.

Tsz-Wo Sze, Hadoop Summit 2011 41

Page 42: Large Scale Math with Hadoop MapReduce

MapReduce Model

Map1 Map2 Map3 Map4

Reduce1 Reduce2 Reduce3 Reduce4

Shuffle

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 42

Page 43: Large Scale Math with Hadoop MapReduce

MapReduce-Sum

Summation1 Summation2 Summation3 Summation4

Carrying1 Carrying2 Carrying3 Carrying4

Carry Evaluation(modified shuffle)

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 43

Page 44: Large Scale Math with Hadoop MapReduce

Job 1: Componwise Summation

Summation1 Summation2 Summation3 Summation4

Input

Output

I A map-only job.

Tsz-Wo Sze, Hadoop Summit 2011 44

Page 45: Large Scale Math with Hadoop MapReduce

Job 2: Carrying

Carry Evaluation

Carrying1 Carrying2 Carrying3 Carrying4

Input

Output

Tsz-Wo Sze, Hadoop Summit 2011 45

Page 46: Large Scale Math with Hadoop MapReduce

MapReduce-SSA

I two concurrent forward FFT jobs;

I a backward FFT job with componentwise

multiplication and splitting ;

I a componentwise summation map-only job;

I a carrying job6.

6 It is possible to combine the last two jobs if we modify the shuffle process in MapReduce [.next].

Tsz-Wo Sze, Hadoop Summit 2011 46

Page 47: Large Scale Math with Hadoop MapReduce

Prototype Implementation

I DistMpMult– distributed multi-precision multiplication

F DistFft – distributed FFT

F DistCompSum – distributed componentwise

summation

F DistCarrying – distributed carrying

I Open source – available at

https://issues.apache.org/jira/browse/MAPREDUCE-2471

Tsz-Wo Sze, Hadoop Summit 2011 47

Page 48: Large Scale Math with Hadoop MapReduce

Cluster Configuration

I A shared cluster:

F Apache Hadoop 0.20

F 1350 nodes

F 6 GB memory per node

F 2 map tasks & 1 reduce task per node

F Imposed a limitation on the aggregated

memory usage of individual jobs. /

Tsz-Wo Sze, Hadoop Summit 2011 48

Page 49: Large Scale Math with Hadoop MapReduce

Running Time

I Actual running time for 236 ≤ N ≤ 240.

7

7.5

8

8.5

9

9.5

10

10.5

11

11.5

32 33 34 35 36 37 38 39 40

log

(t)

t is

the

ela

pse

d tim

e in s

econ

ds

log(N)

Tsz-Wo Sze, Hadoop Summit 2011 49

Page 50: Large Scale Math with Hadoop MapReduce

Agenda

• Introduction

• Integer Multiplication

• MapReduce-FFT

• MapReduce-Sum

• MapReduce-SSA

• A New World Record

• The “Machine” Behind the Computation

Tsz-Wo Sze, Hadoop Summit 2011 50

Page 51: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

Tsz-Wo Sze, Hadoop Summit 2011 51

Page 52: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

I We have π = 3.244

Tsz-Wo Sze, Hadoop Summit 2011 52

Page 53: Large Scale Math with Hadoop MapReduce

What is π?

I π is a mathematical

constant such that,

for any circle,

π =circumference

diameter=C

d.

I We have π = 3.244(in hexadecimal ,)

Tsz-Wo Sze, Hadoop Summit 2011 53

Page 54: Large Scale Math with Hadoop MapReduce

Decimal, Hexadecimal & Binary

I Representing π in different bases

π = 3.1415926535 8979323846 2643383279 ...

= 3.243F6A88 85A308D3 13198A2E ...

= 11.00100100 00111111 01101010 ...

I Bit position is counted after the radix point.

I e.g., the eight bits starting at the ninth bit position

are 00111111 in binary or 3F in hexadecimal.

Tsz-Wo Sze, Hadoop Summit 2011 54

Page 55: Large Scale Math with Hadoop MapReduce

A New World Record

I Yahoo! Cloud Computing (July 2010)

• Machines: Idle slices of 1000-node clusters

Each node has two quad-core 1.8-2.5 GHz CPUs

• Duration: 23 days

• CPU time: 503 years

• Verification: 582 years CPU time

Tsz-Wo Sze, Hadoop Summit 2011 55

Page 56: Large Scale Math with Hadoop MapReduce

A New World Record

I Bit values (in hexadecimal)

0E6C1294 AED40403 F56D2D76 4026265B

CA98511D 0FCFFAA1 0F4D28B1 BB5392B8

Tsz-Wo Sze, Hadoop Summit 2011 56

Page 57: Large Scale Math with Hadoop MapReduce

A New World Record

I Bit values (in hexadecimal)

0E6C1294 AED40403 F56D2D76 4026265B

CA98511D 0FCFFAA1 0F4D28B1 BB5392B8

(256 bits)

F The first bit position: 1,999,999,999,999,997 (= 2 · 1015− 3)

F The last bit position: 2,000,000,000,000,252 (= 2·1015+252)

F The two quadrillionth (2 · 1015th) bit is 0.

Tsz-Wo Sze, Hadoop Summit 2011 57

Page 58: Large Scale Math with Hadoop MapReduce

BBC News (16 Sep 2010)

I Pi record smashed as team finds two-quadrillionth digit

http://www.bbc.co.uk/news/technology-11313194

Tsz-Wo Sze, Hadoop Summit 2011 58

Page 59: Large Scale Math with Hadoop MapReduce

NewScientist (17 Sep 2010)

I New pi record exploits Yahoo’s computers

http://www.newscientist.com/article/dn19465-new-pi-record-exploits-yahoos-computers.

html

Tsz-Wo Sze, Hadoop Summit 2011 59

Page 60: Large Scale Math with Hadoop MapReduce

Other News Coverage

I New Pi Record Exploits Yahoo’s Computers

http://cacm.acm.org/news/99207-new-pi-record-exploits-yahoos-computers

I The Yahoo! boffin scores pi’s two

quadrillionth bit

http://www.theregister.co.uk/2010/09/16/pi_record_at_yahoo

I Pi calculation more than doubles old record

http://www.radionz.co.nz/news/world/57128/pi-calculation-more-than-doubles-old-record

I Hadoop used to calculate Pi’s two quadrillionth bit

http://www.zdnet.co.uk/blogs/mapping-babel-10017967/hadoop-used-to-calculate-pis-two-quadrillionth-bit-10018670/

Tsz-Wo Sze, Hadoop Summit 2011 60

Page 61: Large Scale Math with Hadoop MapReduce

I Yahoo! researcher breaks Pi record in finding

the two-quadrillionth digit

http://www.engadget.com/2010/09/17/yahoo-researcher-breaks-pi-record-in-finding-the-two-quadrillio

I Nicholas Sze of Yahoo Finds Two-Quadrillionth

Digit of Pi

http://science.slashdot.org/story/10/09/16/2155227/Nicholas-Sze-of-Yahoo-Finds-Two-Quadrillionth-Digit-of-Pi

I The 2,000,000,000,000,000th digit of the mathemat-

ical constant pi discovered

http://news.gather.com/viewArticle.action?articleId=281474978525563

I Researcher Shatters Pi Record by Finding

Two-Quadrillionth Digit

http://www.maximumpc.com/article/news/researcher_shatters_pi_record_finding_

two-quadrillionth_digit

Tsz-Wo Sze, Hadoop Summit 2011 61

Page 62: Large Scale Math with Hadoop MapReduce

I A bigger slice of pi

http://radar.oreilly.com/2010/09/strata-week-grabbing-a-slice.html

I 2 Quadrillionth digit of PI is found: Scientist

celebration in worldwide Pandemonium

http://engforum.pravda.ru/showthread.php?296242-2-Quadrillionth-digit-of-PI-is-found-Scientist-celebration-in-worldwide-Pandemonium

I And the number is...0

http://www.hexus.net/content/item.php?item=26505

I Pi Record Smashed as Team Finds Two-

Quadrillionth Digit

http://hardocp.com/news/2010/09/16/pi_record_smashed_as_team_finds_twoquadrillionth_

digit

Tsz-Wo Sze, Hadoop Summit 2011 62

Page 63: Large Scale Math with Hadoop MapReduce

I Yahoo Engineer Calculates Two Quadrillionth

Bit Of Pi

http://www.webpronews.com/topnews/2010/09/17/yahoo-engineer-calculates-two-quadrillionth-bit-of-pi

I A Cloud Computing Milestone: Yahoo!

Reaches the 2 Quadrillionth Bit of Pi

http://www.readwriteweb.com/cloud/2010/09/a-cloud-computing-milestone-ya.

php

I Yahoo researcher Nicolas Sze determines

the 2,000,000,000,000,000th digit of the mathematical con-

stant pi

http://www.thaindian.com/newsportal/sci-tech/yahoo-researcher-nicolas-sze-determines-the-2000000000000000th-digit-of-the-mathematical-constant-pi_

100430278.html

I ...

Tsz-Wo Sze, Hadoop Summit 2011 63

Page 64: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Tsz-Wo Sze, Hadoop Summit 2011 64

Page 65: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Let’s ignore this question in this talk ...

and focus on:

Tsz-Wo Sze, Hadoop Summit 2011 65

Page 66: Large Scale Math with Hadoop MapReduce

Computing π

I How to compute the nth bits of π?

Let’s ignore this question in this talk ...

and focus on:

I How to execute such huge computation?

Tsz-Wo Sze, Hadoop Summit 2011 66

Page 67: Large Scale Math with Hadoop MapReduce

Map- & Reduce-side Computations

I Developed a generic framework to execute tasks

on either the map-side or the reduce-side.

I Applications define two functions:

• partition(c,m):

partition the computation c into m parts.

• compute(c):

execute the computation c

Tsz-Wo Sze, Hadoop Summit 2011 67

Page 68: Large Scale Math with Hadoop MapReduce

Map-side Job

I Contains multiple mappers and zero reducers

• A PartitionInputFormat partitions c

into m parts

• Each part is executed by a mapper

Tsz-Wo Sze, Hadoop Summit 2011 68

Page 69: Large Scale Math with Hadoop MapReduce

Reduce-side Job

I Contains a mapper and multiple reducers

• A SingletonInputFormat launches

a PartitionMapper

• An Indexer launches m reducers.

Tsz-Wo Sze, Hadoop Summit 2011 69

Page 70: Large Scale Math with Hadoop MapReduce

Abstract Machine(1)

I Machine

– an abstract base class allows abstract Runner(s)

to execute MachineComputable tasks.

I Machine subclasses

• Map Side Machinem100t3: 100 maps with 3 threads each.

• Reduce Side Machiner50t2: 50 reduces with 2 threads each.

Tsz-Wo Sze, Hadoop Summit 2011 70

Page 71: Large Scale Math with Hadoop MapReduce

Abstract Machine(2)

I More Machine subclasses

• Mix Machine – chooses Map-/Reduce-side

jobs according to the cluster status.

x-m200t1-r100t2-5: either launch a job with 200 maps

with 1 thread each; or a job with 100 reduces with 2 thread each.

• Alternation Machine – alternates Map-side

and Reduce-side jobs in a regular pattern.

a-m200t1-r100t2-mrr: submit a map job, then a re-

duce job, then another reduce job and repeat this pattern.

• Null Machine – does nothing for testing.

Tsz-Wo Sze, Hadoop Summit 2011 71

Page 72: Large Scale Math with Hadoop MapReduce

Utilizing The Idle Slices

I Monitor cluster status

• Submit a map-side (or reduce-side) job if there

are sufficient available map (or reduce) slots.

I Small jobs

• Hold resource only for a short period of time

I Interruptible & resumable

• can be interrupted at any time by simply

killing the running jobs

Tsz-Wo Sze, Hadoop Summit 2011 72

Page 73: Large Scale Math with Hadoop MapReduce

Running The Jobs

Tsz-Wo Sze, Hadoop Summit 2011 73

Page 74: Large Scale Math with Hadoop MapReduce

The Implementation

I Main programs:

F DistBbp – a program to submit jobs.

F DistSum – distributed summation.

I Open source – available at

https://issues.apache.org/jira/browse/MAPREDUCE-1923

Tsz-Wo Sze, Hadoop Summit 2011 74

Page 75: Large Scale Math with Hadoop MapReduce

The World Record Computation

I 35,000 MapReduce jobs, each job either has:

• 200 map tasks with one thread each, or

• 100 reduce tasks with two threads each.

I Each thread computes 200,000,000 terms

• ∼45 minutes.

I Submit up to 60 concurrent jobs

I The entire computation took:

• 23 days of real time and 503 CPU years

Tsz-Wo Sze, Hadoop Summit 2011 75

Page 76: Large Scale Math with Hadoop MapReduce

Referneces

• [1] Tsz-Wo Sze. Schonhage-Strassen Algorithm with MapReduce for Mul-tiplying Terabit Integers. Symbolic-Numeric Computation 2011, to ap-pear. Preprint available at http://people.apache.org/~szetszwo/

ssmr20110430.pdf

• [2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! DistributedComputation of Pi with Apache Hadoop. In IEEE 2nd InternationalConference on Cloud Computing Technology and Science (CloudCom),pages 727-732, 2010. (Earlier versions available at http://arxiv.org/

abs/1008.3171)

Tsz-Wo Sze, Hadoop Summit 2011 76

Page 77: Large Scale Math with Hadoop MapReduce

Thank you!

Tsz-Wo Sze, Hadoop Summit 2011 77