Segregated storage and compute · 2012. 11. 7. · • Segregated storage and compute – NFS, GPFS, PVFS, Lustre – Batch-scheduled systems: Clusters, Grids, and Supercomputers

• Segregated storage and compute– NFS, GPFS, PVFS, Lustre– Batch-scheduled systems: Clusters, Grids, and

Supercomputers– Programming paradigm: HPC, MTC, and HTC

• Co-located storage and compute– HDFS, GFS– Data centers at Google, Yahoo, and others– Programming paradigm: MapReduce– Others from academia: Sector, MosaStore, Chirp

2




3




4




5

0.1

1

10

100

1000

2002-2004 Today

MB

/s p

er P

roce

ssor

Cor

e

Local DiskClusterSupercomputer

6

--2.2X2.2X--99X99X --15X15X

--438X438X

• Local Disk:– 2002-2004: ANL/UC TG Site

(70GB SCSI)– Today: PADS (RAID-0, 6

drives 750GB SATA)• Cluster:

– 2002-2004: ANL/UC TG Site (GPFS, 8 servers, 1Gb/s each)

– Today: PADS (GPFS, SAN)

• Supercomputer:– 2002-2004: IBM Blue Gene/L

(GPFS)– Today: IBM Blue Gene/P (GPFS)

What if we could combine the scientific community’s existing

programming paradigms, but yet still exploit the data locality that

naturally occurs in scientific workloads?

7

9

Number of Tasks

Input Data Size

Hi

Med

Low1 1K 1M

HPC(Heroic

MPI Tasks)

HTC/MTC(Many Loosely Coupled Tasks)

MapReduce/MTC(Data Analysis,

Mining)

MTC(Big Data and Many Tasks)

Number of Tasks

Input Data Size

Hi

Med

Low1 1K 1M

HPC(Heroic

MPI Tasks)

HTC/MTC(Many Loosely Coupled Tasks)

MapReduce/MTC(Data Analysis,

Mining)

MTC(Big Data and Many Tasks)

[MTAGS08] “Many-Task Computing for Grids and Supercomputers”

10

• Important concepts related to the hypothesis– Workload: a complex query (or set of queries) decomposable into

simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed

systems for scientific and data-intensive applications– Allocate computational and caching storage resources, co-scheduled to

optimize workload performance

“Significant performance improvements can be obtained in the analysis of large dataset by leveraging information

about data analysis workloads rather than individual data analysis tasks.”

11

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

Provisioned Resources

text

Task DispatcherData-Aware Scheduler Persistent Storage

Shared File System

Idle Resources

Provisioned Resources

[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”

• Resource acquired in response to demand

• Data diffuse from archival storage to newly acquired transient resources

• Resource “caching” allows faster responses to subsequent requests

• Resources are released when demand drops

• Optimizes performance by co-scheduling data and computations

• Decrease dependency of a shared/parallel file systems

• Critical to support data intensive MTC

12[SC07] “Falkon: a Fast and Light-weight tasK executiON framework”

• What would data diffusion look like in practice?• Extend the Falkon framework

13

• FA: first-available– simple load balancing

• MCH: max-cache-hit– maximize cache hits

• MCU: max-compute-util– maximize processor utilization

• GCC: good-cache-compute– maximize both cache hit and processor utilization at

the same time


14

0

1

2

3

4

5

first-available

without I/O

first-availablewith I/O

max-compute-util

max-cache-hit

good-cache-

compute

CPU

Tim

e pe

r Tas

k (m

s)

0

1000

2000

3000

4000

5000

Thro

ughp

ut (t

asks

/sec

)

Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)

0

1

2

3

4

5

first-available

without I/O

first-availablewith I/O

max-compute-util

max-cache-hit

good-cache-

compute

CPU

Tim

e pe

r Tas

k (m

s)

0

1000

2000

3000

4000

5000

Thro

ughp

ut (t

asks

/sec

)

Task SubmitNotification for Task AvailabilityTask Dispatch (data-aware scheduler)Task Results (data-aware scheduler)Notification for Task ResultsWS CommunicationThroughput (tasks/sec)

[DIDC09] “Towards Data Intensive Many-Task Computing”, under review

• 3GHz dual CPUs• ANL/UC TG with

128 processors• Scheduling window

2500 tasks• Dataset

• 100K files• 1 byte each

• Tasks• Read 1 file• Write 1 file

15

• Monotonically Increasing Workload– Emphasizes increasing loads

• Sine-Wave Workload– Emphasizes varying loads

• All-Pairs Workload– Compare to best case model of active storage

• Image Stacking Workload (Astronomy)– Evaluate data diffusion on a real large-scale data-

intensive application from astronomy domain


16

• 250K tasks – 10MB reads– 10ms compute

• Vary arrival rate:– Min: 1 task/sec– Increment function:

CEILING(*1.3)– Max: 1000 tasks/sec

• 128 processors• Ideal case:

– 1415 sec– 80Gb/s peak

throughput

0

50000

100000

150000

200000

250000

0

100

200

300

400

500

600

700

800

900

1000

Task

s C

ompl

eted

Arr

ival

Rat

e (p

er s

econ

d)

Time (sec)

Arrival RateTasks completed

17

• GPFS vs. ideal: 5011 sec vs. 1415 sec

0102030405060708090

100

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)

Time (sec)Throughput (Gb/s) Demand (Gb/s)Wait Queue Length Number of Nodes

18

Max-compute-util Max-cache-hit

00.10.20.30.40.50.60.70.80.91

0102030405060708090

100

Cac

he H

it/M

iss

%C

PU

Util

izat

ion

%

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes CPU Utilization

00.10.20.30.40.50.60.70.80.91

0102030405060708090

100

Cac

he H

it/M

iss

%

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Throughput (Gb/s) Demand (Gb/s) Wait Queue LengthNumber of Nodes

19

1GB1.5GB

2GB4GB

0%10%20%30%40%50%60%70%80%90%100%

0102030405060708090

100

Cach

e Hit/

Mis

s %

Node

s Allo

cate

dTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)

Time (sec)Cache Miss % Cache Hit Global % Cache Hit Local %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

0%10%20%30%40%50%60%70%80%90%100%

0102030405060708090

100

Cac

he H

it/M

iss

%

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)


0%10%20%30%40%50%60%70%80%90%100%

0102030405060708090

100

Cach

e Hit/

Mis

s %

Node

s Allo

cate

dTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


00.10.20.30.40.50.60.70.80.91

0102030405060708090

100

Cach

e Hit/

Mis

s %

Node

s Allo

cate

dTh

roug

hput

(Gb/

s)Q

ueue

Len

gth

(x1K

)


20

• Data Diffusion vs. ideal: 1436 sec vs 1415 sec

00.10.20.30.40.50.60.70.80.91

0102030405060708090

100

Cac

he H

it/M

iss

%

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)


21

Throughput:– Average: 14Gb/s vs 4Gb/s– Peak: 81Gb/s vs. 6Gb/s

Response Time – 3 sec vs 1569 sec 506X

80

6

12

73 8181

2146

02468

101214161820

Ideal FA GCC 1GB

GCC 1.5GB

GCC 2GB

GCC 4GB

MCH 4GB

MCU 4GB

Thro

ughp

ut (G

b/s)

Local Worker Caches (Gb/s)Remote Worker Caches (Gb/s)GPFS Throughput (Gb/s)

1569

1084

1143.4 3.1

230 287

0

200

400

600

800

1000

1200

1400

1600

1800

FA GCC 1GB

GCC 1.5GB

GCC 2GB

GCC 4GB

MCH 4GB

MCU 4GB

Aver

age

Res

pons

e Ti

me

(sec

)

22

• Performance Index:– 34X higher

• Speedup– 3.5X faster

than GPFS

1

1.5

2

2.5

3

3.5

00.10.20.30.40.50.60.70.80.9

1

FA GCC 1GB

GCC 1.5GB

GCC 2GB

GCC 4GB

GCC 4GB SRP

MCH 4GB

MCU 4GB

Spee

dup

(com

p. to

LA

N G

PFS)

Perf

orm

ance

Inde

x

Performance Index Speedup (compared to first-available)

23

0

100

200

300

400

500

600

700

800

900

1000

060

012

0018

0024

0030

0036

0042

0048

0054

0060

0066

00

Time (sec)

Arr

ival

Rat

e (p

er s

ec)

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

Num

ber o

f Tas

ks C

ompl

eted

Arrival RateNumber of Tasks

0

100

200

300

400

500

600

700

800

900

1000

060

012

0018

0024

0030

0036

0042

0048

0054

0060

0066

00

Time (sec)

Arr

ival

Rat

e (p

er s

ec)

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

Num

ber o

f Tas

ks C

ompl

eted

Arrival RateNumber of Tasks

• 2M tasks – 10MB reads– 10ms compute

• Vary arrival rate:– Min: 1 task/sec– Arrival rate function:– Max: 1000 tasks/sec

• 200 processors• Ideal case:

– 6505 sec– 80Gb/s peak

throughput

705.5*)11.0(*)1)859678.2*)11.0((sin( timetimesqrtA

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs

24

0102030405060708090

100

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)

Time (sec)Throughput (Gb/s) Demand (Gb/s)Wait Queue Length Number of Nodes

25

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs

0%10%20%30%40%50%60%70%80%90%100%

0102030405060708090

100

Cac

he H

it/M

iss

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)

Time (sec)Cache Hit Local % Cache Hit Global % Cache Miss %Demand (Gb/s) Throughput (Gb/s) Wait Queue LengthNumber of Nodes

j

26

• GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs• GCC+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs• GCC+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs

0%10%20%30%40%50%60%70%80%90%100%

0102030405060708090

100

Cac

he H

it/M

iss

%

Nod

es A

lloca

ted

Thro

ughp

ut (G

b/s)

Que

ue L

engt

h (x

1K)


• All-Pairs( set A, set B, function F ) returns matrix M:

• Compare all elements of set A to all elements of set B via function F, yielding matrix M, such that M[i,j] = F(A[i],B[j])

27

1 foreach $i in A2 foreach $j in B3 submit_job F $i $j4 end5 end

• 500x500 – 250K tasks– 24MB reads– 100ms compute– 200 CPUs

• 1000x1000 • 1M tasks• 24MB reads• 4sec compute• 4096 CPUs

• Ideal case:– 6505 sec– 80Gb/s peak

throughput


0%10%20%30%40%50%60%70%80%90%100%

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

Cac

he H

it/M

iss

Thro

ughp

ut (G

b/s)

Time (sec)Cache Hit Local % Cache Hit Global %Cache Miss % Max Throughput (GPFS)Throughput (Data Diffusion) Max Throughput (Local Disk)

28

Efficiency: 75%

0%10%20%30%40%50%60%70%80%90%100%

0.0020.0040.0060.0080.00

100.00120.00140.00160.00180.00200.00

Cac

he H

it/M

iss

Thro

ughp

ut (G

b/s)

Time (sec)Cache Hit Local % Cache Hit Global %Cache Miss % Max Throughput (GPFS)Throughput (Data Diffusion) Max Throughput (Local Memory)

29

Efficiency: 86%


• Pull vs. Push– Data Diffusion

• Pulls task working set• Incremental spanning

forest– Active Storage:

• Pushes workload working set to all nodes

• Static spanning tree

30

0%10%20%30%40%50%60%70%80%90%

100%

500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

1000x10005832 CPUs

4 sec

Effic

ienc

y

Experiment

Best Case (active storage)Falkon (data diffusion)Best Case (parallel file system)

Experiment ApproachLocal

Disk/Memory (GB)

Network (node-to-node)

(GB)

Shared File

System (GB)

Best Case (active storage) 6000 1536 12

Falkon(data diffusion) 6000 1698 34







500x500200 CPUs

1 sec

500x500200 CPUs

0.1 sec

1000x10004096 CPUs

4 sec

1000x10005832 CPUs

4 sec

Christopher Moretti, Douglas Thain, University of Notre Dame


• Best to use active storage if– Slow data source– Workload working set fits on local node storage

• Best to use data diffusion if– Medium to fast data source– Task working set

32

• Purpose– On-demand “stacks” of

random locations within ~10TB dataset

• Challenge– Processing Costs:

• O(100ms) per object

– Data Intensive: • 40MB:1sec

– Rapid access to 10-10K “random” files

– Time-varying load

AP SloanData

+

+++

+

+

=

+

Locality Number of Objects Number of Files1 111700 111700

1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 12120

10 46480 465020 40460 202530 23695 790

[DADC08] “Accelerating Large-scale Data Exploration through Data Diffusion”[TG06] “AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis”

33

0

50

100

150

200

250

300

350

400

450

GPFS GZ LOCAL GZ GPFS FIT LOCAL FITFilesystem and Image Format

Tim

e (m

s)

openradec2xyreadHDU+getTile+curl+convertArraycalibration+interpolation+doStackingwriteStacking


34

Low data locality – Similar (but better)

performance to GPFS

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128

Number of CPUsTi

me

(ms)

per

sta

ck p

er C

PU

Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128

Number of CPUsTi

me

(ms)

per

sta

ck p

er C

PU


High data locality– Near perfect scalability0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128Number of CPUs

Tim

e (m

s) p

er s

tack

per

CPU


0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16 32 64 128Number of CPUs

Tim

e (m

s) p

er s

tack

per

CPU



35

• Aggregate throughput:– 39Gb/s– 10X higher than GPFS

• Reduced load on GPFS– 0.49Gb/s– 1/10 of the original load

0

5

10

15

20

25

30

35

40

45

50

1 1.38 2 3 4 5 10 20 30Locality

Agg

rega

te T

hrou

ghpu

t (G

b/s)

Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)

0

5

10

15

20

25

30

35

40

45

50

1 1.38 2 3 4 5 10 20 30Locality

Agg

rega

te T

hrou

ghpu

t (G

b/s)

Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)

• Big performance gains as locality increases

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 1.38 2 3 4 5 10 20 30 IdealLocality

Tim

e (m

s) p

er s

tack

per

CPU


0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 1.38 2 3 4 5 10 20 30 IdealLocality

Tim

e (m

s) p

er s

tack

per

CPU



• Data access patterns: write once, read many• Task definition must include input/output files

metadata• Per task working set must fit in local storage• Needs IP connectivity between hosts• Needs local storage (disk, memory, etc)• Needs Java 1.4+

36

• [Ghemawat03,Dean04]: MapReduce+GFS• [Bialecki05]: Hadoop+HDFS • [Gu06]: Sphere+Sector• [Tatebe04]: Gfarm• [Chervenak04]: RLS, DRS• [Kosar06]: Stork

• Conclusions– None focused on the co-location of storage and generic

black box computations with data-aware scheduling while operating in a dynamic elastic environment

– Swift + Falkon + Data Diffusion is arguably a more generic and powerful solution than MapReduce

37

38

• Identified that data locality is crucial to the efficient use of large scale distributed systems for data-intensive applications Data Diffusion– Integrated streamlined task dispatching with data

aware scheduling policies– Heuristics to maximize real world performance– Suitable for varying, data-intensive workloads– Proof of O(NM) Competitive Caching

39

• Falkon is a real system– Late 2005: Initial prototype, AstroPortal– January 2007: Falkon v0– November 2007: Globus incubator project v0.1

• http://dev.globus.org/wiki/Incubator/Falkon

– February 2009: Globus incubator project v0.9• Implemented in Java (~20K lines of code) and C

(~1K lines of code)– Open source: svn co https://svn.globus.org/repos/falkon

• Source code contributors (beside myself)– Yong Zhao, Zhao Zhang, Ben Clifford, Mihael Hategan

[Globus07] “Falkon: A Proposal for Project Globus Incubation”

40

• Workload• 160K CPUs• 1M tasks• 60 sec per task

• 2 CPU years in 453 sec• Throughput: 2312 tasks/sec• 85% efficiency

[TPDS09] “Middleware Support for Many-Task Computing”, under preparation

41[TPDS09] “Middleware Support for Many-Task Computing”, under preparation

42

ACM MTAGS09 Workshop@ SC09

Due Date: August 1st, 2009

43

IEEE TPDS JournalSpecial Issue on MTC

Due Date: December 1st, 2009

44

• More information:– Other publications: http://people.cs.uchicago.edu/~iraicu/– Falkon: http://dev.globus.org/wiki/Incubator/Falkon– Swift: http://www.ci.uchicago.edu/swift/index.php

• Funding:– NASA: Ames Research Center, GSRP– DOE: Office of Advanced Scientific Computing Research,

Office of Science, U.S. Dept. of Energy– NSF: TeraGrid

• Relevant activities:– ACM MTAGS09 Workshop at Supercomputing 2009

• http://dsl.cs.uchicago.edu/MTAGS09/– Special Issue on MTC in IEEE TPDS Journal

• http://dsl.cs.uchicago.edu/TPDS_MTC/

Segregated storage and compute · 2012. 11. 7. · • Segregated storage and compute – NFS, GPFS, PVFS, Lustre – Batch-scheduled systems: Clusters, Grids, and Supercomputers

Documents