Hybrid Preemptive Scheduling of MPI applications

Grand LargeINRIA

1

Hybrid Preemptive Hybrid Preemptive Scheduling of MPI Scheduling of MPI

applicationsapplicationsAurélien Bouteiller, Hinde Lilia

Bouziane, Thomas Hérault, Pierre Lemarinier, Franck Cappello

MPICH-V teamINRIA Grand-Large

LRI, University Paris South

Grand LargeINRIA

2

Problem definition• Context: Clusters and Grids (made of clusters) shared by many users

(less available resources than required at a given time)In this study : finite sets of MPI applications.

Time sharing of parallel applications is attractive to increase fairness between users, compared to Batch scheduling

• It is very likely that several applications will reside in the virtual memory at the same time, exceeding the total physical memory

Out-of-core scheduling of parallel applications on clusters! (scheduling // applications on cluster under mem. constraint)

• Most of the proposed approaches tries to avoid this situation (by limiting job admission based on mem. requirement, delaying some jobsunpredictably if the jobs exec. time is not known)

Issue: Novel approach (out-of-core) that avoid delaying some jobs?

Constraint: No OS modification (no kernel patch)

Grand LargeINRIA

3

Outline

• Introduction (related work)

• A Hybrid approach dedicated to out-of-core

• Evaluation

• Concluding remarks

Grand LargeINRIA

4

Related work 1

Co-scheduling: all processes of each application are scheduled independently(no coordination)

Appl 2Appl 1 Appl 3

Time

Proc 1

Time

Communication Scheduling overhead

Time slice

Proc 1

Proc 2

Gang-scheduling:all processes of each application are executed simultaneously(coordination)

Scheduling parallel applications on distributed memory machines: a long history of research, still very active (5 papers in 2004 in main conferences: IPDPS, Cluster, SC, Grid, Europar)!

sometimes called “co-scheduling”

Expected advantage: overlapping comm. And comp.

Expected advantage: scheduling communicating processes

Proc 2 Appl 2 Appl 1Appl 3

Grand LargeINRIA

5

Comparison between Gang and Co scheduling:

Gang scheduling out-performs co-scheduling•D. G. Feitelson and L. Rudolph. Gang Scheduling Performance Benefits for Fine-Grained Synchronization. Journal of Parallel and Distributed Computing, 16(4):306–318, December 1992.

Co-scheduling out-performs gang scheduling•Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003(Gang schedule only applications that take advantage of if – after classification)

•Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das,“Coscheduling in Clusters: Is It a Viable Alternative?”, SC2004(increases priority of processes during communications)

•Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, Cluster 2004.(Ethernet, Score for Gang, MatMul and Linpack)

Multiple parameter problemConclusion depends on assumptions !

Related work 2

Grand LargeINRIA

6

Metrics for measuring performance:

Metrics and Benchmarking for Parallel Job Scheduling [Fe98]

Performance •Makespan

•Throughput

•Response time

Not so much investigated. Still very important:

Fairness• Standard deviation of the response time for a set of homogeneous applications

• Minimum is the best fairness

Related work 3

Grand LargeINRIA

7

Outline



• Evaluation


Grand LargeINRIA

8

Our approach 1/2: Hybrid

Time

Application Subset context switch

CommunicationsTime slice

In-core co-scheduling

Out-of-core gang scheduling

• A given set of parallel applications to schedule• Application SubSets: a set of applications fitting in memory• Co-scheduling applications within a SubSet• Gang scheduling SubSets of applications

Example: 1 set of 6 apps. with 2 Subsets of 3 applications:

Principle:

Expected benefits:

• Overlapping Comm. and I/O with computation in a SubSet • No memory page miss/replacement during Subset exec.• Allow known Co-scheduling optimizations in SubSet

Potential limitation:

• High “Subset context” switching overhead

Grand LargeINRIA

Basic OS virtual memory management:• Paging in pages on request• Paging out pages on replacement (LRU)• Interaction with OS scheduler (someOS are deliberately unfair in out-of-coresituation.)

Poor performance for HPC applications

9

Our approach 2/2: Checkpointing

Adaptive Memory Paging for Gang-Scheduling [Ry04]Best performance for gang scheduling is obtained by:1) Selective paging out (swapping out only pages of descheduled processes)2) Aggressive paging out (evicting pages of descheduled processes at once)3) Adaptive paging In (swapping in pages of scheduled process at once) Good but requires deep kernel modifications.

page2

page1

page3

page3

Memory

page1

page1

page2

page3

page2

page1

page2

page3

Disk

OS virtual memory management Checkpointing

page2

page1

page3

free

Memory

page1

free

page2

page3

Disk

Our approach: user level application Subset checkpointingCheckpointing provides the same benefits than 1), 2) and 3).Works for Co-scheduling as well as for Gang-SchedulingDoes not require any kernel modification! We need a parallel application (MPI) checkpoint mechanism

Grand LargeINRIA

10

Implementation using MPICH-V Framework

node

Network

node

Dispatcher

node

CheckpointScheduler

MPICH-V framework: a set of components

A MPICH-V protocol: a composition of a subset of these components

Mpi processes

Daemons

1

3

2

Channel MemoriesCheckpoint

servers

Event Loggers

Fault detector

Grand LargeINRIA

11

Checkpoint protocol selection:Coordinated or uncoordinated?

6 protocols implemented in MPICH-V:1 coordinated (Chandy-Lamport)2 uncoordinated + pessimist mess. log.3 uncoordinated + causal log.

The coordinated one provides the best performance for fault free execution

Grand LargeINRIA

12

Coordinated Checkpoint: 2 ways

Flushing the network (Chandy-Lamport)

Checkpointing the communication stack (Parakeet, Meoisys, Score)

P1

In-transit Mess.

Ckpt. Image (P1) = +

P0

Flush

tim

e

P1P0

Ckpt.Comm.Buff.

Ckpt.Sig.

P1Mess.delivery

P0

tim

e

P1P0

Rstrt.Rstrt.Sig.

Ckpt. Image (P1) = +

B0 B1 B0 B1

B1 B0 B1

Checkpoint may last longer Restart may last longer

1) We expect minor Perf. Diff. between the 2 approaches2) Checkpoint/restart of the comm. Stack requires OS modificationsSo we implemented the Chandy-Lamport approach

Mess.delivery

Grand LargeINRIA

13

MPICH-V/CL protocolCoordinated checkpointing, (Chandy-Lamport)

Reference protocol for coordinated checkpointing

1) When receiving a Checkpoint tag, start checkpoint + store any incoming mess.2) Store all incoming messages in checkpoint image3) Send checkpoint tag to all neighbors in the topology.

Checkpoint is finished when a Tag has been received from all neighbors4) After a crash, all nodes retrieve checkpoint images5) Deliver stored in-transit messages to restarted processes

Grand LargeINRIA

14

Implementation detailsMPICH-V

Dispatcher

Ckpt. Sched.

Ckpt. Serv.

Network

Nodes.

Dispatchers

Ckpt. Sched.

Network

Nodes.

MasterScheduler

Daemons

Co-scheduling: Several Dispatchers (no master/checkpoint scheduler)

Gang and (Hybrid): Master Scheduler + several checkpoint schedulers1) Master Scheduler issues a checkpoint order to the Checkpoint Scheduler(s) of running

application(s)2) When receiving this order, a Checkpoint Scheduler launches a coordinated checkpoint.

Every running daemon computes the MPI process image and store it on the local disc. All daemons send a completion message to the Checkpoint Scheduler.

3) All running daemons stop the MPI process and their execution4) The Master Scheduler selects the Checkpoint Scheduler(s) of other application(s) and

sends a restart order. Every Checkpoint Scheduler receiving this order spawns new daemons restarting MPI processes from local images.

Grand LargeINRIA

15

Outline


• A Hybride approach dedicated to out-of-core

• Evaluation


Grand LargeINRIA

16

Methodology• LRI cluster:

– Athlon 1800+– 1GB memory – IDE ATA100 Disc– Ethernet 100Mbs– Linux 2.4.2

• Benchmark (MPI):– NAS BT (computation bound)– NAS CG (communication bound)

• Time measurement:– Homogeneous Applications– Simultaneous launch (scripts)– Time is measured between the first launch and the last

termination– Fairness is measured by response time standard deviation

• Gang Scheduling time slice: 200 or 600 sec– Gang sched. also implemented by checkpointing (not OS signal)

Beowulf Cluster

Grand LargeINRIA

17

Context switch overlap policy

A) Sequential store and load:1X: 1 context in memory

C) Load Prefetech:2X: 2 contexts in memory

B) Store and load in parallel:2X: 2 contexts in memory

Time slice

ExecutionContext Storage

Context Load

We can imagine several policies to switch between set contextsWhich one is the best for in-core and out-of-core situations?

t

2X 2X2X2X2X1X

1X

<3%

Policies for NAS Bench. BT –C- 25

1) Overlapping policies do not provide substantial improvements for the in-core situation2) They need 2x the memory capacity to stay in-core.

the sequential policy is the bestWe used it for the other xps.

In core Near out-of-core

Grand LargeINRIA

18

Co VS. Gang (Ckpt based)

Makespan: Execution time of N applications with Co and Gang SchedulingNAS Benchmark CG and BT

Number of BT-B-9 executed “simultaneously”

In-core Out-of-core

•Which scheduling strategy is the best for communication bound and compute bound applications?

1) Co-scheduling is the best for in-core executions (but small advantage due to ~Checkpoint overhead + tinny Comm./comp. overlap)

2) Gang scheduling outperforms co-scheduling for out-of-core (ckpt.)

Memory constraint is managed by checkpointing not by delaying jobs

>>

More

th

an

24

k

Co-schedulingCheckpoint based Gang scheduling

Number of CG-C-8 executed “simultaneously”

In-core Out-of-core

0

1000

2000

3000

4000

5000

6000

7000

8000

1 6 9 12 15 18 21

Sec.

0

1000

2000

3000

4000

5000

6000

7000

8000

1 12 18 24

Co-schedulingCheckpoint based Gang scheduling

CG BT

Grand LargeINRIA

19

Ckpt Gang VS. Ckpt Hybrid

Makespan: Execution time of N applications with Co, Gang and Hybrid Scheduling

Co-schedulingGang schedulingHybrid scheduling (set of 5)

Gang schedulingHybrid scheduling (set of 5)

Number of CG-C-8 executed “simultaneously” Number of BT-B-9 executed “simultaneously”

Tim

e (

min

ute

s)

Tim

e (

min

ute

s)

In-core Out-of-core In-core Out-of-core

• Gang and Hybrid scheduling outperform co-scheduling for out-of-core• Hybrid scheduling compares favorably to Gang scheduling on BT and OOC

thanks to communication and computation overlap.

Chkp overhead

Comm./comp.Overlap

>>

More

th

an

30

00

Grand LargeINRIA

20

Overhead comparison

Relative slowdown: (total time / # concur exec) / best of seq. times

• What is the performance degradation due to time sharing?

1) Gang and Hybrid scheduling add no performance penalty to CG (and also no improvement),

2) Gang scheduling add 10% performance penalty to BT,3) Hybrid scheduling improves the performance by almost 10%,4) Difference is mostly due to communication/computation overlap.

0,8

0,9

1

1,1

1,2

1,3

1,4

1,5

1,6

6 9 12 15 18 21

Co-scheduling

Gang-Scheduling

Hybrid-SchedulingRelative Slowdown

# concurrent CG-C-8 executions

0,8

0,9

1

1,1

1,2

12 14 16 18 20 22 24

Relative Slowdown

# concurrent BT-B-9 executions

ref

CG BT

ref

Grand LargeINRIA

21

Co-scheduling Fairness (Linux)

47.1507.4509.4......510.66510.66524.4405.78484.589

474.9704......1145.25951264.25239.58.57

app8......app4app3app2app1app0

Stdr.

Deviat.

# page misses per minute (mean over all applications)

# page misses per minute experienced by each node of an application (mean)

# appls.

Page miss statistics for 7 and 9 BT C 25 (out-of-core)

• How fair is co-scheduling for in-core and out-of-core? Response time of BT 9 with modified memory sizes

0

500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10 11 12 13

Slightly out-of-core

14

Threads

M=2251, SD=298, Diff=961In-core

1000

1200

1400

Time(s)

0

200

400

600

800

1 2 3 4 5 6

Application rank

M=1210, Diff=8

1) Fairness deficiency for slightly out-of-core seems due to virtual mem. Mgnt.2) Of course there should be some solution, but involving Kernel modification

Co-scheduling is highly unfair in out-of-core situation!

Grand LargeINRIA

22

Outline



• Evaluation


Grand LargeINRIA

23

Concluding remarks• Checkpoint based Gang Scheduling outperforms Co-scheduling and

certainly classical (OS signal based) Gang scheduling on out-of-core situation (thanks to a better memory management)

• Compared to known approaches, based on job admission control, the benefit of ckpt is that it avoids to delay some jobs

• Hybrid scheduling, combining the two approaches + checkpointing, outperforms Gang scheduling on BT (presumably thanks to overlapping communications and computations)

• More generally, Hybrid scheduling can take advantage of advanced co-scheduling approaches within a gang subset

Work in progress:• Test with other applications / benchmarks• Compare with traditional gang scheduling based on OS signals• Experiments with high speed networks• Experiments on Hybrid scheduling with Co-scheduling optimizations

Grand LargeINRIA

Meet us!

at the INRIA booth 2345INRIABooth 2345

Mail contact: [email protected]

Grand LargeINRIA

References[Ag03] S. Agarwal, G. Choi, C. R. Das, A. B. Yoo, and S. Nagar. Co-ordinated Coscheduling in time-sharing Clusters through a Generic Framework. In Proceedings of International Conference on Cluster Computing, December 2003.

[Ar98] A. C. Arpaci-Dusseau, D. E. Culler, and A. M. Mainwaring. Implicit Scheduling With Implicit Information in Distributed Systems. In Proceedings of the 1998 ACM SIGMETRICS joint International Conference on Measurement and Modeling of Computer Systems, pages 233–243, June 1998.

[Ba00] Anat Batat and Dror G. Feitelson, « Gang Scheduling with Memory Considerations », in proceedings of IPDPS 2000.

[Bo03] Aurélien Bouteiller, Pierre Lemarinier, Géraud Krawezik, and Franck Cappello, « Coordinated checkpoint versus message log for fault tolerant MPI », In IEEE International Conference on Cluster Computing (Cluster 2003). IEEE CS Press, december 2003.

[Ch85] K. M. Chandy and L.Lamport, « Distributed snapshots: Determining global states of distributed systems » In Transactions on Computer Systems, volume 3(1), pages 63–75. ACM, February 1985.

[Fe98] D. G. Feitelson and L. Rudolph, “Metrics and Benchmarking for Parallel Job Scheduling”. In Job Scheduling Strategies for Parallel Processing, LNCS vol. 1495, pp. 1–24, Springer-Verlag, Mar 1998.

[Fr03] Eitan Frachtenberg, Dror G. Feitelson, Fabrizio Petrini and Juan Fernandez, “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources”, IPDPS 2003

[Ho98] Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa, « Overhead analysis of preemptive gang scheduling », Lecture Notes in Computer Science, 1459 :217–230, April 1998.

[Ky04] Kyung Dong Ryu, Nimish Pachapurkar, Liana L. Fong, « Adaptive Memory Paging for Efficient Gang Scheduling of Parallel Applications”, in proceedings of IPDPS 2004.

[Na99] S. Nagar, A. Banerjee, A. Sivasubramaniam, and C. R. Das. Alternatives to Coscheduling a Network of Workstations. Journal of Parallel and Distributed Computing, 59(2):302–327, November 1999.

[Ni02] Dimitrios S. Nikolopoulos and Constantine D. Polychronopoulos, « Adaptive Scheduling under Memory Pressure on Multiprogrammed Clusters”, CCGRID 2002

[Sa04] Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das, “Coscheduling in Clusters: Is It a Viable Alternative?”, to appear in SC2004

[Se99] S. Setia, M. S. Squillante, and V. K. Naik. The Impact of Job Memory Requirements on Gang-Scheduling Performance. ACM SIGMETRICS Performance Evaluation Review, 26(4):30–39, 1999.

[So98] P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien. Dynamic Coscheduling on WorkstationClusters. In Proceedings of the IPPS Workshop on Job Scheduling Strategies for Parallel Processing, pages 231–256, March 1998.

[St04] Peter Strazdins and John Uhlmann, « Local scheduling outperforms gang scheduling on a beowulf cluster » Technical report, Department of Computer Science, Australian National University, January 2004, to appear in Cluster 2004.

[Wi03] Yair Wiseman, Dror G. Feitelson, « Paired Gang Scheduling », IEEE TPDS, June 2003

Grand LargeINRIA

26

optimizations:

-Memory management (mainly based on job admission control):

•Impact of Memory Requirements on Gang-Scheduling Performance [Se99] (cont. of multiprog.)

•Gang Scheduling with Memory Considerations [Ba00] (job admission control to avoid swapping)

•Memory aware Co-scheduling [Ch04] (job admission control to avoid swapping).

•Adaptive Memory Paging for Gang-Scheduling [Ry04] (Improving memory paging in-out)

-Communications (concerns co-scheduling):

•ICS (Implicit Co-scheduling), SB (spin Blocking), CC (Coordinated Co-scheduling): self descheduling after timeout on communication [Ar98][Na99]

•DCS (Dynamic Co-scheduling): incoming message triggers receiver scheduler [So98]

•PB (Periodic Boost): schedule receiver from Periodic check of receiving buffer [Na99]

Related work optimization

Grand LargeINRIA

27

Is result for in-core situationKernel dependent (Linux)?

Kernel 2.4.2 was used in our experimentHow time sharing efficiency evolves with Linux kernel maturation (from 2.4 to 2.6)?

100

1000

1 2 3 4 5

3000Time (s)

# of concurrent CG A 4

Comp 2.4.2

Comm. 2.4.2

Exec 2.4.2

Comp 2.6.2

Comm 2.6.2

Exec 2.6.2

Comp 2.6.7

Comm. 2.6.7

Exec 2.6.7

10

100

1000

1 2 3 4 5

Time (s)

# of concurrent BT A 9

Yes, performance of co-sheduling (in core) depends on the kernel1) Kernel 2.6.2 is less efficient (much less for CG)2) Kernel 2.6.7 and 2.4.2 provides overall similar performance

Careful selection of kernel version OR restriction (desactivation) of Co-scheduling

In-core

Hybrid Preemptive Scheduling of MPI applications

Documents

scheduling applications

checkpoint order

mpi processes

mpi process image

benchmark mpi

novel approach

mpichv protocol

executionthe master