Top Banner
xperimental Comparative Study of Job Management Systems George Washington Universit George Mason University http://ece.gmu.edu/lucite
52

Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Dec 13, 2015

Download

Documents

Blake Gray
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Experimental Comparative Study of Job Management Systems

George Washington UniversityGeorge Mason University

http://ece.gmu.edu/lucite

Page 2: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Outline:

1. Review of experiments

2. Results

3. Encountered problems

4. Functional comparison

5. Extension to reconfigurable hardware

Page 3: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Review of Experiments

Page 4: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

science.gmu.edu

Linux – PII,400 MHz, 128 MB RAM

Linux RH7.0 – PIII 450 MHz, 512 MB RAM

4 x Linux RH6.2 – 2xPIII – 500 MHz, 128MB

m1

pallj / m0

Solaris 8 – UltraSparcIIi,360 MHz, 512 MB RAM

m4 m5 m7

3 x Linux RH6.2 – 2xPIII – 450 MHz, 128MB

Solaris 8 – UltraSparcIIi,440 MHz, 512 MB RAM

Solaris 8 – UltraSparcIIi,440 MHz, 128 MB RAM

Solaris 8 – UltraSparcIIi,330 MHz, 128 MB RAM

palpc2

alicja

anna

magdalena

redfox

gmu.edu

Our Testbed

Page 5: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

* benchmarks used to determine the relative CPU factors of execution hosts

SHORT JOBS (1 s execution time 2 minutes)

No. Group Name Class Script name CPU time [s]

Memory Usage [MB]

Memory Requirements

[MB] 1 NPB FT S ft.S.sh 3.5 3.2 4 2 NPB FT W ft.W.sh 9.4 6.4 8 3 NPB MG W mg.W.sh 10.8 1.9 3 4 NPB EP S ep.S.sh 26.5 0.25 1 5 NPB EP W ep.W.sh 53.0 0.25 1 6 NPB IS W is.W.sh 1.0 1.7 3 7 NPB BT S bt.S.sh 3.0 2.5 3

8* NPB BT W bt.W.sh 115 17 21 9 NSA IS 7 mln radix.7M.sh 6 12.8 16

10 UPC Sobel 256 sobel.256.sh 4 0.4 1 11 UPC Sobel 512 sobel.512.sh 17 0.8 1

12* UPC Sobel 1024 sobel.1024.sh 68 2.4 3 13 UPC MM 512 matrix.1.sh 10.5 5.9 8 14 UPC MM 1024 matrix.2.sh 21 9.9 12 15 UPC MM 2048 matrix.3.sh 40 18.4 23

Average 22.0

Page 6: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

 Machine names Host Type Host Model CPU Factor m1-m4 Linux PIII_2_500_128 1.65m5-m7 Linux PIII_2_450_128 1.55pallj Linux PIII_1_450_512 1.60palpc2 Linux P2_1_400_128 1.70alicja Solaris64 USIIi_1_360_512 1.0anna Solaris64 USIIi_1_440_128 1.2magdalena Solaris64 USIIi_1_440_512 1.2redfox Solaris64 USIIi_1_330_128 1.2

CPU factors for medium benchmark listbased on the execution time for bt.W and Sobel1024i

Page 7: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

No. Group Name Class Script name CPU time [min:s]

Memory Usage [MB]

Memory Requirements

[MB] 1 NPB EP A ep.A.sh 7:45 1.3 3 2 NPB LU W lu.W.sh 8:09 6.8 9 3* NPB SP W sp.W.sh 6:07 15.1 19 4 Crypto Mars M crypto.mars.M.sh 9:21 0.4 1 5 Crypto RC6 M crypto.rc6.M.sh 6:21 0.4 1 6 Crypto Rijndael M crypto.rijndael.M.sh 4:11 0.4 1 7 Crypto Serpent M crypto.serpent.M.sh 8:54 0.4 1 8* Crypto Twofish M crypto.twofish.M.sh 8:05 0.4 1

Average 7:22

MEDIUM JOBS (2 minutes execution time 10 minutes)

* benchmarks used to determine the relative CPU factors of execution hosts

Page 8: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

No. Group Name Class Script name CPU time [min:s]

Memory Usage [MB]

Memory Requirements

[MB] 1* NPB EP B ep.B.sh 30:15 5 6 2 Crypto Mars L crypto.mars.L.sh 14:55 0.4 1 3 Crypto RC6 L crypto.rc6.L.sh 10:07 0.4 1 4 Crypto Rijndael L crypto.rijndael.L.sh 10:58 0.4 1 5 Crypto Serpent L crypto.serpent.L.sh 14:09 0.4 1 6* Crypto Twofish L cryto.twofish.L.sh 20:45 0.4 1

Average 16:51

LONG JOBS (10 minutes execution time 30 minutes)

* benchmarks used to determine the relative CPU factors of execution hosts

Page 9: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

No. Group Name Class Script name CPU time

[min:s]

Memory Usage [MB]

Memory Requirements [MB]

Input files

Output file

1 NPB FT S ft.S.io.sh 0:04 3.2 4 fft_64.in_pc fft_64.in_sun

fft_64.out_pc fft_64.out_sun

2 NPB FT W ft.W.io.sh 0:10 6.4 8 fft_128.in_pc fft_128.in_sun

fft_128.out_pc fft_128.out_sun

3 UPC MM 512 matrix.1.io.sh 0:11 5.9 8 mat_512.in_pc mat_512.in_sun

mat_512.out_pc mat_512.out_sun

4 UPC MM 1024 matrix.2.io.sh 0:21 9.9 12 mat_1024.in_pc mat_1024.in_sun

mat_1024.out_pc mat_1024.out_sun

5 UPC MM 2048 matrix.3.io.sh 0:40 18.4 23 mat_2048.in_pc mat_2048.in_sun

mat_2048.out_pc mat_2048.out_sun

6 NPB LU W lu.W.io.sh 8:09 6.8 9 - LU_W.out Average 1:36

INPUT/OUTPUT JOBS (1 second execution time 10 minutes)

Page 10: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Typical experiment

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

Total time of an experiment 2 hours

N= 150 for medium and small jobs75 for long jobs

Pseudorandom delays between consecutivejob submissions

Poisson distribution of the job submission rate

Page 11: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Experi-ment

Number

Benchmark Set

Average CPU time /

Job

Average Time Intervals

Between Job Submissions

Total Number of Jobs

Special Assumptions

1 Set 2, Medium job list

7 min 22 s 30 s, 15 s, 5 s 150 one job / CPU

2 Set 2, Medium job list

7 min 22 s 15 s 150 two jobs / CPU

3 Set 3, Long job list

16 min 51 s 2 min, 30 s 75 one job / CPU

4 Set 1, Short job list

22 s 15 s, 10 s, 5 s 150 one job / CPU

5 Set 4, I/O job list

1 min 36 s 15 s 150 one job / CPU

List of experiments

Page 12: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

time

ts

submissiontime

tb

begin of executiontime

te

end of executiontime

td

deliverytime

TR

responsetime

TTA

turn aroundtime

TEXE

executiontime

TD

deliverytime

Definition of timing parameters

Page 13: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

time

ts

submissiontime

tb

begin of executiontime

te

end of executiontime

TR

responsetime

TTA

turn aroundtime

TEXE

executiontime TD=0

delivery time=0

Typical scenario

determined using the gettimeofday() function

Page 14: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Total Throughput

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

TN – time necessary to execute N jobs

Total Throughput = N

TN

Page 15: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Partial Throughput

time

Job submissions

time

1 N

i1 iN

time=0

Jobs finishing execution

Tk – time necessary to execute k jobs

Throughput (k) = k

Tk

ik

Page 16: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

machine 2

machine M

machine 1

0%

100%CPU utilization

average CPU utilization

0%

100%CPU utilization

average CPU utilization

0%

100%CPU utilization

average CPU utilization

. . . . . . . . . . . . .

job1 job2 job3

job1 job2

job2

job1 job3

Uavr

1

Uavr

2

Uavr

M

M

1j

avrjU

M

1 UOverall utilization =

Utilization

Page 17: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Results

Page 18: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

20

40

60

80

100

120

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

7670

68

79

97 91

82

114107

102

86

110

Page 19: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

500

1000

1500

2000

2500

2 jobs/min 4 jobs/min 12 jobs/min

Medium jobs – Turn-around Time

LSFPBSCodineCondor

Average job submission rate

Turn-around Time [s]

496 462607

505

1134

944

12931148

1765

1466

1949

1627

Page 20: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

200

400

600

800

1000

1200

1400

1600

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – Response TimeResponse Time [s]

LSFPBSCodineCondor

13 3 31 28

636

452

734671

1274

984

1385

1156

Page 21: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

10

20

30

40

50

60

70

80

90

2 jobs/min 4 jobs/min 12 jobs/minAverage job submission rate

Medium jobs – UtilizationUtilization [%]

LSFPBS

CodineCondor

54

41

70

61 6357

71 74 7367

78

69

Page 22: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

5

10

15

20

25

30

35

40

45

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

25 26

18

40

2830

23

42

Page 23: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

500

1000

1500

2000

2500

3000

3500

4000

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Turn-around TimeTurn-around Time [s]

LSFPBS

CodineCondor

1148 1079

1903 19262191 2163

3401

2357

Page 24: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

200

400

600

800

1000

1200

1400

1600

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – Response TimeResponse Time [s]

LSFPBS

CodineCondor

13 3 3

721

860799

1478

1225

Page 25: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

10

20

30

40

50

60

70

80

0.5 job/min 2 jobs/minAverage job submission rate

Long jobs – UtilizationUtilization [%]

LSFPBS

CodineCondor

4346

52

24

5658

64

69

Page 26: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

200

400

600

800

1000

1200

1400

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

Short jobs – Total ThroughputThroughput [jobs/hour]

LSFPBS

CodineCondor

240227

234

160

356322 337

205

652

414

607

280

1076

576

336

1255

642

370

1027

1210

Page 27: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

20

40

60

80

100

120

140

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – Turn-around TimeTurn-around Time [s]

42

3429

50

41

33 29

51

42

58

29

51

68

58

31

52

120

62

32

50

Page 28: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

10

20

30

40

50

60

70

80

90

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – Response TimeResponse Time [s]

9

2 1

19

9

3 1

19

9 8

1

17

32

8

2

18

83

9

2

18

Page 29: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

0

5

10

15

20

25

30

35

40

45

4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min

Average job submission rate

LSFPBS

CodineCondor

Short jobs – UtilizationUtilization [%]

9

18

6 6

15

21

98

20

35

16

10

26

38

12

3738

12

32

37

Page 30: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Medium jobs – Total ThroughputThroughput [jobs/hour]

0

20

40

60

80

100

120

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

9791

82

114

90

80

67

105

Page 31: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Medium jobs – Turn-around TimeTurn-around Time [s]

0

200

400

600

800

1000

1200

1400

1600

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

1134

944

1293 1147

1297 1273

1482

969

Page 32: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Medium jobs – Response TimeResponse Time [s]

0

100

200

300

400

500

600

700

800

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor636

452

734

671

387

285

386 386

Page 33: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Medium jobs – UtilizationUtilization [%]

0

10

20

30

40

50

60

70

80

1 job/CPU, 4 jobs/min 2 jobs/CPU, 4 jobs/min

Maximum number of jobs per CPU

LSFPBS

CodineCondor

63

57

7174

6358

6354

Page 34: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Encountered problems

Page 35: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

1. Jobs with high requirements on the stack size

Indication: Certain jobs do not finish execution when run under LSF. The same jobs run correctly outside of any JMS, and under other job management systems

Source: Variable STACKLIMIT in $LSB_CONFDIR/<cluster_name>/configdir/lsb.queues

Remaining Problem: Documentation of default limits.

Page 36: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

2. Frequently submitted small jobs

Indication: Unexpectedly high response time and turn-around time for a medium job submission rate

Possible solution: Defining variable CHUNK_JOB_SIZE (e.g., =5) in lsb.queues, and the variable LSB_CHUNK_NORUSAGE=y in lsf.conf

Page 37: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

3. Ordering of machines fulfilling resource requirements

Question: How many machines are dropped from the list based on the first ordering?

Default:

r1m : pg

Page 38: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

4. Random behavior from iteration to iteration

Question: Why is r1m different each time?

Indication: Assignment of jobs to particular machines is different in each iteration of the experiment

Page 39: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

5. Boundary effects in the calculation of the throughput

Question: How to define the steady state throughput?

Indication: Steady state partial throughput different than the total throughput

Page 40: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

6. Throughput vs. turn-around time

Question: How to explain the lack of this correlation?

Indication: No correlation between the ranking of JMSes in terms of the throughput and in terms of the turn-around time

Page 41: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Functional comparison

Page 42: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Operating system, flexibility, user interface

LSF Codine PBS CONDOR RES

Distribution

Source code

OS Support

User Interface

SolarisLinuxTru64NT

GUI &CLI

CLI

com pub pub/com pub gov

GUI &CLI

GUI &CLI

GUI &CLI

Page 43: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Scheduling and Resource Management

LSF Codine PBS CONDOR RES

Batch jobs

Interactive jobs

Parallel jobs

Accounting

Page 44: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Efficiency and Utilization

LSF Codine PBS CONDOR RES

Stage-in andstage-out

Timesharing

Process migration

Dynamic loadbalancing

Scalability

Page 45: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Fault Tolerance and Security

LSF Codine PBS CONDOR RES

Checkpointing

Daemon fault recovery

Authentication

Authorization

Page 46: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Documentation and Technical Support

LSF Codine PBS CONDOR RES

Documentation

Technicalsupport

Page 47: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

JMS features supporting extension to reconfigurable hardware

• capability to define new dynamic resources

• strong support for stage-in and stage-out- configuration bitstreams- executable code- input/output data

• support for Windows NT and Linux

Page 48: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Ranking of Centralized Job Management Systems (1)

Capability to define new dynamic resources:

Excellent: LSF, PBS, CODINEMore difficult: CONDOR, RES

Stage-in and stage-out:

Excellent: LSF, PBSLimited: CONDORNo: CODINE, RES

Page 49: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Ranking of Centralized Job Management Systems (2)

Overall suitability to extend to reconfigurable hardware:

1. LSF2. CODINE3. PBS4. CONDOR5. RES

without changing the JMS source code

requires changes to the JMS source code

Page 50: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Extension to reconfigurablehardware

Page 51: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

Extension of LSF to reconfigurable hardware (1)Operation of LSF

LIM – Load Information ManagerMLIM – Master LIMMBD – Master Batch DaemonSBD – Slave Batch DaemonRES – Remote Execution Server

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

Page 52: Experimental Comparative Study of Job Management Systems George Washington University George Mason University .

Extension of LSF to reconfigurable hardware(2)

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

ELIM – External Load Information ManagerACS API – Adaptive Computing Systems API

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

ELIM

ACS API

14FPGAboard

Statusof theboard