A science-gateway workload archive application to the self-healing of workflow incidents

A science-gateway workload archive application to the self-healing

of workflow incidents

1

Journées Scientifiques Mésocentres et France Grilles October 1st-3rd 2012

Rafael Ferreira da Silva – [email protected]

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

Context: Workload Archives

2 Rafael Ferreira da Silva – [email protected]

Information produced by grid workflow executions

Assumptions validation

Computational activity modeling

Methods evaluation (simulation or experimental)

use

ful fo

r task_status

submit_time execution_time

input_file

site_name

workflow_id

activity_name

exit_code

Science-gateway architecture


User

Web Portal

0. Login 1. Send input data

Storage Element

Workflow Engine

3. Launch workflow

Pilot Manager

4. Generate and submit task

Meta-Scheduler

5. Submit pilot jobs

2. Transfer input files

6. Schedule pilot jobs

Computing site

7. Get task 8. Get files 9. Execute 10. Upload results

State of the Art


task_status

submit_time

execution_time

input_file

site_name

workflow_id

activity_name

exit_code

Information gathered at infrastructure-level

Grid Workload Archives

tasks

Lack of critical information: •  Dependencies among tasks •  Task sub-steps •  Application-level scheduling artifacts •  User

•  Parallel Workloads Archive (http://www.cs.huji.ac.il/labs/parallel/workload/)

•  Grid Workloads Archive (http://gwa.ewi.tudelft.nl/pmwiki/)

At infrastructure-level


Storage Element

Pilot Manager

Meta-Scheduler


Computing site

User

Web Portal


Workflow Engine

3. Launch workflow





Outline

  A science-gateway workload archive

  Case studies   Pilot Jobs   Accounting   Task analysis

  Bag of tasks

  Workflow Self-Healing

  Conclusions


Our approach


task_status

submit_time

execution_time

input_file

site_name

workflow_id

activity_name

exit_code

Information gathered at science-gateway level

Science-Gateway Workload Archive

workflow executions Advantages: •  Fine-grained information about tasks •  Dependencies among tasks •  Workflow characterization •  Accounting

At science-gateway level


User

Web Portal


Storage Element

Workflow Engine

3. Launch workflow

Pilot Manager


Meta-Scheduler




Computing site


Virtual Imaging Platform   Virtual Imaging Platform (VIP)

  Medical imaging science-gateway

  Grid of 129 sites (EGI – http://www.egi.eu)

  Significant usage   Registered users: 244 from 26 countries

  Applications: 18

  Consumed 32 CPU years in 2011


VIP usage in 2011: CPU consumption of VIP and related platforms on EGI.

Applications

File transfer

VIP – http://vip.creatis.insa-lyon.fr

SGWA   Science Gateway Workload Archive (SGWA)

  Archive is extracted from VIP


Science-gateway archive model

Task, Site and Workflow Execution acquired from databases populated by the workflow engine at runtime

File and Pilot Job extracted from the parsing of task standard

output and error files

Workload for Case Studies   Based on the workload of VIP

  January 2011 to April 2012


112 users 2,941 workflow executions 680,988 tasks

338,989 completed

138,480 error

105,488 aborted

15,576 aborted replicas

48,293 stalled

34,162 queued

339,545 pilot jobs

Pilot Jobs   A single pilot can wrap several

tasks and users

  At infrastructure-level   Assimilates pilot jobs to tasks and

users

  Valid for only 62% of the tasks

  Valid for 95% of user-task associations

  At science-gateway level   Users and tasks are correctly

associated to pilots


0

50000

100000

150000200000250000

282331

2812111885

6721 10487

1 2 3 4 5Tasks per pilot

Freq

uenc

y

0

50000

100000

150000200000250000300000

323214

15178

1079 70 4

1 2 3 4 5Users per pilot

Freq

uenc

y

Accounting: Users   Authentications based on login and password are mapped to

X.509 robot certificates

  At infrastructure-level   All VIP users are reported as a single user

  At science-gateway level   Maps task executions to VIP users


0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Months

Users EGI

VIP

Number of reported EGI and VIP users

Accounting: CPU and Wall-clock Time

  Huge discrepancy of values   Pilot jobs do not register to

the pilot system

  Absence of workload

  Outputs unretrievable

  Pilot setup time

  Lost tasks (a.k.a. stalled)

  Undetectable at infrastructure-level


1e+05

2e+05

3e+05

4e+05

5e+05

6e+05

5 10 15Month

Num

ber o

f job

s

VIP jobs

EGI jobs

Number of submitted pilot jobs by EGI and VIP

50

100

150

5 10 15Month

Year

s

VIP CPU time

VIP Wall−clock time

EGI CPU time

EGI Wall−clock time

Consumed CPU and wall-clock time by EGI and VIP

Task Analysis   At infrastructure-level

  Limited to task exit codes

  At science-gateway level   Fine-grained information

  Steps in task life

  Error causes

  Replicas per task


0

10000

20000

30000

40000

50000

5516550925 48293

19463

1123

application input stalled output folderError causes

Num

ber o

f tas

ks

0.2

0.4

0.6

0.8

1.0

1 100 10000Time(s)

CDF

download

execution

upload

Different steps in task life

Δ

Bag of Tasks: at Infrastructure level

  Evaluation of the accuracy of Iosup et al.[8] method to detect bag of tasks (BoT)

  Two successively submitted tasks are in the same BoT if the time interval between submission times is lower or equal to Δ.


Task 1

Task 2

Task 3

t1 t2 t3 time

Δ1,2 Δ2,3

Task 1

Task 2

BoT 1

Task 3

BoT 2

Δ1,2 ≤Δ |t1 – t2|≤Δ

Δ2,3 >Δ |t2 – t3|>Δ

Δ

[8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393

Bag of Tasks: Size and Duration Infrastructure vs science-gateway


0.0

0.2

0.4

0.6

0.8

200 400 600 800 1000Size (number of tasks)

CD

F

Real Batch

Batch

0.0

0.2

0.4

0.6

0.8

10000 20000 30000 40000 50000Duration (s)

CD

F

Real Batch

Real Non−Batch

Batch

Non−Batch

Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT

  90% of Batch BoTs size ranges from 2 to 10 while it represents 50% of Real Batch

  Non-Batch duration is overestimated up to 400%

Bag of Tasks: Inter-arrival Time and Consumed CPU Time


0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000Inter−Arrival Time (s)

CD

F

Real Batch

Real Non−Batch

Batch

Non−Batch

Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT

0.2

0.4

0.6

0.8

0 5000 10000 15000 20000 25000 30000Consumed CPUTime (KCPUs)

CD

F

Real Batch

Real Non−Batch

Batch

Non−Batch

  Batch and Non-Batch inter-arrival times are underestimated by about 30%

  CPU times are underestimated of 25% for Non-Batch and of about 20% for Batch

Outline

  A science-gateway workload archive

  Case studies   Pilot Jobs   Accounting   Task analysis

  Bag of tasks

  Workflow Self-Healing

  Conclusions


Workflow Self-Healing   Problem: costly manual operations

  Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files

  Objective: automated platform administration   Autonomous detection of operational incidents

  Perform appropriate set of actions

  Assumptions: online and non-clairvoyant   Only partial information available

  Decisions must be fast

  Production conditions, no user activity and workloads prediction


0.61

0.30

0.07

General MAPE-K loop


Incident 1 degree η = 0.8



level 1

level2

level3

Roulette wheel selection

Incident 1

Selected

Rule Confidence (ρ) ρxη

2 1 0.8 0.32

3 1 0.2 0.02

1 1 1.0 0.80

Association rules for incident 1

0.37

0.16

0.66 Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

x2

level 1

level2

level3

level 1

level2

level3

€

=ηiη jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

!bEstimation by Median

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

0015

000

Monitoring data

Incident: Activity Blocked   An invocation is late compared to the others

  Possible causes   Longer waiting times

  Lost tasks (e.g. killed by site due to quota violation)

  Resources with poor performance


Invocations completion rate for a simulation Job flow for a simulation

0.0e+00 4.0e+06 8.0e+06 1.2e+07

020

4060

80100

FIELD-II/pasa - workflow-9SIeNv

Time (s)

Com

plet

ed J

obs

Activity blocked: degree   Degree computed from all completed jobs of the activity

  Job phases: setup inputs download execution outputs upload

  Assumption: bag-of-tasks (all jobs have equal durations)

  Median-based estimation:

  Incident degree: job performance w.r.t median


€

d =Ei

Mi + Ei

∈ [0,1]

Median duration of jobs phases

Real job duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated job duration

50s

250s

400s

15s

completed

current

Mi = 715s Ei = 757s

*: max(400s, 20s) = 400s

Activity blocked: levels and actions

  Levels: identified from the platform logs

  Actions   Job replication

  Cancel replicas with bad performance

  Replicate only if all active replicas are running


Replication process for one task !bEstimation by Median

Freq

uenc

y

0.0 0.2 0.4 0.6 0.8 1.0

050

0015

000 Level 1

(no actions) Level 2

action: replicate jobs

d

€

τ1

Experiments

  Goal: Self-Healing vs No-Healing   Cope with recoverable errors

  Metrics   Makespan of the activity execution

  Resource waste

  For w < 0: self-healing consumed less resources

  For w > 0: self-healing wasted resources


€

w =(CPU + data) self −healing(CPU + data)no−healing

−1

Experiment Conditions

  Software   Virtual Imaging Platform

  MOTEUR workflow engine

  DIRAC pilot job system

  Infrastructure   European Grid Infrastructure (EGI): production, shared

  Self-Healing and No-Healing launched simultaneously

  Experiment parameters   Task and file replication limited to 5

  Failed task resubmission limited to 5


Applications


FIELD-II/pasa

•  Ultrasound imaging simulation

•  122 invocations •  CPU Time: 15 min •  ~210 MB •  Data-intensive

Mean-Shift/hs3

•  Image denoising •  250 invocations •  CPU Time: 1 hour •  ~182 MB •  CPU-intensive

Image courtesy of ANR project US-Tagging http://www.creatis.insa-lyon.fr/us-tagging/news

O. Bernard, M. Alessandrini

Image courtesy of Ting Li http://www.creatis.insa-lyon.fr

Results

  Experiment: tests if recoverable errors are detected


FIELD-II/pasa Mean-Shift/hs3

speeds up execution up to 4 speeds up execution up to 2.6

0

2000

4000

6000

8000

10000

12000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

0

5000

10000

15000

20000

1 2 3 4 5Repetitions

Mak

espa

n (s

)

No−HealingSelf−Healing

Self-Healing process reduced resource consumption up to 26% when compared

to the No-Healing execution

Repetition w

1 –0.10

2 –0.15

3 –0.09

4 0.05

5 –0.26

Repetition w

1 –0.02

2 –0.20

3 –0.02

4 –0.02

5 –0.01

Conclusions   Science-gateway model of workload archive

  Illustration by using traces of the VIP from 2011/2012

  Added value when compared to infrastructure-level traces   Exactly identify tasks and users   Distinguishes additional workload artifacts from real workload   Fine-grained information about tasks   Ground-truth of bag of tasks

  Self-healing of worklfow incidents   Implements a generic MAPE-K loop   Incident degrees computed online   Speeds up execution up to a factor of 4   Reduced resource consumption up to 26%   Successfull example of self-healing loop deployed in production

  VIP is openly available at http://vip.creatis.insa-lyon.fr

  Traces are available to the community in the Grid Observatory: http://www.grid-observatory.org


Thank you for your attention. Questions?

30

ACKNOWLEDGMENTS VIP users and project members

French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI)

France-Grilles

Rafael Ferreira da Silva – [email protected]

A science-gateway workload archive application to the self-healing

of workflow incidents

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

A science-gateway workload archive application to the self-healing of workflow incidents

Technology

silva rafael

rafael ferreira da silva

workflow pilot manager

time input

input data workflow

transfer input files

task metascheduler

time execution