Page 1
A science-gateway workload archive application to the self-healing
of workflow incidents
1
Journées Scientifiques Mésocentres et France Grilles October 1st-3rd 2012
Rafael Ferreira da Silva – [email protected]
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France
Page 2
Context: Workload Archives
2 Rafael Ferreira da Silva – [email protected]
Information produced by grid workflow executions
Assumptions validation
Computational activity modeling
Methods evaluation (simulation or experimental)
use
ful fo
r task_status
submit_time execution_time
input_file
site_name
workflow_id
activity_name
exit_code
Page 3
Science-gateway architecture
3 Rafael Ferreira da Silva – [email protected]
User
Web Portal
0. Login 1. Send input data
Storage Element
Workflow Engine
3. Launch workflow
Pilot Manager
4. Generate and submit task
Meta-Scheduler
5. Submit pilot jobs
2. Transfer input files
6. Schedule pilot jobs
Computing site
7. Get task 8. Get files 9. Execute 10. Upload results
Page 4
State of the Art
4 Rafael Ferreira da Silva – [email protected]
task_status
submit_time
execution_time
input_file
site_name
workflow_id
activity_name
exit_code
Information gathered at infrastructure-level
Grid Workload Archives
tasks
Lack of critical information: • Dependencies among tasks • Task sub-steps • Application-level scheduling artifacts • User
• Parallel Workloads Archive (http://www.cs.huji.ac.il/labs/parallel/workload/)
• Grid Workloads Archive (http://gwa.ewi.tudelft.nl/pmwiki/)
Page 5
At infrastructure-level
5 Rafael Ferreira da Silva – [email protected]
Storage Element
Pilot Manager
Meta-Scheduler
6. Schedule pilot jobs
Computing site
User
Web Portal
0. Login 1. Send input data
Workflow Engine
3. Launch workflow
4. Generate and submit task
5. Submit pilot jobs
2. Transfer input files
7. Get task 8. Get files 9. Execute 10. Upload results
Page 6
Outline
A science-gateway workload archive
Case studies Pilot Jobs Accounting Task analysis
Bag of tasks
Workflow Self-Healing
Conclusions
6 Rafael Ferreira da Silva – [email protected]
Page 7
Our approach
7 Rafael Ferreira da Silva – [email protected]
task_status
submit_time
execution_time
input_file
site_name
workflow_id
activity_name
exit_code
Information gathered at science-gateway level
Science-Gateway Workload Archive
workflow executions Advantages: • Fine-grained information about tasks • Dependencies among tasks • Workflow characterization • Accounting
Page 8
At science-gateway level
8 Rafael Ferreira da Silva – [email protected]
User
Web Portal
0. Login 1. Send input data
Storage Element
Workflow Engine
3. Launch workflow
Pilot Manager
4. Generate and submit task
Meta-Scheduler
5. Submit pilot jobs
2. Transfer input files
6. Schedule pilot jobs
Computing site
7. Get task 8. Get files 9. Execute 10. Upload results
Page 9
Virtual Imaging Platform Virtual Imaging Platform (VIP)
Medical imaging science-gateway
Grid of 129 sites (EGI – http://www.egi.eu)
Significant usage Registered users: 244 from 26 countries
Applications: 18
Consumed 32 CPU years in 2011
9 Rafael Ferreira da Silva – [email protected]
VIP usage in 2011: CPU consumption of VIP and related platforms on EGI.
Applications
File transfer
VIP – http://vip.creatis.insa-lyon.fr
Page 10
SGWA Science Gateway Workload Archive (SGWA)
Archive is extracted from VIP
10 Rafael Ferreira da Silva – [email protected]
Science-gateway archive model
Task, Site and Workflow Execution acquired from databases populated by the workflow engine at runtime
File and Pilot Job extracted from the parsing of task standard
output and error files
Page 11
Workload for Case Studies Based on the workload of VIP
January 2011 to April 2012
11 Rafael Ferreira da Silva – [email protected]
112 users 2,941 workflow executions 680,988 tasks
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
34,162 queued
339,545 pilot jobs
Page 12
Pilot Jobs A single pilot can wrap several
tasks and users
At infrastructure-level Assimilates pilot jobs to tasks and
users
Valid for only 62% of the tasks
Valid for 95% of user-task associations
At science-gateway level Users and tasks are correctly
associated to pilots
12 Rafael Ferreira da Silva – [email protected]
0
50000
100000
150000200000250000
282331
2812111885
6721 10487
1 2 3 4 5Tasks per pilot
Freq
uenc
y
0
50000
100000
150000200000250000300000
323214
15178
1079 70 4
1 2 3 4 5Users per pilot
Freq
uenc
y
Page 13
Accounting: Users Authentications based on login and password are mapped to
X.509 robot certificates
At infrastructure-level All VIP users are reported as a single user
At science-gateway level Maps task executions to VIP users
13 Rafael Ferreira da Silva – [email protected]
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Months
Users EGI
VIP
Number of reported EGI and VIP users
Page 14
Accounting: CPU and Wall-clock Time
Huge discrepancy of values Pilot jobs do not register to
the pilot system
Absence of workload
Outputs unretrievable
Pilot setup time
Lost tasks (a.k.a. stalled)
Undetectable at infrastructure-level
14 Rafael Ferreira da Silva – [email protected]
1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
5 10 15Month
Num
ber o
f job
s
VIP jobs
EGI jobs
Number of submitted pilot jobs by EGI and VIP
50
100
150
5 10 15Month
Year
s
VIP CPU time
VIP Wall−clock time
EGI CPU time
EGI Wall−clock time
Consumed CPU and wall-clock time by EGI and VIP
Page 15
Task Analysis At infrastructure-level
Limited to task exit codes
At science-gateway level Fine-grained information
Steps in task life
Error causes
Replicas per task
15 Rafael Ferreira da Silva – [email protected]
0
10000
20000
30000
40000
50000
5516550925 48293
19463
1123
application input stalled output folderError causes
Num
ber o
f tas
ks
0.2
0.4
0.6
0.8
1.0
1 100 10000Time(s)
CDF
download
execution
upload
Different steps in task life
Page 16
Δ
Bag of Tasks: at Infrastructure level
Evaluation of the accuracy of Iosup et al.[8] method to detect bag of tasks (BoT)
Two successively submitted tasks are in the same BoT if the time interval between submission times is lower or equal to Δ.
16 Rafael Ferreira da Silva – [email protected]
Task 1
Task 2
Task 3
t1 t2 t3 time
Δ1,2 Δ2,3
Task 1
Task 2
BoT 1
Task 3
BoT 2
Δ1,2 ≤Δ |t1 – t2|≤Δ
Δ2,3 >Δ |t2 – t3|>Δ
Δ
[8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393
Page 17
Bag of Tasks: Size and Duration Infrastructure vs science-gateway
17 Rafael Ferreira da Silva – [email protected]
0.0
0.2
0.4
0.6
0.8
200 400 600 800 1000Size (number of tasks)
CD
F
Real Batch
Batch
0.0
0.2
0.4
0.6
0.8
10000 20000 30000 40000 50000Duration (s)
CD
F
Real Batch
Real Non−Batch
Batch
Non−Batch
Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT
90% of Batch BoTs size ranges from 2 to 10 while it represents 50% of Real Batch
Non-Batch duration is overestimated up to 400%
Page 18
Bag of Tasks: Inter-arrival Time and Consumed CPU Time
18 Rafael Ferreira da Silva – [email protected]
0.0
0.2
0.4
0.6
0.8
2000 4000 6000 8000 10000Inter−Arrival Time (s)
CD
F
Real Batch
Real Non−Batch
Batch
Non−Batch
Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT
0.2
0.4
0.6
0.8
0 5000 10000 15000 20000 25000 30000Consumed CPUTime (KCPUs)
CD
F
Real Batch
Real Non−Batch
Batch
Non−Batch
Batch and Non-Batch inter-arrival times are underestimated by about 30%
CPU times are underestimated of 25% for Non-Batch and of about 20% for Batch
Page 19
Outline
A science-gateway workload archive
Case studies Pilot Jobs Accounting Task analysis
Bag of tasks
Workflow Self-Healing
Conclusions
19 Rafael Ferreira da Silva – [email protected]
Page 20
Workflow Self-Healing Problem: costly manual operations
Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files
Objective: automated platform administration Autonomous detection of operational incidents
Perform appropriate set of actions
Assumptions: online and non-clairvoyant Only partial information available
Decisions must be fast
Production conditions, no user activity and workloads prediction
20 Rafael Ferreira da Silva – [email protected]
Page 21
0.61
0.30
0.07
General MAPE-K loop
21 Rafael Ferreira da Silva – [email protected]
Incident 1 degree η = 0.8
Incident 2 degree η = 0.4
Incident 3 degree η = 0.1
level 1
level2
level3
Roulette wheel selection
Incident 1
Selected
Rule Confidence (ρ) ρxη
2 1 0.8 0.32
3 1 0.2 0.02
1 1 1.0 0.80
Association rules for incident 1
0.37
0.16
0.66 Incident 2
Selected
Roulette wheel selection based on association rules
Set of Actions
x2
level 1
level2
level3
level 1
level2
level3
€
=ηiη jj=1
n∑
event (job completion and failures)
or timeout
Monitoring Analysis
Execution Knowledge
Planning
!bEstimation by Median
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
0015
000
Monitoring data
Page 22
Incident: Activity Blocked An invocation is late compared to the others
Possible causes Longer waiting times
Lost tasks (e.g. killed by site due to quota violation)
Resources with poor performance
22 Rafael Ferreira da Silva – [email protected]
Invocations completion rate for a simulation Job flow for a simulation
0.0e+00 4.0e+06 8.0e+06 1.2e+07
020
4060
80100
FIELD-II/pasa - workflow-9SIeNv
Time (s)
Com
plet
ed J
obs
Page 23
Activity blocked: degree Degree computed from all completed jobs of the activity
Job phases: setup inputs download execution outputs upload
Assumption: bag-of-tasks (all jobs have equal durations)
Median-based estimation:
Incident degree: job performance w.r.t median
23 Rafael Ferreira da Silva – [email protected]
€
d =Ei
Mi + Ei
∈ [0,1]
Median duration of jobs phases
Real job duration
42s
300s
20s
?
42s
300s
400s*
15s
Estimated job duration
50s
250s
400s
15s
completed
current
Mi = 715s Ei = 757s
*: max(400s, 20s) = 400s
Page 24
Activity blocked: levels and actions
Levels: identified from the platform logs
Actions Job replication
Cancel replicas with bad performance
Replicate only if all active replicas are running
24 Rafael Ferreira da Silva – [email protected]
Replication process for one task !bEstimation by Median
Freq
uenc
y
0.0 0.2 0.4 0.6 0.8 1.0
050
0015
000 Level 1
(no actions) Level 2
action: replicate jobs
d
€
τ1
Page 25
Experiments
Goal: Self-Healing vs No-Healing Cope with recoverable errors
Metrics Makespan of the activity execution
Resource waste
For w < 0: self-healing consumed less resources
For w > 0: self-healing wasted resources
25 Rafael Ferreira da Silva – [email protected]
€
w =(CPU + data) self −healing(CPU + data)no−healing
−1
Page 26
Experiment Conditions
Software Virtual Imaging Platform
MOTEUR workflow engine
DIRAC pilot job system
Infrastructure European Grid Infrastructure (EGI): production, shared
Self-Healing and No-Healing launched simultaneously
Experiment parameters Task and file replication limited to 5
Failed task resubmission limited to 5
26 Rafael Ferreira da Silva – [email protected]
Page 27
Applications
27 Rafael Ferreira da Silva – [email protected]
FIELD-II/pasa
• Ultrasound imaging simulation
• 122 invocations • CPU Time: 15 min • ~210 MB • Data-intensive
Mean-Shift/hs3
• Image denoising • 250 invocations • CPU Time: 1 hour • ~182 MB • CPU-intensive
Image courtesy of ANR project US-Tagging http://www.creatis.insa-lyon.fr/us-tagging/news
O. Bernard, M. Alessandrini
Image courtesy of Ting Li http://www.creatis.insa-lyon.fr
Page 28
Results
Experiment: tests if recoverable errors are detected
28 Rafael Ferreira da Silva – [email protected]
FIELD-II/pasa Mean-Shift/hs3
speeds up execution up to 4 speeds up execution up to 2.6
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
0
5000
10000
15000
20000
1 2 3 4 5Repetitions
Mak
espa
n (s
)
No−HealingSelf−Healing
Self-Healing process reduced resource consumption up to 26% when compared
to the No-Healing execution
Repetition w
1 –0.10
2 –0.15
3 –0.09
4 0.05
5 –0.26
Repetition w
1 –0.02
2 –0.20
3 –0.02
4 –0.02
5 –0.01
Page 29
Conclusions Science-gateway model of workload archive
Illustration by using traces of the VIP from 2011/2012
Added value when compared to infrastructure-level traces Exactly identify tasks and users Distinguishes additional workload artifacts from real workload Fine-grained information about tasks Ground-truth of bag of tasks
Self-healing of worklfow incidents Implements a generic MAPE-K loop Incident degrees computed online Speeds up execution up to a factor of 4 Reduced resource consumption up to 26% Successfull example of self-healing loop deployed in production
VIP is openly available at http://vip.creatis.insa-lyon.fr
Traces are available to the community in the Grid Observatory: http://www.grid-observatory.org
29 Rafael Ferreira da Silva – [email protected]
Page 30
Thank you for your attention. Questions?
30
ACKNOWLEDGMENTS VIP users and project members
French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI)
France-Grilles
Rafael Ferreira da Silva – [email protected]
A science-gateway workload archive application to the self-healing
of workflow incidents
Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon
Lyon, France