Top Banner
MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1
27

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Feb 06, 2016

Download

Documents

keita

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to the Water Threat Management Project Motivation - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

1

MS Thesis DefenseDynamic Fault Tolerant Grid Workflow

in the Water Threat Management Project

Young Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon

Page 2: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

2

OutlineIntroduction to the Water Threat

Management Project

Motivation

Research Objectives

Fault-Tolerant Queue

Evaluation

Conclusion

Page 3: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

3

Water Threat ManagementMotivation

Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.

Methods

Detect contamination using the sensors located across the WDSs.

Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

Page 4: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

4

Existing Water Threat Management System Architecture

Optimization Engine: Runs Evolutionary Algorithm (EA)

Simulation Engine: Runs EPANET

Page 5: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

5

Water Threat Management System RequirementsRequirements

Time sensitiveMassive calculationDynamic adaptation to a Grid environmentFault tolerance

Our goalThe current system is not fault-tolerant -

develop a fault-tolerant framework in the dynamic environment.

Page 6: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

6

MotivationResource (Site)

Outage5% down during

2009

Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

Page 7: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

7

Research ObjectivesDevelop a fault-tolerant framework dealing

with resource outages

Strategy: generation distribution on multiple sites

Reduce queue wait time

Strategy: dynamic job dependency

Page 8: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

8

Water Threat Management ApplicationSequential & parallel processing

Page 9: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

9

Generation DistributionDivide generations into multiple parts as

multiple jobs. Distribute them on multiple sites.

Page 10: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

10

Dynamic Job DependencyProblems of generation distribution on

multiple sitesAdditional queue wait times

Each job is dependent on another. Cannot submit a job before the prior job finishes.Solution: determine job dependency at run

time.Submit jobs at the same time.Any job start first computes the first set of

generations

Page 11: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

11

Dynamic WTM Workflow ManagementExample scenario

Page 12: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

12

Fault-tolerant QueueMost common fault-tolerant strategies in a Grid

ReplicationCheckpointing

Limitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a

different site (heterogeneity)Cannot reschedule job on the same site in case of

site outageChoosing the replication strategy within the

fault-tolerant queue

Page 13: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

13

Fault-tolerant Queue DesignComponents

Command Line Interface

Task Pool

Resource Pool

Scheduler

Resource Checker (intergration with the TeraGrid Information Services)

Page 14: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

14

Fault Detection in Fault-tolerant QueueFault detection

Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure

TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage

information

Page 15: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

15

Evaluation – WTM performanceWTM application performance (original)

Abe

Big Red

#CPUs

16 16

CPU per Node

8 4

Page 16: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

16

Evaluation – Queue Wait TimeQueue wait time statistics

Abe Big Red

Avg. (min)

82 42

Var. 38513

5354

sd. 196 73

Page 17: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

17

Evaluation – Performance OverheadPerformance overhead

Integrating a fault-tolerant framework usually causes performance degradation

No performance loss in our framework

Page 18: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant

deploymentDynamic job dependency VS. static job

dependencyTest each type of deployment in the real Grid

system including queue wait time

Workflow Dependency

Site Name # Jobs Gen. range

Original - Abe 1 1-20

Original - Big Red 1 1-20

Fault-tolerant

static Abe, Big Red

2 1-10 (Abe),11-20 (Big Red)

Fault-tolerant

dynamic Abe, Big Red

2 1-10,11-2018

Evaluation – Workflow Performance

Page 19: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

19

Evaluation – Workflow PerformanceWorkflow comparison results Experiment 1 Experiment 2

Experiment 3

Page 20: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

20

Simulation – Worst Case Run Time Comparison

A threat management system must deliver results in any circumstances.

Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

Page 21: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

21

Simulation – Worst Case Run Time ComparisonSimulation setup

The generations are equally distributed among the machines.

Use the 2009 TeraGrid outage data.Submit jobs every 5 minutes starting from

1/1/2009 12:00 am EST.

Abe Big Red

Queen Bee

Run Time per Gen. (min)

0.52 2.07 1.02

#CPUs 16 16 8

Page 22: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

22

Simulation – Worst Case Run Time ComparisonSimulation

queue wait time setup (unit: minutes)

Page 23: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

23

Simulation – Worst Case Run Time Comparison

TeraGrid User & System News (http://news.teragrid.org/)

Page 24: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

24

Simulation – Worst Case Run Time Comparison

Page 25: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

25

Simulation – Worst Case Run Time Comparison

Page 26: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

26

Simulation – Median Run Time, Worst Case (Max.) Run Time

Page 27: MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

27

ConclusionAchievement:

Worst case run time is significantly reduced.Limitation:

In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance

difference between difference machines.

Possible improvement:Migrate the generation process to a faster

machine whenever possible.