Top Banner
Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1 , Anthony Sulistio 2 , Blanca Caminero 1 , Carmen Carrión 1 , and Rajkumar Buyya 2 1 Dept. of Computing Systems. The University of Castilla La Mancha, Spain 2 Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne, Australia
25

Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

Mar 26, 2015

Download

Documents

Sydney Hill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

Conference title 1

Extending GridSim with an Architecture for Failure DetectionAgustín Caminero1, Anthony Sulistio2, Blanca Caminero1, Carmen Carrión1,

and Rajkumar Buyya2

1Dept. of Computing Systems. The University of Castilla La Mancha, Spain2Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne, Australia

Page 2: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 2

Agenda

Introduction

Contribution of Our Work

Design and Implementation

Experiments and Results

Conclusion and Further Work

Questions and Answers

Page 3: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 3

Grid as Cyberinfrastructure for e-Science and e-Business Applications

Grid Resource Broker

Resource Broker

Application

Grid Information Service

Grid Resource Broker

databaseR2R3

RN

R1

R4

R5

R6

Grid Information Service

Page 4: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 4

Grids as Variable Environments

• Grids are variable environments, as organizations can decide its own policy and when to join/leave a VO at any time.

• Number of resources can fluctuate significantly over time.

• Availability of resources may vary due to:

–changes in network condition,

–partial failures,

–the connection or disconnection of resources, …

• With as many resources in a Grid, resource or network falilures are the rule rather than the exception.

Page 5: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 5

The Importance of Dealing with Failures

• Supporting fault tolerance is one of the main technical challenges in designing Grid environments.

• This is because production Grid systems must be able to tolerate resource failures, while at the same time effectively exploiting the resources in a scalable and transparent manner.

• Thus, both detection and recovery schemes must be an integral part of the Grid computing infrastructure.

Page 6: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 6

Grid Resource Failure Scenario

Page 7: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 7

Grids as a Research Area

• To test new detection and recovery schemes in a Grid environment like the above scenario, a lot of work is required to set up the testbeds on many distributed sites.

• It is very difficult to produce performance evaluation in a repeatable and controlled manner, due to the inherent heterogeneity of the Grid.

• In addition, Grid testbeds are limited and creating an adequately-sized testbed is expensive and time consuming.

• Therefore, it is easier to use simulation as a means of studying complex scenarios.

Page 8: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 8

Contribution of the Paper

• Among the existing Grid simulation tools, we can find GridSim, SimGrid, OptorSim, and MicroGrid.

• None of them provide support for computing resource failures.

• To address the above issues, we have incorporated failure detection and recovery scheme into GridSim.

• This extension allows GridSim to simulate the failure of computing resources.

• Most of the parameters of this extension are configurable, allowing researchers to simulate a wide variety of failure patterns.

Page 9: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 9

Existing Resource Failure Detections

• Computing resource failure can occur in hardware, operating systems, and Grid middleware components, as well as network connections.

• There are two methods for detecting resource failures:

• Push:

– Each monitored resource periodically sends a message to a central server indicating its availability.

– Missing a message after a certain time interval indicates that this resource has failed.

• Pull:

– The resource monitor sends polling requests to the monitored resources.

– On receiving these messages, the resources will send them back, so that the sender knows that each of them is alive.

– A missed message indicates a resource failure.

Page 10: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 10

Designing Resource Failures

• We implement pull method.

• Two types of entities perform polling:

• Grid Information Service (GIS) entity polls the resources registered to it.

• Users poll resources running their jobs.

Page 11: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 11

Scenario of failure detection (I)

Page 12: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 12

Scenario of failure detection (II)

Page 13: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 13

Scenario of failure detection (III)

Page 14: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 14

• RegionalGISWithFailure:

• Keeps a list of available resources, and polls them.

• Support for resource failures:

–Decides how many resources, when, how long, and how many machines at each resource will fail.

–These parameters are based on continuous, discrete or variate distributions, allowing a wide variety of failure patterns.

• GridUserFailure:

• Submits jobs to resources; polls the resources running its jobs; and on the failure of a job, chooses another resource and re-submits the job.

Main classes

Page 15: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 15

Main classes (II)

• SpaceSharedWithFailure:

• Implements AllocPolicyWithFailure interface.

• Behaves like FCFS.

• TimeSharedWithFailure:

• Implements AllocPolicyWithFailure interface.

• Behaves like round-robin.

Page 16: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 17

EU DataGrid Testbed and Grid Modelling

Page 17: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 19

Resource Characteristics

4Space-shared80,00067Bologna (Italy)

4Time-shared1,0001Padova (Italy)

1Space-shared6,0005Rome (Italy)

1Time-shared3,0002Torino (Italy)

1Space-shared70,0005Milano (Italy)

0Space-shared70,00059CERN (Switzerland)

0Space-shared14,00012Lyon (France)

3Space-shared21,00018NIKHEF (Netherlands)

3Space-shared20,00017NorduGrid (Norway)

2Space-shared62,00052Imperial College (UK)

2Space-shared49,00041RAL (UK)

VOPolicyCPU Rating*# NodesResource (Location)

*CPU Rating is measured in MIPS

Page 18: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 20

Users Characteristics

0412Bologna (Italy)

342Padova (Italy)

414Rome (Italy)

312Torino (Italy)

214Milano (Italy)

1024CERN (Switzerland)

1012Lyon (France)

438NIKHEF (Netherlands)

234NorduGrid (Norway)

0216Imperial College (UK)

4212RAL (UK)

Secondary VOPrimary VO# UsersResource (Location)

Page 19: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 21

Experiment Parameters

• We simulated failures based on the hyper-exponential distribution, with mean equal to half of the number of CPUs of the VO.

• Each user has 10 jobs, each one would take 10 min to be run in CERN.

• Users choose a resource to run each job among the resources in their primary VO.

• If no resource is available, they choose a resource from their secondary VO.

Page 20: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 22

Results: Availability and period of failure

Fig 1. Availability of computing resources per VO.

Fig 2. Failed machines per VO.

• VO_0 and VO_1 suffered a big drop in their available MIPS because powerful CPUs suffered a failure

Page 21: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 23

Results: Failed jobs for a user

Fig 3. Time-line for User_0.

• Jobs submitted to different resources have different execution times

Page 22: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 24

Results: Resource failure statistics

2.76 hours2193602571VO_0

9.5 hours96140668VO_4

15.82 hours1201203535VO_3

5.25 hours962802493VO_2

103.36 hours20100512VO_1

MFT * # Failed jobs# Jobs# Failed CPUs# CPUsVO

* MFT: mean failure time.

Page 23: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 25

Conclusion

• Grids are a hot topic in research at the moment, where simulation is essential.

• New features allow GridSim to support computing resource failures based on fully configurable mathematical patterns.

• Our experiment has shown that the new extension can be used to simulate failure of computing resources.

• New improvements regarding network link failures, and finite network buffers are considered as future work.

• GridSim is available to download:

• www.gridbus.org/gridsim/

Page 24: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

Conference title 26

Thank you.

5th December, 2007

Page 25: Conference title 1 Extending GridSim with an Architecture for Failure Detection Agustín Caminero 1, Anthony Sulistio 2, Blanca Caminero 1, Carmen Carrión.

ICPADS 2007 27

Acknowledgement

• This work has been jointly supported by the Spanish MEC and European Commission FEDER funds under grants .Consolider Ingenio-2010 CSD2006-00046 and TIN2006-15516-C04-02; by JCCM under grants PBC-05-007-01, PBC-05-005-01 and José Castillejo.

• This research is also partially funded by the Australian Research Council and the Department of Education, Science and Training.

• We would like to thank Chee Shin Yeo and anonymous reviewers for their comments on the paper.