Top Banner
Roma, Sept 2011 Reducing human cost with glideinWMS 1 Cloud Computing 2011 Reducing the Human Cost of Grid Computing with glideinWMS by Igor Sfiligoi 1 , F. Würthwein 1 , J.M. Dost 1 , I. MacNeill 1 , B. Holzman 2 , and P. Mhashilkar 2 1 UCSD 2 FNAL
16

Reducing the Human Cost of Grid Computing with glideinWMS

Jan 15, 2015

Download

Technology

Igor Sfiligoi

Presented at Cloud Computing 2011.

The switch from dedicated, tightly controlled compute clusters to a widely distributed, shared Grid infrastructure has introduced significant operational overheads. If not properly managed, this human cost could grow to a point where it would undermine the benefits of increased resource availability of Grid computing. The glideinWMS system addresses the human cost problem by drastically reducing the number of people directly exposed to the Grid infrastructure. This presentation provides an analysis of what steps have been taken to reduce the human cost problem, alongside the experience of glideinWMS use within the Open Science Grid.

Paper available at: http://www.thinkmind.org/index.php?view=article&articleid=cloud_computing_2011_8_40_20068

Cloud Computing 2011 Web site: http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 1

Cloud Computing 2011

Reducing the Human Costof Grid Computing with glideinWMS

by Igor Sfiligoi1,F. Würthwein1, J.M. Dost1, I. MacNeill1, B. Holzman2, and P. Mhashilkar2

1UCSD 2FNAL

Page 2: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 2

Our environment - Grid computing

● Set of loosely coupled compute clusters (i.e. sites)● Great for resource providers (i.e. site operators)

● High autonomy● Easy sharing between communities (VOs)● High utilization

● Not so great for users● Actually, not too bad for users when things work● But handling failures extremely time consuming

– May need to contact multiple site admins

Page 3: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 3

A problem of scale

● O(100) sites● Aggregate of O(100k) CPUs● At least a few sites have

some broken nodes at any point in time● O(10k) users

● O(100) users likely hit by those broken nodes every day

● If each spent even 30 mins debugging– O(10) scientific FTEs wasted

(and I am being an optimist)– Plus drastic reduction in usability

(users expect things “to just work”)

Page 4: Reducing the Human Cost of Grid Computing with glideinWMS

Rome, Sep 2011 Adapting with few simple rules in glideinWMS 4

The glideinWMS

● The glideinWMS approach to the problem● Use the pilot paradigm

● Split pilot submissionfrom pilot regulation

● Emphasize sharing of pilot submission service

The glideinWMS is a Grid job scheduler initially developed at FNAL by the CMS experiment

● Based on the CDF glideCAF concept

● With contributions from several other institutes

● Widely used in OSG, with a large instance at UCSD

Page 5: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 5

The pilot paradigm

● Send pilots to Grid sites (never user jobs)● Create a dynamic overlay pool of compute resources● Handle user jobs within

this overlay pool● A broken node will

fail pilot jobs● So they will not join

the overlay pool● No user job ever fails

● Problem moved tothe pilot submitter

Site N

Site 1

Pilotsubmitter

Pilot

Pilot

Overlaypool

Pilots not user specific

One poolx

user community

Page 6: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 6

Cost reduction

● Difference in job types● All user jobs are precious

=> must recover● Pilot jobs are all the same

=> pilot failures not critical– Failures used to detect

broken compute nodes– Diagnose node problem

● Fewer humans exposed● Can be more expert => lower cost per event

O(10M) jobs

O(1k) nodes

O(100k)

O(10)

Assuming1% error rate

Entitiesto debugMetric

Estimates for a sizable OSG VO

Reduction by several ordersof magnitude

Page 7: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 7

Traditional pilots & multiple VOs

● Each user community (VO) wants its own pilot infrastructure● To maintain control

over scheduling policies

● Many pilot admins, debugging the same sites

Site k

Overlaypool

Site 1

Pilot

Overlaypool

Pilot

Site N

Pilotsubmitter

Pilotsubmitter

Page 8: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 8

Splitting the process

● The glideinWMS separates● pilot submission

(glidein factory)● from pilot regulation

(VO frontend)● Credential owed by

VO frontend● And delegated to factory

as needed● All scheduling policy implemented in the frontend

Site N

Site 1

Glideinfactory

Pilot

Pilot

Overlaypool

VO frontend

Page 9: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 9

The factory can be shared

● Each VO runs only its own VO frontend(with the associatedoverlay pool)● While still having

full control over policy● All debugging

handled by asingle factory team

Site k

Overlaypool

Site 1

Pilot

Overlaypool

Pilot

Site N

VO frontend

VO frontend

Glideinfactory

Page 10: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 10

Risk of common factory?

● A single factory is a single point of failure● And possibly a scalability choke point

● The glideinWMS allows for multiple factories● For redundancy, scalability, trust, etc.● Of course the cost goes up

● How many factories to use is a balancebetween low cost and low risk● Each VO can decide what works best for it

Page 11: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 11

OSG experience

● Operating a multi-VO factory since 2009● 12 VOs at the time of writing

● Gliding into ~100 Grid sites● We include sites that

claim to supportthe VOs we serve

● Significant fraction shared● Weekly statistics

One VO muchbigger than the other

Page 12: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 12

Effort investment

● About 1 FTE total● Only fraction for

actual Grid debugging● Comparable fraction

helping VOs debugproblems between Grid nodes and their VO overlay pool

● We also help with know-howin configuring and operating the overlay pool

Page 13: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 13

Savings estimate

● Not counting the consulting services● Those tend to be high at start-up and then level off

● For the remainder of the effort:

7x cheaper

Page 14: Reducing the Human Cost of Grid Computing with glideinWMS

Roma, Sept 2011 Reducing human cost with glideinWMS 14

Conclusions

● Failures in a highly distributed system like the scientific Grids can have a high human cost

● The pilot paradigm drastically reduces this by● Catching errors during provisioning● Debugging by expert staff only

● The glideinWMS further reduces the cost byallowing for a shared pilot factory● Confirmed by the OSG experience

Page 15: Reducing the Human Cost of Grid Computing with glideinWMS

Rome, Sep 2011 Adapting with few simple rules in glideinWMS 15

For more information

● The glideinWMS home pagehttp://tinyurl.com/glideinWMS

● Relevant papers and supporting material:● I. Sfiligoi et al.,

"The pilot way to grid resources using glideinWMS," CSIE, WRI World Cong. on, vol. 2, pp. 428-432, 2009, doi:10.1109/CSIE.2009.950

● Open Science Grid home page,http://www.opensciencegrid.org/

Page 16: Reducing the Human Cost of Grid Computing with glideinWMS

Rome, Sep 2011 Adapting with few simple rules in glideinWMS 16

Acknowledgment

● This work is partially sponsored by ● US Department of Energy under Grant No.

DE-FC02-06ER41436 subcontract No. 647F290 (OSG)

● the US National Science Foundation under Grants No. PHY-0612805 (CMS Maintenance & Operations), and OCI-0943725 (STCI).