Top Banner
EGEE-II INFSO-RI- 031688 EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October 2006 Enabling Grids for E-sciencE
32

EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks

The EGEE Production Grid

Ian Bird

EGEE Operations Manager

HEPiX

Jefferson Lab, 12th October 2006

Enabling Grids for E-sciencE

Page 2: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Outline

• Some history– What led up to where we

are now?– The EGEE project

• What is the EGEE grid infrastructure today?

– What has been achieved?– How is it used?– How does it compare and

relate to other production grids?

• Outlook

Page 3: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Some history … LHC EGEE Grid

• 1999 – Monarc Project– Early discussions on how to organise distributed computing

for LHC

• 2000 – growing interest in grid technology– HEP community was the driver in launching the DataGrid

project

• 2001-2004 - EU DataGrid project– middleware & testbed for an operational grid

• 2002-2005 – LHC Computing Grid – LCG– deploying the results of DataGrid to provide aproduction facility for LHC experiments

• 2004-2006 – EU EGEE project phase 1– starts from the LCG grid– shared production infrastructure– expanding to other communities and sciences

• 2006-2008 – EU EGEE-II – Building on phase 1– Expanding applications and communities …

• … and in the future – Worldwide grid infrastructure??– Interoperating and co-operating infrastructures?

CERN

Page 4: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE project• EGEE - €32 M

– 1 April 2004 – 31 March 2006– 71 partners in 27 countries, federated in regional Grids

• EGEE-II - €35 M– 1 April 2006 – 31 March 2008– 91 partners in 32 countries – 13 Federations

• Objectives– Large-scale, production-quality

infrastructure for e-Science – Attracting new resources and

users from industry as well asscience

– Improving and maintaining “gLite” Grid middleware

Page 5: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Infrastructure

Certification testbeds (SA3)

Pre-production service

Production service

Test-beds & Services

Operations Coordination Centre

Regional Operations Centres

Global Grid User Support

EGEE Network Operations Centre (SA2)

Operational Security Coordination Team

Support Structures

Operations Advisory Group (+NA4)

Joint Security Policy Group EuGridPMA (& IGTF)

Grid Security Vulnerability Group

Security & Policy Groups

Infrastructure:• Physical test-beds & services• Support organisations & procedures• Policy groups

Page 6: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Certification & release preparation

• The goal is to produce a middleware distribution that can be deployed widely

– Not the same as middleware releases from development projects

– More like a Linux distribution – bringing together many pieces from several sources

• Extensive certification test-bed:– Close to 100 machines involved,

CERN + partners

• Emulate the main deployment environments

• Certification testing:– Installation and configuration– Component (service) functionality– System testing (trying to emulate

real workloads and stress testing)– Beginning to use virtualization to

simplify the testing environment

• Deployment into the pre-production system

– Final step of certification – validation by real sites

– Validation by applications – also allows to prepare apps for new versions

Page 7: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Pre-production service

• Pre-production service is now ~ 20 sites• Provides access to some 500 CPU

– Some sites allow access to their full production batch systems for scale tests

• Sites install and test different configurations and sets of services– Try to get good feeling for the quality of the release or updates before

general release to production

– Feedback to: certification, integration, developers, etc.

• P-PS is now used in the way it was intended– For some time it was acting as a second certification test-bed for the gLite-

1.x branch

– Some services may be demonstrated in this environment before going to production (or they may need more work)

Page 8: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Production service

sites

Size of the infrastructure today:

• 196 sites in 42 countries

• ~32 000 CPU

• ~ 3 PB disk, + tape MSS

0

5000

10000

15000

20000

25000

30000

35000

No.

CPU

CPU

Page 9: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Usage of the infrastructureEGEE workload

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

Jo

bs

/mo

nth

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

Normalized CPU time

0

1000000

2000000

3000000

4000000

5000000

6000000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

k.S

I2k

. h

ou

rs

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

>50k jobs/day

~7000 CPU-months/month

Page 10: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Non-LHC VOs

EGEE workload

0

50,000

100,000

150,000

200,000

250,000

Jo

bs

/mo

nth

planck

ops

magic

geant4

fusion

esr

egrid

egeode

compchem

biomed

other VOs

Normalized CPU time

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

k.S

I2k

. h

ou

rs

planck

ops

magic

geant4

fusion

esr

egrid

egeode

dteam

compchem

biomed

other VOs

Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month

• one year ago this was the overall scale of work for all VOs

Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month

• one year ago this was the overall scale of work for all VOs

Page 11: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use of the infrastructure

20k jobs running simultaneously

Page 12: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CPU Usage

Virtual Organizations

Jan. ’06

Sep. ’06

Page 13: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use for massive data transfer

Large LHC experiments now transferring ~ 1PB/month each

Page 14: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Applications on EGEE

• More than 25 applications from anincreasing number of domains– Astrophysics

– Computational Chemistry

– Earth Sciences

– Financial Simulation

– Fusion

– Geophysics

– High Energy Physics

– Life Sciences

– Multimedia

– Material Sciences

– …..

• Application types:• Simulation• Bulk Processing• Responsive Apps.• Workflow• Parallel Jobs

• Legacy Applications

Page 15: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Simulation

• Examples– LHC Monte Carlo simulation

– Fusion

– WISDOM—malaria/avian flu

• Characteristics– Jobs are CPU-intensive

– Large number of independent jobs

– Run by few (expert) users

– Small input; large output

• Needs– Batch-system services

– Minimal data management for storage of results

ATLAS

ITER

Page 16: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Drug Discovery

• WISDOM focuses on in silico drug discovery for neglected and emerging diseases.

• Malaria — Summer 2005– 46 million ligands docked

– 1 million selected

– 1TB data produced; 80 CPU-years used in 6 weeks

• Avian Flu — Spring 2006– H5N1 neuraminidase

– Impact of selected point mutations on eff. of existing drugs

– Identification of new potential drugs acting on mutated N1

• Fall 2006– Extension to other neglected diseases

Page 17: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Bulk Processing

• Examples– HEP processing of raw data, analysis

– Earth observation data processing

• Characteristics– Widely-distributed input data

– Significant amount of input and output data

• Needs– Job management tools (workload management)

– Meta-data services

– More sophisticated data management

Page 18: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Responsive Apps. (I)

• Examples–Prototyping new applications

–Monitoring grid operations

–Direct interactivity

• Characteristics–Small amounts of input and output data

–Not CPU-intensive

–Short response time (few minutes)

• Needs–Configuration which allows “immediate” execution (QoS)

–Services must treat jobs with minimum latency

Page 19: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Responsive Apps. (II)

• Grid as a backend infrastructure:– gPTM3D: interactive analysis of medical images

– GPS@: bioinformatics via web portal

– GATE: radiotherapy planning

– DILIGENT: digital libraries

– Volcano sonification

• Characteristics– Rapid response: a human waiting for the result!

– Many small but CPU-intensive tasks

– User is not aware of “grid”!

• Needs– Interfacing (data & computing) with non-grid application or portal

– User and rights management between front-end and grid

Page 20: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Workflow

• Examples–“Bronze Standard”: image registration

–Flood prediction

• Characteristics–Use of grid and non-grid services

–Complex set of algorithms for the analysis

–Complex dependencies between individual tasks

• Needs–Tools for managing the workflow itself

–Standard interfaces for services (I.e. web-services)

Page 21: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Parallel Jobs

• Examples– Climate modeling

– Earthquake analysis

– Computational chemistry

• Characteristics– Many interdependent, communicating tasks

– Many CPUs needed simultaneously

– Use of MPI libraries

• Needs– Configuration of resources for flexible use of MPI

– Pre-installation of optimized MPI libraries

Page 22: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 22

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Legacy Applications

• Examples–Commercial or closed source binaries

–Geocluster: geophysical analysis software

–FlexX: molecular docking software

–Matlab, Mathematics, …

• Characteristics–Licenses: control access to software on the grid

–No recompilation no direct use of grid APIs!

• Needs–License server and grid deployment model

–Transparent access to data on the grid

Page 23: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 23

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid management: structure

• Operations Coordination Centre (OCC)

– management, oversight of all operational and support activities

• Regional Operations Centres (ROC)

– providing the core of the support infrastructure, each supporting a number of resource centres within its region

– Grid Operator on Duty

• Resource centres – providing resources

(computing, storage, network, etc.);

• Grid User Support (GGUS)

– At FZK, coordination and management of user support, single point of contact for users

Page 24: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 24

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid Monitoring

• Goal:– Proactively monitor operational state & performance of the grid

– Trigger corrective actions at sites, ROCs, service managers

• Many tools used:– Distributed responsibility for tools maintenance and operation

– Operator portal, Info sys monitor, SFT/SAM, job monitors, etc.

• Site Functional Tests (SFT) Site Availability Monitor (SAM)– Framework to sample/test services at sites and publish results

– Can include ad-hoc tests (e.g. VO-specific) in the framework or externally

– Allows dynamic look-up by VO of sites that are currently OK for them

– SAM: extends the concept to measure service availability

– Web service access to the data

– Intend to use this to generate trouble tickets and alarms

• Primary tools of the operator on duty are – Information system monitoring and SFT/SAM

Page 25: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 25

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Site metrics - availability

Page 26: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 26

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Support - GGUS

Page 27: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 27

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Network Operations Centre

• Creating a “Network Support unit” in the EGEE operational model;

• Tasks:– Receive tickets from NRENs, and

forward to GGUS if impact on grid– Receive tickets from GGUS if a

network issue– Troubleshoot & follow up with sites

or NRENs

GGUS

Users

SupportUnits

ENOC

NRENs

GÉANT2

EGEE Network

Page 28: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 28

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperation

• Interoperability and interoperation (or co-operation)

• EGEE has interoperability activities with:(enabling the middlewares to work together)

– Open Science Grid (U.S.) – quite far advanced– Nordugrid (ARC) – task in EGEE-II, 4 workshops and ongoing activity– UNICORE – task in EGEE-II– NAREGI (Japan) – 1 workshop, continued activity– GIN (OGF) – active in several areas

• EGEE has interoperation activities with:(enabling the infrastructures to co-operate)

– Open Science Grid – actually in use– Anticipated with NorduGrid (NDGF) for WLCG

Page 29: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 29

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperating information systems

EGEE

OSG

Naregi

Teragrid

Pragma

Nordugrid

Page 30: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 30

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Related infrastructure projects

DEISATeraGrid

Coordination in SA1 for:

• EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID

Interoperation with

• OSG, NAREGI

SA3: • DEISA, ARC, NAREGI

Page 31: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 31

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Sustainability: Beyond EGEE-II

• Need to prepare for permanent Grid infrastructure– Maintain Europe’s leading position in global science Grids

– Ensure a reliable and adaptive support for all sciences

– Independent of short project funding cycles

– Modelled on success of GÉANT Infrastructure managed in collaboration

with national grid initiatives

Page 32: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October.

[email protected] HEPiX; JLab; 9th-13th October 2006 32

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary of status

• Today we have an operating production infrastructure – Probably the largest in the world, supporting many science domains– Relied upon by several as their primary source of computing

• We have a managed operations process addressing most areas– Constantly evolving

• Inter/Co-operation is a fact and is becoming more important very quickly– Several applications need to work across grids – and they need support for

that

• A large fraction of the value of the operations activity is in the intangibles – processes, structures, expertise, etc.

• We recognise that there are many outstanding problems with the current state of things: reliability and robustness are the focus for the next year