Top Banner
Project Status David Britton,15/Dec/08.
20

Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

Mar 28, 2015

Download

Documents

Jacob Coughlin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

Project Status

David Britton,15/Dec/08.

Page 2: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

2

Outline

• Programmatic Review Outcome

• CCRC08

• LHC Schedule Changes

• Service Resilience

• CASTOR

• Current Status

• Project Management

• Feedback from the last Oversight Committee

• Forward Look

10/04/23

Page 3: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

3

Programmatic Review

10/04/23

• The programmatic review recommended a 5% cut to GridPP:

• Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining:

• Bottom Line: GridPP3 reduced by £1.24m on top of the £1.20m removed from GridPP2 noted at the last OC.

Page 4: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

4

Funding Cut

10/04/23

• Savings of £1.24m achieved by:– Planned and unplanned late starts to a number of GridPP3

posts.– Reduction in Tier-1 hardware to reflect changes imposed by

the programmatic review (LHCb and BaBar).– Re-costing of hardware based on the 2007 procurement.– A reduction in the budget line for the second tranche of Tier-2

hardware, consistent with the reduction in Tier-1 hardware.– Reduction in travel and miscellaneous spending.

• New plan presented to STFC in July 08; Updated in GridPP-PMB-133-Resources.doc

Page 5: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

5

CCRC08

• The Combined Computing Readiness Challenge took place in two phases, February and May 2008. Largely successful for all experiments.

10/04/23

Page 6: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

610/04/23

LHC Schedule

Current indications are:

- Machine cold in June.

- First beams in July.

- Collisions at some point later.

- Plans may change!

Consequences on GridPP

- Capacity and services need to be ready in June.

- Meanwhile many exercises (MC productions, Cosmics re-processings, Analysis challenges) to keep things busy and stress the system.

- Prudent to maintain procurement schedule for April 2009 (little downside to this and helps reduce risks).

- Opportunity to build on the service quality and resilience.

Page 7: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

7

Service Resilience

• Emphasis over the last year of making the Grid resilient.– Much work on monitoring and alarms.– 24 x 7 service initiated.– Extensive work on making the component services more

resilient at many levels (see document).

• Future work on Resilience– Create project-manager overview to keep this active at the

PMB level– Provision a back-up link for the OPN (significant cost).– Link to the (evolving) experiment disaster planning (UCL

meeting)

10/04/23

Page 8: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

8

CASTOR

• CASTOR proved unreliable in early 2007 but performed well with the upgrade to 2.1.3 for CCRC08.

• In time for first collisions, an upgrade from 2.1.6 to 2.1.7 was required in order to maintain a version supported by CERN. This coincided with a move to a resilient RAC Oracle system – combination of upgrades led to instability in August and September.

• System is now stabilising and the problems have lead to improved communications and management processes.– High load-testing identified as a critical missing step for new

releases.– Oracle problems raised to a higher level of awareness in wLCG.– Storage Review at RAL in November.

• Other Tier-1s have had similar or worse problems with mass storage – a difficult area where effort is underestimated.

10/04/23

Page 9: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

9

Status: Resources 2008 (2007)

10/04/23

Tier-1 Tier-2

CPU [kSI2k] 4590 (1500) 14140 (8588)

Disk [TB] 2222 (750 ) 1365 (743)

Tape [TB] 2195 (~800)

• MOU commitments for 2008 met.

• Combined effort from all Institutions.

Page 10: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

10

Global Resource

10/04/23

Status in Oct 2007: 245 sites, 40,518

CPUs, 24,135 TB storage

Status in Dec 2008:

263 sites, 81,953 CPUs, xx,xxx TB

storage

Page 11: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

11

Current Performance

10/04/23

75

80

85

90

95

100

105

Dec-07 Jan-08 Feb-08 Mar-08 Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08

RAL Reliability

Ave Reliability

Top-8 Reliability

RAL Availability

Ave Availability

Top-8 Availability

Tier-1

Reliability

50

55

60

65

70

75

80

85

90

95

100

2Q07 3Q07 4Q07 1Q08 2Q08 3Q08

%

London

NorthGrid

ScotGrid

SouthGrid

Tier-2s

• Good and improving reliability at the Tier-1 and Tier-2s (but need to move to experiment-specific SAM tests).

• 2008 MOU resources at Tier-1 and Tier-2s delivered in full.

• Following CCRC08 successes, other exercises continue: eg. CMS Cosmic Reprocessing at the end of November which inadvertently ran (successfully) at 10x the I/O rate (Tier-1 LAN and CASTOR service) for 3.5 days!

• Although some problems, RAL ~the best Tier-1 for LHCb globally. CMS needs also ~met.

Page 12: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

12

Current Performance

10/04/23

Disk failure rate ~1/working day or ~6% failure rate (twice our assumption).

ATLAS hit by two multiple disk failures within a RAID array resulting in data loss.

CASTOR 2.1.7 and the Oracle RAC upgrade caused considerably instability and ATLAS lost 2 weeks of UK simulated production when the Tier-1 became unavailable to receive data.

Database loads are running several times higher than at CERN; this is partly a cost-issue; also partly triggered by the higher than average number of transactions triggered by some ATLAS jobs.

Page 13: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

13

Project Map

10/04/23

1.1 1.2 1.3 1.4

2.1 3.1 4.1 5.1 6.1

2.2 3.2 4.2 5.2 6.2

2.3 3.3 4.3 6.3

2.4 3.4 4.4 6.4

2.5

Navigate down

External link

Link to goals

Outreach &

management

engagementNorthGrid

Resource delivery

Tier-1

London EGEE

National GridInfrastructure

support

ScotGrid

Grid services

Middleware

Hardware procurement

3 4 5Tier-2 Management

Other experiments

Planning

SouthGrid Execution

6External

To provide UK computing for the Large Hadron ColliderGridPP3 Goal

Front end systems

LCG

LHCb

Operations

2

& tracking

ATLAS CMS

Storage systems

& deployment

Data and storage

Security

Network

Page 14: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

14

Project Plan

10/04/23

Page 15: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

15

Feedback from last Oversight Committee

• 8.1 (Disaster recovery) – GridPP-PMB-135-Resilience.doc

• 8.2 (CASTOR) – GridPP-PMB-136-CASTOR.doc

• 8.3 (Documentation) - http://www.gridpp.ac.uk/support/

• 8.4 (Certificates) - http://www.gridpp.ac.uk/deployment/users/certificate.html

• 8.5 (24x7 Cover) – Now fully operational.

• 8.6 (Experiment Support Posts) – Despite all the cuts we have managed to fund 1-FTE for each of ATLAS, CMS, and LHCb.

10/04/23

Page 16: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

16

Forward Look

• Move to the new building at RAL.

• Concentrate on further improving service resilience and engage ATLAS, CMS, LHCb in developing coherent disaster management strategies.

• Investigate (even more) rigorous certification of CASTOR releases.

• Recognise global conclusion that mass data storage requires more effort than anticipated.

• Preparations for GridPP3 took ~20 months: Need to start considering now what happens after GridPP3.

10/04/23

Page 17: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

17

Backup Slides

10/04/23

Page 18: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

18

Job Success Rate

10/04/23

ATLAS data analysis site tests – Nov 25-27 2008.

Page 19: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

19

Job Efficiencies

• Efficiency for RAL Tier-1: CPU-Time / Wall-Clock

• Nov 2008 – Overall efficiency 58% - LHC experiments 83%

10/04/23

Page 20: Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

20

Error Messages

10/04/23