Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

Post on 28-Mar-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Project Status

David Britton,15/Dec/08.

2

Outline

• Programmatic Review Outcome

• CCRC08

• LHC Schedule Changes

• Service Resilience

• CASTOR

• Current Status

• Project Management

• Feedback from the last Oversight Committee

• Forward Look

10/04/23

3

Programmatic Review

10/04/23

• The programmatic review recommended a 5% cut to GridPP:

• Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining:

• Bottom Line: GridPP3 reduced by £1.24m on top of the £1.20m removed from GridPP2 noted at the last OC.

4

Funding Cut

10/04/23

• Savings of £1.24m achieved by:– Planned and unplanned late starts to a number of GridPP3

posts.– Reduction in Tier-1 hardware to reflect changes imposed by

the programmatic review (LHCb and BaBar).– Re-costing of hardware based on the 2007 procurement.– A reduction in the budget line for the second tranche of Tier-2

hardware, consistent with the reduction in Tier-1 hardware.– Reduction in travel and miscellaneous spending.

• New plan presented to STFC in July 08; Updated in GridPP-PMB-133-Resources.doc

5

CCRC08

• The Combined Computing Readiness Challenge took place in two phases, February and May 2008. Largely successful for all experiments.

10/04/23

610/04/23

LHC Schedule

Current indications are:

- Machine cold in June.

- First beams in July.

- Collisions at some point later.

- Plans may change!

Consequences on GridPP

- Capacity and services need to be ready in June.

- Meanwhile many exercises (MC productions, Cosmics re-processings, Analysis challenges) to keep things busy and stress the system.

- Prudent to maintain procurement schedule for April 2009 (little downside to this and helps reduce risks).

- Opportunity to build on the service quality and resilience.

7

Service Resilience

• Emphasis over the last year of making the Grid resilient.– Much work on monitoring and alarms.– 24 x 7 service initiated.– Extensive work on making the component services more

resilient at many levels (see document).

• Future work on Resilience– Create project-manager overview to keep this active at the

PMB level– Provision a back-up link for the OPN (significant cost).– Link to the (evolving) experiment disaster planning (UCL

meeting)

10/04/23

8

CASTOR

• CASTOR proved unreliable in early 2007 but performed well with the upgrade to 2.1.3 for CCRC08.

• In time for first collisions, an upgrade from 2.1.6 to 2.1.7 was required in order to maintain a version supported by CERN. This coincided with a move to a resilient RAC Oracle system – combination of upgrades led to instability in August and September.

• System is now stabilising and the problems have lead to improved communications and management processes.– High load-testing identified as a critical missing step for new

releases.– Oracle problems raised to a higher level of awareness in wLCG.– Storage Review at RAL in November.

• Other Tier-1s have had similar or worse problems with mass storage – a difficult area where effort is underestimated.

10/04/23

9

Status: Resources 2008 (2007)

10/04/23

Tier-1 Tier-2

CPU [kSI2k] 4590 (1500) 14140 (8588)

Disk [TB] 2222 (750 ) 1365 (743)

Tape [TB] 2195 (~800)

• MOU commitments for 2008 met.

• Combined effort from all Institutions.

10

Global Resource

10/04/23

Status in Oct 2007: 245 sites, 40,518

CPUs, 24,135 TB storage

Status in Dec 2008:

263 sites, 81,953 CPUs, xx,xxx TB

storage

11

Current Performance

10/04/23

75

80

85

90

95

100

105

Dec-07 Jan-08 Feb-08 Mar-08 Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08

RAL Reliability

Ave Reliability

Top-8 Reliability

RAL Availability

Ave Availability

Top-8 Availability

Tier-1

Reliability

50

55

60

65

70

75

80

85

90

95

100

2Q07 3Q07 4Q07 1Q08 2Q08 3Q08

%

London

NorthGrid

ScotGrid

SouthGrid

Tier-2s

• Good and improving reliability at the Tier-1 and Tier-2s (but need to move to experiment-specific SAM tests).

• 2008 MOU resources at Tier-1 and Tier-2s delivered in full.

• Following CCRC08 successes, other exercises continue: eg. CMS Cosmic Reprocessing at the end of November which inadvertently ran (successfully) at 10x the I/O rate (Tier-1 LAN and CASTOR service) for 3.5 days!

• Although some problems, RAL ~the best Tier-1 for LHCb globally. CMS needs also ~met.

12

Current Performance

10/04/23

Disk failure rate ~1/working day or ~6% failure rate (twice our assumption).

ATLAS hit by two multiple disk failures within a RAID array resulting in data loss.

CASTOR 2.1.7 and the Oracle RAC upgrade caused considerably instability and ATLAS lost 2 weeks of UK simulated production when the Tier-1 became unavailable to receive data.

Database loads are running several times higher than at CERN; this is partly a cost-issue; also partly triggered by the higher than average number of transactions triggered by some ATLAS jobs.

13

Project Map

10/04/23

1.1 1.2 1.3 1.4

2.1 3.1 4.1 5.1 6.1

2.2 3.2 4.2 5.2 6.2

2.3 3.3 4.3 6.3

2.4 3.4 4.4 6.4

2.5

Navigate down

External link

Link to goals

Outreach &

management

engagementNorthGrid

Resource delivery

Tier-1

London EGEE

National GridInfrastructure

support

ScotGrid

Grid services

Middleware

Hardware procurement

3 4 5Tier-2 Management

Other experiments

Planning

SouthGrid Execution

6External

To provide UK computing for the Large Hadron ColliderGridPP3 Goal

Front end systems

LCG

LHCb

Operations

2

& tracking

ATLAS CMS

Storage systems

& deployment

Data and storage

Security

Network

14

Project Plan

10/04/23

15

Feedback from last Oversight Committee

• 8.1 (Disaster recovery) – GridPP-PMB-135-Resilience.doc

• 8.2 (CASTOR) – GridPP-PMB-136-CASTOR.doc

• 8.3 (Documentation) - http://www.gridpp.ac.uk/support/

• 8.4 (Certificates) - http://www.gridpp.ac.uk/deployment/users/certificate.html

• 8.5 (24x7 Cover) – Now fully operational.

• 8.6 (Experiment Support Posts) – Despite all the cuts we have managed to fund 1-FTE for each of ATLAS, CMS, and LHCb.

10/04/23

16

Forward Look

• Move to the new building at RAL.

• Concentrate on further improving service resilience and engage ATLAS, CMS, LHCb in developing coherent disaster management strategies.

• Investigate (even more) rigorous certification of CASTOR releases.

• Recognise global conclusion that mass data storage requires more effort than anticipated.

• Preparations for GridPP3 took ~20 months: Need to start considering now what happens after GridPP3.

10/04/23

17

Backup Slides

10/04/23

18

Job Success Rate

10/04/23

ATLAS data analysis site tests – Nov 25-27 2008.

19

Job Efficiencies

• Efficiency for RAL Tier-1: CPU-Time / Wall-Clock

• Nov 2008 – Overall efficiency 58% - LHC experiments 83%

10/04/23

20

Error Messages

10/04/23

top related