Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

Project Status

David Britton,15/Dec/08.

Outline

• Programmatic Review Outcome

• CCRC08

• LHC Schedule Changes

• Service Resilience

• CASTOR

• Current Status

• Project Management

• Feedback from the last Oversight Committee

• Forward Look

10/04/23

Programmatic Review

10/04/23

• The programmatic review recommended a 5% cut to GridPP:

• Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining:

• Bottom Line: GridPP3 reduced by £1.24m on top of the £1.20m removed from GridPP2 noted at the last OC.

Funding Cut

10/04/23

• Savings of £1.24m achieved by:– Planned and unplanned late starts to a number of GridPP3

posts.– Reduction in Tier-1 hardware to reflect changes imposed by

the programmatic review (LHCb and BaBar).– Re-costing of hardware based on the 2007 procurement.– A reduction in the budget line for the second tranche of Tier-2

hardware, consistent with the reduction in Tier-1 hardware.– Reduction in travel and miscellaneous spending.

• New plan presented to STFC in July 08; Updated in GridPP-PMB-133-Resources.doc

CCRC08

• The Combined Computing Readiness Challenge took place in two phases, February and May 2008. Largely successful for all experiments.

10/04/23

610/04/23

LHC Schedule

Current indications are:

- Machine cold in June.

- First beams in July.

- Collisions at some point later.

- Plans may change!

Consequences on GridPP

- Capacity and services need to be ready in June.

- Meanwhile many exercises (MC productions, Cosmics re-processings, Analysis challenges) to keep things busy and stress the system.

- Prudent to maintain procurement schedule for April 2009 (little downside to this and helps reduce risks).

- Opportunity to build on the service quality and resilience.

Service Resilience

• Emphasis over the last year of making the Grid resilient.– Much work on monitoring and alarms.– 24 x 7 service initiated.– Extensive work on making the component services more

resilient at many levels (see document).

• Future work on Resilience– Create project-manager overview to keep this active at the

PMB level– Provision a back-up link for the OPN (significant cost).– Link to the (evolving) experiment disaster planning (UCL

meeting)

10/04/23

CASTOR

• CASTOR proved unreliable in early 2007 but performed well with the upgrade to 2.1.3 for CCRC08.

• In time for first collisions, an upgrade from 2.1.6 to 2.1.7 was required in order to maintain a version supported by CERN. This coincided with a move to a resilient RAC Oracle system – combination of upgrades led to instability in August and September.

• System is now stabilising and the problems have lead to improved communications and management processes.– High load-testing identified as a critical missing step for new

releases.– Oracle problems raised to a higher level of awareness in wLCG.– Storage Review at RAL in November.

• Other Tier-1s have had similar or worse problems with mass storage – a difficult area where effort is underestimated.

10/04/23

Status: Resources 2008 (2007)

10/04/23

Tier-1 Tier-2

CPU [kSI2k] 4590 (1500) 14140 (8588)

Disk [TB] 2222 (750 ) 1365 (743)

Tape [TB] 2195 (~800)

• MOU commitments for 2008 met.

• Combined effort from all Institutions.

Global Resource

10/04/23

Status in Oct 2007: 245 sites, 40,518

CPUs, 24,135 TB storage

Status in Dec 2008:

263 sites, 81,953 CPUs, xx,xxx TB

storage

Current Performance

10/04/23

Dec-07 Jan-08 Feb-08 Mar-08 Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08

RAL Reliability

Ave Reliability

Top-8 Reliability

RAL Availability

Ave Availability

Top-8 Availability

Tier-1

Reliability

2Q07 3Q07 4Q07 1Q08 2Q08 3Q08

London

NorthGrid

ScotGrid

SouthGrid

Tier-2s

• Good and improving reliability at the Tier-1 and Tier-2s (but need to move to experiment-specific SAM tests).

• 2008 MOU resources at Tier-1 and Tier-2s delivered in full.

• Following CCRC08 successes, other exercises continue: eg. CMS Cosmic Reprocessing at the end of November which inadvertently ran (successfully) at 10x the I/O rate (Tier-1 LAN and CASTOR service) for 3.5 days!

• Although some problems, RAL ~the best Tier-1 for LHCb globally. CMS needs also ~met.

Current Performance

10/04/23

Disk failure rate ~1/working day or ~6% failure rate (twice our assumption).

ATLAS hit by two multiple disk failures within a RAID array resulting in data loss.

CASTOR 2.1.7 and the Oracle RAC upgrade caused considerably instability and ATLAS lost 2 weeks of UK simulated production when the Tier-1 became unavailable to receive data.

Database loads are running several times higher than at CERN; this is partly a cost-issue; also partly triggered by the higher than average number of transactions triggered by some ATLAS jobs.

Project Map

10/04/23

1.1 1.2 1.3 1.4

2.1 3.1 4.1 5.1 6.1

2.2 3.2 4.2 5.2 6.2

2.3 3.3 4.3 6.3

2.4 3.4 4.4 6.4

Navigate down

External link

Link to goals

Outreach &

management

engagementNorthGrid

Resource delivery

Tier-1

London EGEE

National GridInfrastructure

support

ScotGrid

Grid services

Middleware

Hardware procurement

3 4 5Tier-2 Management

Other experiments

Planning

SouthGrid Execution

6External

To provide UK computing for the Large Hadron ColliderGridPP3 Goal

Front end systems

Operations

& tracking

ATLAS CMS

Storage systems

& deployment

Data and storage

Security

Network

Project Plan

10/04/23

Feedback from last Oversight Committee

• 8.1 (Disaster recovery) – GridPP-PMB-135-Resilience.doc

• 8.2 (CASTOR) – GridPP-PMB-136-CASTOR.doc

• 8.3 (Documentation) - http://www.gridpp.ac.uk/support/

• 8.4 (Certificates) - http://www.gridpp.ac.uk/deployment/users/certificate.html

• 8.5 (24x7 Cover) – Now fully operational.

• 8.6 (Experiment Support Posts) – Despite all the cuts we have managed to fund 1-FTE for each of ATLAS, CMS, and LHCb.

10/04/23

Forward Look

• Move to the new building at RAL.

• Concentrate on further improving service resilience and engage ATLAS, CMS, LHCb in developing coherent disaster management strategies.

• Investigate (even more) rigorous certification of CASTOR releases.

• Recognise global conclusion that mass data storage requires more effort than anticipated.

• Preparations for GridPP3 took ~20 months: Need to start considering now what happens after GridPP3.

10/04/23

Backup Slides

10/04/23

Job Success Rate

10/04/23

ATLAS data analysis site tests – Nov 25-27 2008.

Job Efficiencies

• Efficiency for RAL Tier-1: CPU-Time / Wall-Clock

• Nov 2008 – Overall efficiency 58% - LHC experiments 83%

10/04/23

Error Messages

10/04/23

Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.

tranche of tier

best tier

tb storage slide

io rate tier

castor castor

castor service

storage review

tb storage status

Documents

Photographer: Ian Britton, Offert par .

Calcul CMS: bilan CCRC08

D. Britton Preliminary Project Plan for GridPP3 David...

Britton, Val 2160498232 - San Francisco · Britton, Val...

John Britton

Hannah E. Britton britton@ku.edu EDUCATION ACADEMIC … ·....

Session: CIVIL JURY HEARING ROOM : CIVIL TRIAL LIST 1 ......

Hannah E. Britton - Welcome | Political Science E. Britton.....

Easkey britton epa_hse_conference_2017

Britton Brown Flora 1

Britton Chiropractic

The Britton Fund Ride 2009

Britton Timbers - Export Brochure

D. Britton Response to the PPRP David Britton 8/Nov/06.

Britton cj portfolio

D. Britton GridPP Status - ProjectMap 8/Feb/07. D....