Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.
Post on 28-Mar-2015
213 Views
Preview:
Transcript
Project Status
David Britton,15/Dec/08.
2
Outline
• Programmatic Review Outcome
• CCRC08
• LHC Schedule Changes
• Service Resilience
• CASTOR
• Current Status
• Project Management
• Feedback from the last Oversight Committee
• Forward Look
10/04/23
3
Programmatic Review
10/04/23
• The programmatic review recommended a 5% cut to GridPP:
• Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining:
• Bottom Line: GridPP3 reduced by £1.24m on top of the £1.20m removed from GridPP2 noted at the last OC.
4
Funding Cut
10/04/23
• Savings of £1.24m achieved by:– Planned and unplanned late starts to a number of GridPP3
posts.– Reduction in Tier-1 hardware to reflect changes imposed by
the programmatic review (LHCb and BaBar).– Re-costing of hardware based on the 2007 procurement.– A reduction in the budget line for the second tranche of Tier-2
hardware, consistent with the reduction in Tier-1 hardware.– Reduction in travel and miscellaneous spending.
• New plan presented to STFC in July 08; Updated in GridPP-PMB-133-Resources.doc
5
CCRC08
• The Combined Computing Readiness Challenge took place in two phases, February and May 2008. Largely successful for all experiments.
10/04/23
610/04/23
LHC Schedule
Current indications are:
- Machine cold in June.
- First beams in July.
- Collisions at some point later.
- Plans may change!
Consequences on GridPP
- Capacity and services need to be ready in June.
- Meanwhile many exercises (MC productions, Cosmics re-processings, Analysis challenges) to keep things busy and stress the system.
- Prudent to maintain procurement schedule for April 2009 (little downside to this and helps reduce risks).
- Opportunity to build on the service quality and resilience.
7
Service Resilience
• Emphasis over the last year of making the Grid resilient.– Much work on monitoring and alarms.– 24 x 7 service initiated.– Extensive work on making the component services more
resilient at many levels (see document).
• Future work on Resilience– Create project-manager overview to keep this active at the
PMB level– Provision a back-up link for the OPN (significant cost).– Link to the (evolving) experiment disaster planning (UCL
meeting)
10/04/23
8
CASTOR
• CASTOR proved unreliable in early 2007 but performed well with the upgrade to 2.1.3 for CCRC08.
• In time for first collisions, an upgrade from 2.1.6 to 2.1.7 was required in order to maintain a version supported by CERN. This coincided with a move to a resilient RAC Oracle system – combination of upgrades led to instability in August and September.
• System is now stabilising and the problems have lead to improved communications and management processes.– High load-testing identified as a critical missing step for new
releases.– Oracle problems raised to a higher level of awareness in wLCG.– Storage Review at RAL in November.
• Other Tier-1s have had similar or worse problems with mass storage – a difficult area where effort is underestimated.
10/04/23
9
Status: Resources 2008 (2007)
10/04/23
Tier-1 Tier-2
CPU [kSI2k] 4590 (1500) 14140 (8588)
Disk [TB] 2222 (750 ) 1365 (743)
Tape [TB] 2195 (~800)
• MOU commitments for 2008 met.
• Combined effort from all Institutions.
10
Global Resource
10/04/23
Status in Oct 2007: 245 sites, 40,518
CPUs, 24,135 TB storage
Status in Dec 2008:
263 sites, 81,953 CPUs, xx,xxx TB
storage
11
Current Performance
10/04/23
75
80
85
90
95
100
105
Dec-07 Jan-08 Feb-08 Mar-08 Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08
RAL Reliability
Ave Reliability
Top-8 Reliability
RAL Availability
Ave Availability
Top-8 Availability
Tier-1
Reliability
50
55
60
65
70
75
80
85
90
95
100
2Q07 3Q07 4Q07 1Q08 2Q08 3Q08
%
London
NorthGrid
ScotGrid
SouthGrid
Tier-2s
• Good and improving reliability at the Tier-1 and Tier-2s (but need to move to experiment-specific SAM tests).
• 2008 MOU resources at Tier-1 and Tier-2s delivered in full.
• Following CCRC08 successes, other exercises continue: eg. CMS Cosmic Reprocessing at the end of November which inadvertently ran (successfully) at 10x the I/O rate (Tier-1 LAN and CASTOR service) for 3.5 days!
• Although some problems, RAL ~the best Tier-1 for LHCb globally. CMS needs also ~met.
12
Current Performance
10/04/23
Disk failure rate ~1/working day or ~6% failure rate (twice our assumption).
ATLAS hit by two multiple disk failures within a RAID array resulting in data loss.
CASTOR 2.1.7 and the Oracle RAC upgrade caused considerably instability and ATLAS lost 2 weeks of UK simulated production when the Tier-1 became unavailable to receive data.
Database loads are running several times higher than at CERN; this is partly a cost-issue; also partly triggered by the higher than average number of transactions triggered by some ATLAS jobs.
13
Project Map
10/04/23
1.1 1.2 1.3 1.4
2.1 3.1 4.1 5.1 6.1
2.2 3.2 4.2 5.2 6.2
2.3 3.3 4.3 6.3
2.4 3.4 4.4 6.4
2.5
Navigate down
External link
Link to goals
Outreach &
management
engagementNorthGrid
Resource delivery
Tier-1
London EGEE
National GridInfrastructure
support
ScotGrid
Grid services
Middleware
Hardware procurement
3 4 5Tier-2 Management
Other experiments
Planning
SouthGrid Execution
6External
To provide UK computing for the Large Hadron ColliderGridPP3 Goal
Front end systems
LCG
LHCb
Operations
2
& tracking
ATLAS CMS
Storage systems
& deployment
Data and storage
Security
Network
14
Project Plan
10/04/23
15
Feedback from last Oversight Committee
• 8.1 (Disaster recovery) – GridPP-PMB-135-Resilience.doc
• 8.2 (CASTOR) – GridPP-PMB-136-CASTOR.doc
• 8.3 (Documentation) - http://www.gridpp.ac.uk/support/
• 8.4 (Certificates) - http://www.gridpp.ac.uk/deployment/users/certificate.html
• 8.5 (24x7 Cover) – Now fully operational.
• 8.6 (Experiment Support Posts) – Despite all the cuts we have managed to fund 1-FTE for each of ATLAS, CMS, and LHCb.
10/04/23
16
Forward Look
• Move to the new building at RAL.
• Concentrate on further improving service resilience and engage ATLAS, CMS, LHCb in developing coherent disaster management strategies.
• Investigate (even more) rigorous certification of CASTOR releases.
• Recognise global conclusion that mass data storage requires more effort than anticipated.
• Preparations for GridPP3 took ~20 months: Need to start considering now what happens after GridPP3.
10/04/23
17
Backup Slides
10/04/23
18
Job Success Rate
10/04/23
ATLAS data analysis site tests – Nov 25-27 2008.
19
Job Efficiencies
• Efficiency for RAL Tier-1: CPU-Time / Wall-Clock
• Nov 2008 – Overall efficiency 58% - LHC experiments 83%
10/04/23
20
Error Messages
10/04/23
top related