RN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure & Operations Group IT Department Material provided by Ian Bird, Flavia Donno, Jamie Shiers and others
16
Embed
CERN - IT Department CH-1211 Genève 23 Switzerland t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CERN - IT DepartmentCH-1211 Genève 23
Switzerlandwww.cern.ch/it
LCG Deployment
GridPP 18, Glasgow, 21st March 2007
Tony CassLeader, Fabric Infrastructure & Operations Group
IT Department
Material provided by Ian Bird, Flavia Donno, Jamie Shiers and others
first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1).
April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
SL(C)4 Migration
• The target OS level for initial LHC operation– RHES5 out, but migration not feasible– Experiments would like to drop SL(C)3
builds• SL3 built, but SL4 compatible
middleware rpms are available now– no longer any problem preventing
subsequent updates.• Natively built WN rpms are under test;
expected in PPS next week.– still some issues with the native UI build
• Plan your migration! GridPP18: LCG Deployment - 5
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
SRM v2.2 Server Status
• DPM version 1.6.3 available in production. SRM 2.2 features still not officially certified. Implementation stable. Use-case tests are OK. Copy not available but interoperability tests are OK. Few general issues to be solved.– Volunteers to install and test SRM 2.2 features welcome!
• BeStMan and StoRM: Copy in PULL mode not available in StoRM. Stable implementations. Recently some instability observed with BeStMan. Some use-case tests still not passing and under investigation.
• dCache: Stable implementation. Copy is available and working with all implementations excluding DRM. Working on some use-case tests.– Requires migration to v1.8.0 (which will support v1.1 & v2.2
transparently); beta version in April.• CASTOR: The implementation has improved remarkably. A lot of
progress during the last 3 weeks. Main instability causes found and fixed. Use-case tests OK. Copy not yet implemented but interoperability tests OK.– Stress tests at CERN now and CNAF from next week, but an upgrade to
the underlying CASTOR version is required for efficient operation• Deployment at CERN scheduled for mid-May; CNAF and RAL to follow soon
afterwards to be ready for production use by July 1st. GridPP18: LCG Deployment - 6
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
SRM v2.2 Client Status
• FTS– SRM client code has been unit-tested and integrated
into FTS– Tested against DPM, dCache and StoRM. CASTOR and
DRM test started. – Released to development testbed.
• GFAL/lcg-utils– New rpms available on test UI and being officially
certified. No outstanding issues at the moment. ATLAS has started some tests.
• Reliability of the gLite WMS is being addressed with high priority– not yet ready to replace LCG RB– no plans (or effort?) to migrate LCG RB to SL4.
• Acceptance criteria for the RB have been agreed based on performance requirements from ATLAS and CMS
GridPP18: LCG Deployment - 8
CMS ATLAS
Performance
2007 Dress rehearsals Not specified but was 50K jobs/day in CSA06
20K successful jobs/day + analysis load
2008 200K jobs/day through WMS<10 WMS
100K jobs/day100K
<10 WMS
Stability
Not specified <1 restart of WMS or LB every month (== LCG RB)
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite WMS criteria
• A single WMS machine should demonstrate submission rates of at least 10K jobs/day sustained over 5 days, during which time the WMS services including the L&B should not need to be restarted. This performance level should be reachable with both bulk and single job submission.– During this 5 day test the performance must not degrade
significantly due to filling of internal queues, memory consumption, etc. i.e. the submission rate on day 5 should be the same as that on day 1.
• Proxy renewal must work at the 98% level: i.e. <2% of jobs should fail due to proxy renewal problems (the real failure rate should be less because jobs may be retried).
• The number of stale jobs after 5 days must be <1%. • The L&B data and job states must be verified:
– After a reasonable time after submission has ended, there should be no jobs in "transient" or "cancelled" states
– If jobs are very short no jobs should stay in "running" state for more than a few hours
– After proxy expires all jobs must be in a final state (Done-Success or Aborted)
GridPP18: LCG Deployment - 9
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite CE
• Similarly for the gLite CE– it is not yet reliable enough– reliability criteria have been defined– no port of the LCG CE to SL4 is foreseen.
For both, development progress against the reliability criteria is reviewed weekly.
Deployment of the gLite versions is not recommended at this stage, butif you do have them installed, please keep them running
and track developments to help in testing, andbe ready to deploy when production ready code
becomes available!
GridPP18: LCG Deployment - 10
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite CE criteria
• Performance:– 2007 dress rehearsals:
• 5000 simultaneous jobs per CE node. • 50 user/role/submission node combinations (Condor_C instances) per
CE node
– End 2007:• 5000 simultaneous jobs per CE node (assuming same machine as
2007, but expect this to improve) • 1 CE node should support an unlimited number of user/role/submission
node combinations, from at least 10 VOs, up to the limit on the number of jobs. (might be achieved with 1 Condor_C per VO with user switching done by glexec in blah)
• Reliability:– Job failure rates due to CE in normal operation: < 0.5%; Job
failures due to restart of CE services or CE reboot <0.5%. – 2007 dress rehearsals:
• 5 days unattended running with performance on day 5 equivalent to that on day 1
– End 2007:• 1 month unattended running without performance degradation
GridPP18: LCG Deployment - 11
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
gLite CE
• Similarly for the gLite CE– it is not yet reliable enough– reliability criteria have been defined– no port to SL4 is foreseen.
• For both, development progress against the reliability criteria is reviewed weekly.
• Deployment of the gLite versions is not recommended at this stage, but– if you do have them installed, please keep them running
and track developments to help in testing, and– be ready to deploy when production ready code becomes
available!
GridPP18: LCG Deployment - 12
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
Planning future deployments• Still a number of components to deploy before data
taking– and even before the dress rehearsals.
• Remember, there is no longer a “big bang” model; individual components are released as they are ready – be prepared…
• Deployment/Intervention Scheduling– discussed at January workshop: when is “the least
inconvenient time”?– has been discussed since then at the LCG Experiment
Coordination Meeting, but no consensus. Opinion seems to be that this is not an issue for the “engineering run”
• Last system changes in September/October then things kept stable for the short run.
– situation for 2008 to be decided before the run– Whatever, clear and early announcement of changes leads to
ready acceptance by users…
GridPP18: LCG Deployment - 13
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
WLCG Intervention Scheduling1. Scheduled service interventions shall normally be performed outside
of the announced period of operation of the LHC accelerator.2. In the event of mandatory interventions during the operation period
of the accelerator – such as a non-critical security patch – an announcement will be made using the Communication Interface for Central (CIC) operations portal and the period of scheduled downtime entered in the Grid Operations Centre (GOC) database (GOCDB).
3. Such an announcement shall be made at least one working day in advance for interventions of up to 4 hours.
4. Interventions resulting in significant service interruption or degradation longer than 4 hours and up to 12 hours shall be announced at the Weekly Operations meeting prior to the intervention, with a reminder sent via the CIC portal as above.
5. Interventions exceeding 12 hours must be announced at least one week in advance, following the procedure above.
6. A further announcement shall be made once normal service has been resumed.
7. [deleted]8. Intervention planning should also anticipate any interruptions to jobs
running in the site batch queues. If appropriate the queues should be drained and the queues closed for further job submission.
GridPP18: LCG Deployment - 14
CERN - IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/it
LCG Commissioning Schedule
GridPP18: LCG Deployment - 15
2006
2007
2008
SC4 – becomes initial service whenreliability and performance goals met
01jul07 - service commissioned - full 2007 capacity, performance
Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, ….
first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1).
April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)