CERN - IT Department CH-1211 Genève 23 Switzerland t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

LCG Deployment

GridPP 18, Glasgow, 21st March 2007

Tony CassLeader, Fabric Infrastructure & Operations Group

IT Department

Material provided by Ian Bird, Flavia Donno, Jamie Shiers and others

CERN - IT Department

CH-1211 Genève 23


A Usual Deployment Talk…

GridPP18: LCG Deployment - 2

Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-070

500000

1000000

1500000

2000000

2500000

3000000

Other VOsplanckopsmagiclhcbgeant4fusionesregridegeodedteamcompchemcmsbiomedatlasalice

jobs /

month

• Continually increasing workloads• 50k-80k jobs per day• Feb’07 ~12500 cpu-months /

month

LHC experiments now transferring ~ 1PB/month each


CH-1211 Genève 23


This Deployment Talk

• Future Deployment Issues– SL(C)4– SRM– CE & RB– Deployment Windows

• Not “Deployment of Windows”!



CH-1211 Genève 23


LCG Commissioning Schedule


2006

2007

2008

SC4 – becomes initial service whenreliability and performance goals met

01jul07 - service commissioned - full 2007 capacity, performance

Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, ….

Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4

first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1).

April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)


CH-1211 Genève 23


SL(C)4 Migration

• The target OS level for initial LHC operation– RHES5 out, but migration not feasible– Experiments would like to drop SL(C)3

builds• SL3 built, but SL4 compatible

middleware rpms are available now– no longer any problem preventing

subsequent updates.• Natively built WN rpms are under test;

expected in PPS next week.– still some issues with the native UI build

• Plan your migration! GridPP18: LCG Deployment - 5


CH-1211 Genève 23


SRM v2.2 Server Status

• DPM version 1.6.3 available in production. SRM 2.2 features still not officially certified. Implementation stable. Use-case tests are OK. Copy not available but interoperability tests are OK. Few general issues to be solved.– Volunteers to install and test SRM 2.2 features welcome!

• BeStMan and StoRM: Copy in PULL mode not available in StoRM. Stable implementations. Recently some instability observed with BeStMan. Some use-case tests still not passing and under investigation.

• dCache: Stable implementation. Copy is available and working with all implementations excluding DRM. Working on some use-case tests.– Requires migration to v1.8.0 (which will support v1.1 & v2.2

transparently); beta version in April.• CASTOR: The implementation has improved remarkably. A lot of

progress during the last 3 weeks. Main instability causes found and fixed. Use-case tests OK. Copy not yet implemented but interoperability tests OK.– Stress tests at CERN now and CNAF from next week, but an upgrade to

the underlying CASTOR version is required for efficient operation• Deployment at CERN scheduled for mid-May; CNAF and RAL to follow soon

afterwards to be ready for production use by July 1st. GridPP18: LCG Deployment - 6


CH-1211 Genève 23


SRM v2.2 Client Status

• FTS– SRM client code has been unit-tested and integrated

into FTS– Tested against DPM, dCache and StoRM. CASTOR and

DRM test started. – Released to development testbed.

• GFAL/lcg-utils– New rpms available on test UI and being officially

certified. No outstanding issues at the moment. ATLAS has started some tests.

• GLUE– V1.3 of the schema available

• http://glueschema.forge.cnaf.infn.it/Spec/V13


http://glueschema.forge.cnaf.infn.it/Spec/V13


CH-1211 Genève 23


gLite WMS & LCG RB

• Reliability of the gLite WMS is being addressed with high priority– not yet ready to replace LCG RB– no plans (or effort?) to migrate LCG RB to SL4.

• Acceptance criteria for the RB have been agreed based on performance requirements from ATLAS and CMS


CMS ATLAS

Performance

2007 Dress rehearsals Not specified but was 50K jobs/day in CSA06

20K successful jobs/day + analysis load

2008 200K jobs/day through WMS<10 WMS

100K jobs/day100K

<10 WMS

Stability

Not specified <1 restart of WMS or LB every month (== LCG RB)


CH-1211 Genève 23


gLite WMS criteria

• A single WMS machine should demonstrate submission rates of at least 10K jobs/day sustained over 5 days, during which time the WMS services including the L&B should not need to be restarted. This performance level should be reachable with both bulk and single job submission.– During this 5 day test the performance must not degrade

significantly due to filling of internal queues, memory consumption, etc. i.e. the submission rate on day 5 should be the same as that on day 1.

• Proxy renewal must work at the 98% level: i.e. <2% of jobs should fail due to proxy renewal problems (the real failure rate should be less because jobs may be retried).

• The number of stale jobs after 5 days must be <1%. • The L&B data and job states must be verified:

– After a reasonable time after submission has ended, there should be no jobs in "transient" or "cancelled" states

– If jobs are very short no jobs should stay in "running" state for more than a few hours

– After proxy expires all jobs must be in a final state (Done-Success or Aborted)



CH-1211 Genève 23


gLite CE

• Similarly for the gLite CE– it is not yet reliable enough– reliability criteria have been defined– no port of the LCG CE to SL4 is foreseen.

For both, development progress against the reliability criteria is reviewed weekly.

Deployment of the gLite versions is not recommended at this stage, butif you do have them installed, please keep them running

and track developments to help in testing, andbe ready to deploy when production ready code

becomes available!



CH-1211 Genève 23


gLite CE criteria

• Performance:– 2007 dress rehearsals:

• 5000 simultaneous jobs per CE node. • 50 user/role/submission node combinations (Condor_C instances) per

CE node

– End 2007:• 5000 simultaneous jobs per CE node (assuming same machine as

2007, but expect this to improve) • 1 CE node should support an unlimited number of user/role/submission

node combinations, from at least 10 VOs, up to the limit on the number of jobs. (might be achieved with 1 Condor_C per VO with user switching done by glexec in blah)

• Reliability:– Job failure rates due to CE in normal operation: < 0.5%; Job

failures due to restart of CE services or CE reboot <0.5%. – 2007 dress rehearsals:

• 5 days unattended running with performance on day 5 equivalent to that on day 1

– End 2007:• 1 month unattended running without performance degradation



CH-1211 Genève 23


gLite CE

• Similarly for the gLite CE– it is not yet reliable enough– reliability criteria have been defined– no port to SL4 is foreseen.

• For both, development progress against the reliability criteria is reviewed weekly.

• Deployment of the gLite versions is not recommended at this stage, but– if you do have them installed, please keep them running

and track developments to help in testing, and– be ready to deploy when production ready code becomes

available!



CH-1211 Genève 23


Planning future deployments• Still a number of components to deploy before data

taking– and even before the dress rehearsals.

• Remember, there is no longer a “big bang” model; individual components are released as they are ready – be prepared…

• Deployment/Intervention Scheduling– discussed at January workshop: when is “the least

inconvenient time”?– has been discussed since then at the LCG Experiment

Coordination Meeting, but no consensus. Opinion seems to be that this is not an issue for the “engineering run”

• Last system changes in September/October then things kept stable for the short run.

– situation for 2008 to be decided before the run– Whatever, clear and early announcement of changes leads to

ready acceptance by users…



CH-1211 Genève 23


WLCG Intervention Scheduling1. Scheduled service interventions shall normally be performed outside

of the announced period of operation of the LHC accelerator.2. In the event of mandatory interventions during the operation period

of the accelerator – such as a non-critical security patch – an announcement will be made using the Communication Interface for Central (CIC) operations portal and the period of scheduled downtime entered in the Grid Operations Centre (GOC) database (GOCDB).

3. Such an announcement shall be made at least one working day in advance for interventions of up to 4 hours.

4. Interventions resulting in significant service interruption or degradation longer than 4 hours and up to 12 hours shall be announced at the Weekly Operations meeting prior to the intervention, with a reminder sent via the CIC portal as above.

5. Interventions exceeding 12 hours must be announced at least one week in advance, following the procedure above.

6. A further announcement shall be made once normal service has been resumed.

7. [deleted]8. Intervention planning should also anticipate any interruptions to jobs

running in the site batch queues. If appropriate the queues should be drained and the queues closed for further job submission.



CH-1211 Genève 23


LCG Commissioning Schedule


2006

2007

2008

SC4 – becomes initial service whenreliability and performance goals met

01jul07 - service commissioned - full 2007 capacity, performance

Initial service commissioning – increase performance, reliability, capacity to target levels, experiencein monitoring, 24 x 7 operation, ….

Introduce residual servicesFull FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4

first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1).

April 1st target is to allow experiments to prepare for July 1st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)

CERN - IT Department CH-1211 Genève 23 Switzerland t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.

Documents

usecase tests ok

interoperability tests

stress tests

production use

available nowno

lcg deploymentgridpp

implementation stable

x srm v2