Glenn Patrick Rutherford Appleton Laboratory GridPP22 1 st April 2009.

Glenn PatrickRutherford Appleton

Laboratory

GridPP221st April 2009

2

Key WordsDisaster Planning

Resilience & Performance

Response

3

ALICE

4

Tier-0Tier-0

CASTOR

CASTOR

CAFCAF

Prompt Reconstruction Calibration & AlignmentExpress Stream Analysis

Tier-1Tier-1 Tier-1Tier-1 Tier-1Tier-1

RAW Re-processingSimulation, analysis (if free resources)

SimulationAnalysis

Tier-2Tier-2 Tier-2Tier-2 Tier-2Tier-2 Tier-2Tier-2 Tier-2Tier-2

T1 AFT1 AF

T2 AFT2 AF

Storage hypervisor –

xrootd global

redirector

Storage hypervisor –

xrootd global

redirector

ALICE Workflows

c/o Kors Bos, CHEP 2009

5

ALICE1. Loss of custodial data and T2 data. Would

both be handled by restoration from existing replicas. In general, loss of T2 data is less critical as in the ALICE Computing Model there are 3 replicas of ESDs and AODs – mainly loss of resources for analysis until data restored. Loss of custodial data (RAW) is more critical as only original + 1 replica is kept and would need higher priority.

2. Compute/storage loss. The affected services would be excluded from the production activities at the level of the central AliEn services. Response to incident and all remedy actions would be co-ordinated by the ALICE Grid team in collaboration with the technical and management groups in the affected centre.

c/oCristinaLazzeroni

6

ALICE3. Massive procurement failure. Fairshare use

of Grid computing resources is maintained through a internal prioritisation system at the central AliEn level. A loss of a fraction of computing resources will not be directly visible to end users.

4. Extended outage of Tier 1(> 5 days). Short term changes – stop the replication of RAW and divert traffic to other T1 centres. Stop the processing of RAW and discontinue using the centre as a target for custodial storage of ESDs/AODs. Discontinue T1 – T2 data replication (may affect availability of ESD/AOD at T2). Changes done at level of AliEn central services. Users not directly affected, but processing capacity will be reduced. Highest restoration priority = MSS and replication services. Users informed through mailing lists.

7

MINOS

8

MINOS1. Loss of custodial data and T2 data. Use of

Tier 1 is limited to MC production – little user analysis. MC data is shipped directly to FNAL, which also holds the master copies of the software and data input to the MC. A data loss on UK T1 would just lose the small amount of data awaiting transfer along with about 200GB of input data which would be retransferred from FNAL.

2. Compute/storage loss. For a short term loss, would just wait for the system to come back up – one MC production run takes of order months. For a longer term loss, would look to move production elsewhere.

3. Massive procurement failure. Would look into alternative facilities.

c/oPhilipRodrigues

9

MINOS4. Extended outage of Tier 1(> 5 days). Again,

an outage of days would not change much, but once the outage moved into weeks would start to consider alternative facilities. Small number of users makes it easy to communicate changes.

10

MICE

11

MICEMICE resilience plan in preparation.• Loss of CPU at Tier 1 would interfere with ability to tune beam and reduce efficiency by as much as 20-30% (beam tuning phase.)

• Loss of ability to store data on tape would mean data taking coming to a halt once local storage was exhausted (4 days). Could be countered by copying data to multiple Tier 2 sites, unless disaster takes out network.

• Network loss would mean inability to analyse data (T1 not used for analysis) and inability to store data at Tier 2 centres.

• Hence network access is the highest priority, followed by the ability to write to tape.

c/oPaulKyberd

12

SiD Detector for ILCLOI Studies at T1

41M events

SimulationReconstruction

LeptonIDVertexing

13

SiD1. Loss of custodial data and T2 data. At the

moment, only have 2 T1 – SLAC and RAL. No real T2 structure yet. Recovery strategy would be to copy all data from SLAC to RAL – bandwidth limited.

2. Compute/storage loss. As don’t have enough resources to have other centres take over, would lose compute power. At this point, it would probably take longer to coordinate a backup strategy than to wait for recovery of services. In case of foreseeable long term loss, response would be co-ordinated locally assuming SLAC has no free reserves to absorb extra demands. Highest priority would be to recover storage, as this is bottleneck in the VO.

c/oJan Strube

14

SiD3.Massive procurement failure. Excellent question, but no useful response at the moment.4.Extended outage of Tier 1(> 5 days). An extended outage would be pretty devastating. Would have to recover data from tape store at SLAC and ship it to computing centres with enough resources on conventional farms. At the moment, only limited by storage throughput. Loss of the UK T1 would cause a considerable delay in work. Taking the recent LOI efforts as an example, ~ half the benchmarking analyses would not have finished before the deadline.

SiD Collaboration wish to thank RAL T1 team for all the help in their recent studies.

15

SuperNEMO

16

SuperNEMO

c/oGianfrancoSciacca

1. Loss of custodial data and T2 data. Currently not using T1 and mainly using T2s for cache storage. Little impact of data loss on VO operations.

2. Compute/storage loss. Response channelled through VO Admin. Implications are a massive slowdown of activities as would need to fall back on local clusters.

3. Massive procurement failure. Not yet thought about this scenario.

4. Extended outage of Tier 1(> 5 days). Currently, relying only on WMS at UK T1. Short-term – fall back to WMS’s in France for immediate activities. Long term – establish support from other WMS’s at other centres. Main comm. channel is GridPP/TB Support mailing lists - VO Admin – users. Some users may also be subscribed to GridPP-USERS list.

17

H1

18

H1Provided a list of UK site problems.

• Scheduling time is too long for H1 (Oxford). After 6 hours staying in scheduled state, jobs are cancelled and resubmitted. Concern over VO priority.

• Specific scheduling of jobs into running state (RAL, Birmingham). Certain queues show “one-by-one” or “two by two” submission.

• Bad sites list (Brunel, Birmingham, RAL). Includes missing libraries, “forever running”, etc.

• srm/LFC Catalogue Problem (QMUL, Oxford, IC). LFC entry exists, but physical file does not. Only seems to be a problem on UK sites. After a few cases, site is deleted from experiment list of queues.

c/oDaveSankey

19

Final Comments

“The Others” have very limited manpower resources to deal with disasters and to “fire-fight”....

Although LHC has priority, important to remember that “The Others” actually exist ....

ATLASCMS

One of “The Others”

20

The END

Glenn Patrick Rutherford Appleton Laboratory GridPP22 1 st April 2009.

Documents

data loss

loss of t2 data

data t1

mc data

loss of custodial data

data input

loss of resources

t1 t2 data replication