Top Banner
CAF_T2_141116 1 Introduction T2s L. Poggioli, LAL Recent S&C week @CERN, 26-30/09 – CHEP2016 Pledges 2017 & 2018 revisited • RRB feedback •ATLAS policy Next Sites jamboree 18-20 January @ CERN https://indico.cern.ch/event/579473/ Luc
23

Introduction T2s - indico.in2p3.fr

Oct 16, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 1

Introduction T2sL. Poggioli, LAL

• Recent

– S&C week @CERN, 26-30/09

– CHEP2016

– Pledges 2017 & 2018 revisited

• RRB feedback

•ATLAS policy

• Next

– Sites jamboree 18-20 January @ CERNhttps://indico.cern.ch/event/579473/

Luc

Page 2: Introduction T2s - indico.in2p3.fr

Resource usage in 2016

CAF_T2_141116 Luc 2

Distributed processingData and MC

Heterogeneous resource

Page 3: Introduction T2s - indico.in2p3.fr

Resource usage

CAF_T2_141116 Luc 6

• T1s are full (88%)– 5-10% cannot be used

(tape buffers)

• Fraction of secondary

• T2s are FINALLY full – Old recurrent pb

(Full at 78%)

• Large fraction of secondary

Page 4: Introduction T2s - indico.in2p3.fr

WORLD Cloud (1)

CAF_T2_141116 Luc 7

See Barreiro et al., CHEP2016

• Fully activated end March 2016

• Going definitely away from MONARC model

• Dynamic, tasks not confined to a cloud. Group of processing sites defined dynamically per task

• Task nucleus– Task brokerage choose nucleus for each task wrt

data locality, queued work & available storage

– T1s and the bigger T2s are defined as nuclei

– Output aggregated in task nucleus

Page 5: Introduction T2s - indico.in2p3.fr

WORLD Cloud (2)

CAF_T2_141116 Luc 8

• Task satellites– Run jobs and ship the output to the nucleus

– Job brokerage selects satellites for each task, based on usual criteria (#jobs, data availability)

– Satellites are selected worldwide: a network weight matches well connected nuclei & satellites

• Nuclei http://adc-ddm-mon.cern.ch/ddmusr01/NUCLEUS_DATADISK.html

– Currently T1s and ~20% of T2s Better T2 disk usage!!

– 65% datadisk in nuclei, aim to increase to ~80%• Today: CC, Tokyo, LAPP for FR

Page 6: Introduction T2s - indico.in2p3.fr

Run-2: 2017 & 2018

• LHC delivered 50% more data in 2016– Expect the same for 2017 & 2018

• -> New input parameters for CRSG

-> New pledges requests (for the 4 expts)CAF_T2_141116 Luc 11

Page 7: Introduction T2s - indico.in2p3.fr

Requests for 2017 & 2018

50% more data but only 20% increase

(disk & CPU)

– Majority of resource for MC production• NYEAR-N (FullSim) = 3.5B + 0.3B * NYEAR-N(Data)

• NYEAR-N (FastSim) = F * NYEAR-N(FullSim) F = 0.6 in 2017, F = 0.7 in 2018

– Tape request reduced from experience with lifetime model

CAF_T2_141116 Luc 12

FLAT BUDGET modelno more valid

Page 8: Introduction T2s - indico.in2p3.fr

CRSG outcome• For 2017: Endorse ATLAS requests!

– Crucial: Requested resources to be available at T0. Highest priority: tape at T0 & T1s for data and MC

– Essential: large use of beyond-pledge CPU resources for full simulation

This does not guarantee Funding Agencies to be able to fulfill requests (OK for major FAs, not France today )

• For 2018– Evaluate impact of parking data

– Evaluate impact of reduction #MC eventsCAF_T2_141116 Luc 13

Page 9: Introduction T2s - indico.in2p3.fr

If +20% resource not available (1)• In France (under discussion with LCG-FR)

– In flat budget scenario already unable to fulfill2017 April (except for disk at T2s)

• ATLAS model has very little contingency– Many aspects have been optimized (#copies,

dynamic data placement, lifetime model)

• In practice, reduction in our resources required -> production of fewer MC events – ATLAS grid dominated by MC simulation

• If lack of resources Collaborative effort across: S&C,Trigger, DataPrep, PC, subdet.

CAF_T2_141116 Luc 14

Page 10: Introduction T2s - indico.in2p3.fr

If +20% resource not available (2)• 2 ‘options’

– Reducing the HLT output rate to 750Hz -> stronger impact on physics program

– Parking data until LS2 -> negative impact on ATLAS students and require unexpected extra resources during LS2

• If not get more resources & maintain 1kHz HLT & process all data we can produce 4.5B FullSim in 2017– For comparison, need is 5.9 for 2017

– In 2016 will produce approximately 5.2B

CAF_T2_141116 Luc 15

Page 11: Introduction T2s - indico.in2p3.fr

ATLAS policy for 2017

CAF_T2_141116 Luc 16

• In case +20% not achievable (realistic)

• T1– Favor disk wrt CPU

• Allow to better benefit from opportunistic & pledged CPU resource

– Situation ~OK for tapes

• T2s– For Nuclei-like: favor disk

– For satellite-like: favor CPU

Balancing CPU/Disk is obsolete

Page 12: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 17

Activities since last CAF (1)

Average 1.1M

jobs/day

(was 1.05M)

Page 13: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 18

Activities since last CAF (2)

Running slots

• Constant > 220k running slots, up to 300k– MC simu decrease (end campaign) & Dip 1 25/10 (CentOS vuln.)

• Dominated by MC simulation (Less MC Reco)

Page 14: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 19

Activities since last CAF (3)

FR cloud 11.1% (last period 10.0)

Walltime /processing cloud

Page 15: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 20

MCORE (Production only) • No more quota required

– eg 80% obsolete

– Just ‘dynamic’ handling performing well

• FR-cloud: 13.4% in WT (last period 10.0%)• (CERN-T0 -3%)

Page 16: Introduction T2s - indico.in2p3.fr

FR-Cloud

CAF_T2_141116 Luc 21

WT all cloud

• CC (-9%) But last period higher by 6% wrt normal

• Tokyo, GRIF sites, CPPM, LAPP

Page 17: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 22

Transfers FR as source

FR as destination

CPPM, CC

RO-02, RO-07

Page 18: Introduction T2s - indico.in2p3.fr

FR-sites availability (ATLAS_CRITICAL)

CAF_T2_141116 Luc 23

• All sites >90%, but RO-16

& RO-14

Page 19: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 24

Sites ranking ASAP (ICB) Analysis availability Online (since last CAF)

Integr’ed over 2 months

All above 90%!! Instabilities: LAL

Page 20: Introduction T2s - indico.in2p3.fr

Issues (1)• CPPM

– GGUS:124652 Failing transfers

• LAPP– UDT cooling failure

• LPC– GGUS:123227 Deletion errors

– Renater network problem

• LPNHE– GGUS:124043 Deletion errors. Storage server pb

• LAL– GGUS:124726 Deletion errors

CAF_T2_141116 Luc 27

Page 21: Introduction T2s - indico.in2p3.fr

Issues (2)

• IRFU– Cooling issue Disk OK but CPU@30% OK now

– GGUS:124532 Failing transfers (reverse DNS lookup broken for IPv6)

• RO-14– DATADISK full

• RO-16– GGUS:124175 Squid down

• RO-07– GGUS:124939 Failing transfers (Disk full)

CAF_T2_141116 Luc 28

Page 22: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 29

Ongoing: hpc progress • hpc used today

– US, China, Europe (Germany)

– Still under development

– Used in backfill modes

• In France– Working group ATLAS/CCIN2P3/IDRIS

(Orsay)• Work on a demonstrator

• Architecture PowerPC not favorable

– Contact with TGCC (Saclay)• Architecture OK but machine full

• Ongoing

Page 23: Introduction T2s - indico.in2p3.fr

CAF_T2_141116 Luc 30

Ongoing • General

– Pledges revision for 2017 & 2018

– AFS_GROUPDIR removal OK

– Archiving to tape old files (sps, LOCAL)

– IPV6?

– hpc

• Sites– Critical kernel vulnerability OK

– RO-LCG Federation review• Document supplied to reviewers (A. Filipcic, LP)

• Report provided by reviewers