Role of Tier-0, Tier- Role of Tier-0, Tier- 1 and Tier-2 Regional 1 and Tier-2 Regional Centers during CMS Centers during CMS DC04 DC04 D. Bonacorsi (CNAF-INFN Bologna, Italy) on behalf of the CMS Collaboration
Dec 27, 2015
Role of Tier-0, Tier-1 Role of Tier-0, Tier-1 and Tier-2 Regional and Tier-2 Regional Centers during CMS Centers during CMS
DC04DC04D. Bonacorsi (CNAF-INFN Bologna, Italy)
on behalf of the CMS Collaboration
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 2
OutlineOutline
Introductory overview on: CMS Pre-Challenge Production (PCP)CMS Pre-Challenge Production (PCP) CMS Data Challenge (DC04)CMS Data Challenge (DC04)
ideas, strategies, key points: ( main focus on Regional Centers (RC) )
Role of RCs in data distribution infrastructure description of distinct scenarios deployed and tested in DC04
Successes, failures, experience gained, issues raised
Summary and conclusions
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 3
• Pre-Challenge Production: PCP (Jul. 03 - Feb. 04)Pre-Challenge Production: PCP (Jul. 03 - Feb. 04)– Simulation and digitization of data samples needed as input for DC– PCP Strategy:
• mainly non-grid productions, but also grid prototypes (CMS/LCG-0, LCG-1, Grid3)
~70M Monte Carlo events (20M with Geant-4) produced, 750K jobs, 3500 KSI2000 months, 80 TB of data
CMS PCP-DC04 overviewCMS PCP-DC04 overviewValidation of CMS computing model on a sufficient number of Tier-0/1/2 sites
large scale test of the computing/analysis models
• Data Challenge : DC04 (Mar. - Apr. 04)Data Challenge : DC04 (Mar. - Apr. 04)– Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi– Data distribution to Tier-1,Tier-2 sites– DC Strategy:
• sustain a 25 Hz reconstruction rate in the Tier-0 farm• register data and metadata to a world-readable catalogue• transfer reconstructed data from Tier-0 to Tier-1 centers• analyze reconstructed data at the Tier-1/2’s as they arrive• monitor and archive resources and process information
Aimed to the demostration of feasibility of the full chain
PCPPCP
DC04DC04
Reconstruction
Analysis
GenerationSimulation
Digitization
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 4
Global DC04 layoutGlobal DC04 layoutand data distribution infrastructureand data distribution infrastructure
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal Job
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal Job
Tier-1Tier-1Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
Tier-1Tier-1Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
Tier-0 Tier-0
CastorMSS
IBIB
fake on-lineprocess
RefDB
POOL RLScatalogue
TMDB
ORCARECO
Job
GDBGDBTier-0
data distributionagents
EBEB
LCG-2Services
Tier-2Tier-2
Physicist
T2T2storagestorage
ORCALocal Job
Tier-1Tier-1Tier-1agent
T1T1storagestorage
ORCAAnalysis
Job
MSS
ORCAGrid Job
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 5
DC04 key pointsDC04 key pointsand Regional Centers involvementand Regional Centers involvement
see later
• Maximize reconstruction efficiency at the Tier-0Tier-0
• Automatic registration and distribution of data via a set of loosely coupled agents running at the Tier-1Tier-1’s key role of the Transfer Management DB (TMDB) inter-agent communication
• Support a (reasonable) variety of data transfer strategies (and MSS):
LCG-2 Replica Manager (CNAF, PIC T1’s: with LCG-2 Castor-SE) native SRM (FNAL T1: with dCache+Enstore) SRB (RAL, IN2P3, GridKA, T1’s: with Castor, HPSS,…)
this reflects into 3 distinct distribution chains T0 T1’s
• Use a single global file catalogue (accessible from all Tier-1Tier-1’s) RLS used for data and metadata (POOL) by all transfer tools
• Redundant monitor/archive of info on resources and processes: MonaLisa global monitoring of network and all CPU resources, LEMON dedicated monitoring of DC04 Tier-0 resources, GridICE monitoring all LCG resources
• Grant data access at the Tier-2Tier-2’s for “real-time data analysis”
see also see also [ id162 ses7 tr4 ][ id162 ses7 tr4 ]
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 6
Hierarchy of RCs in DC04Hierarchy of RCs in DC04and data distribution chainsand data distribution chains
CERN
RAL(UK)
GridKA(Germany)
IN2P3(France)
FNAL(USA)
CNAF(Italy)
Legnaro
PIC(Spain)
CIEMAT UFL Caltech
LCG-2 RM chainSRM chain
SRB chain
Tier-2’sTier-2’s
Tier-1’sTier-1’s
Tier-0Tier-0
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 7
Tier-0Tier-0
Systems• LSF batch system
3 racks, 44 nodes each, dedicated: tot 264 CPUs Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT
Dedicated cmsdc04 batch queue, 500 RUN-slots• Disk servers:
DC04 dedicated stager, with 2 pools 2 pools: IB and GDB, 10 + 4 TB
Export Buffers• EB-SRM ( 4 servers, 4.2 TB total )• EB-SRB ( 4 servers, 4.2 TB total )• EB-SE ( 3 servers, 3.1 TB total )
Databases• RLS (Replica Location Service)• TMDB (Transfer Management DB)
Transfer steering• Agents steering data transfers
on a dedicated node (close monitoring..)
Monitoring Services
Architecture built on:
Castor
IBIB
fake on-lineprocess
RefDB
POOL RLScatalogue
TMDB
ORCARECO
Job
GDBGDBTier-0
data distrib.agents
EBEB
Tier-0Tier-0
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 8
The LCG-2 chain (1/2)The LCG-2 chain (1/2)
Tier-2
Tier-1
CERNCastor
RLSTMDB
RM data distribution
agent
Disk SE Disk SE EBEB
Tier-1agent
CASTORCASTORSESECastor
Disk SEDisk SE
• involved Tier-1’s: CNAF and PIC
Principle: data replication between LCG-2 SEs
Set-up: Tier-0: 1 EB - classic disk-based LCG-2 SE 3 SE machines with 1 TB eachTier-1’s: a Castor-SE receiving data but different underlying MSS hardware solution
Strategies comparison:
CNAF: Replica Manager CLI (+ LRC C++ API for listing replicas only) copy a file and inherently register it to the RLS, with file-size info stored in the LRC
over-head introduced by CLI java processes safer against failed replicas
PIC: globus-url-copy + LRC C++ API copy a file and later register to the RLS, no file-size check
faster! no quality-check of replica operations
see also see also [ id497 ses9 tr5 ][ id497 ses9 tr5 ]
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 9
The LCG-2 chain (2/2)The LCG-2 chain (2/2)• both CNAF and PIC approaches achieved good performances
T1 agents robust, kept the pace with data available at EB network ‘stress-test’ at the end of DC04 with ‘big’ files:
typical transfer rates >30 MB/s, CNAF sustained >42 MB/s for some hours
• dealing with too many small files (a DC issue affecting all distribution chains): “bad” for: efficient use of bandwidth
scalability of MSS systems
eth I/O
CERN Tier-0:SE-EB of the LCG chain
CNAF Tier-1:Castor-SE
CNAF Tier-1:Classic disk-SE
eth I/O
CNAF T1 network monitoring
>3k files>750 GB
~340 Mbps
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 10
The SRM chain (1/2)The SRM chain (1/2)
• involved Tier-1: FNAL
Principle: SRM transactions to receive TURLs from EB, transfers via gridFTP
Set-up: Tier-0: SRM/dCache based DRM serving as an EB files are staged out of Castor to the dCache pool disk and pinned until transferred Tier-1: SRM/dCache/Enstore based HRM acting as Import Buffer with SRM interface providing access to Enstore via dCache
SRM(performs space
reservation, write)
SRM(performs stage,
pin file)
SRM-GET (one file at a time, return TURL)
GridFTP GET (pull mode)
EBdCache PoolEnstore
Network transfer
SRM-COPY CERN T0FNAL T1
dCache Pool
SRM client(T1 agent machine)
see see [ id190 ses7 tr4 ][ id190 ses7 tr4 ]
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 11
0
5000
10000
15000
20000
1-M
ar-2
004
8-M
ar-2
004
15-M
ar-2
004
22-M
ar-2
004
29-M
ar-2
004
5-A
pr-2
004
12-A
pr-2
004
19-A
pr-2
004
26-A
pr-2
004
Nu
mb
er
of
Tra
ns
ferr
ed
file
s
The SRM chain (2/2)The SRM chain (2/2)• in general quite robust tools
e.g. SRM in error checking/retrying, dCache for authomatic migration to tape, ..
• stressed a few software/hardware components to the breaking point e.g. monitoring not implemented to catch service failures, forcing manual interventions
• again: problems of high-nb/small-size of DC files use of multiple streams with multiple files in each stream
reduced the overhead of the authentication process MSS optimization necessary to handle the challenge load
Inefficient use of tapes forced more tapes allocations + deployment of larger namespace service relevant improvements during ongoing DC operations
e.g. reduction on delegated proxy’s modulus size in SRM yielding a speed-up of the interaction between SRM client and server of a factor 3.5
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 12
The SRB chain (1/2)The SRB chain (1/2)
• involved Tier-1’s: GridKA, IN2P3, RAL
Principle: use SRB to transfer files to local MSS with consistent catalog info
Set-up: Tier-0: SRB EB files copied from Castor to EB machine then ‘inserted’ into SRB virtual space (both data and metadata)Tier-1’s: one SRB IB at each site data replication with SRB commands, i.e. Sreplicate or Sget/Sput
• GMCat component developed in UK link the SRB namespaces by publishing SRB replica info into the RLC at CERN periodically
• again: problems of high-nb/small-size of DC files troublesome injection process of the initial entries onto the SRB EB at the T0
unexpected inefficiencies with SRB commands on small files..
• reasonable T0T1 transfer rates e.g. IN2P3 averaged at ~30 Mbps, and sustained 80 Mbps for some hours
mainly limited by the small file sizes..
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 13
The SRB chain (2/2)The SRB chain (2/2)• while successful in PCP (that’s why some sites chose it), SRB showed unexpected poor performance in DC04
• severely hampered by technical issues: MCat single point of failure: unusability of metadata catalogue at RAL
Loss of performance, long time queries causing transfer commands to timeout, core-dumps.. several annoyances in both client/server sw of SRB v.2 used in DC04
SRB commands return code not reliable, hard to cleanly kill on-going Sreplicate processes, ...
• its use was stopped before official end of DC04
T1’s of the SRB chain did not take part to the large file transfer test at the end of DC04
• in-depth investigation in progress most problematic items successfully being addressed in SRB v.3
MCat problems..
GridKA Tier-1
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 14
Tier-2’s: real-time data analysisTier-2’s: real-time data analysis
• Tier-2’s involved in DC04: CIEMAT referring to PIC T1 Legnaro referring to CNAF T1 UFL, Caltech referring to FNAL T1
LCG-2 chainLCG-2 chain
SRM chainSRM chain
• LCG-2 chainLCG-2 chain: automatic procedures to advertise analysists that new data became available on T1 and T2 disk-SEs…
difficult to identify complete file sets.. … then job submission is automatically triggered, via the Resource Broker
job processes run at site close to data, access files via rfio, register output onto RLS, ... >15k jobs submitted for about 2 weeks via LCG-2 ran through the system
real-time data analysis at PIC measured a median delay of ~20 minutes between files being ready for distribution at the T0 and analysis jobs being submitted at the T1
• SRM chainSRM chain: FNAL T1 deployed a MySQL POOL catalogue to enable access to DC data transferred to US a few days of data access was attempted through dCache via a ROOT plug-in, allowing for COBRA based applications to access the data
software environment based on access to applications over AFS at CERN
high number of small files logistically difficult to find the needed files: stored by date on tape, thus many stages required to complete a file-set
see also see also [ id136 ses9 tr5 ][ id136 ses9 tr5 ]
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 15
An example:An example: Replica to disk-SEs Replica to disk-SEs
CNAF T1 disk-SE
green
CNAF T1 Castor SE
eth I/O inputfrom SE-EB
eth I/O inputfrom Castor SE
Legnaro T2 disk-SEeth I/O input from Castor SE
Just one day:Apr, 19th
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 16
Summary and ConclusionsSummary and Conclusions
The full chain is demonstrated to be feasible, but for limited amount of time
• Tier-0:– reconstruction/data-transfer/analysis may run at 25 Hz– 2200 running jobs/day (on ~500 CPU’s), 4 MB/s produced and distributed to each Tier-1, 0.4 files/s registered to RLS (with POOL metadata)
• Tier-1’s: different Tier-1 performances, related to operational choices– key items raised and addressed but e.g. good overall performance of LCG-2 chain (among others) throughout the DC
• main areas for future improvements have been identified:– Reduce number of files (i.e. increase <#events>/<#files>)
• more efficient use of bandwidth• fixed time to “start-up” dominates command execution times (e.g. java in replicas..)• address scalability of MSS systems
– Better organize in advance, foresee what the real working scenarios will be• avoid to work in a “always-reacting-to-something” mode..• avoid conditions of “statistical debugging” on too many files in problematic states..
• Real-time analysis at Tier-2’s was demonstrated to be possible– Time window between reco data availability - start of analysis jobs can be reasonably low– … but need a clean environment.
CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 17
Full authors listFull authors list
T. Barras, S. Metson, Bristol University, United KingdomJ. Andreeva, W. Jank, N. Sinanis, CERN, Switzerland
N. Colino, P. Garcia-Abia, J. M. Hernandez, F. J. Rodriguez-Calonge, CIEMAT, Madrid, SpainM. Ernst, DESY, Germany
A. Anzar, L. Bauerdick, I. Fisk, R. Harris, Y. Wu, , FNAL, Batavia, USAG. Quast, K. Rabbertz, J. Rehn, Karlsruhe University, Germany
N. De Filippis, G. Donvito, G. Maggi, INFN-Bari, ItalyP. Capiluppi, A. Fanfani, C. Grandi, INFN-Bologna, Italy
D. Bonacorsi, A.Chierici, L. Dell’Agnello, G. LoRe, B. Martelli, P. Ricci,F. Rosso, F. Ruggieri, INFN-CNAF, Italy
M. Biasotto, S. Fantinel, INFN-Legnaro, ItalyM. Corvo, F. Fanzago, M. Mazzucato, INFN-Padova, Italy
C.Charlot, P.Mine', I.Semeniouk, LLR-Ecole Polytechnique, CNRS&IN2P3, FranceL. Tuura, Northeastern University, Boston, USA
M. Delfino, F. Martinez, G. Merino, A. Pacheco, M. Rodriguez, PIC, Barcelona, SpainD. Stickland, T. Wildish, Princeton University, USA
D. Newbold, C. Shepherd-Themistocleous, RAL, United KingdomA. Nowack, RWTH Aachen, Germany