Role of Tier-0, Tier-1 and Tier-2 Regional Centers during CMS DC04 D. Bonacorsi (CNAF-INFN Bologna, Italy) on behalf of the CMS Collaboration.

Role of Tier-0, Tier-1 Role of Tier-0, Tier-1 and Tier-2 Regional and Tier-2 Regional Centers during CMS Centers during CMS

DC04DC04D. Bonacorsi (CNAF-INFN Bologna, Italy)

on behalf of the CMS Collaboration

CHEP’04, Interlaken (Sept 27th – Oct 1st, 2004) D.Bonacorsi (CNAF-INFN, Italy) [ id498 ses9 tr5 ][ id498 ses9 tr5 ] 2

OutlineOutline

Introductory overview on: CMS Pre-Challenge Production (PCP)CMS Pre-Challenge Production (PCP) CMS Data Challenge (DC04)CMS Data Challenge (DC04)

ideas, strategies, key points: ( main focus on Regional Centers (RC) )

Role of RCs in data distribution infrastructure description of distinct scenarios deployed and tested in DC04

Successes, failures, experience gained, issues raised

Summary and conclusions


• Pre-Challenge Production: PCP (Jul. 03 - Feb. 04)Pre-Challenge Production: PCP (Jul. 03 - Feb. 04)– Simulation and digitization of data samples needed as input for DC– PCP Strategy:

• mainly non-grid productions, but also grid prototypes (CMS/LCG-0, LCG-1, Grid3)

~70M Monte Carlo events (20M with Geant-4) produced, 750K jobs, 3500 KSI2000 months, 80 TB of data

CMS PCP-DC04 overviewCMS PCP-DC04 overviewValidation of CMS computing model on a sufficient number of Tier-0/1/2 sites

large scale test of the computing/analysis models

• Data Challenge : DC04 (Mar. - Apr. 04)Data Challenge : DC04 (Mar. - Apr. 04)– Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi– Data distribution to Tier-1,Tier-2 sites– DC Strategy:

• sustain a 25 Hz reconstruction rate in the Tier-0 farm• register data and metadata to a world-readable catalogue• transfer reconstructed data from Tier-0 to Tier-1 centers• analyze reconstructed data at the Tier-1/2’s as they arrive• monitor and archive resources and process information

Aimed to the demostration of feasibility of the full chain

PCPPCP

DC04DC04

Reconstruction

Analysis

GenerationSimulation

Digitization


Global DC04 layoutGlobal DC04 layoutand data distribution infrastructureand data distribution infrastructure

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal Job

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal Job

Tier-1Tier-1Tier-1agent

T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job


T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job

Tier-0 Tier-0

CastorMSS

IBIB

fake on-lineprocess

RefDB

POOL RLScatalogue

TMDB

ORCARECO

Job

GDBGDBTier-0

data distributionagents

EBEB

LCG-2Services

Tier-2Tier-2

Physicist

T2T2storagestorage

ORCALocal Job


T1T1storagestorage

ORCAAnalysis

Job

MSS

ORCAGrid Job


DC04 key pointsDC04 key pointsand Regional Centers involvementand Regional Centers involvement

see later

• Maximize reconstruction efficiency at the Tier-0Tier-0

• Automatic registration and distribution of data via a set of loosely coupled agents running at the Tier-1Tier-1’s key role of the Transfer Management DB (TMDB) inter-agent communication

• Support a (reasonable) variety of data transfer strategies (and MSS):

LCG-2 Replica Manager (CNAF, PIC T1’s: with LCG-2 Castor-SE) native SRM (FNAL T1: with dCache+Enstore) SRB (RAL, IN2P3, GridKA, T1’s: with Castor, HPSS,…)

this reflects into 3 distinct distribution chains T0 T1’s

• Use a single global file catalogue (accessible from all Tier-1Tier-1’s) RLS used for data and metadata (POOL) by all transfer tools

• Redundant monitor/archive of info on resources and processes: MonaLisa global monitoring of network and all CPU resources, LEMON dedicated monitoring of DC04 Tier-0 resources, GridICE monitoring all LCG resources

• Grant data access at the Tier-2Tier-2’s for “real-time data analysis”

see also see also [ id162 ses7 tr4 ][ id162 ses7 tr4 ]


Hierarchy of RCs in DC04Hierarchy of RCs in DC04and data distribution chainsand data distribution chains

CERN

RAL(UK)

GridKA(Germany)

IN2P3(France)

FNAL(USA)

CNAF(Italy)

Legnaro

PIC(Spain)

CIEMAT UFL Caltech

LCG-2 RM chainSRM chain

SRB chain

Tier-2’sTier-2’s

Tier-1’sTier-1’s

Tier-0Tier-0


Tier-0Tier-0

Systems• LSF batch system

3 racks, 44 nodes each, dedicated: tot 264 CPUs Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT

Dedicated cmsdc04 batch queue, 500 RUN-slots• Disk servers:

DC04 dedicated stager, with 2 pools 2 pools: IB and GDB, 10 + 4 TB

Export Buffers• EB-SRM ( 4 servers, 4.2 TB total )• EB-SRB ( 4 servers, 4.2 TB total )• EB-SE ( 3 servers, 3.1 TB total )

Databases• RLS (Replica Location Service)• TMDB (Transfer Management DB)

Transfer steering• Agents steering data transfers

on a dedicated node (close monitoring..)

Monitoring Services

Architecture built on:

Castor

IBIB

fake on-lineprocess

RefDB

POOL RLScatalogue

TMDB

ORCARECO

Job

GDBGDBTier-0

data distrib.agents

EBEB

Tier-0Tier-0


The LCG-2 chain (1/2)The LCG-2 chain (1/2)

Tier-2

Tier-1

CERNCastor

RLSTMDB

RM data distribution

agent

Disk SE Disk SE EBEB

Tier-1agent

CASTORCASTORSESECastor

Disk SEDisk SE

• involved Tier-1’s: CNAF and PIC

Principle: data replication between LCG-2 SEs

Set-up: Tier-0: 1 EB - classic disk-based LCG-2 SE 3 SE machines with 1 TB eachTier-1’s: a Castor-SE receiving data but different underlying MSS hardware solution

Strategies comparison:

CNAF: Replica Manager CLI (+ LRC C++ API for listing replicas only) copy a file and inherently register it to the RLS, with file-size info stored in the LRC

over-head introduced by CLI java processes safer against failed replicas

PIC: globus-url-copy + LRC C++ API copy a file and later register to the RLS, no file-size check

faster! no quality-check of replica operations



The LCG-2 chain (2/2)The LCG-2 chain (2/2)• both CNAF and PIC approaches achieved good performances

T1 agents robust, kept the pace with data available at EB network ‘stress-test’ at the end of DC04 with ‘big’ files:

typical transfer rates >30 MB/s, CNAF sustained >42 MB/s for some hours

• dealing with too many small files (a DC issue affecting all distribution chains): “bad” for: efficient use of bandwidth

scalability of MSS systems

eth I/O

CERN Tier-0:SE-EB of the LCG chain

CNAF Tier-1:Castor-SE

CNAF Tier-1:Classic disk-SE

eth I/O

CNAF T1 network monitoring

>3k files>750 GB

~340 Mbps


The SRM chain (1/2)The SRM chain (1/2)

• involved Tier-1: FNAL

Principle: SRM transactions to receive TURLs from EB, transfers via gridFTP

Set-up: Tier-0: SRM/dCache based DRM serving as an EB files are staged out of Castor to the dCache pool disk and pinned until transferred Tier-1: SRM/dCache/Enstore based HRM acting as Import Buffer with SRM interface providing access to Enstore via dCache

SRM(performs space

reservation, write)

SRM(performs stage,

pin file)

SRM-GET (one file at a time, return TURL)

GridFTP GET (pull mode)

EBdCache PoolEnstore

Network transfer

SRM-COPY CERN T0FNAL T1

dCache Pool

SRM client(T1 agent machine)

see see [ id190 ses7 tr4 ][ id190 ses7 tr4 ]


0

5000

10000

15000

20000

1-M

ar-2

004

8-M

ar-2

004

15-M

ar-2

004

22-M

ar-2

004

29-M

ar-2

004

5-A

pr-2

004

12-A

pr-2

004

19-A

pr-2

004

26-A

pr-2

004

Nu

mb

er

of

Tra

ns

ferr

ed

file

s

The SRM chain (2/2)The SRM chain (2/2)• in general quite robust tools

e.g. SRM in error checking/retrying, dCache for authomatic migration to tape, ..

• stressed a few software/hardware components to the breaking point e.g. monitoring not implemented to catch service failures, forcing manual interventions

• again: problems of high-nb/small-size of DC files use of multiple streams with multiple files in each stream

reduced the overhead of the authentication process MSS optimization necessary to handle the challenge load

Inefficient use of tapes forced more tapes allocations + deployment of larger namespace service relevant improvements during ongoing DC operations

e.g. reduction on delegated proxy’s modulus size in SRM yielding a speed-up of the interaction between SRM client and server of a factor 3.5


The SRB chain (1/2)The SRB chain (1/2)

• involved Tier-1’s: GridKA, IN2P3, RAL

Principle: use SRB to transfer files to local MSS with consistent catalog info

Set-up: Tier-0: SRB EB files copied from Castor to EB machine then ‘inserted’ into SRB virtual space (both data and metadata)Tier-1’s: one SRB IB at each site data replication with SRB commands, i.e. Sreplicate or Sget/Sput

• GMCat component developed in UK link the SRB namespaces by publishing SRB replica info into the RLC at CERN periodically

• again: problems of high-nb/small-size of DC files troublesome injection process of the initial entries onto the SRB EB at the T0

unexpected inefficiencies with SRB commands on small files..

• reasonable T0T1 transfer rates e.g. IN2P3 averaged at ~30 Mbps, and sustained 80 Mbps for some hours

mainly limited by the small file sizes..


The SRB chain (2/2)The SRB chain (2/2)• while successful in PCP (that’s why some sites chose it), SRB showed unexpected poor performance in DC04

• severely hampered by technical issues: MCat single point of failure: unusability of metadata catalogue at RAL

Loss of performance, long time queries causing transfer commands to timeout, core-dumps.. several annoyances in both client/server sw of SRB v.2 used in DC04

SRB commands return code not reliable, hard to cleanly kill on-going Sreplicate processes, ...

• its use was stopped before official end of DC04

T1’s of the SRB chain did not take part to the large file transfer test at the end of DC04

• in-depth investigation in progress most problematic items successfully being addressed in SRB v.3

MCat problems..

GridKA Tier-1


Tier-2’s: real-time data analysisTier-2’s: real-time data analysis

• Tier-2’s involved in DC04: CIEMAT referring to PIC T1 Legnaro referring to CNAF T1 UFL, Caltech referring to FNAL T1

LCG-2 chainLCG-2 chain

SRM chainSRM chain

• LCG-2 chainLCG-2 chain: automatic procedures to advertise analysists that new data became available on T1 and T2 disk-SEs…

difficult to identify complete file sets.. … then job submission is automatically triggered, via the Resource Broker

job processes run at site close to data, access files via rfio, register output onto RLS, ... >15k jobs submitted for about 2 weeks via LCG-2 ran through the system

real-time data analysis at PIC measured a median delay of ~20 minutes between files being ready for distribution at the T0 and analysis jobs being submitted at the T1

• SRM chainSRM chain: FNAL T1 deployed a MySQL POOL catalogue to enable access to DC data transferred to US a few days of data access was attempted through dCache via a ROOT plug-in, allowing for COBRA based applications to access the data

software environment based on access to applications over AFS at CERN

high number of small files logistically difficult to find the needed files: stored by date on tape, thus many stages required to complete a file-set



An example:An example: Replica to disk-SEs Replica to disk-SEs

CNAF T1 disk-SE

green

CNAF T1 Castor SE

eth I/O inputfrom SE-EB

eth I/O inputfrom Castor SE

Legnaro T2 disk-SEeth I/O input from Castor SE

Just one day:Apr, 19th


Summary and ConclusionsSummary and Conclusions

The full chain is demonstrated to be feasible, but for limited amount of time

• Tier-0:– reconstruction/data-transfer/analysis may run at 25 Hz– 2200 running jobs/day (on ~500 CPU’s), 4 MB/s produced and distributed to each Tier-1, 0.4 files/s registered to RLS (with POOL metadata)

• Tier-1’s: different Tier-1 performances, related to operational choices– key items raised and addressed but e.g. good overall performance of LCG-2 chain (among others) throughout the DC

• main areas for future improvements have been identified:– Reduce number of files (i.e. increase <#events>/<#files>)

• more efficient use of bandwidth• fixed time to “start-up” dominates command execution times (e.g. java in replicas..)• address scalability of MSS systems

– Better organize in advance, foresee what the real working scenarios will be• avoid to work in a “always-reacting-to-something” mode..• avoid conditions of “statistical debugging” on too many files in problematic states..

• Real-time analysis at Tier-2’s was demonstrated to be possible– Time window between reco data availability - start of analysis jobs can be reasonably low– … but need a clean environment.


Full authors listFull authors list

T. Barras, S. Metson, Bristol University, United KingdomJ. Andreeva, W. Jank, N. Sinanis, CERN, Switzerland

N. Colino, P. Garcia-Abia, J. M. Hernandez, F. J. Rodriguez-Calonge, CIEMAT, Madrid, SpainM. Ernst, DESY, Germany

A. Anzar, L. Bauerdick, I. Fisk, R. Harris, Y. Wu, , FNAL, Batavia, USAG. Quast, K. Rabbertz, J. Rehn, Karlsruhe University, Germany

N. De Filippis, G. Donvito, G. Maggi, INFN-Bari, ItalyP. Capiluppi, A. Fanfani, C. Grandi, INFN-Bologna, Italy

D. Bonacorsi, A.Chierici, L. Dell’Agnello, G. LoRe, B. Martelli, P. Ricci,F. Rosso, F. Ruggieri, INFN-CNAF, Italy

M. Biasotto, S. Fantinel, INFN-Legnaro, ItalyM. Corvo, F. Fanzago, M. Mazzucato, INFN-Padova, Italy

C.Charlot, P.Mine', I.Semeniouk, LLR-Ecole Polytechnique, CNRS&IN2P3, FranceL. Tuura, Northeastern University, Boston, USA

M. Delfino, F. Martinez, G. Merino, A. Pacheco, M. Rodriguez, PIC, Barcelona, SpainD. Stickland, T. Wildish, Princeton University, USA

D. Newbold, C. Shepherd-Themistocleous, RAL, United KingdomA. Nowack, RWTH Aachen, Germany

Role of Tier-0, Tier-1 and Tier-2 Regional Centers during CMS DC04 D. Bonacorsi (CNAF-INFN Bologna, Italy) on behalf of the CMS Collaboration.

Documents

distribution of data

role of tier

farmregister data

sufficient number of

realtime data analysissee

digitization of data

dc04 successes

dc04 dedicated stager