Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

Towards the operations ofTowards the operations ofthe Italian Tier-1 for CMS:the Italian Tier-1 for CMS:

lessons learned from the CMS Data Challengelessons learned from the CMS Data Challenge

D. Bonacorsi(on behalf of INFN-CNAF Tier-1 staff and the CMS experiment)

ACAT 2005X Int. Work. on Advanced Computing & Analysis Techniques in Physics Research

May 22nd-27th, 2005 - DESY, Zeuthen, Germany

2 ACAT05 - May 22nd-27th, 2005 – DESY D. Bonacorsi

OutlineOutline

The past CMS operational environment during the Data Challenge

focus on INFN-CNAF Tier-1 resources and set-up

The present lessons learned from the challenge

The future … try to apply what we (think we) learned…


The INFN-CNAF Tier-1The INFN-CNAF Tier-1

Located at INFN-CNAF centre, in Bologna (Italy) computing facility for INFN HNEP community

one of the main nodes of GARR network

Multi-experiment Tier-1 LHC experiments + AMS, Argo, BaBar, CDF, Magic, Virgo, … evolution: dynamic sharing of resources among involved exps

CNAF is a relevant Italian site from a Grid perspective partecipating to LCG, EGEE, INFN-GRID projects support to R&D activities, develop/testing prototypes/components “traditional” access to resources granted also, but more ‘manpower-

consuming’


Tier-1 resources and servicesTier-1 resources and services computing power

CPU farms for ~1300 kSI2k + few dozen of servers biproc boxes [320 @0.8-2.4 GHz, 350 @3 GHz], ht activated

storage on-line data access (disks)

IDE, SCSI, FC; 4 NAS systems IDE, SCSI, FC; 4 NAS systems [~60 TB][~60 TB], 2 SAN systems , 2 SAN systems [~225 TB][~225 TB] custodial task on MSS (tapes in Castor HSM system)

Stk L180 lib - overall ~18 TBoverall ~18 TB Stk 5500 lib - 6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed)6 LTO-2 [~ 240 TB] + 2 9940b [~ 136 TB] (more to be installed)

networking T1 LAN

rack FE switches with 2xGbps uplinks to core switch (ds via GE to core) upgrade foreseen rack Gb switches

1 Gbps T1 link to WAN (+1 Gbps is for Service Challenge) will be 10 Gbps [Q3 2005]

More: infrastructure (electric power, UPS, etc.) system administration, database services administration, etc. support to experiment-specific activities coordination with Tier-0, other Tier-1’s, and Tier-n’s (n>1)


The CMS Data Challenge: The CMS Data Challenge: whatwhat and and howhow

CMS Pre-Challenge Production (PCPPCP) up to digitization (needed as input for DC) mainly non-grid productions…

• …but also grid prototypes (CMS/LCG-0, LCG-1, Grid3)

GenerationGenerationSimulationSimulation

DigitizationDigitization~70M Monte Carlo events (20M with Geant-4) produced,750K jobs ran, 3500 KSI2000 months, 80 TB of data

CMS Data Challenge (DC04DC04) Reconstruction and analysis on CMS data sustained over 2 months at the 5% of the LHC rate at full luminosity 25% of start-up lumi

• sustain a 25 Hz reconstruction rate in the Tier-0 farm• register data and metadata to a world-readable catalogue• distribute reconstructed data from Tier-0 to Tier-1/2’s• analyze reconstructed data at the Tier-1/2’s as they arrive• monitor/archive information on resources and processes

not a CPU challenge.. aimed to the demostration of feasibility of the full chain

Reconstruction

Analysis

Validate the CMS computing model on a sufficient number of Tier-0/1/2’s large scale test of the computing/analysis models


BOSS DB

Dataset

metadataJob

metadata

McRunjobSite Manager startsan assignment

RefDB

Phys.Group asks fora new dataset

shellscripts

LocalBatch Manager

Computer farmJob level

query

Data-levelquery

Production Managerdefines assignments

Push data or info

Pull info

JDL Grid (LCG)Scheduler LCG-x

RLS

DAG

job job

job

job

DAGMan(MOP)

ChimeraVDL

Virtual DataCatalogue

Planner

Grid3

PCP set-up: a hybrid modelPCP set-up: a hybrid model


EU-CMS: submit to LCG scheduler CMS-LCG “virtual” Regional Center0.5 Mevts Generation [“heavy” pythia](~2000 jobs ~8 hours* each, ~10 KSI2000 months)~ 2.1 Mevts Simulation [CMSIM+OSCAR](~8500 jobs ~10hours* each, ~130 KSI2000 months)~2 TB data

* PIII 1GHz

CMSIM: ~1.5 Mevtson CMS/LCG-0

OSCAR: ~0.6 Mevtson LCG-1

PCP grid-based prototypesPCP grid-based prototypes

“traditional” production constant work of integration in CMS between: CMS software and production tools

evolving EDG-XLCG-Y middlewarein several phases: CMS “Stress Test” with EDG<1.4, then: PCP on the CMS/LCG-0 testbed PCP on LCG-1

… towards DC04 with LCG-2

CMS prod. steps: INFN/CMS [%]Generation 13 %Simulation 14 %ooHitformatting 21 %Digitisation 18 %

Strong INFN contributionto crucial PCP production,in both:


GlobalGlobal DC04 layout and workflow DC04 layout and workflow

CastorMSS

IBIB

fake on-lineprocess

RefDB

POOL RLScatalogue

ORCARECO

Job

GDBGDB LCG-2Services Physicist

ORCAJob

ORCAJob

T0 datadistribution

agents

Tier-0Tier-0

Tier-1Tier-1

Tier-2Tier-2

TMDB

disk-SEdisk-SEEBsEBs

T2T2Disk-SEDisk-SE

T1 datadistribution

agents

T1T1Castor-SECastor-SE

T1T1disk-SEdisk-SE

CastorMSS

Hierarchy of RCs &data distribution chains3 distinct scenariosdeployed and tested


INFN-specificINFN-specific DC04 workflow DC04 workflowdisk-SE

Export Buffer

TRA-Agent

TransferManagement

DB

T1Castor SE

LTO-2tape library

T1disk-SE

REP-Agent

T2disk-SE

CNAF T1CNAF T1

Legnaro T2Legnaro T2

localMySQL

SAFE-Agent

query dbupdate db

data flow

data movement T0T1 data custodial task: interface to MSS data movement T1T2 for “real-time analysis”

Basic issuesaddressed at T1:


An example:An example:Data flow during just 1 day of DC04Data flow during just 1 day of DC04

CNAF T1 disk-SE

green

CNAF T1 Castor SECNAF T1 Castor SEeth I/O inputfrom SE-EB

eth I/O inputfrom Castor SE

TCP connections

RAM memory

Legnaro T2 disk-SEeth I/O input from Castor SE

Just one day:Apr, 19th


DC04 outcome DC04 outcome (grand-summary + focus on INFN T1)(grand-summary + focus on INFN T1) reconstruction/data-transfer/analysis may run at 25 Hz automatic registration and distribution of data, key role of the TMDB

was the embrional PhEDEx! support a (reasonable) variety of different data transfer tools and set-up

Tier-1’s: different performances, related to operational choices SRB, LCG Replica Manager and SRM investigated: see CHEP04 talk

INFN T1: good performance of LCG-2 chain (PIC T1 also) register all data and metadata (POOL) to a world-readable catalogue

RLS: good as a global file catalogue, bad as a global metadata catalogue analyze the reconstructed data at the Tier-1’s as data arrive

LCG components: dedicated bdII+RB; UIs, CEs+WNs at CNAF and PIC real-time analysis at Tier-2’s was demonstrated to be possible

~15k jobs submitted time window between reco data availability - start of analysis jobs can be

reasonably low (i.e. 20 mins) reduce number of files (i.e. increase <#events>/<#files>)

more efficient use of bandwidth reduce overhead of commands address scalability of MSS systems (!)


Some general considerations may apply: although a DC is experiment-specific, maybe its conclusions

are not

an “experiment-specific” problem is better addressed if conceived as a “shared” one in a shared Tier-1

an experiment DC just provides hints, real work gives insight

crucial role of the experiments at the Tier-1• find weaknesses of CASTOR MSS system in particular operating

conditions• stress-test new LSF farm with official production jobs by CMS• testing DNS-based load-balancing by serving data for production and/or

analysis from CMS disk-servers• test new components, newly installed/upgraded Grid tools, etc… • find bottleneck and scalability problems in DB services• give feedback on monitoring and accounting activities• …

Learn from DC04 lessons…Learn from DC04 lessons…


T1 today: T1 today: farmingfarming What changed since DC04? What changed since DC04?

RUNNING

PENDING

Total nb. jobsMax nb. slots

Analysis “controlled” and “fake” (DC04) vs. “unpredictable” and “real” (now)

T1 provides one full LCG site + 2 dedicated RBs/bdII + support to CRABers Interoperability: always an issue, even harder in a transition period

dealing with ~2-3 sub-farms in use by ~10 exps (in prod) resource use optimization: still to be achieved

Migration in progress: OS

RH v.7.3 SLC v.3.0.4 middleware

upgrade to LCG v.2.4.0 install/manage WNs/servers

lcfgng Quattor integration LCG-Quattor

batch scheduler Torque+Maui LSF v.6.0 queues for prod/anal manage Grid interfacing

see see [ N.DeFilippis session II day 3][ N.DeFilippis session II day 3]


Storage issues (1/2): disks driven by requirements of LHC data processing at the Tier-1

i.e. simultaneous access of ~PBs of data from ~1000 nodes at high rate main focus is on robust, load-balanced, redundant solutions to grant proficient

and stable data access to distributed users namely: “make both sw and data accessible from jobs running on WNs”

• remote access (gridftp) and local access (rfiod, xrootd, GPFS) services, afs/nfs to share exps’ sw on WNs, filesystems tests, specific problem solving in analysts’ daily operations, CNAF participation to SC2/3, etc.

a SAN approach with a parallel filesystem on-top looks promising

Storage issues (2/2): tapes CMS DC04 helped to focus some problems:

LTO-2 drives not efficiently used by exps in production at T1• performance degradation increases as file size decreases• hangs on locate/fskip after ~100 not-sequential reading• not-full tapes are labelled ‘RDONLY’ after 50-100 GB written only

CASTOR performances increase with clever pre-staging of files• some reliability achieved only on sequential/pre-staged reading

solutions?• from the HSM sw side: fix coming with CASTOR v.2 (Q2 2005)?• from the HSM hw side: test 9940b drives in prod (see PIC T1)• from the exp side: explore possible solutions

▪ e.g. file-merging in coupling PhEDEx tool to CMS production system▪ e.g. depict a pure-disk buffer in front of MSS disantangled from CASTOR

see see [ P.P.Ricci session II day 3][ P.P.Ricci session II day 3]

T1 today: T1 today: storagestorage What changed since DC04? What changed since DC04?


CMS activities at the Tier-1CMS activities at the Tier-1

Current CMS set-up at the Tier-1Current CMS set-up at the Tier-1

Castordisk buffer

CastorMSS

CE

LSF SE

SE

Grid.it / LCG layer

Productiondisks

Analysisdisks

Import-ExportImport-ExportBufferBuffer

SE

PhED

Ex a

gent

s

OverflowOverflowWN WN WN

WN WN WN

WN WN WN

WN WN WN

CoreCoreWN WN WN

WN WN

CPUsCPUs

WN

shared

CMSlocal

remote “access”

logical grouping

OperationsOperationscontrolcontrol

gw/UI

UI

Local prod

PhEDEx agentsGrid prod/anal

Resources manag.Resources manag.

“control”


PhEDEx in CMSPhEDEx in CMS PhEDExPhEDEx (Physics Experiment Data Export) used by CMS

components: TMDB from DC04

• files, topology, subscriptions... coherent set of sw agents,

loosely coupled, inter-operating and communicating with TMDB blackboard

• e.g. agents for data allocation (based on site data subscriptions), file import/export, migration to MSS, routing (based on implemented topologies), monitoring, etc…

INFN T1 mainly on INFN T1 mainly on data transferdata transfer…… INFN T1 mainly on INFN T1 mainly on prod/analprod/anal

overall infrastructure for data transfer management in CMS allocation and transfers of CMS physics data among Tier-0/1/2’s

• different datasets move on bidirectional routes among Regional Centers• data should reside on SEs (e.g. gsiftp or srm protocols)

born, and growing fast… >70 TB known to PhEDEx, >150 TB total replicated


CNAF T1 diskserver I/O

Rate out of CERN Tier-0

PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1weekly daily


PhEDEx at INFNPhEDEx at INFN INFN-CNAF is a T1 ‘node’ in PhEDEx

CMS DC04 experience was crucial to start-up PhEDEX in INFN CNAF node operational since the beginning

First phase (Q3/4 2004): Agent code development + focus on operations: T0T1 transfers

>1 TB/day T0T1 demonstrated feasible• … but the aim is not to achieve peaks, but to sustain them in normal operations

Second phase (Q1 2005): PhEDEx deployment in INFN to Tier-n, n>1:

“distributed” topology scenario• Tier-n agents run at remote sites, not at the T1: know-how required, T1 support

already operational at Legnaro, Pisa, Bari, Bologna

Third phase (Q>1 2005): Many issues.. e.g. stability of service, dynamic routing, coupling PhEDEx to CMS official production system, PhEDEx involvement in SC3-phaseII, etc…

~450 Mbps CNAF T1 ~450 Mbps CNAF T1 LNL-T2 LNL-T2 ~205 Mbps CNAF T1 ~205 Mbps CNAF T1 Pisa-T2 Pisa-T2An example:data flow to T2’s in daily operations (here: a test with ~2000 files, 90 GB, with no optimization)


CMS production system evolving into a permanent effort strong contribution of INFN T1 to CMS productions

252 ‘assignments’ in PCP-DC04, for all production step [both local and Grid] plenty of assignments (simulation only) now running on LCG (Italy+Spain)

• CNAF support for ‘direct’ submitters + backup SEs provided for Spain currently, digitization/DST efficiently run locally (mostly at T1)

produced data hence injected in the CMS data distribution infrastructure future of T1 productions: rounds of “scheduled” reprocessing

DST production at INFN T1

~11.8 Mevts prodotti

~12.9 Mevts assegnati

CMS MonteCarlo productionsCMS MonteCarlo productions


coming next: coming next: Service Challenge (SC3)Service Challenge (SC3) data transfer and data serving in real use-cases

review existing infrastructure/tools and give a boost details of the challenge are currently under definition

Two phases: Jul05: SC3 “throughput” phase

Tier-0/1/2 simultaneous import/export, MSS involved move real files, store on real hw

>Sep05: SC3 “service” phase small scale replica of the overall system

• modest throughput, main focus is on testing in a quite complete environment, with all the crucial components

space for experiment-specific tests and inputs Goals

test crucial components, push to prod-quality, and measure. towards the next production service

INFN T1 participated in SC2, and is joining SC3


ConclusionsConclusions INFN-CNAF T1 is quite young but ramping-up towards stable

production-quality services optimized use of resources + interfaces to the Grid policy/HR to support experiments at the Tier-1

the Tier-1 actively partecipated to CMS DC04 good hints: identified bottlenecks in managing resources, scalability, …

Learn the lessons: overall revision of CMS set-up at the T1 involves both Grid and non-Grid access first results are encouraging, success of daily operations

local/Grid productions + distributed analysis are running…

Go ahead: long path… next step on it: preparation for SC3, also with CMS applications


Back-up slidesBack-up slides




weekly daily

PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1




weekly daily

PhEDEx transfer rates T0PhEDEx transfer rates T0INFN T1INFN T1


CNAF “autopsy” of DC04CNAF “autopsy” of DC04

DC04DC04

Lethalinjuries only

Agents drain data from SE-EB down to CNAF/PIC T1’s andland directly on a Castor SE buffer it occurred that in DC04 these files were many and small

So: for any file on the Castor SE fs, a tape migration isforeseen with a given policy, regardless of their size/nb..

this strongly affected data transfer at CNAF T1 (MSS below is STK tape lib with LTO-2 tapes)

Castor stager scalability issues many small files (mostly 500B-50kB) stager db bad performances of stager db for >300-400k entries (may need more RAM?)

• CNAF fast set-up of an additional stager in DC04: basically worked• REP-Agent cloned to transparently continue replication to disk-SEs

tape library LTO-2 issues high nb. segments on tape bad tape read/write performances, LTO-2 SCSI errors, repositioning failures, slow migration to tape and delays in the TMDB “SAFE”-labelling, inefficient tape space usage

A–posteriori solutions: consider a disk-based Import Buffer in front of MSS…

[ see next slide ][ see next slide ]


minor (?) Castor/tape-library issues Castor filename length (more info: Castor ticket CT196717CT196717) ext3 file-system corruption on a partition of the old stager tapes blocked in the library

several crashes/hanging of the TRA-Agent (rate: ~ 3 times per week) created from time to time some backlogs, nevertheless fast to be recovered post-mortem analysis in progress

experience with the Replica Manager interface e.g. files of size 0 created at destination when trying to replicate from Castor SE some data which are temporarily not accessible for stager (or other) problems on the Castor side needs further tests to achieve reproducibility and then Savannah reports

Globus-MDS Information System instabilities (rate: ~ once per week) some temporary stop of data transfer (i.e. ‘no SE found’ means ‘no replicas’)

RLS instabilities (rate: ~ once per week) some temporary stop of data transfer (cannot both list replicas and (de)register files)

SCSI driver problems on CNAF disk-SE (rate: just once but affected fake-analysis) disks mounted but no I/O: under investigation

CNAF “autopsy” of DC04CNAF “autopsy” of DC04 Non-lethalinjuries

constant and painfuldebugging…


CMS DC04: number and sizes of filesCMS DC04: number and sizes of files

DC04 datatime window:51 (+3) days

March 11th – May 3rd

Global CNAF network activityGlobal CNAF network activity ~340 Mbps~340 Mbps(>42 MB/s)

sustainedfor ~5 hours

(max was383.8 Mbps383.8 Mbps)

May 2May 2ndndMay 1May 1stst>3k files for >750 GB


POOL RLScatalogue

RM/SRM/SRB EB agents

Configurationagent

Tier-1Transfer agent

LCGORCA

AnalysisJob

SRBGMCAT

XMLPublication

Agent

ReplicaManager

1. Register Files

2. Find Tier-1 Location (based on metadata)

3. Copy/delete files to/from export buffers

4. Copy filesto Tier-1’s

6. Process DSTand registerprivate data

Local POOLcatalogueTMDB

ResourceBroker5. Submit

analysis job

Specific client tools: POOL CLI, Replica Manager CLI, C++ LRC API based programs, LRC java API tools (SRB/GMCAT), Resource Broker

CNAF RLSreplica

ORACLEmirroring

Description of RLS usageDescription of RLS usage


Tier-0 in DC04Tier-0 in DC04

Systems• LSF batch system

3 racks, 44 nodes each, dedicated: tot 264 CPUs Dual P-IV Xeon 2.4GHz, 1GB mem, 100baseT

Dedicated cmsdc04 batch queue, 500 RUN-slots• Disk servers:

DC04 dedicated stager, with 2 pools 2 pools: IB and GDB, 10 + 4 TB

Export Buffers• EB-SRM ( 4 servers, 4.2 TB total )• EB-SRB ( 4 servers, 4.2 TB total )• EB-SE ( 3 servers, 3.1 TB total )

Databases• RLS (Replica Location Service)• TMDB (Transfer Management DB)

Transfer steering• Agents steering data transfers

on a dedicated node (close monitoring..)

Monitoring Services

Architecture built on:

Castor

IBIB

fake on-lineprocess

RefDB

POOL RLScatalogue

TMDB

ORCARECO

Job

GDBGDBTier-0

data distrib.agents

EBEB

Tier-0Tier-0


CMS Production tools

CMS production tools (OCTOPUS)

RefDBContains production requests with all needed parameters to produce

the dataset and the details about the production process MCRunJob

Evolution of IMPALA: more modular (plug-in approach)Tool/framework for job preparation and job submission

BOSSReal-time job-dependent parameter tracking. The running job

standard output/error are intercepted and filtered information are stored in BOSS database. The remote updator is based on MySQL but a remote updator based on R-GMA is being developed.

Towards the operations of the Italian Tier-1 for CMS: lessons learned from the CMS Data Challenge

Documents

bonacorsithe infncnaf

reconstructed data

data challenge focus

sufficient number of

farmregister data

tierns n1phedex meeting

operations ofthe italian

infncnaf centre