December 17th 2008 RAL PPD Computing Christm as Lectures 1 1 ATLAS Distributed Computing Stephen Burke RAL
Jan 11, 2016
December 17th 2008 RAL PPD Computing Christmas Lectures
11
ATLAS Distributed Computing
Stephen Burke
RAL
December 17th 2008 RAL PPD Computing Christmas Lectures
2
Outline
• Introduction• Computing model
– File types, data flows etc• Production system
– Monitoring– Performance this year
• Physics analysis• Outlook for 2009
• Some slides “borrowed” from Kors Bos– Mistakes are mine
December 17th 2008 RAL PPD Computing Christmas Lectures
3
Introduction
• Not much time to cover the whole of ATLAS computing!
• Focus on Distributed Computing (~ Grid)– Ignore detector, trigger etc– Ignore offline software (athena, sim, reco, …)
• Just the big picture– Not RAL- or UK-specific– Not going to explain the Grid– Still very complex– Some parts still subject to change
• Successes and problems
December 17th 2008 RAL PPD Computing Christmas Lectures
4
Tiers of ATLAS
• Tier structure is common for the LHC experiments, but some usage is ATLAS-specific
• Tier-0 @ CERN: does initial processing of raw data• Tier-1, e.g. RAL: reprocessing, simulation, group
analysis (no users!)– Typical Tier-1 is ~ 10% of total
• Tier-2, e.g. Southgrid: simulation, ATLAS-wide user analysis
• Tier-3, e.g. RAL PPD: Local user analysis• Tiers are logical concepts: physical sites may merge
functions– RAL Tier-1 has no Tier-2 component, but that’s unusual
• Tier-1 + associated Tier-2s form a “cloud”– logical unit for task + data assignment
December 17th 2008 RAL PPD Computing Christmas Lectures
5
Data types
• HITS – simulated data from GEANT– ~ 4 Mb/event
• RDO (Raw Data Out) – raw data from the detector or simulation– ~ 2 Mb/event
• ESD (Event Summary Data) – output from reconstruction– ~ 1 Mb/event
• AOD (Analysis Object Data) – reduced format used for most analysis (= DST)– ~ 200 Kb/event
• DPD (Derived Physics Data) – root ntuple format for specific purpose (several types)– ~ 10 Kb/event
• For guidance, expect ~ 10 million events/day in normal data-taking, so e.g. ~ 10 Tb/day for ESD.
December 17th 2008 RAL PPD Computing Christmas Lectures
6
Tier-1
Dataflow for ATLAS DATA
Group Analysis
TAPE
ATLASDATADISKEnd User Analysis
DPD
Tier-2
Tier-3ATLASUSERDISK
ATLASLOCALGROUPDISK
DPD
ATLASGROUP
6
ESDAOD
RDOESD
ATLASDATADISK ATLASDATATAPE
ESDAOD
AODAOD
Tier-0
Group Analysis
ATLASGROUP
DPD
Reprocessing Other Tier-1
ESDAOD
ESDAOD
ESDRDO
AOD
End User Analysis DPD
DPD
December 17th 2008 RAL PPD Computing Christmas Lectures
7
Tier-1
Tier-2Simulation ATLASPRODDISK
Tier-0
ATLASMCDISK
OtherTier-1
Data flow for Simulation Production
TAPEPile-up
ATLASMCDISK
Reconstruction
Mixing ATLASPRODDISK
7
RDO
HITS
HITSRDO
HITS
RDO
HITS
RDO
RDO
ESDAOD
ESDAOD
RDO
RDO
RDO
December 17th 2008 RAL PPD Computing Christmas Lectures
8
Production system
• ATLAS has recently moved to a pilot job system similar to LHCb (PANDA - Production ANd Distributed Analysis)– PANDA originated in the US, but recently moved to CERN– Tasks <- jobs– Pilot jobs sent to each site, when they start they pull jobs from a
central repository• Data management by DQ2 (DQ=Don Qixote!)
– Files -> datasets -> containers– Data moved to sites according to computing model, then jobs
sent to where the data sits– Job output stored on local Storage Element, then moved with
DQ2– Dataset movement can be requested by anyone, but can only be
triggered by authorised people• Metadata stored in AMI (ATLAS Metadata Interface)
December 17th 2008 RAL PPD Computing Christmas Lectures
9
Production dashboard
December 17th 2008 RAL PPD Computing Christmas Lectures
10
DQ2 dashboard
December 17th 2008 RAL PPD Computing Christmas Lectures
11
Experience in 2008
• Many tests of different aspects of the production system– CCRC (Common Computing Readiness Challenge) in May
• All experiments testing at once
– FDR (Full/Final Dress Rehearsal)– Reprocessing tests– Functional tests (regular low-priority system tests)– Simulation Production
• General results are good– The system works!– Many detailed problems at sites– Lots of babysitting
December 17th 2008 RAL PPD Computing Christmas Lectures
12
CCRC: Results
NOMINAL
PEAK
ERRORS
December 17th 2008 RAL PPD Computing Christmas Lectures
13
CCRC: all experiments
December 17th 2008 RAL PPD Computing Christmas Lectures
14
Transfers over one month
December 17th 2008 RAL PPD Computing Christmas Lectures
15
Efficiencies over one month
December 17th 2008 RAL PPD Computing Christmas Lectures
16
Simulation Production over one month
December 17th 2008 RAL PPD Computing Christmas Lectures
17
User Analysis• Grid-based analysis framework/procedures still in development
– No real data yet– Many people use lxplus@CERN– Some Grid pioneers– GANGA tool is popular (shared with LHCb, developed in the UK)
• “Traditional” Grid job submission vs pilot jobs not yet decided• Run anywhere vs run locally?
– Grid concept is that all users can run at all sites, but “Tier-3” resources can be local (how local?)
– Pilot jobs make it hard for sites to control whose jobs run• User data storage prototype
– No storage quotas on Grid storage• May need a big stick!
– GROUPDISK – managed by physics groups– LOCALGROUPDISK – for local (= country) users– USERDISK – scratch storage, anyone can write, files are cleaned after ~ 1
month• Little experience so far, but tests now starting
– Seems that bandwidth to storage may be a bottleneck
December 17th 2008 RAL PPD Computing Christmas Lectures
18
Outlook for 2009
• Many ongoing activities– Simulation production– Cosmics
• Once the detector is back together– Functional tests
• Specific tests– “10 million files”
• Testing Tier-1 to Tier-1 transfers– Reprocessing– CCRC09?– FDR?
• Analysis challenges– Analysis is the big challenge!
• Real data …
December 17th 2008 RAL PPD Computing Christmas Lectures
19
Are we ready?
• Yes, but …• Production system works
– Tested well above nominal rates– Bulk production of simulated data now standard operation– Computing and storage resources ~ adequate
• At least for now
• Constant barrage of problems, many people on shift and lots of manual intervention– One point recently when 7 Tier-1s were down simultaneously!– 24*7 cover now at Tier-1– Some critical people are leaving
• Analysis on the Grid still largely untested– Real data will bring a lot of new, inexperienced users– Will they be able to cope with the typical failure rate on the Grid?