Top Banner
December 17th 2008 RAL PPD Computing Christm as Lectures 1 1 ATLAS Distributed Computing Stephen Burke RAL
19

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

Jan 11, 2016

Download

Documents

Solomon Short
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

11

ATLAS Distributed Computing

Stephen Burke

RAL

Page 2: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

2

Outline

• Introduction• Computing model

– File types, data flows etc• Production system

– Monitoring– Performance this year

• Physics analysis• Outlook for 2009

• Some slides “borrowed” from Kors Bos– Mistakes are mine

Page 3: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

3

Introduction

• Not much time to cover the whole of ATLAS computing!

• Focus on Distributed Computing (~ Grid)– Ignore detector, trigger etc– Ignore offline software (athena, sim, reco, …)

• Just the big picture– Not RAL- or UK-specific– Not going to explain the Grid– Still very complex– Some parts still subject to change

• Successes and problems

Page 4: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

4

Tiers of ATLAS

• Tier structure is common for the LHC experiments, but some usage is ATLAS-specific

• Tier-0 @ CERN: does initial processing of raw data• Tier-1, e.g. RAL: reprocessing, simulation, group

analysis (no users!)– Typical Tier-1 is ~ 10% of total

• Tier-2, e.g. Southgrid: simulation, ATLAS-wide user analysis

• Tier-3, e.g. RAL PPD: Local user analysis• Tiers are logical concepts: physical sites may merge

functions– RAL Tier-1 has no Tier-2 component, but that’s unusual

• Tier-1 + associated Tier-2s form a “cloud”– logical unit for task + data assignment

Page 5: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

5

Data types

• HITS – simulated data from GEANT– ~ 4 Mb/event

• RDO (Raw Data Out) – raw data from the detector or simulation– ~ 2 Mb/event

• ESD (Event Summary Data) – output from reconstruction– ~ 1 Mb/event

• AOD (Analysis Object Data) – reduced format used for most analysis (= DST)– ~ 200 Kb/event

• DPD (Derived Physics Data) – root ntuple format for specific purpose (several types)– ~ 10 Kb/event

• For guidance, expect ~ 10 million events/day in normal data-taking, so e.g. ~ 10 Tb/day for ESD.

Page 6: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

6

Tier-1

Dataflow for ATLAS DATA

Group Analysis

TAPE

ATLASDATADISKEnd User Analysis

DPD

Tier-2

Tier-3ATLASUSERDISK

ATLASLOCALGROUPDISK

DPD

ATLASGROUP

6

ESDAOD

RDOESD

ATLASDATADISK ATLASDATATAPE

ESDAOD

AODAOD

Tier-0

Group Analysis

ATLASGROUP

DPD

Reprocessing Other Tier-1

ESDAOD

ESDAOD

ESDRDO

AOD

End User Analysis DPD

DPD

Page 7: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

7

Tier-1

Tier-2Simulation ATLASPRODDISK

Tier-0

ATLASMCDISK

OtherTier-1

Data flow for Simulation Production

TAPEPile-up

ATLASMCDISK

Reconstruction

Mixing ATLASPRODDISK

7

RDO

HITS

HITSRDO

HITS

RDO

HITS

RDO

RDO

ESDAOD

ESDAOD

RDO

RDO

RDO

Page 8: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

8

Production system

• ATLAS has recently moved to a pilot job system similar to LHCb (PANDA - Production ANd Distributed Analysis)– PANDA originated in the US, but recently moved to CERN– Tasks <- jobs– Pilot jobs sent to each site, when they start they pull jobs from a

central repository• Data management by DQ2 (DQ=Don Qixote!)

– Files -> datasets -> containers– Data moved to sites according to computing model, then jobs

sent to where the data sits– Job output stored on local Storage Element, then moved with

DQ2– Dataset movement can be requested by anyone, but can only be

triggered by authorised people• Metadata stored in AMI (ATLAS Metadata Interface)

Page 9: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

9

Production dashboard

Page 10: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

10

DQ2 dashboard

Page 11: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

11

Experience in 2008

• Many tests of different aspects of the production system– CCRC (Common Computing Readiness Challenge) in May

• All experiments testing at once

– FDR (Full/Final Dress Rehearsal)– Reprocessing tests– Functional tests (regular low-priority system tests)– Simulation Production

• General results are good– The system works!– Many detailed problems at sites– Lots of babysitting

Page 12: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

12

CCRC: Results

NOMINAL

PEAK

ERRORS

Page 13: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

13

CCRC: all experiments

Page 14: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

14

Transfers over one month

Page 15: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

15

Efficiencies over one month

Page 16: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

16

Simulation Production over one month

Page 17: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

17

User Analysis• Grid-based analysis framework/procedures still in development

– No real data yet– Many people use lxplus@CERN– Some Grid pioneers– GANGA tool is popular (shared with LHCb, developed in the UK)

• “Traditional” Grid job submission vs pilot jobs not yet decided• Run anywhere vs run locally?

– Grid concept is that all users can run at all sites, but “Tier-3” resources can be local (how local?)

– Pilot jobs make it hard for sites to control whose jobs run• User data storage prototype

– No storage quotas on Grid storage• May need a big stick!

– GROUPDISK – managed by physics groups– LOCALGROUPDISK – for local (= country) users– USERDISK – scratch storage, anyone can write, files are cleaned after ~ 1

month• Little experience so far, but tests now starting

– Seems that bandwidth to storage may be a bottleneck

Page 18: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

18

Outlook for 2009

• Many ongoing activities– Simulation production– Cosmics

• Once the detector is back together– Functional tests

• Specific tests– “10 million files”

• Testing Tier-1 to Tier-1 transfers– Reprocessing– CCRC09?– FDR?

• Analysis challenges– Analysis is the big challenge!

• Real data …

Page 19: December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

December 17th 2008 RAL PPD Computing Christmas Lectures

19

Are we ready?

• Yes, but …• Production system works

– Tested well above nominal rates– Bulk production of simulated data now standard operation– Computing and storage resources ~ adequate

• At least for now

• Constant barrage of problems, many people on shift and lots of manual intervention– One point recently when 7 Tier-1s were down simultaneously!– 24*7 cover now at Tier-1– Some critical people are leaving

• Analysis on the Grid still largely untested– Real data will bring a lot of new, inexperienced users– Will they be able to cope with the typical failure rate on the Grid?