Top Banner
Dataflow Monitori ng Nicoletta Garelli ALICE, ATLAS, CMS & LHCb joint workshop on DAQ@LHC 12-14 March 2013 Château de Bossey
22

Dataflow Monitoring

Jan 20, 2016

Download

Documents

Helmut Hummler

Dataflow Monitoring. Nicoletta Garelli ALICE, ATLAS, CMS & LHCb joint workshop on DAQ@LHC 12-14 March 2013 Château de Bossey. What Dataflow Monitoring Means. Monitoring in real time the flow of data to ensure optimal data taking from detector readout to permanent storage - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dataflow  Monitoring

Dataflow Monitoring

Nicoletta Garelli

ALICE, ATLAS, CMS & LHCb joint workshop on DAQ@LHC

12-14 March 2013Château de Bossey

Page 2: Dataflow  Monitoring

What Dataflow Monitoring Means

• Monitoring in real time the flow of data to ensure optimal data taking– from detector readout to permanent storage– trigger & DAQ quantities (counters, data rate, buffer

occupancy, etc.)• Avoid dead-time – and eventually allow to fix problems

• Each experiment uses its own jargon to indicate the same thing

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 2

Page 3: Dataflow  Monitoring

Requirements

• Access any relevant information in real time to follow data taking• Online aggregation & data correlation• Online problem detection: dead-time,

data losses, etc.• Archive: access historical data for

diagnostics, statistics, post-mortem• Use monitoring data to trigger alarms

and/or automatic actions to recover problems

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 3

Basics

Added later

Page 4: Dataflow  Monitoring

Evolution• Users: shifters & experts• LHC operations …at the beginning– scattered information and rudimentary tools– shifters: intense monitoring activity – experts: high presence in control room + ringing on-call phone

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 4

• LHC operations …routine– coherent information and optimized tools– automate as much as possible to reduce shifter’s

tasks• see Luca’s talk of this morning

– move from custom GUI to ubiquitous web based tools• let’s do all with a smartphone

Page 5: Dataflow  Monitoring

THE 4 ARCHITECTURES

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 5

Page 6: Dataflow  Monitoring

Middleware

• Each experiment developed 4 different DAQ systems using different technologies• Variety reflected in dataflow monitoring middleware– LHCb & ALICE: Distributed Information Management (DIM)

• client/server paradigm, light weight

– ATLAS: Information Service (IS)• custom library on top of CORBA• client-server communication model where information is stored

in memory by so called IS servers

– CMS: Web Service• Cross-DAQ (XDAQ) framework

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 6

Page 7: Dataflow  Monitoring

ALICE DAQ

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 7

• ~300 processes on ~300 machines• 100k dataflow information published every 5 s ~3 GB/h

NOTE:HLT monitoring & dedicated expert storage monitoringnot discussed here

Page 8: Dataflow  Monitoring

ALICE Dataflow Monitoring Architecture

• “Logbook”– much more

than what you think

– PHP12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 8

DAQ processes

DIM / SMI

Status Display Backpressure Monitor Logbook

MySQL

• Based on DIM/SMI– SMI: framework for designing

and implementing distributed control systems developed by DELPHI

• MySQL:– store system

configuration – LDC&GDC write

run info – Archive.

• Logbook as visualization

• 2 monitoring applications

• Tcl/Tk

Page 9: Dataflow  Monitoring

ALICE Visualization

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 9

Page 10: Dataflow  Monitoring

ALICE & Android

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 10

Page 11: Dataflow  Monitoring

ATLAS DAQ

• O(20k) processes on ~2k machines• 1M dataflow

information published every 5-10 s ~4 GB/h

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 11

Custom HW

Processing UnitProcessing UnitProcessing UnitProcessing UnitProcessing Node

Level 2

Processing UnitProcessing UnitProcessing UnitProcessing UnitProcessing Node

Event Filter

Other

FE FE FE

Calo/ Muon

ROD ROD ROD

Det

ecto

r Rea

dout

Processing UnitProcessing UnitReadOut System

Processing UnitProcessing UnitEvent Builder

Processing UnitData Logger

DC

BE

Level 1

~150

~100

~5

~8k

~8k

Page 12: Dataflow  Monitoring

ATLAS Dataflow Monitoring Architecture

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 12

Information Service (IS)

~100 servers

Monitoring GUI

applicationsWeb Server

(Apache)

Web IS (Python CGI)

Web Browser

DAQ processes

Mirror Information Service

Monitoring GUI applications

CERN GPN

• WEB IS: IS gives information access on demand via HTTP protocol‐ python wrapper accepts

HTTP requests & sends back dynamically formed XML text (value of IS obj pointed by given URL)

• Mirror IS: real-time copy from IS (ATCN) to mirror counterpart in CERN GPN

• Archive: None.‐ information stored &

accessed for ~2 month in RDD files each ~30 s via network monitoring system

Web Server(Apache)

Web IS (Python CGI)

Web Browser

Page 13: Dataflow  Monitoring

Shifter’s Tools in 2012

• DAQ Panel: tool portal for shifters• DFSummary– dynamically constructed web page

which computes & displays most important dataflow parameters (~200 variables)

– ~30 s update rate• Busy Panel: Qt application for

monitoring dead-time• Shifter Assistant– see Luca’s talk of this morning

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 13

Page 14: Dataflow  Monitoring

CMS DAQ

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 14

• O(20k) processes on ~2k machines• O(600k) dataflow information published every 1-5 s ~8 GB/h

Page 15: Dataflow  Monitoring

XDAQ Monitoring & Alarming Service• Fully scalable distributed

monitoring & alarming system• Service-oriented architecture

organized in 3-tier structured collection of communicating components:– Sensor: report monitoring data – Eventing: scalable publisher-

subscriber service orchestrated by a load balancer application (Broker)

– Collector: build relational tables– Live Access Service:

presentation of raw data (Web Service)

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 15

• Archive: automatic persistency of collected tables in ORACLE according to configuration– Subset of information stored: ~30 GB/y

Page 16: Dataflow  Monitoring

Monitoring as a Service• XDAQ as a Service (XaaS): common infrastructure for both central DAQ &

sub-detectors

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 16

...

ecal

hcal

csc

xaasXDAQ as a service

cdaq

sentinel monitor application

central xaas

• interoperable services providing standard functionalities for use in XDAQ environment

• All processes organized into searchable groups known as zones

• zone defines scope of a distributed XDAQ application

• Each zone has its own monitoring data types (flashlists)

Page 17: Dataflow  Monitoring

CMS Visualization• LabView DAQMon

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 17

Page 18: Dataflow  Monitoring

LHCb DAQ

• O(40 k) processes on 2k machines• 4M dataflow information published

every 5s ~ 11.5 GB/h12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 18

SWITCH

HLT farm

Detector

TFC System

SWITCHSWITCH SWITCH SWITCH SWITCH SWITCH

READOUT NETWORK

L0 triggerLHC clock

MEP Request

Event building

Front-End

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Readout Board

Expe

rimen

t Con

trol

Sys

tem

(EC

S)

VELO ST OT RICH ECal HCal MuonL0

Trigger

Event dataTiming and Fast Control SignalsControl and Monitoring data

SWITCH

MON farm

CPU

CPU

CPU

CPU

Readout Board

Readout Board

Readout Board

Readout Board

Readout Board

Readout Board

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

Page 19: Dataflow  Monitoring

LHCb Dataflow Monitoring Architecture

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 19

UI

Archive

AlarmUI

UIs

UIDetailedUIs

StatusCountersRatesAlarms

DetailedCounters

ControlSystem

UIReadoutUnits

UITiming &Trigger

CountersRates

CountersRates

Infrastr

Dataflow

DQHisto

HLT

Infrastr

Dataflow

DQHisto

Monitoring

Infrastr

Dataflow

DQHisto

Reconstr.

Histos

Histos

Histos

HLT SubFarms X 56

DIM

Infrastr

Dataflow

DQHisto

Infrastr

Dataflow

DQHisto

HLT Nodes X ~30DIM

DIM

HLT SF Ctrl

Infrastr

Dataflow

Storage

EventData

Status,Counters

Counters

Automatic Actions (aggregation)(aggregation)

Archive: • PVSS ORACLE

DB• O(200k) values

every 5s

Page 20: Dataflow  Monitoring

LHCb Visualization

PVSS GUI

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 20

VT100 graphicsdetailed UI

Page 21: Dataflow  Monitoring

CONCLUSIONS

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 21

Page 22: Dataflow  Monitoring

Satisfied?“YES, it does the job”

“ … BUT …”– 4 different solutions for the same problem …– sharing experience and maybe even future common

solutions?

Luciano’s talk on Thursday

12/03/2013 N. Garelli (SLAC) - Dataflow Monitoring 22