Top Banner
Automating the CMS DAQ A review of automation features added to CMS during Run 1 of the LHC 20 th International Conference on Computing in High Energy and Nuclear Physics (CHEP) Amsterdam, The Netherlands, Oct 2013 Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group
41

Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group

Feb 25, 2016

Download

Documents

arama

Automating the CMS DAQ A review of automation features added to CMS during Run 1 of the LHC 20 th International Conference on Computing in High Energy and Nuclear Physics (CHEP ) Amsterdam, The Netherlands, Oct 2013. Hannes Sakulin, CERN/PH on behalf of the CMS DAQ group. Overview. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

The Terabit/s Super-Fragment Builder and Trigger Throttling System for CMS

Automating the CMS DAQA review of automation features added to CMS during Run 1 of the LHC

20th International Conference on Computing in High Energy and Nuclear Physics (CHEP)Amsterdam, The Netherlands, Oct 2013Hannes Sakulin, CERN/PHon behalf of the CMS DAQ groupAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#1OverviewAutomation Features added to CMS DAQ over Run 1 of the LHC added graduallyas we learned from operationas new requirements emergedas new failure scenarios became apparent

These features were needed to operate CMS with high data taking efficiencyto operate CMS with a non-expert crewto reduce the work load for on-call expertsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Definition: AutomationIn principle, anything done by the experiments online software ultimately falls into the domain of automation

For this talk we define automation to beThe automation of routine tasks that otherwise would be performed by the operator or on-call expert (or which were initially performed by the operator)Automatic diagnosisAutomatic recovery from known error conditions Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#System OverviewAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#

The Compact Muon Solenoid ExperimentGeneral-purpose detector at the LHC12 sub-systems55 million readout channelsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#L1 trigger electronics

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontrolThe first-level trigger

40 racks of custom electronics

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#6L1 trigger electronicsSub-det DAQelectronics

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontrolSub-detector DAQ electronics

Front-end Controllers

Front-end DriversDelivering data to thecentral DAQ700 linksAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#7L1 trigger electronicsSub-det DAQelectronics

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontrolCentral DAQelectronicsCentral DAQ farm,High Level Trigger & storagedataBuilds events at 100 kHz, 100 GB/s2 stages:MyrinetGigabit Ethernet8 independent event builder / filter slicesHigh level trigger running on filter farm~1200 PCs ~13000 cores (2012)Filter farm

FrontendReadout LinksThe central DAQAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#8L1 trigger electronicsSub-det DAQelectronicsCentral DAQelectronicsCentral DAQ farm,High Level Trigger & storage

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontroldataXDAQ Online SoftwareCMSSWCMSSWThe online software

XDAQ ApplicationXDAQ Framework C++XDAQ applications control hardware and handle data flow

Hardware Access, Transport Protocols, XML configuration, SOAP communication,HyperDAQ web server

~20000 applications to control

dataAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#9L1 trigger electronicsSub-det DAQelectronicsCentral DAQelectronicsCentral DAQ farm,High Level Trigger & storage

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontroldataXDAQ Online SoftwareCMSSWCMSSWThe online softwareLevel-0DAQTriggerSliceSliceECALTrackerDCSRun Control SystemDQMFunction ManagerNode in the Run Control Tree defines a State Machine & parametersUser function managers dynamically loaded into the web applicationRun Control System Java, Web TechnologiesDefines the control structure

HTML, CSS, JavaScript, AJAXGUI in a web browserRun Control Web ApplicationApache Tomcat Servlet ContainerJava Server Pages, Tag Libraries,

CommunicationWeb Services (WSDL, Axis, SOAP)Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#10L1 trigger electronicsSub-det DAQelectronicsCentral DAQelectronicsCentral DAQ farm,High Level Trigger & storage

Low voltageHigh voltage Gas, MagnetFront-end ElectronicsdatacontroldataTrackerECALDetector Control SystemXDAQ Online SoftwareCMSSWCMSSWTRGDAQMonitor collectorsXDAQ monitoring& alarmingLive Access ServersLevel-0DAQTriggerSliceSliceECALTrackerDCSRun Control SystemDQM

DAQ Operator

DCSOperatorThe online softwareWinCC OA (Siemens ETM)SMI++ (CERN)Expert system

DQMshifter

TriggerShifter

Shift Leadererrorsalarmsmonitorclients

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#11Adding automationAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Principles of Automation in CMSIntegrated into the control treeAs local as possibleControl nodes react to state of their child nodesDetailed knowledge stays encapsulatedPropagate information only as far up as necessaryImplemented in the Run Control & XDAQ frameworksDistributedParallel

In addition external automationFor recovery that changes the system structureE.g. re-computation of the configuration after hardware failure

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Integrated Automationa) at the XDAQ layerAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Automation in the XDAQ layer

XDAQ application detects problem and takes corrective action

HowUsing framework functionality (work loops )Full flexibility of C++ (e.g. Unix signals)

Example

Event Processors (High-Level Trigger)Child processes are forked to process in parallelreduces memory footprint (Copy-On-Write)Reduces configuration timeIn case a child processes crashes, a new one is created

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Integrated Automationb) at the Run Control layerAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Automation in Run Control

HowFunction Managers receive asynchronous updates from their childrenState updates Monitoring dataFunction Managers define event handlers (Java) reacting to these notificationsFull flexibility of Java

ExamplesExclude crashed resources from controlRecover flush failureWhen stopping a run, sometimes a DAQ slice cannot be flushedThe corresponding function manager detects this and locally recovers this slice

Function Manager

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Automation in the Run Control Top-level Node

For actions that need coordination between multiple sub-systems

Example 1: Automatic actions in the DAQ in response to condition changes in the DCS and the LHC

Example 2: Synchronized error recovery

Example 3: Automatic configuration selection based on LHC mode

Function Manager

see following slides Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#18Automation Example 1:Automatic reaction to LHC and DCSAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#DAQ actions on LHC and DCS state changesExtensive automation has been added to DCSAutomatic handshake with the LHCAutomatic ramping of high voltages (HV) driven by LHC machine and beam mode

Some DAQ settings depend on the LHC and DCS statesSuppress payload while HV is off (noise)Reduce gain while HV is offMask sensitive channels while LHC ramps

Initially needed to start new run with new settings very error prone

Since 2010: automatic new run sections driven by asynchronous state notifications from DCS/LHCLevel-0DAQTrackerDCSTrackerECALDetector Control SystemDCSRun Control SystemPSXLHCPVSS SOAP eXchangeXDAQserviceAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#20Now everything is driven by the LHC

LHC dipole currentramp startMasksensitivetriggerchannelsramp doneUnmasksensitivetriggerchannelsTracker HV onEnable payload (Tk)raise gains (Pixel)DCS ramps up tracker HVDCS ramps down tracker HVTracker HV offDisable payload (Tk)reduce gains (Pixel)Automatic actions in DAQ :Special run with circulating beamADJUSTSTABLE BEAMSSTABLE BEAMSsection 1section 1section 2section 1Section 2Start run at FLAT TOP.The rest is automatic.Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#21Example 2:Automatic recovery from Soft ErrorsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Automatic soft error recoveryWith higher instantaneousluminosity in 2011 more and more frequent soft errors causing the run to get stuckProportional to integrated luminosityBelieved to be due to single event upsets

Recovery procedureStop run (30 sec)Re-configure a sub-detector (2-3 min)Start new run (20 sec)

One Single Event Upset (needing recovery) every 73 pb-1Single-event upsets in the electronics of the Si-Pixeldetector. Proportional to integrated luminosity.3-10 min down-timeAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Automatic soft error recovery

From 2012, new automatic recovery procedure in top-level control nodeSub-system detects soft error and signals by changing its state Top-level control node invokes recovery procedurePause TriggersInvoke newly defined selective recovery transition on requesting detectorIn parallel perform preventive recovery of other detectorsResynchronizeResume

Function Manager

12 seconds down-timeAt least 46 hours of down-time avoided in 2012Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Example 3:Automatic Configuration HandlingAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Lots of options to parameterize the configurationsdf

Top level Run Control GUIInitially flexibility needed many manual settingsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#First added shifter guidance

Added cross-checks and remindersWhen sub-systems need to be re-configuredBecause of clock instabilitiesBecause (External) configuration changes ..Correct order of operations enforced

Then simplified and automated

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#L1 TRGConfiguration Handling (initial)2009/2010L1 TRG KeyHLT ConfClock sourceSub Det A ModeSub Det Z Mode

L1 trigger shifterDAQ shifterL1 Run SettingInstructionsIf then Synchronization byhuman communicationInstructionsIf then Trigger configuration needs to be consistentAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#28L1 TRGConfiguration Handling (simplified)2010L1 TRG KeyHLT ConfClock sourceSub Det A ModeSub Det Z Mode

DAQ shifterL1 Run SettingL1/HLT Trigger MODEInstructions

cosmics -> A1 B1 C1 D1collisions -> A2 B2 C2 D2Choices prepared by expertsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#29L1 TRGConfiguration Handling (Run Mode)2012L1 TRG KeyHLT ConfClock sourceSub Det A ModeSub Det Z Mode

DAQ shifterL1 Run SettingL1/HLT Trigger MODERUN MODEChoices prepared by expertsChoices prepared by expertsInstructions

cosmics -> mode 1collisions -> mode 2Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#30L1 TRGConfiguration Handling (automatic)2012L1 TRG KeyHLT ConfClock sourceSub Det A ModeSub Det Z ModeL1 Run SettingL1/HLT Trigger MODERUN MODELHCmachine / beam mode

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#31Central Expert SystemAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#L1 trigger electronicsLevel-0DAQTriggerSliceSliceECALTrackerDCSTrigger SupervisorRun Control System

DAQ Operator

Sub-det DAQelectronicsCentral DAQelectronicsCentral DAQ farm,High Level Trigger & storageXDAQ Online SoftwaredataCMSSWCMSSWDQMTRGDAQMonitor collectorsXDAQ monitoring& alarmingLive Access Servers

Sound alertsSpoken alertsAdvice

Computing farmmonitoring

AutomaticactionsDAQ DoctorExpert System

Monitoring Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#33

The DAQ DoctorExpert tool based on the same technology asHigh level scripting language (Perl)

Generic framework & pluggable modules

Detection of high level anomaly triggers further investigation

Archive (web based)All NotesSub-system errorsCRC errors

Dumps (of all monitoring data) for expert analysis in case of anomalies

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#The DAQ DoctorDiagnoses Anomalies in L1 rateHLT physics stream rateDead timeBackpressureResynchronization rateFarm health Event builder and HLT farm data flowHLT farm CPU utilization

Automatic actionsTriggers computation of a new central DAQ configuration in case of PC hardware failure(great help for on-call experts since 2012)

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#

CMS control room, Cessy, France Operational EfficiencyAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#

2010: 90.7%2011: 93.1%2012 : 94.8%

CMS data taking efficiency:CMS data taking efficiency (stable beams):Central DAQ availability > 99.6% in all three yearsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Summary & OutlookAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#SummaryDuring Run 1 of the LHC, routine actions and error recovery have largely been automated in the CMS DAQ

Two approaches were followedIntegrated into the control treeLocal, distributedCentral expert system For diagnosis, advice & automation that changes the system structure

Automation allowed us toRun CMS with a non-expert crewIncrease data taking efficiency despite frequent single-event upsets at high luminositySignificantly ease the load on the on-call experts

Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#OutlookAutomation integrated into the DAQ control tree served us wellKnowledge encapsulated (rather than centralized)Fast & reliablePlan to cover more error scenarios using this approach

Expert analysis of the central DAQ system will change significantly due to the DAQ & Filter Farm upgradePlanning to include more information sourcesin analysis: errors, alarms We are currently investigating the use of a complex event processing engine (Esper)Powerful query languageRather steep learning curve

On our wish listCompletely automated operation: start of collision, cosmic and calibration runsCorrelation of errors: root cause finding stay tuned for CHEP 2015#87: P. Zejdl: 10 Gbps TCP/IP streams from FPGA for HEP#72: R. Mommsen: Prototype of a file-based high level trigger in CMS #139: A. Holzner: The new CMS DAQ system for LHC operation after 2014 Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Thank YouAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#