The Terabit/s Super-Fragment Builder and Trigger Throttling
System for CMS
Automating the CMS DAQA review of automation features added to
CMS during Run 1 of the LHC
20th International Conference on Computing in High Energy and
Nuclear Physics (CHEP)Amsterdam, The Netherlands, Oct 2013Hannes
Sakulin, CERN/PHon behalf of the CMS DAQ groupAutomating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN
PH#1OverviewAutomation Features added to CMS DAQ over Run 1 of the
LHC added graduallyas we learned from operationas new requirements
emergedas new failure scenarios became apparent
These features were needed to operate CMS with high data taking
efficiencyto operate CMS with a non-expert crewto reduce the work
load for on-call expertsAutomating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#Definition: AutomationIn principle,
anything done by the experiments online software ultimately falls
into the domain of automation
For this talk we define automation to beThe automation of
routine tasks that otherwise would be performed by the operator or
on-call expert (or which were initially performed by the
operator)Automatic diagnosisAutomatic recovery from known error
conditions Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#System OverviewAutomating the CMS DAQ, CHEP, Oct
17, 2013, AmsterdamH. Sakulin / CERN PH#
The Compact Muon Solenoid ExperimentGeneral-purpose detector at
the LHC12 sub-systems55 million readout channelsAutomating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#L1 trigger
electronics
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontrolThe first-level trigger
40 racks of custom electronics
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#6L1 trigger electronicsSub-det DAQelectronics
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontrolSub-detector DAQ electronics
Front-end Controllers
Front-end DriversDelivering data to thecentral DAQ700
linksAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#7L1 trigger electronicsSub-det DAQelectronics
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontrolCentral DAQelectronicsCentral DAQ farm,High
Level Trigger & storagedataBuilds events at 100 kHz, 100 GB/s2
stages:MyrinetGigabit Ethernet8 independent event builder / filter
slicesHigh level trigger running on filter farm~1200 PCs ~13000
cores (2012)Filter farm
FrontendReadout LinksThe central DAQAutomating the CMS DAQ,
CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#8L1 trigger
electronicsSub-det DAQelectronicsCentral DAQelectronicsCentral DAQ
farm,High Level Trigger & storage
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontroldataXDAQ Online SoftwareCMSSWCMSSWThe online
software
XDAQ ApplicationXDAQ Framework C++XDAQ applications control
hardware and handle data flow
Hardware Access, Transport Protocols, XML configuration, SOAP
communication,HyperDAQ web server
~20000 applications to control
dataAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#9L1 trigger electronicsSub-det
DAQelectronicsCentral DAQelectronicsCentral DAQ farm,High Level
Trigger & storage
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontroldataXDAQ Online SoftwareCMSSWCMSSWThe online
softwareLevel-0DAQTriggerSliceSliceECALTrackerDCSRun Control
SystemDQMFunction ManagerNode in the Run Control Tree defines a
State Machine & parametersUser function managers dynamically
loaded into the web applicationRun Control System Java, Web
TechnologiesDefines the control structure
HTML, CSS, JavaScript, AJAXGUI in a web browserRun Control Web
ApplicationApache Tomcat Servlet ContainerJava Server Pages, Tag
Libraries,
CommunicationWeb Services (WSDL, Axis, SOAP)Automating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#10L1 trigger
electronicsSub-det DAQelectronicsCentral DAQelectronicsCentral DAQ
farm,High Level Trigger & storage
Low voltageHigh voltage Gas, MagnetFront-end
ElectronicsdatacontroldataTrackerECALDetector Control SystemXDAQ
Online SoftwareCMSSWCMSSWTRGDAQMonitor collectorsXDAQ
monitoring& alarmingLive Access
ServersLevel-0DAQTriggerSliceSliceECALTrackerDCSRun Control
SystemDQM
DAQ Operator
DCSOperatorThe online softwareWinCC OA (Siemens ETM)SMI++
(CERN)Expert system
DQMshifter
TriggerShifter
Shift Leadererrorsalarmsmonitorclients
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#11Adding automationAutomating the CMS DAQ, CHEP, Oct 17,
2013, AmsterdamH. Sakulin / CERN PH#Principles of Automation in
CMSIntegrated into the control treeAs local as possibleControl
nodes react to state of their child nodesDetailed knowledge stays
encapsulatedPropagate information only as far up as
necessaryImplemented in the Run Control & XDAQ
frameworksDistributedParallel
In addition external automationFor recovery that changes the
system structureE.g. re-computation of the configuration after
hardware failure
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#Integrated Automationa) at the XDAQ layerAutomating the
CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN
PH#Automation in the XDAQ layer
XDAQ application detects problem and takes corrective action
HowUsing framework functionality (work loops )Full flexibility
of C++ (e.g. Unix signals)
Example
Event Processors (High-Level Trigger)Child processes are forked
to process in parallelreduces memory footprint
(Copy-On-Write)Reduces configuration timeIn case a child processes
crashes, a new one is created
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#Integrated Automationb) at the Run Control
layerAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#Automation in Run Control
HowFunction Managers receive asynchronous updates from their
childrenState updates Monitoring dataFunction Managers define event
handlers (Java) reacting to these notificationsFull flexibility of
Java
ExamplesExclude crashed resources from controlRecover flush
failureWhen stopping a run, sometimes a DAQ slice cannot be
flushedThe corresponding function manager detects this and locally
recovers this slice
Function Manager
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#Automation in the Run Control Top-level Node
For actions that need coordination between multiple
sub-systems
Example 1: Automatic actions in the DAQ in response to condition
changes in the DCS and the LHC
Example 2: Synchronized error recovery
Example 3: Automatic configuration selection based on LHC
mode
Function Manager
see following slides Automating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#18Automation Example 1:Automatic
reaction to LHC and DCSAutomating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#DAQ actions on LHC and DCS state
changesExtensive automation has been added to DCSAutomatic
handshake with the LHCAutomatic ramping of high voltages (HV)
driven by LHC machine and beam mode
Some DAQ settings depend on the LHC and DCS statesSuppress
payload while HV is off (noise)Reduce gain while HV is offMask
sensitive channels while LHC ramps
Initially needed to start new run with new settings very error
prone
Since 2010: automatic new run sections driven by asynchronous
state notifications from
DCS/LHCLevel-0DAQTrackerDCSTrackerECALDetector Control SystemDCSRun
Control SystemPSXLHCPVSS SOAP eXchangeXDAQserviceAutomating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#20Now
everything is driven by the LHC
LHC dipole currentramp startMasksensitivetriggerchannelsramp
doneUnmasksensitivetriggerchannelsTracker HV onEnable payload
(Tk)raise gains (Pixel)DCS ramps up tracker HVDCS ramps down
tracker HVTracker HV offDisable payload (Tk)reduce gains
(Pixel)Automatic actions in DAQ :Special run with circulating
beamADJUSTSTABLE BEAMSSTABLE BEAMSsection 1section 1section
2section 1Section 2Start run at FLAT TOP.The rest is
automatic.Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#21Example 2:Automatic recovery from Soft
ErrorsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#Automatic soft error recoveryWith higher
instantaneousluminosity in 2011 more and more frequent soft errors
causing the run to get stuckProportional to integrated
luminosityBelieved to be due to single event upsets
Recovery procedureStop run (30 sec)Re-configure a sub-detector
(2-3 min)Start new run (20 sec)
One Single Event Upset (needing recovery) every 73
pb-1Single-event upsets in the electronics of the Si-Pixeldetector.
Proportional to integrated luminosity.3-10 min down-timeAutomating
the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN
PH#Automatic soft error recovery
From 2012, new automatic recovery procedure in top-level control
nodeSub-system detects soft error and signals by changing its state
Top-level control node invokes recovery procedurePause
TriggersInvoke newly defined selective recovery transition on
requesting detectorIn parallel perform preventive recovery of other
detectorsResynchronizeResume
Function Manager
12 seconds down-timeAt least 46 hours of down-time avoided in
2012Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#Example 3:Automatic Configuration HandlingAutomating the
CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#Lots of
options to parameterize the configurationsdf
Top level Run Control GUIInitially flexibility needed many
manual settingsAutomating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#First added shifter guidance
Added cross-checks and remindersWhen sub-systems need to be
re-configuredBecause of clock instabilitiesBecause (External)
configuration changes ..Correct order of operations enforced
Then simplified and automated
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#L1 TRGConfiguration Handling (initial)2009/2010L1 TRG
KeyHLT ConfClock sourceSub Det A ModeSub Det Z Mode
L1 trigger shifterDAQ shifterL1 Run SettingInstructionsIf then
Synchronization byhuman communicationInstructionsIf then Trigger
configuration needs to be consistentAutomating the CMS DAQ, CHEP,
Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#28L1 TRGConfiguration
Handling (simplified)2010L1 TRG KeyHLT ConfClock sourceSub Det A
ModeSub Det Z Mode
DAQ shifterL1 Run SettingL1/HLT Trigger MODEInstructions
cosmics -> A1 B1 C1 D1collisions -> A2 B2 C2 D2Choices
prepared by expertsAutomating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#29L1 TRGConfiguration Handling (Run
Mode)2012L1 TRG KeyHLT ConfClock sourceSub Det A ModeSub Det Z
Mode
DAQ shifterL1 Run SettingL1/HLT Trigger MODERUN MODEChoices
prepared by expertsChoices prepared by expertsInstructions
cosmics -> mode 1collisions -> mode 2Automating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#30L1
TRGConfiguration Handling (automatic)2012L1 TRG KeyHLT ConfClock
sourceSub Det A ModeSub Det Z ModeL1 Run SettingL1/HLT Trigger
MODERUN MODELHCmachine / beam mode
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#31Central Expert SystemAutomating the CMS DAQ, CHEP, Oct
17, 2013, AmsterdamH. Sakulin / CERN PH#L1 trigger
electronicsLevel-0DAQTriggerSliceSliceECALTrackerDCSTrigger
SupervisorRun Control System
DAQ Operator
Sub-det DAQelectronicsCentral DAQelectronicsCentral DAQ
farm,High Level Trigger & storageXDAQ Online
SoftwaredataCMSSWCMSSWDQMTRGDAQMonitor collectorsXDAQ
monitoring& alarmingLive Access Servers
Sound alertsSpoken alertsAdvice
Computing farmmonitoring
AutomaticactionsDAQ DoctorExpert System
Monitoring Automating the CMS DAQ, CHEP, Oct 17, 2013,
AmsterdamH. Sakulin / CERN PH#33
The DAQ DoctorExpert tool based on the same technology asHigh
level scripting language (Perl)
Generic framework & pluggable modules
Detection of high level anomaly triggers further
investigation
Archive (web based)All NotesSub-system errorsCRC errors
Dumps (of all monitoring data) for expert analysis in case of
anomalies
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#The DAQ DoctorDiagnoses Anomalies in L1 rateHLT physics
stream rateDead timeBackpressureResynchronization rateFarm health
Event builder and HLT farm data flowHLT farm CPU utilization
Automatic actionsTriggers computation of a new central DAQ
configuration in case of PC hardware failure(great help for on-call
experts since 2012)
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#
CMS control room, Cessy, France Operational EfficiencyAutomating
the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#
2010: 90.7%2011: 93.1%2012 : 94.8%
CMS data taking efficiency:CMS data taking efficiency (stable
beams):Central DAQ availability > 99.6% in all three
yearsAutomating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH.
Sakulin / CERN PH#Summary & OutlookAutomating the CMS DAQ,
CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#SummaryDuring Run
1 of the LHC, routine actions and error recovery have largely been
automated in the CMS DAQ
Two approaches were followedIntegrated into the control
treeLocal, distributedCentral expert system For diagnosis, advice
& automation that changes the system structure
Automation allowed us toRun CMS with a non-expert crewIncrease
data taking efficiency despite frequent single-event upsets at high
luminositySignificantly ease the load on the on-call experts
Automating the CMS DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin
/ CERN PH#OutlookAutomation integrated into the DAQ control tree
served us wellKnowledge encapsulated (rather than centralized)Fast
& reliablePlan to cover more error scenarios using this
approach
Expert analysis of the central DAQ system will change
significantly due to the DAQ & Filter Farm upgradePlanning to
include more information sourcesin analysis: errors, alarms We are
currently investigating the use of a complex event processing
engine (Esper)Powerful query languageRather steep learning
curve
On our wish listCompletely automated operation: start of
collision, cosmic and calibration runsCorrelation of errors: root
cause finding stay tuned for CHEP 2015#87: P. Zejdl: 10 Gbps TCP/IP
streams from FPGA for HEP#72: R. Mommsen: Prototype of a file-based
high level trigger in CMS #139: A. Holzner: The new CMS DAQ system
for LHC operation after 2014 Automating the CMS DAQ, CHEP, Oct 17,
2013, AmsterdamH. Sakulin / CERN PH#Thank YouAutomating the CMS
DAQ, CHEP, Oct 17, 2013, AmsterdamH. Sakulin / CERN PH#