Niko Neufeld "A 32 Tbit/s Data Acquisition System"

LHCb Trigger & DAQ an Introductory Overview

Niko NeufeldCERN/PH Department

Yandex, July 3rd Moscow

The Large Hadron Collider

LHC Trigger & DAQ - Niko Neufeld, CERN 2

3

Physics, Detectors, Trigger & DAQ

High throughput DAQ, Niko Neufeld, CERN

rare, need many collisions

High rate collider

Fast electronics

Data acquisitionTrigger

decisions

Event Filter

MassStorage

data

decisions

data

signals

4

The Data Acquisition Challenge at LHC

?

• 15 million detector channels• @ 40 MHz• = ~15 * 1,000,000 * 40 * 1,000,000 bytes

• = ~ 600 TB/sec

LHC Trigger & DAQ - Niko Neufeld, CERN


Should we read everything? • A typical collision is “boring”

– Although we need also some of these “boring” data as cross-check, calibration tool and also some important “low-energy” physics

• “Interesting” physics is about 6–8 orders of magnitude rarer (EWK & Top)

• “Exciting” physics involving new particles/discoveries is 9 orders of magnitude below tot– 100 GeV Higgs 0.1 Hz*– 600 GeV Higgs 0.01 Hz

• We just need to efficiently identify these rare processes from the overwhelming background before reading out & storing the whole event

109 Hz

5106 Hz

EWK: 20–100 Hz

10 Hz

*Note: this is just the production rate, properly finding it is much rarer!

6

Know Your Enemy: pp Collisions at 14 TeV at 1034 cm-2s-1

• (pp) = 70 mb --> >7 x 108 /s (!)

• In ATLAS and CMS* 20 – 30 min bias events overlap

• HZZZ mmH 4 muons:the cleanest(“golden”)signature

Reconstructed tracks with pt > 25 GeV

And this (not the H though…)

repeats every 25 ns…

*)LHCb @4x1033 cm-2-1 isn’t much nicer and in Alice (PbPb) is even more busyLHC Trigger & DAQ - Niko Neufeld, CERN

Trivial DAQ with a real trigger 2

High throughput DAQ, Niko Neufeld, CERN 7

ADC

Sensor

Delay

Proces-sing

Interrupt

Discriminator

Trigger

Start

Deadtime (%) is the ratio between the time the DAQis busy and the total time.

SetQClear

and not

Busy Logic

Ready

storage

A “simple” 40 MHz track trigger – the LHCb PileUp system


Niko Neufeld

Finding vertices in FPGAs

• Use r-coordinates of hits in Si-detector discs (detector geometry made for this task!)

• Find coincidences between hits on two discs

• Count & histogram


LHCb Pileup Finding multiple vertices and quality


Comparing with the “offline” truth(full tracking, calibration, alignment)

LHCb Pileup Algorithm

• Time-budget for this algorithm about 2 us

• Runs in conventional FPGAs in a radiation-safe area

• Limited to low pile-up (ok for LHCb)


After the TriggerDetector Read-out and DAQ

DAQ design guidelines

• Scalability – change in event-size, luminosity (pileup!)• Robust (very little dead-time, high efficiency, non-

expert operators) intelligent control-systems• Use industry-standard, commercial technologies (long-

term maintenance) PCs, Ethernet• Low cost PCs, standard LANs• High band-width (many Gigabytes/s) use local area

networks (LAN)• “Creative” & “Flexible” (open for new things) use

software and reconfigurable logic (FPGAs)


One network to rule the all• Ethernet, IEEE 802.3xx, has almost become

synonymous with Local Area Networking• Ethernet has many nice features: cheap,

simple, cheap, etc…• Ethernet does not:

– guarantee delivery of messages – allow multiple network paths– provide quality of service or bandwidth

assignment (albeit to a varying degree this is provided by many switches)

• Because of this raw Ethernet is rarely used, usually it serves as a transport medium for IP, UDP, TCP etc…


Ethernet

• Flow-control in standard Ethernet is only defined between immediate neighbors

• Sending station is free to throw away x-offed frames (and often does )

Xoff data

Generic DAQ implemented on a LAN


Powerful Core routers

“Readout Units”for protocol adaptation

Custom links from thedetector

Edge switches

Servers for eventfiltering

Typical number of piecesDetector

1

1000

100 to 1000

2 to 8

50 to 100

> 1000

16

Congestion

2• "Bang" translates into

random, uncontrolled packet-loss

• In Ethernet this is perfectly valid behavior and implemented by many low-latency devices

• This problem comes from synchronized sources sending to the same destination at the same time

• Either a higher level “event-building” protocol avoids this congestion or the switches must avoid packet loss with deep buffer memories


2

2

Bang

17

Push-Based Event Building with store& forward switching and load-balancing

Data AcquisitionSwitch

1Event Builders notify Event Manager available capacity 2

Event Managerensures that data are sent only to nodes with available capacity 3

Readout system relies on feedback from Event Builders

“Send next event

to EB1”

“Send next event

to EB2”

EB1: EB2:EB3:

0 00

“Send me an event!”


1 11



0 11

0 01 “Send

next event to EB3”

1 01

Event Builder 1

Event Builder 2

Event Builder 3


Event Manager

Sources do not buffer –so switch must buffer to avoid packet loss due to overcommitment

18

LHCb DAQ

SWITCH

HLT farm

Detector

TFC System

SWITCHSWITCH SWITCH SWITCH SWITCH SWITCH

READOUT NETWORK

L0 triggerLHC clock

MEP Request

Event building

Front-End

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Readout Board

Expe

rimen

t Con

trol

Sys

tem

(EC

S)

VELO ST OT RICH ECal HCal MuonL0

Trigger

Event dataTiming and Fast Control SignalsControl and Monitoring data

SWITCH

MON farm

CPU

CPU

CPU

CPU

Readout Board

Readout Board

Readout Board

Readout Board

Readout Board

Readout Board

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

FEElectronics

55 GB/s

200 - 300 MB/s

Average event size 55 kBAverage rate into farm 1 MHzAverage rate to tape 4 – 5 kHz


LHCb DAQ

• Events are very small (about 55 kB) total – each read-out board contributes about 200 bytes (only!!)– A UDP message on Ethernet takes 8 + 14 + 20 + 8 + 4 = 52

bytes 25% overhead(!)• LHCb uses coalescence of messages, packing about 10

to 15 events into one message (called MEP) message rate is ~ 80 kHz (c.f. CMS, ATLAS)

• Protocol is a simple, single stage push, every farm-node builds complete events, the TTC system is used to assign IP addresses coherently to the read-out boards


DAQ network parametersLink load Technology Protocol Eventbuilding[%]

30% Ethernet TCP/IP pull InfiniBand (HLT) pull (RDMA)

20% 10 Gbit/s (L2) Ethernet TCP/IP pull 50% (Event-collection)

65% Myrinet Myrinet push (with credits) 40 – 80% Ethernet TCP/IP pull

40 - 80% Ethernet UDP push

ALIC

EAT

LAS

CMS

LHCb

20LHC Trigger & DAQ - Niko Neufeld, CERN

LHC Trigger/DAQ parameters (as seen 2011/12)

# Level-0,1,2 Event Network StorageTrigger Rate (Hz) Size (Byte) Bandw.(GB/s) MB/s (Event/s)

4 Pb-Pb 500 5x107 25 4000 (102) p-p 103 2x106 200 (102)

3 LV-1 105 1.5x106 6.5 700 (6x102) LV-2 3x103

2 LV-1 105 106 100 ~1000 (102)

2 LV-0 106 5.5x104 55 250 (4.5x103)

ALIC

EAT

LAS

CMS

LHCb

21LHC Trigger & DAQ - Niko Neufeld, CERN


High Level Trigger Farms

And that, in simple terms, is what we do in the High Level Trigger


Online Trigger Farms 2012 ALICE ATLAS CMS LHCb

# cores(+ hyperthreading)

2700 17000 13200 15500

# servers (mainboards)

~ 2000 ~ 1300 1574

total available cooling power

~ 500 ~ 820 800 525

total available rack-space (Us)

~ 2000 2400 ~ 3600 2200

CPU type(s) AMD Opteron, Intel 54xx, Intel 56xx

Intel 54xx, Intel 56xx

Intel 54xx, Intel 56xxIntel E5-2670

Intel 5450,Intel 5650,AMD 6220

And counting…

Not yet approved!

LHC planning


Long Shutdown 1 (LS1)

Long Shutdown 2 (LS2)Long Shutdown 3 (LS3)

CMS track-trigger

ALICE continuous read-outLHCb 40 MHz read-out

CMS: Myrinet InfiniBand / EthernetATLAS: Merge L2 and EventCollection infrastructures

Motivation

• The LHC (large hadron collider) collides protons every 25 ns (40 MHz)

• Each collision produces about 100 kB of data in the detector

• Currently a pre-selection in custom electronics rejects 97.5% of these events unfortunately a lot of them contain interesting physics

• In 2017 the detector will be changed so that all events can be read-out into a standard compute platform for detailed inspection

Niko Neufeld, CERN 25

LHCb after LS2


• Ready for all software trigger (resources permitting)• 0-suppression on front-end electronics mandatory!• Event-size about 100 kB, readout-rate up to 40 MHz• Will need a network scalable up to 32 Tbit/s:

InfiniBand, 10/40/100 Gigabit Ethernet?

Key figures

• Minimum required bandwidth: > 32 Tbit/s • # of 100 Gigabit/s links > 320• # of compute units > 1500• An event (“snapshot of a collision”) is about 100

kB of data• # of events processed every second: 10 to 40

millions• # of events retained after filtering: 20000 to

30000 (data reduction of at least a factor 1000)


GBT: custom radiation- hard link over MMF, 3.2 Gbit/s (about 10000)

Input into DAQ network (10/40 Gigabit Ethernet or FDR IB) (1000 to 4000)

Output from DAQ network into compute unit clusters (100 Gbit Ethernet / EDR IB) (200 to 400 links)

Compute units could be servers with GPUs or other coprocessors

LHCb DAQ as of 2018


Detector

DAQ network

100 m rock

Readout Units

Compute Units

Readout Unit

• Readout Unit needs to collect custom-links• Some pre-processing• Buffering• Coalescing of data-fragment reduce message-rate /

transport overheads• Needs an FPGA• Sends data using standard network protocol (IB, Ethernet)• Sending of data can be done directly from the FPGA or via

a standard network silicon• Works together with Compute Units to build events


Compute Unit

• A compute unit is a destination for the event-data fragments from the readout units

• It assembles the fragments into a complete “event” and runs various selection algorithms on this event

• About 0.1 % of events is retained• A compute unit will be a high-density server

platform (mainboard with standard CPUs), probably augmented with a co-processor card (like Intel MIC or GPU)


Future DAQ systems: trends

• Certainly LAN based – InfiniBand deserves a serious evaluation for high-bandwidth (> 100

GB/s)– In Ethernet if DCB works, might be able to build networks from

smaller units, otherwise we will stay with large store&forward boxes• Trend to “trigger-free” do everything in software bigger

DAQ will continue– Physics data-handling in commodity CPUs

• Will there be a place for multi-core / coprocessor cards (Intel MIC / CUDA)?– IMHO this will depend on if we can establish a development

framework which allows for longterm maintenance of the software by non-”geek” users, much more than on the actual technology


Fat-Tree Topology for One Slice• 48-port 10 GbE switches• Mix readout-boards (ROB) and filter-farm-servers in one switch

– 15 x readout-boards– 18 x servers– 15 x uplinks

Non-block switchinguse 65% of installed bandwidth(classical DAQ only 50%)

• Each slice accomodates– 690 x inputs (ROBS)– 828 x outputs servers

Ratio (server/ROB) is adjustable


33

Pull-Based Event Building

Data AcquisitionSwitch

1Event Builders notify Event Manager of available capacity 2

Event Manager elects event-builder node 3

Readout traffic is driven by Event Builders

“EB1, get next

event”

“EB2, get next

event”

EB1: EB2:EB3:

0 00



1 11



0 11

0 01

1 01

Event Builder 1

Event Builder 2

Event Builder 3


“Send event to EB 1!”“Send event

to EB 1!”“Send event to EB 1!”“Send event

to EB 1!”

Summary

• Large modern DAQ systems are based entirely (mostly) on Ethernet and big PC-server farms

• Bursty, uni-directional traffic is a challenge in the network and the receivers, and requires substantial buffering in the switches

• The future:– It seems that buffering in switches is being reduced (latency vs. buffering)– Advanced flow-control is coming, but it will need to be tested if it is sufficient for

DAQ– Ethernet is still strongest, but InfiniBand looks like a very interesting alternative– Integrated protocols (RDMA) can offload servers, but will be more complex– Integration of GPUs, non-Intel processors and other many-cores will be need to

be studied

• For the DAQ and triggering the question is not if we can do it, but how we can do it so we can afford it!


More Stuff

36

Cut-through switchingHead of Line Blocking

2 24

Packet to node 4 must waiteven though port to node 4 is free

• The reason for this is the First in First Out (FIFO) structure of the input buffer

• Queuing theory tells us* that for random traffic (and infinitely many switch ports) the throughput of the switch will go down to 58.6% that means on 100 MBit/s network the nodes will "see" effectively only ~ 58 MBit/s

*) "Input Versus Output Queueing on a Space-Division Packet Switch"; Karol, M. et al. ; IEEE Trans. Comm., 35/12


42

1 3

GBT: custom radiation- hard link over MMF, 3.2 Gbit/s (about 10000)

Input into DAQ network (10/40 Gigabit Ethernet or FDR IB) (1000 to 4000)

Output from DAQ network into compute unit clusters (100 Gbit Ethernet / EDR IB) (200 to 400 links)

Event-building


Detector

DAQ network

100 m rock

Readout Units

Compute Units

Readout Units send to Compute UnitsCompute Units receive passively“Push-architecture”

Runcontrol

© W

arne

r B

ros.


Runcontrol challenges

• Start, configure and control O(10000) processes on farms of several 1000 nodes

• Configure and monitor O(10000) front-end elements

• Fast data-base access, caching, pre-loading, parallelization and all this 100% reliable!


Runcontrol technologies

• Communication:– CORBA (ATLAS)– HTTP/SOAP (CMS)– DIM (LHCb, ALICE)

• Behavior & Automatisation:– SMI++ (Alice) – CLIPS (ATLAS)– RCMS (CMS)– SMI++ (in PVSS) (used also in the DCS)

• Job/Process control:– Based on XDAQ, CORBA, … – FMC/PVSS (LHCb, does also fabric monitoring)

• Logging:– log4C, log4j, syslog, FMC (again), …

Niko Neufeld "A 32 Tbit/s Data Acquisition System"

Business

lhc trigger

compute unit

high throughput

compute unit

niko neufeld

hard link

custom radiation

intel 54xx