Top Banner
Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani, Kevin Pedro, Bill Pellico, Gabe Perdue, Ryan Rivera, Brian Schupbach, Kiyomi Seiya, Jason St. John, Mike Wang,… May 10, 2019 Real-time on-detector AI
14

Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

May 05, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani, Kevin Pedro, Bill

Pellico, Gabe Perdue, Ryan Rivera, Brian Schupbach, Kiyomi Seiya, Jason St. John, Mike Wang,…

May 10, 2019

Real-time on-detector AI

Page 2: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

CMS EVENT PROCESSING 2

Javier Duarte I hls4ml 6

CMS TriggerHigh-Level TriggerL1 Trigger

1 kHz 1 MB/evt

40 MHz

100 kHz

• Level-1 Trigger (hardware)

• 99.75% rejected

• decision in ~4 μs

• High-Level Trigger (software)

• 99% rejected

• decision in ~100s ms

• After trigger, 99.99975% of events are gone forever

Offline

Offline

1 ns 1 us 1 s1 ms

ComputeLatency

FPGAs CPUs CPUs

ML ML

Page 3: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

CMS EVENT PROCESSING 3

Javier Duarte I hls4ml 6

CMS TriggerHigh-Level TriggerL1 Trigger

1 kHz 1 MB/evt

40 MHz

100 kHz

• Level-1 Trigger (hardware)

• 99.75% rejected

• decision in ~4 μs

• High-Level Trigger (software)

• 99% rejected

• decision in ~100s ms

• After trigger, 99.99975% of events are gone forever

Offline

Offline

1 ns 1 us 1 s1 ms

ComputeLatency

FPGAs CPUs CPUs

ML ML

FPGAs FPGAs

ML

Page 4: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

CMS EVENT PROCESSING 4

Javier Duarte I hls4ml 6

CMS TriggerHigh-Level TriggerL1 Trigger

1 kHz 1 MB/evt

40 MHz

100 kHz

• Level-1 Trigger (hardware)

• 99.75% rejected

• decision in ~4 μs

• High-Level Trigger (software)

• 99% rejected

• decision in ~100s ms

• After trigger, 99.99975% of events are gone forever

Offline

Offline

1 ns 1 us 1 s1 ms

ComputeLatency

FPGAs CPUs CPUs

ML ML

FPGAs FPGAs

ML

A whole other talk, mostly for computing grouphttps://arxiv.org/abs/1904.08986

Page 5: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

CMS EVENT PROCESSING 5

Javier Duarte I hls4ml 6

CMS TriggerHigh-Level TriggerL1 Trigger

1 kHz 1 MB/evt

40 MHz

100 kHz

• Level-1 Trigger (hardware)

• 99.75% rejected

• decision in ~4 μs

• High-Level Trigger (software)

• 99% rejected

• decision in ~100s ms

• After trigger, 99.99975% of events are gone forever

Offline

Offline

1 ns 1 us 1 s1 ms

ComputeLatency

CPUs CPUs

ML ML

FPGAs FPGAs

FPGAs

ML

ASICs

???

At > ~1ms (network switching latencies), this hits

the domain of CPU/GPU and you’re better off going

to industry tools.

But…- no time for CPU- heavy calculation- high throughput

Custom real-time detector AI applications are for you!

Page 6: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

HIGH RATE AND INTELLIGENT EDGE 6

FPGA

DSPs (multiply-accumulate, etc.) Flip Flops (registers/distributed memory)

LUTs (logic) Block RAMs (memories)

FPGA 6

O(50-100) optical transceivers running at ~O(15) Gbs

Traditionally, FPGAs programmed with low-level

languages like Verilog and VHDL

High level synthesis (HLS) New languages C-level

programming with specialized preprocessor directives which

synthesizes optimized firmware; Drastically reduces development

times for firmware

Page 7: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

NEURAL NETWORKS AND LINEAR ALGEBRA 7

input layer

output layer

M hidden layers

N1

NM

layer m

Nm

Figure 2: A cartoon of a deep, fully connected neural network illustrating the description conventionsused in the text

2.2 Case study: jet substructure

Jets are collimated showers of particles that result from the decay and hadronization of quarks q andgluons g. At the LHC, due to the high collision energy, a particularly interesting jet signature emergesfrom overlapping quark-initiated showers produced in decays of heavy standard model particles. Forexample, the W and Z bosons decay to two quarks (qq̄) 67%-70% of the time and the Higgs bosonis predicted to decay to two b-quarks (bb̄) approximatly 58% of the time. The top quark decays totwo light quarks and a b-quark (qq̄b). It is the task of jet substructure [18, 47] to distinguish thevarious radiation profiles of these jets from backgrounds consisting mainly of quark (u, d, c, s, b) andgluon-initiated jets. The tools of jet substructure have been used to distinguish interesting jet signaturesfrom backgrounds that have production rates hundreds of times larger than the signal [48].

Jet substructure at the LHC has been a particularly active field for machine learning techniques asjets contain O(100) particles whose properties and correlations may be exploited to identify physicssignals. The high dimensionality and highly correlated nature of the phase space makes this task aninteresting testbed for machine learning techniques. There are many studies that explore this possibility,both in experiment and theory [18, 39–47, 49–51]. For this reason, we choose to benchmark our FPGAstudies using the jet substructure task.

We give two examples in Fig. 3 where jet substructure techniques in the trigger can play animportant role: low mass hidden hadronic resonances [52] and boosted Higgs produced in gluonfusion [53]. Both processes are overwhelmed by backgrounds and current trigger strategies would

– 6 –

Oj = Φ(Ii × Wij + bj)→ ↔ →→

Φ = ACTIVATION FUNCTION (NON-LINEARITY)

Page 8: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

PROJECT OVERVIEW 8

2 Building neural networks with hls4ml

In this section, we give an overview of translating a given neural network model into a FPGAimplementation using HLS. We then detail a specific jet substructure case study, but the same conceptsare applicable for a broad class of problems. We conclude this section by discussing how to createan e�cient and optimal implementation of a neural network in terms of performance, resource usage,and latency.

2.1 hls4ml concept

The task of automatically translating a trained neural network, specified by the model’s architecture,weights, and biases, into HLS code is performed by the hls4ml package. A schematic of a typicalworkflow is illustrated in Fig. 1.

- -- -

- -

-

- /

--- -

/

hls 4 ml

hls4ml

HLS 4 ML

Figure 1: A typical workflow to translate a model into a FPGA implementation using hls4ml.

The part of the workflow illustrated in red indicates the usual software workflow required todesign a neural network for a specific task. This usual machine learning workflow, with tools such asKeras and PyTorch, involves a training step and possible compression steps (more discussion belowin Sec. 2.3) before settling on a final model. The blue section of the workflow is the task of hls4ml,which translates a model into an HLS project that can be synthesized and implemented to run on anFPGA.

At a high level, FPGA algorithm design is unique from programming a CPU in that independentoperations may run fully in parallel, allowing FPGAs to achieve trillions of operations per second at arelatively low power cost. However, such operations consume dedicated resources onboard the FPGAand cannot be dynamically remapped while running. The challenge in creating an optimal FPGAimplementation is to balance FPGA resource usage with achieving the latency and throughput goalsof the target algorithm. Key metrics for an FPGA implementation include:

– 4 –

Quantization, Compression, Parallelization made easy with hls4ml!

Results and outlook: 4000 parameter network inferred in < 100 ns with 30% of FPGA resources!

Muon pT reconstruction with NN reduces rate by 80%Larger networks and different architectures actively developed (CNN, RNN, Graph)

Page 9: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

TECH TRANSFER

LDRD: Add “reinforcement learning” to improve accelerator operations

Tuning the Gradient Magnet Power Supply (GMPS) system for the Boosterwill be a first for accelerators and critical for future machines A first proof-of-concept, could apply across the accelerator complex

9

Page 10: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

FUTURISITIC IDEAS 10

HARDWARE ALTERNATIVES �11

FPGAs

EFFICIENCY

Control Unit (CU)

Registers

Arithmetic Logic Unit

(ALU) + + + +

+ ++

Silicon alternatives

FLEXIBILITY

CPUs GPUsASICs

Photonics

Page 11: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

ASICS 11

Edge TPU

Page 12: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

FRANKENSTEINS 12

Xilinx Versal

Page 13: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

PHOTONICS

Even faster — a neural network photonics “ASIC” Recently fabrication processes have become more reliable

13

In contact with 2 groups (MIT,

Princeton) on possible photonics prototypes

Page 14: Nhan Tran + Javier Duarte, Lindsey Gray, Sergo Jindariani ...

SUMMARY

Real-time AI brings processing power on-detector Improves losses in efficiency/performance for triggers - gains back physics

Other physics scenarios? A lot of efficiency loss from high bandwidth systems…

Want to demonstrate helps with automation and efficiency of system operation

Futuristic technologies could bring even more front end processing power

Hardened vector DSPs, electronics and photonics

14