-
Searching for Exotic Particles in High-Energy Physics with Deep
Learning
P. Baldi,1 P. Sadowski,1 and D. Whiteson2
1Dept. of Computer Science, UC Irvine, Irvine, CA 926172Dept. of
Physics and Astronomy, UC Irvine, Irvine, CA 92617
Collisions at high-energy particle colliders are a traditionally
fruitful source of exotic particle dis-coveries. Finding these rare
particles requires solving difficult signal-versus-background
classificationproblems, hence machine learning approaches are often
used. Standard approaches have relied onshallow machine learning
models that have a limited capacity to learn complex non-linear
functionsof the inputs, and rely on a pain-staking search through
manually constructed non-linear features.Progress on this problem
has slowed, as a variety of techniques have shown equivalent
performance.Recent advances in the field of deep learning make it
possible to learn more complex functions andbetter discriminate
between signal and background classes. Using benchmark datasets, we
showthat deep learning methods need no manually constructed inputs
and yet improve the classificationmetric by as much as 8% over the
best current approaches. This demonstrates that deep
learningapproaches can improve the power of collider searches for
exotic particles.
INTRODUCTION
The field of high energy physics is devoted to the studyof the
elementary constituents of matter. By investigat-ing the structure
of matter and the laws that govern itsinteractions, this field
strives to discover the fundamentalproperties of the physical
universe. The primary toolsof experimental high energy physicists
are modern ac-celerators, which collide protons and/or antiprotons
tocreate exotic particles that occur only at extremely highenergy
densities. Observing these particles and measur-ing their
properties may yield critical insights about thevery nature of
matter [1]. Such discoveries require power-ful statistical methods,
and machine learning tools playa critical role. Given the limited
quantity and expen-sive nature of the data, improvements in
analytical toolsdirectly boost particle discovery potential.
To discover a new particle, physicists must isolate asubspace of
their high-dimensional data in which the hy-pothesis of a new
particle or force gives a significantlydifferent prediction than
the null hypothesis, allowing foran effective statistical test. For
this reason, the criticalelement of the search for new particles
and forces in high-energy physics is the computation of the
relative likeli-hood, the ratio of the sample likelihood functions
in thetwo considered hypotheses, shown by Neyman and Pear-son [2]
to be the optimal discriminating quantity. Of-ten this relative
likelihood function cannot be expressedanalytically, so simulated
collision data generated withMonte Carlo methods are used as a
basis for approxima-tion of the likelihood function. The high
dimensionalityof data, referred to as the feature space, makes it
in-tractable to generate enough simulated collisions to de-scribe
the relative likelihood in the full feature space, andmachine
learning tools are used for dimensionality reduc-tion. Machine
learning classifiers such as neural networksprovide a powerful way
to solve this learning problem.
The relative likelihood function is a complicated func-tion in a
high-dimensional space. While any function
can theoretically be represented by a shallow classifier,such as
a neural network with a single hidden layer [3],an intractable
number of hidden units may be required.Circuit complexity theory
tells us that deep neural net-works (DN) have the potential to
compute complex func-tions much more efficiently (fewer hidden
units), but inpractice they are notoriously difficult to train due
to thevanishing gradient problem [4, 5]; the adjustments to
theweights in the early layers of a deep network rapidly ap-proach
zero during training. A common approach is tocombine shallow
classifiers with high-level features thatare derived manually from
the raw features. These aregenerally non-linear functions of the
input features thatcapture physical insights about the data. While
helpful,this approach is labor-intensive and not necessarily
op-timal; a robust machine learning method would obviatethe need
for this additional step and capture all of theavailable
classification power directly from the raw data.
Recent successes in deep learning e.g. neural net-works with
multiple hidden layers have come from al-leviating the gradient
diffusion problem by a combina-tion of factors, including: 1)
speeding up the stochas-tic gradient descent algorithm with
graphics processors;2) using much larger training sets; 3) using
new learn-ing algorithms, including randomized algorithms such
asdropout [6, 7]; and 4) pre-training the initial layers ofthe
network with unsupervised learning methods suchas autoencoders [8,
9]. With these methods, it is be-coming common to train deep
networks of five or morelayers. These advances in deep learning
could have a sig-nificant impact on applications in high-energy
physics.Construction and operation of the particle acceleratorsis
extremely expensive, so any additional classificationpower
extracted from the collision data is very valuable.
In this paper, we show that the current techniquesused in
high-energy physics fail to capture all of theavailable
information, even when boosted by manually-constructed
physics-inspired features. This effectively re-duces the power of
the collider to discover new particles.
arX
iv:1
402.
4735
v2 [
hep-
ph]
5 Jun
2014
-
2We demonstrate that recent developments in deep learn-ing tools
can overcome these failings, providing signifi-cant boosts even
without manual assistance.
RESULTS
The vast majority of particle collisions do not pro-duce exotic
particles. For example, though the LargeHadron Collider produces
approximately 1011 collisionsper hour, approximately 300 of these
collisions result ina Higgs boson, on average. Therefore, good data
anal-ysis depends on distinguishing collisions which
produceparticles of interest (signal) from those producing
otherparticles (background).
Even when interesting particles are produced, detect-ing them
poses considerable challenges. They are toosmall to be directly
observed and decay almost immedi-ately into other particles. Though
new particles cannotbe directly observed, the lighter stable
particles to whichthey decay, called decay products, can be
observed. Mul-tiple layers of detectors surround the point of
collision forthis purpose. As each decay product pass through
thesedetectors, it interacts with them in a way that allows
itsdirection and momentum to be measured.
Observable decay products include electrically-chargedleptons
(electrons or muons, denoted `), and particle jets(collimated
streams of particles originating from quarksor gluons, denoted j).
In the case of jets we attemptto distinguish between jets from
heavy quarks (b) andjets from gluons or low-mass quarks; jets
consistent withb-quarks receive a b-quark tag. For each object, the
mo-mentum is determined by three measurements: the mo-mentum
transverse to the beam direction (pT), and twoangles, (polar) and
(azimuthal). For convenience, athadron colliders, such as Tevatron
and LHC, the pseu-dorapidity, defined as = ln(tan(/2)) is used
insteadof the polar angle . Finally, an important quantity isthe
amount of momentum carried away by the invisibleparticles. This
cannot be directly measured, but can beinferred in the plane
transverse to the beam by requiringconservation of momentum. The
initial state has zeromomentum transverse to the beam axis,
therefore anyimbalance of transverse momentum (denoted 6ET ) in
thefinal state must be due to production of invisible particlessuch
as neutrinos () or exotic particles. The momentumimbalance in the
longitudinal direction along the beamcannot be measured at hadron
colliders, as the initialstate momentum of the quarks is not
known.
Benchmark Case for Higgs Bosons (HIGGS)
The first benchmark classification task is to distinguishbetween
a signal process where new theoretical Higgsbosons are produced,
and a background process with the
b
b
WWg
g
H0
H
h0
(a)
g
g
t
t
b
b
W+
W
(b)
FIG. 1: Diagrams for Higgs benchmark. (a) Diagram de-scribing
the signal process involving new exotic Higgs bosonsH0 and H. (b)
Diagram describing the background processinvolving top-quarks (t).
In both cases, the resulting particlesare two W bosons and two
b-quarks.
identical decay products but distinct kinematic features.This
benchmark task was recently considered by experi-ments at the LHC
[10] and the Tevatron colliders [11].
The signal process is the fusion of two gluons into aheavy
electrically-neutral Higgs boson (gg H0), whichdecays to a heavy
electrically-charged Higgs bosons (H)and a W boson. The H boson
subsequently decays to asecond W boson and the light Higgs boson,
h0 which hasrecently been observed by the ATLAS [12] and CMS
[13]experiments. The light Higgs boson decays predomi-nantly to a
pair of bottom quarks, giving the process:
gg H0 WH WWh0 WWbb, (1)
which leads to WWbb, see Figure 1. The backgroundprocess, which
mimics WWbb without the Higgs bo-son intermediate state, is the
production of a pair of topquarks, each of which decay toWb, also
givingWWbb,see Figure 1.
Simulated events are generated with the mad-graph5 [14] event
generator assuming 8 TeV collisionsof protons as at the latest run
of the Large HadronCollider, with showering and hadronization
performedby pythia [15] and detector response simulated bydelphes
[16]. For the benchmark case here, mH0 = 425GeV and mH = 325 GeV
has been assumed.
We focus on the semi-leptonic decay mode, in whichone W boson
decays to a lepton and neutrino (`) andthe other decays to a pair
of jets (jj), giving decay prod-ucts `b jjb. We consider events
which satisfy the re-quirements:
-
3 Exactly one electron or muon, with pT > 20 GeVand || <
2.5; at least four jets, each with pT > 20 GeV and || 20GeV and
|| < 2.5; at least 20 GeV of missing transverse momentum
As above, the basic detector response is used to mea-sure the
momentum of each visible particle, in this casethe charged leptons.
In addition, there may be particlejets induced by radiative
processes. A critical quantityis the missing transverse momentum,
6ET . Figure 5 givesdistributions of low-level features for signal
and back-ground processes.
The search for supersymmetric particles is a centralpiece of the
scientific mission of the Large Hadron Col-lider. The strategy we
applied to the Higgs boson bench-
-
5g
g
h
+
0
0
`+
`
(a)
q
q
q
W
W
`+
`
(b)
FIG. 4: Diagrams for SUSY benchmark. Example di-agrams
describing the signal process involving hypotheticalsupersymmetric
particles and 0 along with charged lep-tons ` and neutrinos (a) and
the background process in-volving W bosons (b). In both cases, the
resulting observedparticles are two charged leptons, as neutrinos
and 0 escapeundetected.
mark, of reconstructing the invariant mass of the inter-mediate
state, is not feasible here, as there is too muchinformation
carried away by the escaping neutrinos (twoneutrinos in this case,
compared to one for the Higgscase). Instead, a great deal of
intellectual energy hasbeen spent in attempting to devise features
which giveadditional classification power. These include
high-levelfeatures such as:
Axial 6ET : missing transverse energy along the vec-tor defined
by the charged leptons,
stransverse mass MT2: estimating the mass ofparticles produced
in pairs and decaying semi-invisibly [17, 18],
6ETRel: 6ET if pi/2, 6ET sin() if < pi/2,where is the minimum
angle between 6ET and ajet or lepton,
razor quantities ,R, and MR [19],
super-razor quantities R+1, cos(R+1), R, MR ,MTR , and
sR [20].
See Figure 6 for distributions of these high-level fea-tures for
both signal and background processes.
[GeV]T
Lepton 1 p0 50 100 150 200
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(a)
[GeV]T
Lepton 2 p0 50 100 150 200
Frac
tion
of E
vent
s
0
0.1
0.2
0.3
(b)
Sum jet pT0 50 100 150 200
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(c)
Missing Trans. Mom [GeV]0 50 100 150 200
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(d)
N jets0 1 2 3 4
Frac
tion
of E
vent
s
0
0.2
0.4
0.6
(e)
FIG. 5: Low-level input features for SUSY bench-mark.
Distribution of low-level features in simulated samplesfor the SUSY
signal (black) and background (red) benchmarkprocesses.
A dataset containing five million simulatedcollision events are
available for download atarchive.ics.uci.edu/ml/datasets/SUSY.
Current Approach
Standard techniques in high-energy physics data anal-yses
include feed-forward neural networks with a singlehidden layer and
boosted-decision trees. We use thewidely-used TMVA package [21],
which provides a stan-dardized implementation of common
multi-variate learn-ing techniques and an excellent performance
baseline.
Deep Learning
We explored the use of deep neural networks as a prac-tical tool
for applications in high-energy physics. Hyper-
-
6R0 0.2 0.4 0.6 0.8 1
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(a)
Axial MET-200 0 200
Frac
tion
of E
vent
s
0
0.1
0.2
(b)
R
0 0.2 0.4 0.6 0.8 1
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(c)
R+1
0 0.2 0.4 0.6 0.8 1
Frac
tion
of E
vent
s
0
0.2
0.4
(d)
)R+1cos(0 0.2 0.4 0.6 0.8 1
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(e)
R
-2 0 2
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(f)
RM
0 50 100 150 200 250
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(g)
RM0 100 200 300 400 500
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(h)
RTM
0 50 100 150 200
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(i)
MET_rel0 100 200 300
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(j)
T2M0 50 100 150 200
Frac
tion
of E
vent
s
0
0.1
0.2
0.3
(k)
Rs0 200 400 600
Frac
tion
of E
vent
s
0
0.05
0.1
0.15
0.2
(l)
FIG. 6: High-level input features for SUSY bench-mark.
Distribution of high-level features in simulated sam-ples for the
SUSY signal (black) and background (red) bench-mark processes.
parameters were chosen using a subset of the HIGGS
dataconsisting of 2.6 million training examples and
100,000validation examples. Due to computational costs,
thisoptimization was not thorough, but included combina-tions of
the pre-training methods, network architectures,initial learning
rates, and regularization methods shownin Supplementary Table 1. We
selected a five-layer neuralnetwork with 300 hidden units in each
layer, a learningrate of 0.05, and a weight decay coefficient of 1
105.Pre-training, extra hidden units, and additional hiddenlayers
significantly increased training time without no-ticeably
increasing performance. To facilitate compari-son, shallow neural
networks were trained with the samehyper-parameters and the same
number of units per hid-den layer. Additional training details are
provided in theMethods section below.
The hyper-parameter optimization was performed us-ing the full
set of HIGGS features. To investigate whetherthe neural networks
were able to learn the discrimina-tive information contained in the
high-level features, wetrained separate classifiers for each of the
three featuresets described above: low-level, high-level, and
combinedfeature sets. For the SUSY benchmark, the networkswere
trained with the same hyper-parameters chosen forthe HIGGS, as the
datasets have similar characteristicsand the hyper-parameter search
is computationally ex-pensive.
Performance
Classifiers were tested on 500,000 simulated examplesgenerated
from the same Monte Carlo procedures as thetraining sets. We
produced Receiver Operating Charac-teristic (ROC) curves to
illustrate the performance of theclassifiers. Our primary metric
for comparison is the areaunder the ROC curve (AUC), with larger
AUC values in-dicating higher classification accuracy across a
range ofthreshold choices.
This metric is insightful, as it is directly connected
toclassification accuracy, which is the quantity optimizedfor in
training. In practice, physicists may be interestedin other
metrics, such as signal efficiency at some fixedbackground
rejection, or discovery significance as calcu-lated by p-value in
the null hypothesis. We choose AUCas it is a standard in machine
learning, and is closelycorrelated with the other metrics. In
addition, we calcu-late discovery significance the standard metric
in high-energy physics to demonstrate that small increases inAUC
can represent significant enhancement in discoverysignificance.
Note, however, that in some applications the determin-ing factor
in the sensitivity to new exotic particles is de-termined not only
by the discriminating power of the se-lection, but by the
uncertainties in the background modelitself. Some portions of the
background model may be
-
7TABLE I: Performance for Higgs benchmark. Com-parison of the
performance of several learning techniques:boosted decision trees
(BDT), shallow neural networks (NN),and deep neural networks (DN)
for three sets of input fea-tures: low-level features, high-level
features and the completeset of features. Each neural network was
trained five timeswith different random initializations. The table
displays themean Area Under the Curve (AUC) of the
signal-rejectioncurve in Figure 7, with standard deviations in
parentheses aswell as the expected significance of a discovery (in
units ofGaussian ) for 100 signal events and 1000 50
backgroundevents.
AUC
Technique Low-level High-level Complete
BDT 0.73 (0.01) 0.78 (0.01) 0.81 (0.01)
NN 0.733 (0.007) 0.777 (0.001) 0.816 (0.004)
DN 0.880 (0.001) 0.800 (< 0.001) 0.885 (0.002)
Discovery significance
Technique Low-level High-level Complete
NN 2.5 3.1 3.7
DN 4.9 3.6 5.0
better understood than others, so that some simulatedbackground
collisions have larger associated systematicuncertainties than
other collisions. This can transformthe problem into one of
reinforcement learning, whereper-collision truth labels no longer
indicate the ideal net-work output target. This is beyond the scope
of thisstudy, but see Refs. [22, 23] for stochastic
optimizatonstrategies for such problems.
Figure 7 and Table I show the signal efficiency andbackground
rejection for varying thresholds on the out-put of the neural
network (NN) or boosted decision tree(BDT).
A shallow NN or BDT trained using only the low-levelfeatures
performs significantly worse than one trainedwith only the
high-level features. This implies that theshallow NN and BDT are
not succeeding in indepen-dently discovering the discriminating
power of the high-level features. This is a well-known problem with
shallowlearning methods, and motivates the calculation of
high-level features.
Methods trained with only the high-level features,however, have
a weaker performance than those trainedwith the full suite of
features, which suggests that despitethe insight represented by the
high-level features, they donot capture all of the information
contained in the low-level features. The deep learning techniques
show nearlyequivalent performance using the low-level features
andthe complete features, suggesting that they are automat-ically
discovering the insight contained in the high-levelfeatures.
Finally, the deep learning technique finds addi-tional separation
power beyond what is contained in thehigh-level features,
demonstrated by the superior perfor-
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
NN lo+hi-level (AUC=0.81)
NN hi-level (AUC=0.78)
NN lo-level (AUC=0.73)
(a)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN lo+hi-level (AUC=0.88)
DN lo-level (AUC=0.88)
DN hi-level (AUC=0.80)
(b)
FIG. 7: Performance for Higgs benchmark. For theHiggs benchmark,
comparison of background rejection versussignal efficiency for the
traditional learning method (a) andthe deep learning method (b)
using the low-level features, thehigh-level features and the
complete set of features.
mance of the deep network with low-level features to
thetraditional network using high-level features. These re-sults
demonstrate the advantage to using deep learningtechniques for this
type of problem.
The internal representation of a NN is notoriously dif-ficult to
reverse engineer. To gain some insight into themechanism by which
the deep network (DN) is improvingupon the discrimination in the
high-level physics features,we compare the distribution of
simulated events selectedby a minimum threshold on the NN or DN
output, cho-sen to give equivalent rejection of 90% of the
background
-
8 [GeV]Wbbm150 200 250 300 350 400 450 500
Frac
tion
of E
vent
s
0
0.01
0.02
0.03
0.04
0.05
0.06
Signal
Backg.
DN21 (rej=0.9)NN7 (rej=0.9)
(a)
[GeV]WWbbm200 300 400 500 600 700 800
Frac
tion
of E
vent
s
0
0.01
0.02
0.03
0.04
0.05
0.06
Signal
Backg.
DN21 (rej=0.9)NN7 (rej=0.9)
(b)
FIG. 8: Performance comparisons. Distribution ofevents for two
rescaled input features: (a) mWbb and (b)mWWbb. Shown are pure
signal and background distribu-tions, as well as events which pass
a threshold requirementwhich gives a background rejection of 90%
for a deep networkwith 21 low-level inputs (DN21) and a shallow
network with7 high-level inputs (NN7).
events. Figure 8 shows events selected by such thresh-olds in an
mixture of 50% signal and 50% backgroundcollisions, compared to
pure distributions of signal andbackground. The NN preferentially
selects events withvalues of the features close to the
characteristic signal val-ues and away from background-dominated
values. TheDN, which has a higher efficiency for the equivalent
re-jection, selects events near the same signal values, butalso
retains events away from the signal-dominated re-gion. The likely
explanation is that the DN has discov-ered the same signal-rich
region identified by the physicsfeatures, but has in addition found
avenues to carve intothe background-dominated region.
In the case of the SUSY benchmark, the deep neuralnetworks again
perform better than the shallow networks.
TABLE II: Performance comparison for the SUSYbenchmark. Each
model was trained five times with dif-ferent weight
initializations. The mean AUC is shown withstandard deviations in
parentheses as well as the expected sig-nificance of a discovery
(in units of Gaussian ) for 100 signalevents and 1000 50 background
events.
AUC
Technique Low-level High-level Complete
BDT 0.850 (0.003) 0.835 (0.003) 0.863 (0.003)
NN 0.867 (0.002) 0.863 (0.001) 0.875 (< 0.001)
NNdropout 0.856 (< 0.001) 0.859 (< 0.001) 0.873 (<
0.001)
DN 0.872 (0.001) 0.865 (0.001) 0.876 (< 0.001)
DNdropout 0.876 (< 0.001) 0.869 (< 0.001) 0.879 (<
0.001)
Discovery significance
Technique Low-level High-level Complete
NN 6.5 6.2 6.9
DN 7.5 7.3 7.6
The improvement is less dramatic, though
statisticallysignificant.
An additional boost in performance is obtained by us-ing the
dropout training algorithm, in which we stochas-tically drop
neurons in the top hidden layer with 50%probability during
training. For deep networks trainedwith dropout, we achieve an AUC
of 0.88 on both thelow-level and complete feature sets. Table II,
Supple-mentary Fig. 2 and Supplementary Fig. 3 compare
theperformance of shallow and deep networks for each ofthe three
sets of input features.
In this SUSY case, neither the high-level features northe deep
network finds dramatic gains over the shallownetwork of low-level
features. The power of the deepnetwork to automatically find
non-linear features revealssomething about the nature of the
classification problemin this case: it suggests that there may be
little gainfrom further attempts to manually construct
high-levelfeatures.
Analysis
To highlight the advantage of deep networks over shal-low
networks with a similar number of parameters, weperformed a
thorough hyper-parameter optimization forthe class of single-layer
neural networks over the hyper-parameters specified in
Supplementary Table 2 on theHIGGS benchmark. The largest shallow
network had300, 001 parameters, slightly more than the 279, 901
pa-rameters in the largest deep network, but these additionalhidden
units did very little to increase performance overa shallow network
with only 30, 001 parameters. Supple-mentary Table 3 compares the
performance of the bestshallow networks of each size with deep
networks of vary-ing depth.
-
9While the primary advantage of deep neural networksis their
ability to automatically learn high-level featuresfrom the data,
one can imagine facilitating this processby pre-training a neural
network to compute a particu-lar set of high-level features. As a
proof of concept, wedemonstrate how deep neural networks can be
trainedto compute the high-level HIGGS features with a highdegree
of accuracy (Supplementary Table 4). Note thatsuch a network could
be used as a module within a largerneural network classifier.
DISCUSSION
It is widely accepted in experimental high-energyphysics that
machine learning techniques can providepowerful boosts to searches
for exotic particles. Untilnow, physicists have reluctantly
accepted the limitationsof the shallow networks employed to date;
in an attemptto circumvent these limitations, physicists manually
con-struct helpful non-linear feature combinations to guidethe
shallow networks.
Our analysis shows that recent advances in deeplearning
techniques may lift these limitations by auto-matically discovering
powerful non-linear featurecombinations and providing better
discriminationpower than current classifiers even when aided
bymanually-constructed features. This appears to be thefirst such
demonstration in a semi-realistic case.
We suspect that the novel environment of high energyphysics,
with high volumes of relatively low-dimensionaldata containing rare
signals hiding under enormous back-grounds, can inspire new
developments in machine learn-ing tools. Beyond these simple
benchmarks, deep learn-ing methods may be able to tackle thornier
problems withmultiple backgrounds, or lower-level tasks such as
identi-fying the decay products from the high-dimensional
rawdetector output.
METHODS
Neural Network Training
In training the neural networks, the following hyper-parameters
were predetermined without optimization.Hidden units all used the
tanh activation function.Weights were initialized from a normal
distribution withzero mean and standard deviation 0.1 in the first
layer,0.001 in the output layer, and 0.05 all other hidden lay-ers.
Gradient computations were made on mini-batchesof size 100. A
momentum term increased linearly overthe first 200 epochs from 0.9
to 0.99, at which point itremained constant. The learning rate
decayed by a factorof 1.0000002 every batch update until it reached
a mini-mum of 106. Training ended when the momentum had
reached its maximum value and the minimum error onthe validation
set (500,000 examples) had not decreasedby more than a factor of
0.00001 over 10 epochs. Thisearly stopping prevented overfitting
and resulted in eachneural network being trained for 200-1000
epochs.
Autoencoder pretraining was performed by training astack of
single-hidden-layer autoencoder networks as in[9], then fine-tuning
the full network using the class la-bels. Each autoencoder in the
stack used tanh hiddenunits and linear outputs, and was trained
with the sameinitialization scheme, learning algorithm, and
stoppingparameters as in the fine-tuning stage. When trainingwith
dropout, we increased the learning rate decay factorto 1.0000003,
and only ended training when the momen-tum had reached its maximum
value and the error on thevalidation set had not decreased for 40
epochs.
Datasets
The data sets were nearly balanced, with 53% positiveexamples in
the HIGGS data set, and 46% positive exam-ples in the SUSY data
set. Input features were standard-ized over the entire train/test
set with mean zero andstandard deviation one, except for those
features withvalues strictly greater than zero these we scaled so
thatthe mean value was one.
Computation
Computations were performed using machines with 16Intel Xeon
cores, an NVIDIA Tesla C2070 graphics pro-cessor, and 64 GB memory.
All neural networks weretrained using the GPU-accelerated Theano
and Pylearn2software libraries [24, 25]. Our code is available
athttps://github.com/uci-igb/higgs-susy.
ACKNOWLEDGMENTS
We are grateful to Kyle Cranmer, Chris Hays, ChaseShimmin,
Davide Gerbaudo, Bo Jayatilaka, Jahred Adel-man, and Shimon
Whiteson for their insightful com-ments. We wish to acknowledge a
hardware grant fromNVIDIA.
AUTHOR CONTRIBUTIONS
PB conceived the idea of applying deep learning meth-ods to high
energy physics and particle detection. DWchose the benchmarks and
generated the data and struc-tured the problem. PB and PS designed
the architectures
-
10
and the algorithms. PS implemented the code and per-formed the
experiments. All authors analyzed the resultsand contributed to
writing up the manuscript.
REFERENCES
Electronic address: [email protected] Electronic address:
[email protected]
[1] Dawson, S. et al. Higgs Working Group Report of theSnowmass
2013 Community Planning Study, Preprint
athttp://arxiv.org/abs/1310.8361 (2013).
[2] Neyman, J. & Pearson, E. Philosophical Transactions
ofthe Royal Society 231, 694706 (1933).
[3] Hornik, K., Stinchcombe, M. & White, H.
Multilayerfeedforward networks are universal approximators. Neu-ral
Netw. 2, 359366 (1989). URL
http://dx.doi.org/10.1016/0893-6080(89)90020-8.
[4] Hochreiter, S. Recurrent Neural Net Learning and Van-ishing
Gradient (1998).
[5] Bengio, Y., Simard, P. & Frasconi, P. Learning
long-termdependencies with gradient descent is difficult.
NeuralNetworks, IEEE Transactions on 5, 157166 (1994).
[6] Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever,I.
& Salakhutdinov, R. R. Improving neural net-works by preventing
co-adaptation of feature detectors.http://arxiv.org/abs/1207.0580
(2012).
[7] Baldi, P. & Sadowski, P. The dropout learn-ing
algorithm. Artificial Intelligence 210, 78122(2014). URL
http://www.sciencedirect.com/science/article/pii/S0004370214000216.
[8] Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast
learningalgorithm for deep belief nets. Neural Computation
18,15271554 (2006). URL
http://dx.doi.org/10.1162/neco.2006.18.7.1527.
[9] Bengio, Y. et al. Greedy layer-wise training of deep
net-works. In Advances in Neural Information ProcessingSystems 19
(MIT Press, 2007).
[10] Aad, G. et al. Search for a Multi-Higgs Boson Cascade
inW+Wbb events with the ATLAS detector in pp collisionsat s = 8 TeV
(2013).
[11] Aaltonen, T. et al. Search for a two-Higgs-boson
doubletusing a simplified model in pp collisions at
s = 1.96
TeV. Phys.Rev.Lett. 110, 121801 (2013).[12] ATLAS Collaboration.
Observation of a new particle in
the search for the Standard Model Higgs boson with theATLAS
detector at the LHC. Phys.Lett. B716, 129(2012).
[13] CMS Collaboration. Observation of a new boson at amass of
125 GeV with the CMS experiment at the LHC.Phys.Lett. B716, 3061
(2012).
[14] Alwall, J. et al. MadGraph 5 : Going Beyond. JHEP1106, 128
(2011).
[15] Sjostrand, T. et al. PYTHIA 6.4 physics and manual.JHEP 05,
026 (2006).
[16] Ovyn, S., Rouby, X. & Lemaitre, V. DELPHES, a
frame-work for fast simulation of a generic collider
experiment(2009).
[17] Cheng, H.-C. & Han, Z. Minimal Kinematic Constraintsand
m(T2). JHEP 0812, 063 (2008).
[18] Barr, A., Lester, C. & Stephens, P. m(T2): The
Truthbehind the glamour. J.Phys. G29, 23432363 (2003).
[19] Rogan, C. Kinematical variables towards new dynamicsat the
LHC (2010).
[20] Buckley, M. R., Lykken, J. D., Rogan, C. & Spiropulu,M.
Super-Razor and Searches for Sleptons and Charginosat the LHC
(2013).
[21] Hocker, A. et al. TMVA - Toolkit for Multivariate
DataAnalysis. PoS ACAT, 040 (2007).
[22] Whiteson, S. & Whiteson, D. Stochastic optimizationfor
collision selection in high energy physics. In IAAI2007:
Proceedings of the Nineteenth Annual InnovativeApplications of
Artificial Intelligence Conference, 18191825 (2007).
[23] Aaltonen, T. et al. Measurement of the top quark masswith
dilepton events selected using neuroevolution atCDF. Phys.Rev.Lett.
102, 152001 (2009).
[24] Bergstra, J. et al. Theano: a CPU and GPU math expres-sion
compiler. In Proceedings of the Python for ScientificComputing
Conference (SciPy) (Austin, TX, 2010). OralPresentation.
[25] Goodfellow, I. J. et al. Pylearn2: a machine
learningresearch library. arXiv preprint arXiv:1308.4214
(2013).
-
11
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN lo-level (AUC=0.88)
NN lo-level (AUC=0.73)
(a)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN hi-level (AUC=0.80)
NN hi-level (AUC=0.78)
(b)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN lo+hi-level (AUC=0.88)
NN lo+hi-level (AUC=0.81)
(c)
Supplementary Figure 1: Details of performance forHiggs
benchmark. For the HIGGS benchmark, compar-ison of background
rejection versus signal efficiency for thelow-level features (a),
high-level features (b) and complete setof features (c) using
traditional and deep learning methods.
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
NN lo+hi-level (AUC=0.88)
NN hi-level (AUC=0.86)
NN lo-level (AUC=0.86)
(a)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN lo+hi-level (AUC=0.88)
DN lo-level (AUC=0.88)
DN hi-level (AUC=0.87)
(b)
Supplementary Figure 2: Performance for SUSY bench-mark.
Comparison of background rejection versus signal ef-ficiency for
the traditional learning method (a) and the deeplearning method (b)
using the low-level features, the high-level features and the
complete set of features.
-
12
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN lo-level (AUC=0.88)
NN lo-level (AUC=0.86)
(a)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN hi-level (AUC=0.87)
NN hi-level (AUC=0.86)
(b)
Signal efficiency0 0.2 0.4 0.6 0.8 1
Back
grou
nd R
ejecti
on
0
0.2
0.4
0.6
0.8
1
DN hi-level (AUC=0.87)
NN hi-level (AUC=0.86)
(c)
Supplementary Figure 3: Details of performance forSUSY
benchmark. Comparison of background rejectionversus signal
efficiency for the low-level features (a), high-levelfeatures (b)
and complete set of features (c) using traditionaland deep learning
methods.
-
13
Supplementary Table 1: Hyper-parameter choices fordeep networks.
Shown are the considered value for eachhyper-parameter.
Hyper-parameters Choices
Depth 2, 3, 4, 5, 6 layers
Hidden units per layer 100, 200, 300, 500
Initial learning rate 0.01, 0.05
Weight decay 0, 105
Pre-training none, autoencoder
Supplementary Table 2: Hyper-parameter choices forshallow
networks. Shown are the considered value for
eachhyper-parameter.
Hyper-parameters Choices
Hidden units 300, 1000, 2000, 10000
Initial learning rate 0.05, 0.005, 0.0005
Weight decay 0, 105
Supplementary Table 3: Study of network size anddepth.
Comparison of shallow networks with different num-bers of hidden
units (single hidden layer), and deep networkswith varying hidden
layers in terms of the Area Under theROC Curve (AUC) for the HIGGS
benchmark. The deepnetworks have 300 units in each hidden
layer.
AUC
Technique Low-level High-level Complete
NN 300-hidden 0.733 0.777 0.816
NN 1000-hidden 0.788 0.783 0.841
NN 2000-hidden 0.787 0.788 0.842
NN 10000-hidden 0.790 0.789 0.841
DN 3 layers 0.836 0.791 0.850
DN 4 layers 0.868 0.797 0.872
DN 5 layers 0.880 0.800 0.885
DN 6 layers 0.888 0.799 0.893
-
14
Supplementary Table 4: Error Analysis. Mean SquaredError (MSE)
of networks trained to compute the seven high-level features from
the 21 low-level features on HIGGS. Allnetworks were trained using
the same hyperparameters as thedeep networks in the previous
experiments, except with linearoutput units, no momentum, and a
smaller initial learningrate of 0.005.
Technique Feature Regression MSE
Linear Regression 0.1468
NN 0.0885
DN 3 layers 0.0821
DN 4 layers 0.0818
DN 5 layers 0.0815
DN 6 layers 0.0812
Introduction Results Benchmark Case for Higgs Bosons (HIGGS)
Benchmark Case for Supersymmetry Particles (SUSY) Current Approach
Deep Learning Performance Analysis
Discussion Methods Neural Network Training Datasets
Computation
Acknowledgments Author Contributions References References