Higgs Boson Machine Learning Challenge Report

CS4622 – Machine Learning

Group Project

(Higgs Boson Machine Learning Challenge)

Group Members

Dhanushka S.W.S. 100097C

Lasandun K.H.L. 100295G

Siriwardena M. P. 100512X

Upeksha W. D. 100552T

Wijayarathna D. G. C. D. 100596F

1

Table of Content Table of Content……………………………………………………………………………….2

Introduction……………………………………………………………………………………..3

What is Higgs boson………………………………………………………………….3

Higgs boson Challenge……………………………………………………………....3

Signal and Background………………………………………………………………3

Training Data Analysis and Data Pre processing Techniques Used…………………....5

Classification Methods Used……………………………………………………………….17

Boosted Decision Trees with XGBoost…………………………………………..17

Boosted Decision Trees with TMVA (Toolkit for MultiVariate Analysis)……...21

Naive Bayesian Model……………………………………………………………..24

Artificial Neural Network Based Approach……………………………………….24

Multiboost Based Approach…………………………………………………….....25

AMS Score and Model Evaluation Criteria………………………………………………..27

Conclusion……………………………………………………………………………………28

References…………………………………………………………………………………...29

2

1.0 Introduction 1.1 What is Higgs boson

The Higgs boson (or Higgs particle) is a particle that gives mass to other particles. Peter

Higgs was the first person to think of it. It is part of the Standard Model in physics, which

means it is found everywhere. The Higgs particle is a boson. Bosons are particles

responsible for all physical forces except gravity.

It is very difficult to detect the Higgs boson with the equipment and technology we have now.

These particles are believed to exist for less than a septillionth of a second. As the Higgs

boson has so much mass compared to other particles, it takes a lot of energy to create one.

(E = mC2)

Discovery of the long awaited Higgs boson was announced July 4, 2012 and confirmed six

months later. 2013 saw a number of prestigious awards, including a Nobel Prize. But for

physicists, the discovery of a new particle means the beginning of a long and difficult quest to

measure its characteristics and determine if it fits the current model of nature.

1.2 Higgs boson Challenge

The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of

advanced machine learning methods to improve the discovery significance of the experiment.

Using simulated data with features characterizing events detected by ATLAS, the task is to

classify events into "tau tau decay of a Higgs boson" versus "background." The best method

may eventually be applied to real data.

1.3 Signal and background

Classification algorithms have been routinely used since the 90s in highenergy physics to

separate signal and background in particle detectors. The goal of the classifier is to maximize

the sensitivity of a counting test in a selection region. It is similar in spirit but formally different

from the classical objectives of minimizing misclassification error or maximizing AUC. The

initial step is ongoing example of detecting the Higgs boson in the tautau decay channel in

the ATLAS detector of the LHC. First the problem is being formalized, then go on by

describing the usual analysis chain, and explain some of the choices physicists make when

3

designing a classifier for optimizing the discovery significance. Different surrogates will be

derived that capture this goal and show some simple techniques to optimize them, raising

some questions both on the statistical and on the algorithmic side.

4

2.0 Training Data Analysis and Data Pre processing Techniques Used Competition has provided two files containing the data. “training.csv” file contains training set

of 250000 events with an column ID, 30 feature columns, a weight column and a label

column. “test.csv” file contains the test set 550000 events with an column ID and 30 feature

columns. Attributes in the training and the test set is as below. The distribution of the attribute

value and the frequencies are shown with each attribute.

EventId

An unique integer identifier of the event.

DER_mass_MMC

The estimated mass mH of the Higgs boson candidate, obtained through a

probabilistic phase space integration (may be undefined if the topology of the event is too far

from the expected topology)

DER_mass_transverse_met_lep

The transverse mass between the missing transverse energy and the lepton.

5

DER_mass_vis The invariant mass of the hadronic tau and the lepton.

DER_pt_h

The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton, and the missing transverse energy vector.

DER_deltaeta_jet_jet

The absolute value of the pseudorapidity separation (22) between the two jets (undefined if PRI_jet_num ≤ 1).

6

DER_mass_jet_jet The invariant mass of the two jets (undefined if PRI_jet_num ≤ 1).

DER_prodeta_jet_jet

The product of the pseudorapidities of the two jets (undefined if PRI_jet_num ≤ 1).

DER_deltar_tau_lep

The R separation between the hadronic tau and the lepton.

7

DER_pt_tot The modulus of the vector sum of the missing transverse momenta and the

transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num ≤ 1) and the subleading jet (if PRI_jet_num = 2) (but not of any additional jets).

DER_sum_pt

The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num ≤ 1) and the subleading jet (if PRI_jet_num = 2) and the other jets (if PRI_jet_num = 3).

8

DER_pt_ratio_lep_tau The ratio of the transverse momenta of the lepton and the hadronic tau.

DER_met_phi_centrality

The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton

A )/C = ( + B √A2 + B2 where A = sin(ᵯmet ᵯlep), B = sin(ᵯhad ᵯmet), and ᵯmet, ᵯlep, and ᵯhad are the azimuthal angles of the missing transverse energy vector, the lepton, and the hadronic tau, respectively. The centrality is if the missing transverse energy vector Emiss

ᵯ is on the√2 bisector of the transverse momenta of the lepton and the hadronic tau. It decreases to 1 if Emiss

ᵯ is collinear with one of these vectors and it decreases further to when Emissᵯ is− √2

exactly opposite to the bisector.

9

DER_lep_eta_centrality

The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI_jet_num ≤ 1) xp e −[ 4/(η )1 − η2 2 * (η η )/2)lep − ( 1 + η2 2]

where is the pseudorapidity of the lepton and and are the pseudorapidities of theηlep η1 η2 two jets. The centrality is 1 when the lepton is on the bisector of the two jets, decreases to 1/e when it is collinear to one of the jets, and decreases further to zero at infinity.

PRI_tau_pt

The transverse momentum of the hadronic tau. √P x2 + P y2

10

PRI_tau_eta

The pseudorapidity of the hadronic tau.η

PRI_tau_phi

The azimuth angle ᵯ of the hadronic tau.

PRI_lep_pt

The transverse momentum of the lepton (electron or muon). √P x2 + P y2

11

PRI_lep_eta

The pseudorapidity of the lepton.η

PRI_lep_phi

The azimuth angle ᵯ of the lepton.

PRI_met

The missing transverse energy .ETmiss

12

PRI_met_phi

The azimuth angle ᵯ of the missing transverse energy.

PRI_met_sumet

The total transverse energy in the detector.

PRI_jet_num

The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3).

13

PRI_jet_leading_pt

The transverse momentum of the leading jet, that is the jet with largest √P x2 + P y2

transverse momentum (undefined if PRI_jet_num = 0).

PRI_jet_leading_eta

The pseudorapidity of the leading jet (undefined if PRI_jet_num = 0).η

PRI_jet_leading_phi

14

The azimuth angle ᵯ of the leading jet (undefined if PRI_jet_num = 0).

PRI_jet_subleading_pt

The transverse momentum of the leading jet, that is, the jet with second √P x2 + P y2

largest transverse momentum (undefined if PRI_jet_num ≤ 1).

PRI_jet_subleading_eta

The pseudorapidity of the subleading jet (undefined if PRI_jet_num ≤ 1).η PRI_jet_subleading_phi

The azimuth angle ᵯ of the subleading jet (undefined if PRI_jet_num ≤ 1).

15

PRI_jet_all_pt

The scalar sum of the transverse momentum of all the jets of the events.

Relative importance of each attribute can be shown as below.

16

17

3.0 Classification Methods Used

3.1 Boosted Decision Trees with XGBoost An optimized general purpose gradient boosting library. The library is parallelized using

OpenMP. It implements machine learning algorithm under gradient boosting framework,

including generalized linear model and gradient boosted regression tree. XGBoost is

originally written in C++. We used Python wrapper for XGBoost to do our classification

task. Main parameters of XGBoosting are number of trees, Maximum depth of trees,

Learning rate and Threshold value. We tried several values for above parameters and

observed results using K fold validation to find optimum parameters.

Trees = 200, Max Depth = 5 K Fold maximum AMS values

K Fold Number Best AMS Value

1 3.630984241

2 3.54537596044

3 3.54888839138

4 3.59332200144

5 3.62470914431

Threshold Value AMS value

0.05 2.9453

0.10 3.4022

0.15 3.5887

0.20 3.4950

0.25 3.2897

0.30 3.0586

18

Trees = 200, Max Depth = 4 K Fold maximum AMS values


1 3.53563871014

2 3.57851195657

3 3.52942843937

4 3.53440432178

5 3.64061642647


0.05 2.7504

0.10 3.3832

0.15 3.5571

0.20 3.4424

0.25 3.2577

0.30 3.0415 Trees = 200, Max Depth = 6 K Fold maximum AMS values


1 3.56166869876

2 3.55592410986

3 3.5719883896

4 3.61092024853

19

5 3.6750468775


0.05 2.8546

0.10 3.4115

0.15 3.5951

0.20 3.5013

0.25 3.2910



1 3.65364319102

2 3.55973684323

3 3.57199860177

4 3.6157171063

5 3.69048997716


0.05 3.6048

0.10 3.6055

0.15 3.5795

0.20 3.5495

0.25 3.5280

20



1 3.65865435037

2 3.56965475557

3 3.60216938156

4 3.6471499509

5 3.69041678819


0.05 3.6100

0.10 3.6129

0.15 3.5896

0.20 3.5672

0.25 3.5247

0.30 3.4989 Looking at the results with cross validation, we found that optimum AMS value is get when number of trees are 225 and maximum depth is 5. One of the most important factor we noticed is, independent of change of other parameters, highest results were obtained at threshold value is 0.15. It means that among all training data 85% are background data and 15% are signal data.

21

Distribution of Background and Signal data after prediction and cross validation of training set. Clear separation can be obtained at 0.85.

3.2 Boosted Decision Trees with TMVA (Toolkit for MultiVariate Analysis) Root is a C++ analysis framework which is very popular among High Energy Physicists (HEP).TMVA is a toolkit for training and applying various Multivariate Analysis algorithms. TMVA has implemented machine learning Algorithms such as Boosted Decision Tree, Support Vector Machine, Artificial Neural Networks, etc. We tried Boosted Decision Tree algorithm and Artificial Neural Network algorithms implemented in TMVA. A decision (regression) tree is a binary tree structured classifier similar to the one sketched below.

22

http://www.google.com/url?q=http%3A%2F%2Froot.cern.ch%2F&sa=D&sntz=1&usg=AFQjCNFS2xruvKU0QPtFnd4OUTHoElXJtg

http://www.google.com/url?q=http%3A%2F%2Ftmva.sourceforge.net%2F&sa=D&sntz=1&usg=AFQjCNHrlu5Ld2gENHLmSwJiJx4ubZkrHg

We used python wrapper of the library which was originally written in C++. It is a simple script with five steps :

1. conversion of the .csv file into a .root file 2. training on the training file 3. evaluation of the scores for the training and test files 4. optimisation of the score threshold with respect to AMS 5. creation of the submission file

The TMVA training phase begins by instantiating a Factory object with configuration options. The MVA method to use is booked via the Factory by specifying the method’s type, plus a unique name chosen by the user, and a set of specific configuration options encoded in a string qualifier. method = factory.BookMethod(TMVA.Types.kBDT, "BDT") BDT has set of configuration variables to be defined at the stage of booking. We tried different values for them and tuned them to get a maximum AMS value. Following are the variable we used and AMS values we get for different values of them. NTrees Number of trees in the forest

Ntree AMS Value

900 3.52663

1000 3.56202

1100 3.52020

23

1250 3.54684

1500 3.53972

2000 3.52912 MaxDepth Max depth of the decision tree allowed

MaxDepth AMS Value

3 3.49033

4 3.56202

5 3.30283

BoostType Boosting type for the trees in the forest. Here we get best AMS value for ‘AdaBoost’ method. In Adaptive boosting, events that are misclassified during the training of a decision tree are given a higher event weight.

BoostType AMS Value

AdaBoost 3.56202

RealAdaBoost 3.30376

Grad 3.53710

Bagging 2.64225

nEventsMin Minimum number of events to be in a tree leaf

nEventsMin AMS Value

50 3.49956

75 3.49966

100 3.60636

125 3.51272

150 3.56202

175 3.61129

200 3.55118

24

SeparationType Separation criterion for node splitting

SeparationType AMS Value

CrossEntropy 3.49039

GiniIndex 3.61129

GiniIndexWithLaplace 3.49648

MisClassificationError 2.70885

SDivSqrtSPlusB 2.26043

3.3 Naive Bayesian Model We tried the probabilistic approach towards predicting the results. We tried using the bayesian model with the features we derived. The results were lagging behind the other classifiers we used. As we know, the trick to get the Naive Bayesian model working desirably is to select and input the best feature set. We tried using various kinds of feature sets by preprocessing and analyzing the relevance. But the maximum AMS value we got was the value 2.05933. This value is significantly behind the other values we obtained. So we ignored this classifier and moved on to using other classifiers.

3.4 Artificial Neural Network Based Approach We tried using the artificial neural network based approach. But we encountered some problems while doing that. It took a lot of time to train ( more than 5 hours). It was very hard to do the trial and error methods with the feature selection and optimization since to get the results it took more than 5 hours. We wanted to test how the results are generated and how the features are used ...etc by the classifiers. Since the interpretability of Neural Networks are fairly low, we gave up using this particular method.

25

3.5 Multiboost Based Approach Adaboost is a boosting software package. It constructs a classifier in an incremental fashion by adding simple classifiers to a pool, and uses their weighted “vote” to determine the final classification. Multiboost is the extended version of Adaboost which includes multiclass capability along with weaklearning algorithms and cascades. Multiboost is essentially a boosting package implemented in C++ which supports multiclass/ multilabel/ multitask classification. First we need to do the preprocessing. Then to run the multiboost package, it is required for the data to be converted into the arff format. The conversion code is shown below.

What we did is first we fed both the training data and the validation data to the DataToArff converter. DataToArff(xsTrain,labelsTrain,weightsTrain,header,"HiggsML_challenge_train","training")

DataToArff(xsValidation,labelsValidation,weightsValidation,header,"HiggsML_challenge_vali

dation","validation")

Then we used that training result as input to the multiboost package and ran the multiboost for those data. Then we used the results we got from the multiboost package and plotted the learning curve (balanced weighted error rate) as shown in the code below.

26

Then we used an optimization to optimize the AMS scoring (discussed below) to produce better values. We use the validation data to optimize the AMS values by generating a configScoresValidation.txt from the previous run result scoreValidation.txt. Then we load the resulting dataset to select the maximum value produced by the multiboost for the validation set. There we use some normalization methods to choose the best possible value.

27

4.0 AMS Score and Model Evaluation Criteria Given a classifier g, a realization of the experiment with n observed events selected by g (positives), the (Gaussian) significance of discovery would be roughly

as the Poisson fluctuation of the background has an )/ standard deviations (sigma)( − μb √μb standard deviation of . Since we can estimate n by s + b and by b, this would suggest√μb μb an objective function of for training g. Indeed, the first order behavior of all objective/s √b functions is ~ , but it is only valid when s ≪ b and b ≫ 1, which is often not the case in/s √b practice. To improve the behavior of the objective function in this range, the approximate median significance (AMS) objective function is used which defined by

MS A = √2((s )ln(1 /(b )))+ b + berg + s + breg where s and b are defined by

x g(x) sG = : = fi x ∈ G i g(x ) sG = :

i = : i =

s = ∑

i∈S∩Gwi

b = ∑

i∈B∩Gwi

and is a regularization term set to a constant = 10 in the Challenge.berg berg so the task of the participants is to train a classifier g based on the training data D with the goal of maximizing the AMS on a heldout (test) data set.

28

5.0 Conclusion Our team score in public leaderboard

Our final team score in private leader board

Our Best submission scores

29

6.0 References

1. Claire AdamBourdarios, Learning to discover: the Higgs boson machine learning challenge, http://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf

2. Observation of single topquark production, http://arxiv.org/pdf/0903.0850v2.pdf

3. eXtreme Gradient Boosting (Tree) Library, https://github.com/tqchen/xgboost

4. MultiBoost benchmark script, http://higgsml.lal.in2p3.fr/software/multiboost/

5. TMVA Starting Kit, http://higgsml.lal.in2p3.fr/software/heptmvakit/

6. Naive Bayesian Starting kit, http://higgsml.lal.in2p3.fr/software/startingkit/

7. TMVA4 Users Guide, http://tmva.sourceforge.net/docu/TMVAUsersGuide.pdf

30

http://www.google.com/url?q=http%3A%2F%2Fhiggsml.lal.in2p3.fr%2Ffiles%2F2014%2F04%2Fdocumentation_v1.8.pdf&sa=D&sntz=1&usg=AFQjCNGhYLRQqVQgHSs6i2A31ckdr-A2Fg

http://www.google.com/url?q=http%3A%2F%2Farxiv.org%2Fpdf%2F0903.0850v2.pdf&sa=D&sntz=1&usg=AFQjCNHhytY88pKa4BbWXh8UhZSV-AZknA

https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Ftqchen%2Fxgboost&sa=D&sntz=1&usg=AFQjCNHuwxYCCHT_F-mEE3GmGTT1nu7AXA

http://www.google.com/url?q=http%3A%2F%2Fhiggsml.lal.in2p3.fr%2Fsoftware%2Fmultiboost%2F&sa=D&sntz=1&usg=AFQjCNF88foL8ptm8p1UW0YjPCZZUdA9eQ

http://www.google.com/url?q=http%3A%2F%2Fhiggsml.lal.in2p3.fr%2Fsoftware%2Fhep-tmva-kit%2F&sa=D&sntz=1&usg=AFQjCNHrAmOuC-s1unKvIgHl4OZPgGgXFw

http://www.google.com/url?q=http%3A%2F%2Fhiggsml.lal.in2p3.fr%2Fsoftware%2Fstarting-kit%2F&sa=D&sntz=1&usg=AFQjCNHUfg7iMxSI2Vw-emHNslk1cnjDXQ

http://www.google.com/url?q=http%3A%2F%2Ftmva.sourceforge.net%2Fdocu%2FTMVAUsersGuide.pdf&sa=D&sntz=1&usg=AFQjCNH1j6okwk8ecIUgLffIpngWnt2byg

Higgs Boson Machine Learning Challenge Report

Education