Multivariate Discrimination Methods for Top Quark Analysis

This article was downloaded by: [Oulu University Library]On: 23 February 2015, At: 10:01Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

TechnometricsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/utch20

Multivariate Discrimination Methods for Top QuarkAnalysisLasse Holmström a & Stephan R. Sain ba Rolf Nevanlinna Institute University of Helsinki , Finlandb Department of Statistical Science , Southern Methodist University , Dallas , TX ,752750332Published online: 12 Mar 2012.

To cite this article: Lasse Holmström & Stephan R. Sain (1997) Multivariate Discrimination Methods for Top QuarkAnalysis, Technometrics, 39:1, 91-99

To link to this article: http://dx.doi.org/10.1080/00401706.1997.10485443

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purposeof the Content. Any opinions and views expressed in this publication are the opinions and views of theauthors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should notbe relied upon and should be independently verified with primary sources of information. Taylor and Francisshall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, andother liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relationto or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Multivariate Discrimination Methods for Top Quark Analysis

Lasse H~LMSTR~~M Stephan Ft. SAIN

Rolf Nevanlinna Institute Department of Statistical Science University of Helsinki Southern Methodist University

Finland Dallas, TX 752750332

Multivariate discrimination techniques are being considered as alternatives to many data-analysis methods conventionally used in experimental high-energy physics. In this article, the applications of four different methods-a quadratic classifier, kernel discriminant analysis, a feedforward neural network, and multivariate adaptive regression splines-are compared for the detection of the top quark. Simulated particle-collision data that correspond to two classes, a rare signal and a dominant background, are used.

KEY WORDS: High-energy physics; Kernel estimate; Multivariate adaptive regression splines; Neural networks; Quadratic classifier.

According to the accepted theory of physics, all matter in the universe is made of 12 elementary particles-6 leptons and 6 quarks. The familiar electron is an example of a lep- ton, and a proton is made up of three quarks. The 17-year search for the heaviest of the quarks, the “top,” finally ended in early 1995 when two separate research groups based at the Fermi National Accelerator Laboratory simultaneously reported conclusive evidence for its existence (Abachi et al. 1995; Abe et al. 1995). In their experiments, collisions with protons and antiprotons were produced using the Teva- tron particle accelerator, and data from individual collision events were screened for signs of the top quark. In the last stage of the standard data analysis, a discrimination rule is applied to a set of promising event candidates to reject “background” events that generate data closely resembling the actual top, or “signal” events.

The discrimination method conventionally used in high- energy physics sets thresholds on the values of suitable physical variables measured from the collisions. The thresholds are obtained by examining the marginal distributions of such variables for the signal and background events. To improve the effectiveness of discrimination, it is natural to consider more flexible multivariate classification methods, and this idea is in fact gaining acceptance. Neu- ral networks are one recently proposed technique (Denby 1993). In connection with the top quark search, linear and quadratic classifiers have also been suggested (Ballochi and Odorico 1983; Odorico 1983). The purpose of the present study is to compare the performance of some popular multivariate discrimination schemes using simulated collision data. The methods tested include a quadratic classifier, kernel discriminant analysis, a neural network, and multivariate adaptive regression splines (MARS). The kernel method described in this article contributed to the data analysis leading to the discovery of the top. A simulation study indicates that this method also outperforms the conventional technique used (HolmstrGm, Sain. and Miettinen 1995). Some results on the performance of the kernel method on real data were reported by Miettinen (1995). Note that simula-

tions do also play a role in the analysis of real data because the classifiers used in practice are partly based on simulated events.

The article is organized as follows. Section 1 discusses the particular discrimination problem studied and the data used. Section 2 describes the four discrimination methods used. Section 3 discusses the results obtained. We compare the discrimination performance of the four methods and discuss their relative merits from other points of view, too.

1. THE PROBLEM The energy released in the annihilation of the constituent

quarks of a colliding proton and antiproton can give rise to a brief existence of a top (t) and its antimatter counterpart (t). The tt pair can further decay in several ways, but in this article we use the mode tt 4 f’+ El + 4 jets as the signal event. Here the final state consists of an electron (c), missing energy (Et) transverse to the particle beam (due to an undetected neutrino), and jets, or collimated bunches of particles emanating from the point of collision. The statistical problem involved is one of two-way discrimination, and it rises from the fact that the proton-antiproton collision can also take place in a way that does not involve the top quark but in which the measured final state still closely mimics the actual signal event. The leading process of this type of proton-antiproton collision produces the l!V’ particle together with jets and then W+ jets + e+ Et+ jets. We refer to this as the background event.

A special characteristic of the signal-background discrimination problem is the great difference between the prior probabilities of the two classes. The background events dominate the data, and a useful discrimination procedure must be able to detect a very weak signal against an overwhelming background. In such a situation, simply minimizing the error of misclassification could easily lead

@ 1997 American Statistical Association and the American Society for Quality Control

TECHNOMETRICS, FEBRUARY 1997, VOL. 39, NO. 1

91

Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

92 LASSE HOLMSTRiiM AND STEPHAN R. SAIN

to the rejection of all true signal events, and it is therefore necessary to consider the misclassification probabilities for the two classes separately.

Let X = (X1, , Xd)’ denote d variables used to describe an event, and let J indicate the event type, J = s for signal and J = b for background. The two quantities of interest to an experimental physicist are the signal efficiency E and the signal-to-background ratio S/B among the events classified as signals. Given a discrimination rule that either accepts or rejects events as signals, these two quantities are defined as

E = P(X accepted\J = s)

S/B = P(J = SIX accepted) P(J = b/X accepted)

A useful event classifier accepts a reasonable fraction of the true signal events and rejects as many of the background events as possible.

The quantities E and S/B are related to standard decision-theoretic concepts as follows. Consider testing the “background” null hypothesis against the “signal” alternative hypothesis, and denote by 7rl and x2 the probabilities of error of the first and the second kind, respectively. Then, E = 1 - 7r2. Furthermore, if P, = P(J = s) and Pb = P(J = b) are the signal and background a priori probabilities, then

S/B = P(J = s/X accepted) P(J = b/X accepted)

= [Ps/P(X accepted)]P(X acceptedlb = s) [P,/P(X accepted)]P(X accepted]J = b)

P, P(X accepted]J = s) = pb ’ P(X accepted\J = b)’ (1)

Thus, S/B = (P,s/Pb)(l - x2)/q. The signal and background events were generated using

the PYTHIA simulator (Sjostrand 1992). Each event was described by 14 variables, all of which are commonly used in top quark search:

1. Electron transverse energy 2. Electron pseudorapidity 3. Missing transverse energy & 4. The angular difference between the electron direction

and the J$ vector in the transverse plane 5. The normalized total transverse energy of the event,

excluding muons and neutrinos 6. Centrality of the event, excluding muons and neutrinos 7-9. Planarity, sphericity, and aplanarity of the event,

calculated using jet momenta 10-14. Transverse energies of the five leading jets, in-

cluding the electron The first four variables contain complete information

about the electron and Et, the next five characterize the entire event, and the last five contain partial information about the jets. The exact definitions of these variables were given by Sjostrand (1992). The variables were normalized


to have roughly equal magnitude. The marginal densities of the variables were mostly unimodal but not symmetric. Variable 4 has a clear multimodal appearance for signal events. The signal marginal densities generally had longer tails than the corresponding background densities. Figure 1 depicts some of the marginal densities. Each curve is a kernel density estimate based on 2,500 simulated events. Within the signal and background classes, the variables were generally correlated, some of them very strongly. An exception was variable 2, which did not appear to correlate with any other variable.

The simulations suggested that the relative rate of occur- rence of signal and background events in proton-antiproton collisions is 1:2,566. This ratio could be improved to 1:41 by imposing a set of simple selection criteria based on the transverse energies of the particles observed in the event (Holmstrom et al. 1995). We therefore evaluated the discriminant analysis methods assuming that the pooled event population has signal a priori probability P, = l/42 and background a priori probability Pb = 41/42. The design goal for this work was to increase the signal-to-background ratio to about 1:l while maintaining a signal efficiency of at least .5.

A total of 5,000 signal and 5,000 background events were generated. Half of each set was used as training data to design the classifiers, and half was set aside as independent testing data. When N, signal events out of a total of N are accepted as signals, then the signal efficiency of the discrimination rule can be estimated as l? = N,/N. When N background events also are tested and Nb of them are erroneously accepted as signals, the signal-to-background ratio (1) in the accepted event population can be estimated as

-- l-/J!L 1’ ’ ‘1 \ /I \ \ \

(4 (b)

(4 Figure 1. Signal (solid) and Background (dashed) Marginal Densities

of Variables 2 (a), 4 (b), 5 (c), and 14 (d).

Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

https://www.researchgate.net/publication/223042566_A_new_multivariate_technique_for_top_quark_search?el=1_x_8&enrichId=rgreq-a02ab843-33cb-44fa-83a4-aa57d861f3af&enrichSource=Y292ZXJQYWdlOzIzOTc5OTY2MztBUzoyMDAwODE5NzExMjYyOThAMTQyNDcxNDY2NzcxMw==

MULTIVARIATE DISCRIMINATION METHODS FOR TOP QUARK ANALYSIS 93

2. DISCRIMINATION METHODS A large part of discrimination methods falls into one of

two main categories, likelihood ratio and regression methods. A likelihood ratio classifier uses the likelihood ratio

l(x) = # (2)

to classify an event described by x = (x1 ~ . zd)’ as signal if I(x) > n and as background if l(x) < (1. Here & and fb are the probability density functions of the signal and the background event classes and must be estimated from data. The threshold CI > 0 can be adjusted to achieve desirable discrimination properties. The choice Q = Pb/Ps corresponds to the Bayes classifier that minimizes the probability of misclassification. However, (4 can also be used to control the individual class-conditional error probabilities (errors of the first and the second kind), which is required in the present problem. When the class densities are estimated using multivariate normal distributions, the logarithm of l(x) is a quadratic or, in the case of equal covariances, a linear function of x. A recent account of the resulting quadratic and linear classifiers was provided by McLach- lan (1992). Nonparametric density estimation gives rise to such popular classification methods as kernel discriminant analysis and nearest neighbor rules (Fix and Hodges I95 1; Hand 1982; Devijver and Kittler 1982; Silverman and Jones 1989).

One gets an equivalent classification rule by using, in- stead of the likelihood ratio l(x), the signal posterior probability

I’(.1 = s/x = x) = P,.fs (4 Pv.f.5 (xl + 8LfbfbX) (3)

as the discriminant function and a threshold value 0 < u < 1. We in fact used the likelihood ratio method in this form.

In the regression approach, one uses the training data to model the posterior probability directly, without first estimating the class-conditional densities separately. Fisher himself used linear regression in the classic article that orig- inated the statistical theory of discrimination (Fisher 1936). As a general approach, one can use the least squares error criterion to associate the predictor variables X with a binary O-l response Y. If Y = 1 is made to correspond to signal, one clearly has E(YIX = x) = P(J = SIX = x) so that a least squares fit approximates the signal posterior probability. Alternatively, one can try to model the posterior probability as

P(J = SIX = x) = g

and use logistic regression methodology to estimate g. Note that the estimated posterior probability then takes values in the interval [O, 11 as a probability should. The function ,y can be linear or taken as one of the modern nonparametric techniques such as projection pursuit (Flick, Jones, Priest, and Herman 1990: Friedman and Stuetzle 1981), additive models (Hastie and Tibshirani 1990), or MARS (Friedman 1991). An alternative but related approach is pro-

vided by artificial neural networks (Cheng and Titterington 1994; Haykin 1994; Ripley 1994). The logistic-output feedforward layered network trained with binary regression is a commonly used discrimination technique. Other modern developments include classification trees (Breiman, Fried- man, Olshen, and Stone 1984). An example of a family of techniques that is based neither on the likelihood ratio nor regression is the subspace methods that model the classes as linear subspaces of the predictor variable space and classify a point on the basis of its distance to these subspaces (Oja 1983).

We included the quadratic classifier and kernel discriminant analysis as examples of likelihood ratio methods in our comparison study. Regression methodology was repre- sented by the feedforward neural network and the logistic regression MARS. Signal and background training sets of equal size were used to estimate the class densities and to fit the regression models. Thus, the discriminant function of each method can be thought to approximate (3) with P,, = Pb = .5. The next sections describe in more detail how each technique was used.

2.1 Quadratic Classifier The quadratic classifier uses the likelihood ratio rule and

the multivariate normal density estimates

.fs(x) = L& e*p (- f (x - PJ%‘(x - Psi)

and

Here the sample means fis. btI and the sample covariance matrices c,5. 2, are estimated from the training data. If N signal training vectors XI, ! XN are used, one can take

and similarly for the background. The higher the threshold 0 < cl < 1 on the discrimi-

nant function (3), the lower the achieved signal efficiency E = E(o). From the point of view of top quark search, the value E = .5 was still regarded as acceptable. The achieved signal-to-background ratio at this efficiency was used as a performance index in the design of the quadratic, kernel, and neural-network classifiers. Thus, given a classifier, a threshold cl,5 with E(o.~) = .5 was determined and the performance of the classifier was measured by the index

I = S/B(CV,~).

The classifier design was optimized by maximizing I. This strategy captures in a concise manner the very goal of discrimination, a high signal-to-background ratio with acceptable signal efficiency.

The finite number of training data limits the complexity of the classifiers one can profitably use. One way to reduce


Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

https://www.researchgate.net/publication/38359473_Multi-Variate_Adaptive_Regression_Splines_with_Discussion?el=1_x_8&enrichId=rgreq-a02ab843-33cb-44fa-83a4-aa57d861f3af&enrichSource=Y292ZXJQYWdlOzIzOTc5OTY2MztBUzoyMDAwODE5NzExMjYyOThAMTQyNDcxNDY2NzcxMw==

https://www.researchgate.net/publication/216300608_The_Use_of_Multiple_Measurements_in_Taxonomic_Problems_Ann_Eugenics_7?el=1_x_8&enrichId=rgreq-a02ab843-33cb-44fa-83a4-aa57d861f3af&enrichSource=Y292ZXJQYWdlOzIzOTc5OTY2MztBUzoyMDAwODE5NzExMjYyOThAMTQyNDcxNDY2NzcxMw==

https://www.researchgate.net/publication/256476655_Generalized_Additive_Model?el=1_x_8&enrichId=rgreq-a02ab843-33cb-44fa-83a4-aa57d861f3af&enrichSource=Y292ZXJQYWdlOzIzOTc5OTY2MztBUzoyMDAwODE5NzExMjYyOThAMTQyNDcxNDY2NzcxMw==

94 LASSE HOLMSTRijM AND STEPHAN R. SAIN

complexity is to use only a subset of the original variables. For the quadratic classifier, all possible subsets of the original 14 variables were searched to find the variable combination with the highest performance index I. The value of 1 was computed for each variable combination using 20- fold cross-validation (or “rotation”; see Devijver and Kittler 1982, sec. 10.6) on the training data. The dash-dotted curve of Figure 2 shows the performance index of the optimal variable combinations for the quadratic classifier as a function of d, the number of variables used. The final quadratic classifier used the optimal eight variables,

1,3> 5.6,7,8,13; 14, (41

and a discriminant function estimated from all training data.

2.2 Kernel Discriminant Analysis The quadratic classifier attempts to separate the signal

from background using a quadratic surface. Although such a classifier is computationally efficient, it has only limited power to model the possibly highly nonlinear optimal decision boundaries between the signal and background classes. Kernel discriminant analysis is a nonparametric technique that offers greater flexibility at the cost of increased complexity. In connection with top quark search, a simple nonparametric classifier based on multivariate histograms was considered by Ballochi and Odorico 1983.

The kernel estimate of a d-variate probability density function f has the form

i=l ’ I

Here xl, , xN is a sample from f, the kernel K is a fixed probability density function, and h > 0 is a smoothing parameter. It is generally desirable that the shape of the kernel reflects the covariance of the density f (e.g., Fukunaga 1990, sec. 6.1). For a given variable set, we used a linear transformation to diagonalize simultaneously the signal and the background covariances before the kernel density esti-

I.41

0.6 1

0.5’ I 2 4 6 6 10 12 14

d

Figure 2. Selection of Variables for the Quadratic Classifier (dash- dotted), Kernel Discriminant Analysis (solid), and Neural Network (dotted) Methods. The best performance index values for d variables are shown.


mates were found. It was then natural to use estimates ii,?, and &,f, j = 1~ : d, of the signal and background standard deviations and a univariate kernel k to construct product kernels

for the estimates f.s and fbb. We took /C to be the standard normal density. A single smoothing parameter h was used to control the smoothing simultaneously in the two density estimates. A rough idea of the proper amount of smoothing can be obtained from kernel density estimation theory in which it is shown that the value

(5)

is optimal in the sense of mean integrated squared error, provided that the estimated density is a product of univariate normal densities (Scott 1992, p. 152).

Selection of variables by searching through all possible variable subsets is not computationally feasible when kernel density estimates are used. We therefore applied a simple sequential forward-selection strategy that first finds the best single variable, next a variable that, together with the best single variable, gives the best two-variable performance, and so on. This strategy, called SFS, is, of course, only one of many possible suboptimal search schemes (Devijver and Kittler 1982, sec. 5.6). The performance of a variable combination was evaluated by using the index I. Only now I depends on the parameter 1~ used to smooth the signal and the background density estimates. The performance index was optimized by using leave-one-out (i.e., S,OOO-fold) cross-validation on the training data and searching through a grid of smoothing parameter values in a neighborhood of the value (5). Figure 2 (solid curve) suggests that seven variables be used. In the order of preference, these variables are

5,9,14,6,4,8,11. (6)

The optimal smoothing parameter for the best variable combinations ranged between 1.2 and 2.8 times the value given by (5). Figure 3 shows the performance index for the best seven-variable combination as a function s, where h = S/LN is the smoothing parameter used. A clear peak in the performance index was observed for most combinations picked by SFS. The final kernel discriminant function used all training data, the optimal variable combination (6), and the best smoothing parameter value suggested by Figure 3.

2.3 A Neural Network The basic unit of a feedforward layered neural net-

work is the formal neuron. The neuron receives inputs x = (xi, . : zd)’ and computes the output :y = ~Y(cL,~ + a’x). Here a0 and a = (ur , . ad)’ are parameters or weights associated with the neuron and the function T is assumed to be monotonically increasing and differentiable. Often the logistic function r(x) = C/(1 + eZ) is used.

Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

https://www.researchgate.net/publication/223365423_Comparison_of_calorimetric_profiles_of_top_and_QCD_jets_and_possibilities_of_discriminating_between_them?el=1_x_8&enrichId=rgreq-a02ab843-33cb-44fa-83a4-aa57d861f3af&enrichSource=Y292ZXJQYWdlOzIzOTc5OTY2MztBUzoyMDAwODE5NzExMjYyOThAMTQyNDcxNDY2NzcxMw==


s

Figure 3. Selection of the Best Smoothing Parameter for Kernel Dis- criminant Analysis. The performance index is shown as a function of s, where h = shN is the smoothing parameter used and hN is the normal distribution-based optima/ value.

In a feedforward layered network such formal neurons are arranged in consecutive layers, and each neuron in a layer receives its input from the outputs of all neurons in the previous layer. We used a network with one “hidden” layer. Such a network has the functional form y = T( ba + b’z)! z = (~1,. , z~)‘, zk = ,r(uka + a&x), where zk’s are the hidden layer outputs; II) is the size of the hidden layer; ~a, ak = (ukl, . . ukd)‘, and ba, b = (bi, . , b,)’ are weights; and x is the input. We denote the collection of all weights by 8 and the mapping defined by the network by y = g(x; f3).

The weights 8 of the network are optimized using least squares binary regression. Thus, if xi. . : xN is a training sample and gyi is defined to be 1 for signal, 0 for background, one attempts to find weights 8* that minimize the averaged-squared residual (l/N) CE1(yi - g(xi! 0))“. The optimization is performed numerically starting from ran- domly chosen initial weights. A useful recursive scheme, called back-propagation of errors, can be used to compute the required partial derivatives. Both simple gradient de- scent and more powerful optimization methods are used. We optimized using the Marquardt and the conjugate gradient methods.

The neural network classifier uses a threshold 0 < ~1 < 1 to classify an event as signal if g(x, 0*) 2 QI, and as background otherwise. Such an approach is justified because the conditional mean E(Y]X = x) coincides with the signal posterior probability (3), and consequently

E[(Y - g(x, e)j2] = E[(Y - P(J = sIx))2] + E[(P(J = cslx) - g(x, e)jz].

Thus, the optimal network tries to approximate the signal posterior probability in the least squares sense.

A set of variables for the neural network was selected using the SFS procedure and 5-fold cross-validation. Fur- thermore, a modest effort was made to find an optimal hidden layer size ,111 by trying three alternative network architectures for each d-variable combination. The hidden

layer sizes considered were m = (l/2, rl, and 3~112, rounded upward to integer values. The required computation times made more thorough model selection infeasible. Thus, the performance of a given variable combination and hidden layer size was evaluated using the index I computed using the cross-validated outputs of five optimized networks. The dotted curve of Figure 2 shows how the best I changed with d. The optimal suggested nine variables, in the order of preference, are

5,8,14,12.4.3.11,7; 9. (7)

The corresponding best hidden layer size was 5. The peak of the performance index at nine variables occurs quite sud- denly, so we therefore tested more closely the variables (7) by performing the S-fold cross-validation run 10 times with random weight initializations in the optimization rou- tine. The results suggested an average performance index value of .8, which is substantially smaller than the observed peak value. This apparent sensitivity to initialization was regarded as a drawback in the neural-network approach. The final neural network classifier was obtained by constructing a network with the input variables (7) and hidden layer size 5, training it with all training data using three random initializations, and picking the network with the smallest least squares error.

2.4 MARS The MARS algorithm (Friedman 199 1) attempts to model

high-dimensional structure by taking advantage of local fea- tures in data through an expansion in product spline basis functions. The procedure is adaptive in that the number of basis functions, as well as the product degree and spline knot locations, is automatically determined by the data. The algorithm is a two-stage procedure, beginning with a forward stepwise phase that adds basis functions to the model in a deliberate attempt to overfit the data. The second stage of the algorithm is standard linear regression backward subset selection.

MARS finds the regression function in the form

(8)

where the functions B,,, are multivariate splines. The univariate truncated power-spline basis is used to construct ten- sor product basis functions of the form

Here [ ]+ denotes the positive part and y is an integer. The order of variable interactions is defined by K,,,. The con- stant .skm takes on only two values il. The index ,j(X:- 71)) labels the predictor variables, and the tkTn are knot values selected from the training points. The forward procedure begins with &J(X) = 1, and after A1 iterations, there are 2M + 1 basis functions, Ba(x), : &r(x), in the model. At the (111 + 1)st iteration, two new basis functions of the


Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

96 LASSE HOLMSTRGM AND STEPHAN R. SAIN

following form are added to the model:

B!M+1 (x) = ~,(,,l+l)(x)[+(.r,r(.~1+1) ~ tnr+dl’: &nr+n(x) = Bl(izl+l)(x)[-(.1:,(dl+l) - h+l)I:.

where U,(,jl+lj is one of the 2h1 + 1 basis functions al- ready present in the model, ~~(nr+r) is one of the predictor variables, and t,jl+r is a knot location on that variable. The indices a(nr+ 1) and j( iZI + 1) and the knot location t.br+r are chosen to give the most improvement (minimizing the sum of squared residuals) to the model.

In the backward model-selection stage, terms are deleted from the model to optimize the fit as measured by the generalized cross-validation criterion [GCV (Craven and Wahba 1979)]. The form of the GCV criterion is

GCV(nf) = (l/W C,N_Jl!h - 6dXi))” [l - C(M)/N]2 .

where the numerator is the lack-of-fit based on the training data and the denominator imposes a penalty for increasing the model complexity, C(A1). This complexity function is taken to be C(nl) = iU (c/2 + l), where A)1 is the number of basis functions in the proposed model and c represents an additional cost for fitting the parameters associated with the included additional basis functions.

The smoothness of the model is controlled by the parameter q, the exponent in the spline basis functions (9). The strategy used in the MARS algorithm is to start with q = 1 splines to fit the model. Then, the basis functions of this initial model are used to derive an analogous piecewise cubic approximation ensuring continuous first derivatives.

In our work we used a modification of MARS that is partly based on logistic regression (see Friedman 199 1, pp. 46-48). In this procedure the basis functions are chosen using the standard MARS algorithm, and the coefficients in the final model are chosen using linear logistic regression.

We experimented with alternative models by considering different values for the maximum allowed nrl in (8) during the forward stage, maximum allowed K,,, in (9), and the cost factor c. The different models were evaluated on the basis of the achieved signal efficiencies and signal-to- background ratios. Trying the values 15, 29, and 43 for maximum Al produced very similar performance results, and we decided to use the value 29. For the order of interactions, we tried two alternatives, K,, = 1 for all mn and K,,, = 2 for all In. The results appeared better for K,,, = 1, so an additive model was adopted. Trying the three cost fac- tors, c = 2,3, and 4, had virtually no effect, so we used the default value c = 3. The final model had 20 basis functions, and all variables except 6, 7, and 8 were included.

3. COMPARISONS AND DISCUSSION Figure 2 suggests that when variables are selected for

the quadratic and kernel classifiers the value of the performance index 1 generally first increases as more variables are included until a point is reached beyond which adding new variables begins to deteriorate the performance. The


performance index for the neural network behaves more erratically for large numbers of variables.

Peaking of the performance index can be explained as follows. Each classifier approximates the decision rule based on the likelihood ratio (2). By the Neyman-Pearson lemma (e.g., Lindgren 1976), the likelihood ratio classitier is optimal in the sense that, if an alternative classifier accepts signal events with the same probability, then the likelihood ratio classifier cannot accept background events at a probability higher than the alternative classifier. In other words, when the two classifiers both have signal efficiency 5. then the performance index I of the alternative classifier cannot be higher than that of the likelihood ratio classifier. The performance index of the likelihood ratio classifier cannot then decrease when a new variable is added because a d-variable classifier can be viewed as an alternative classifier for events described by an augmented set of (I+ 1 variables. A similar argument shows that I is monotonically increasing also for the exhaustive search strategy. The number of training data is fixed, however, and it therefore becomes increasingly dif- ficult to estimate the classifiers when the dimension d of the underlying space increases. Thus, a point may be reached after which the rising estimation errors start to overshadow the decreasing extra discrimination information provided by the addition of increasingly unimportant variables and the performance begins to deteriorate. The number of variables for which this happens naturally depends on the size of the signal and background training sets. The larger the training sets are, the more variables the classifiers can use.

The best single variable for the quadratic, kernel, and neural network classifiers was 5 [see (41, (6), and (7)]. None of these classifiers used variable 2, which was uncorrelated with the other variables. The other omitted variable, 10, correlates with the rest of the variables (with the exception of variable 2), being most strongly correlated with the best single variable 5 with signal and background correla- tion coefficients .8 and $2, respectively. In MARS, model selection is an automatic part of the algorithm itself. The relative importance of each variable in the MARS model can be evaluated based on how much leaving that variable out degrades the GCV score computed for the transformed model exp(,Gnr)/[l + exp(jnr)] (see Friedman 1991). The variable 5 was by far the most important, and the least important variable was 2.

All the classifier discriminant functions had their values in the interval [0, 11. Figure 4 shows typical distributions of discriminant function values, both with equal prior probabilities for the two classes and for the actual 1:41 ratio used to evaluate discrimination performance. With equal signal and background prior probabilities, a classifier with both a high signal-to-background ratio and high signal elliciency could be constructed. The 1:41 prior ratio, however, forces one to give up high efficiency to avoid a very low signal- to-background ratio. Figure 5 shows the achieved signal- to-background ratios as a function of the signal elhciency for the four tested methods using the independent testing data. The error bars indicate 95% confidence intervals obtained using the delta method (e.g., Barndorlf-Nielsen and Cox 1989, sec. 2.7). The MARS classifier achieves

Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015


800

600

0 0 0.2 0.4 0.6 0.8 1 signal

1400 I’r- 7

"0 0.2 0.4 0.6 0.8 1

Figure 4. Typical Distributions of Classifier Discriminant Function Val- ues (kernel discriminant analysis was used). Also shown (the bottom panel) are the distributions for signal (solid) and background (dashed and truncated from above) with the I:4 1 prior ratio.

the design goal of 1:l signal-to-background ratio with signal efficiency at least .5, and it has the highest signal- to-background ratio throughout the efficiency range considered, although the absolute differences get rather small toward the high-efficiency end. The quadratic classifier appears to fall short of the design goal, and the performance of the kerneI and the neural-network methods lies between

2-

O 1 0.5 0.55 0.6 0.65 0.7 0.75 0.8

signal efficiency

Figure 5. Performance of the Quadratic (dash-dotted), Kernel (solid), Neural Network (doffed), and MARS (dashed) Classifiers.

that of the MARS classifier and the quadratic classifier. We also note that the regression-based methods perform better than the likelihood-based methods.

It appears essential that an event classifier estimates the optimal discrimination rule well in the tails of the background density (see Fig. 4). The use of a single smoothing parameter in the kernel method, however, may result in an unsatisfactory compromise between estimation accuracy in high- and low-density areas. Using different smoothing for signal and background, as well as adaptive methods (Scott 1992), might therefore improve kernel discriminant analysis. On the other hand, due to their great degree of local adaptivity, both MARS and the feedforward neural network are able to approximate very complicated functions, which could explain their better discrimination performance in the task at hand. The backward model-selection stage built into MARS efficiently prunes the overfitted model obtained in the forward stage to produce a classifier with good testing- set performance. Model selection for the neural network classifier consisted of only stepwise variable selection and a rather crude attempt to select an optimal hidden layer size. More sophisticated neural network model selection, including incremental network growing and pruning, could be tried as well (Haykin 1994, sec. 6.17). The quadratic classifier is often applied even if the underlying normality assumptions are known to be unfounded. Because many of the marginal distributions appear nonnormal (see Fig. l), this seems to be the case here, too. As a remedy, a nor- malizing transformation to the data might improve the performance of the quadratic classifier (see Velilla and Barrio 1994).

Besides discrimination performance, the methods can also be compared on the basis of ease of implementation and classifier design, computational complexity of the discriminant functions, interpretability of model and variable selection, conceptual simplicity, and theoretical tractability.

The quadratic and kernel classifiers are easy to imple- ment and design. The smoothing parameter optimization required in the kernel method is straightforward to carry out. The MARS and neural network techniques require much


Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

98 LASSE HOLMSTRijM AND STEPHAN R. SAIN

more sophisticated software, but fortunately good imple- mentations of these techniques are publicly available. The MARS algorithm used in our study was based on the software written by J. H. Friedman and available from the StatLib software library. The training of a neural network classifier needs multivariate nonlinear optimization. This is slow for high-dimensional problems, and sensitivity to initialization-that is, local minima-can be a problem. The neural network classifier was in fact the slowest to design. The MARS model has several tunable parameters, and one needs to experiment with different designs to find suitable values for them.

The complexity of the discriminant function affects the time needed to classify an event. If the classifier has to work on-line, the evaluation speed becomes highly relevant. The discrimination task considered here, however, is not tackled until a small set of candidate signal events has been isolated. In a different application that requires fast classification, the kernel method could be too slow to use in its basic form, and computational speed-ups should be considered (e.g., Htirdle and Scott 1992).

The variable selection used in the first three discrimination methods can shed at least some light on the relative importance of the original variables. For a given number of variables, however, the exhaustive search used with the quadratic classifier typically produced many combinations with nearly peak performance. The SFS procedure again is deliberately suboptimal. It is therefore impossible to conclude that any of the variable combinations suggested is truly optimal. An alternative to variable selection would be to use linear combinations of the original 14 variables. Such linear feature extraction can be based on principal components or related methodology, but the tests with the kernel method indicate that no discrimination improvement is achieved (Holmstriim and Sain 1993). Furthermore, compared with variable selection, the relative importance of the individual variables would be harder to discern. The terms, and hence the variables, included in the MARS model can be analyzed for their relative importance by considering their variance and effect on the GCV criterion. The way new terms are added to the model, however, is suboptimal. We also observed that changing the maximum allowed size of the model in the forward stage could radically change the structure of the terms appearing in the final model without affecting the discrimination performance much. Thus, as with the quadratic classifier, there appear to be many variable subsets with nearly top performance, and it is therefore not possible to conclude the optimality of any particular variable subset appearing in the MARS model.

Both the quadratic classifier and kernel discriminant analysis are conceptually simple and supported by a vast body of theoretical literature. The nonparametric kernel method is capable of estimating consistently arbitrary density functions, and hence kernel discriminant analysis can work well even when the underlying distributions are not normal. Successful application of the kernel method in a high- dimensional space may, however, require a large training sample. Neural networks and MARS have been introduced


quite recently, and their theoretical foundations are not as well developed as those of the two likelihood-based methods.

In conclusion, on the basis of pure discrimination performance MARS appears to perform best. Its practical implementation is facilitated by the available public-domain software. On the other hand, the easy-to-understand explicit definition of the kernel density estimate has been an important factor in the acceptance of the kernel method as a top quark data-analysis tool.

4. SUMMARY Multivariate discrimination methods are gaining ground

in the analysis of high-energy physics data. In this work we have compared the performance of four different techniques in the search for the top quark. The problem was to detect a rare signal against a large background of uninteresting collision events. The quadratic classifier and kernel discriminant analysis were examples of likelihood-ratio-based methods, whereas the feedforward neural network and MARS rep- resented regression approaches. Such multivariate classifiers offer promising alternatives to the conventional data- analysis technique currently in use.

ACKNOWLEDGMENTS

The research leading to this article was initiated while the authors were at the Department of Statistics of Rice Univer- sity, Houston, Texas, L. Holmstrijm as a Visiting Professor and S. R. Sain as a doctoral student. The impetus to develop multivariate discrimination techniques for top quark search came from our collaboration with Hannu Miettinen from the Department of Physics at Rice University. He also pre- pared the simulated datasets used in this study. We are also grateful to David Scott and Dennis Cox from the Depart- ment of Statistics at Rice University for useful discussions. The neural network software was written by Petri Koistinen from the Rolf Nevanlinna Institute. Charles Lo Presti from the Pacific Northwest Laboratory wrote the driver for the MARS software. We also thank the editor, an unknown associate editor, and unknown referees for useful suggestions that led to considerable improvement of the article during the revision process. This research was supported by grant 8001726 from the Academy of Finland, grants from the Emil Aaltonen Foundation and the Rolf Nevanlinna Insti- tute, and National Science Foundation grant DMS93066S8.

[Received March 1994. Revised April lYY6.]

REFERENCES Abachi, S., et al. (D@ Collaboration) (1995), “Observation of the Top

Quark,” Physical Review Letters, 14, 2632-2637. Abe, F., et al. (CDF Collaboration) (1995), “Observation of Top Quxk

Production in p-p Collisions With the Collider Detector at Fermilab,” Physical Review Letters, 74, 2626-263 I.

Ballochi, G., and Odorico, R. (1983), “Comparison of Calorimetric Pro- files of Top and QCD Jets and Possibilities of Discriminating Between Them,” Nuclear Physics, B229, l-28.

Barndorff-Nielsen, 0. E., and Cox, D. K. (1989). Asymptotic E&rliq!res jar Use in Statistics, London: Chapman and Hall.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (I 9X4), C/as.@ution

Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015


and Regression Trees, New York: Chapman and Hall. Cheng, B., and Titterington, D. M. (1994), “Neural Networks: A Review

From a Statistical Perspective” (with discussion), Statistic& Science, 9, 2-54.

Craven, P., and Wahba, G. (1979), “Smoothing Noisy Data With Spline Functions,” Numerical Mathematics, 3 1, 3 17403.

Denby, B. (1993), “The Use of Neural Networks in High-Energy Physics,” Neurul Computation, 5, X-549.

Devijver, P. A., and Kittler, J. (1982) Pattern Recognition: A Stutistical Approach, London: Prentice-Hall International.

Fisher, R. A. (1936), “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179-188.

Fix, E., and Hodges, J. L. (1951), “Discriminatory Analysis- Nonparametric Discrimination: Consistency Properties,” Report 4, Project 21-49-004, USAF School of Aviation Medicine, Randolph Field, Texas.

Flick, T. E., Jones, L. K., Priest, R. G., and Herman, C. (1990), “Pattern Classification Using Projection Pursuit,” Pattern Recognition, 23, 1367- 1316.

Friedman, J. H. (1991), “Multivariate Adaptive Regression Splines” (with discussion), The Annals of Statistics, 19, 1-141.

Friedman, J. H., and Stuetzle, W. (198 l), “Projection Pursuit Regression,” Journul of the American Statistical Association, 76, 8 17-823.

Fukunaga, K. (1990), Introduction to Statistical Pattern Recognition (2nd ed.), San Diego: Academic Press.

Hand, D. J. (1982), Kernel Discriminant Analysis, Chichester, U.K.: Re- search Studies Press.

Hardle, W. K., and Scott, D. W. (1992) “Smoothing by Weighted Averag- ing of Rounded Points,” Computational Statistics, 7, 97-128.

Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, New York: Chapman and Hall.

Haykin, S. (1994), Neural Networks. A Comprehensive Foundation, New York: Macmillan.

Holmstriim, L., and Sain, S. R. (1993) “Searching for the Top Quark Using Multivariate Density Estimates,” Technical Report 93-3, Rice University, Dept. of Statistics.

Holmstrom, L., Sain, S. R., and Miettinen, H. E. (1995). “A New Multi- variate Technique for Top Quark Search,” Computer Pkysics Communi- cations, 88, 195-210.

Lindgren, B. W. (1976), Statistical Tkeoly (3rd ed.), New York: Macmillan. McLachlan, G. J. (1992), Discriminant Analysis and Statistical Pattern

Recognition, New York: Wiley. Miettinen, H. E. (1995), “Top Quark Results From Do,” in New Computing

Techniques in Physics Research IV, eds. B. Denby and D. Perret-Gallix, Singapore: World Scientific, pp. 499-507.

Odorico, R. (1983), “Telling Top Jets from QCD Jets Using Energy Flow,” Physics Letters, 120B, 219-223.

Oja, E. (1983), Subspace Methods of Pattern Recognition, Letchworth, U.K.: Research Studies Press.

Ripley, B. D. (1994), “Neural Networks and Related Methods for Classi- fication” (with discussion), Journal of the Royal Statistical Society, Ser. B, 56,409-456.

Scott, D. W. (1992), Multivariate Density Estimation. Theory, Practice, and Visualization, New York: Wiley.

Silverman, B. W., and Jones, M. C. (1989), “E. Fix and J. L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation-Commentary on Fix and Hodges (I 95 l),” Internu- tional Statistical Review, 57, 233-247.

Sjostrand, T. (1992), “PYTHIA 5.6 and JETSET 7.3 Physics Manual,” CERN preprint CERN-TH.6488/92.

Velilla, S., and Barrio J. A. (1994), “A Discriminant Rule Under Transfor- mation,” Tecknometrics, 36, 348-353.


Dow

nloa

ded

by [O

ulu

Uni

vers

ity L

ibra

ry] a

t 10:

01 2

3 Fe

brua

ry 2

015

Multivariate Discrimination Methods for Top Quark Analysis

Documents