Luis Francisco Sanchez Merchante To cite this version

HAL Id tel-00868847httpstelarchives-ouvertesfrtel-00868847

Submitted on 2 Oct 2013

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers

Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes

Learning algorithms for sparse classificationLuis Francisco Sanchez Merchante

To cite this versionLuis Francisco Sanchez Merchante Learning algorithms for sparse classification Computer scienceUniversiteacute de Technologie de Compiegravegne 2013 English NNT 2013COMP2084 tel-00868847

Par Luis Francisco SANCHEZ MERCHANTE

Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

Learning algorithms for sparse classification

Soutenue le 07 juin 2013

Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

Algorithmes drsquoestimation pour laclassification parcimonieuse

Luis Francisco Sanchez MerchanteUniversity of Compiegne

CompiegneFrance

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

Part I

Context and Foundations

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

nsumi=1

psumj=1

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sumjisinG`

|βj |s r

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

λminus partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

minusλminus partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

Part II

Sparse Linear Discriminant Analysis

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

Bβk minus Pk(βk)

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

)minus1YgtX

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

)prop πk exp

(minus1

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

` π` exp(minusd(xmicro`)

πk exp(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

where dmax = maxk d(xmicrok)

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

radicλw2

∥∥βj∥∥2

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

) (425)

τj = wj∥∥βj∥∥

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

s t nminus1 ΘgtYgtYΘ = IKminus1

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

take it out ofINACTIVE SET

compute Θ

and update B end

Figure 51 GLOSS block diagram

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

(XBminusYΘ) + λwjβj∥∥βj∥∥

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

∥∥∥xjgtYΘ0∥∥∥

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

5 GLOSS Algorithm

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

Sim 3 K = 4 1D mean shift ind features

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

GLOSS SLDA

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

minus2 minus15 minus1 minus05 0

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

Part III

Sparse Clustering Analysis

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸︷︷︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸︷︷︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸︷︷︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸︷︷︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

)minus 1

tik(xi minus microk)gtΣminus1(xi minus microk)

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

f(xi|φ πθν) =Ksumk=1

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

This factor is approximated since the integrated likelihoods f(X(1)|M1

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

Table 101 Experimental results for simulated data

Err () Var Time

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

Appendix

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

)gt(Kminus1summ=1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

with θk =Kminus1summ=1

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

radicλ

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

J(B) + λ

psumj=1

wj∥∥βj∥∥

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

2∥∥βj∥∥2

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

2βj (D8)

2le wj (D9)

D2 An Upper Bound on the Objective Function

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

psumj=1

wj∥∥βj∥∥

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

tik(θprime) log

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

tik(θprime) log(πkfk(xiθk))

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

πk minus 1

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

G2 Means

partL(θ)

partmicrok= 0hArr minus1

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

︸︷︷︸as per property 5

rArr Σ =1

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

SANCHEZ MERCHANTE PDTpdf

Thesis Luis Francisco Sanchez Merchantepdf

List of figures
List of tables
Context
Regularization for Feature Selection
Motivations
Categorization of Feature Selection Techniques
Regularization
Important Properties
Pure Penalties
Hybrid Penalties
Mixed Penalties
Sparsity Considerations
Optimization Tools for Regularized Problems
Abstract
Feature Selection in Fisher Discriminant Analysis
Fisher Discriminant Analysis
Feature Selection in LDA Problems
Inertia Based
Regression Based
Formalizing the Objective
From Optimal Scoring to Linear Discriminant Analysis
Penalized Optimal Scoring Problem
Penalized Canonical Correlation Analysis
Penalized Linear Discriminant Analysis
Summary
Practicalities
Solution of the Penalized Optimal Scoring Regression
Distance Evaluation
Posterior Probability Evaluation
Graphical Representation
From Sparse Optimal Scoring to Sparse LDA
A Quadratic Variational Form
Group-Lasso OS as Penalized LDA
GLOSS Algorithm
Regression Coefficients Updates
Cholesky decomposition
Numerical Stability
Score Matrix
Optimality Conditions
Active and Inactive Sets
Penalty Parameter
Options and Variants
Scaling Variables
Sparse Variant
Diagonal Variant
Elastic net and Structured Variant
Experimental Results
Normalization
Decision Thresholds
Simulated Data
Gene Expression Data
Correlated Data
Discussion
Abstract
Feature Selection in Mixture Models
Mixture Models
Model
Parameter Estimation The EM Algorithm
Feature Selection in Model-Based Clustering
Based on Penalized Likelihood
Based on Model Variants
Based on Model Selection
Theoretical Foundations
Resolving EM with Optimal Scoring
Relationship Between the M-Step and Linear Discriminant Analysis
Relationship Between Optimal Scoring and Linear Discriminant Analysis
Clustering Using Penalized Optimal Scoring
From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
Optimized Criterion
A Bayesian Derivation
Maximum a Posteriori Estimator
Mix-GLOSS Algorithm
Mix-GLOSS
Outer Loop Whole Algorithm Repetitions
Penalty Parameter Loop
Inner Loop EM Algorithm
Model Selection
Tested Clustering Algorithms
Results
Discussion
Conclusions
Appendix
Matrix Properties
The Penalized-OS Problem is an Eigenvector Problem
How to Solve the Eigenvector Decomposition
Why the OS Problem is Solved as an Eigenvector Problem
Solving Fishers Discriminant Problem
Alternative Variational Formulation for the Group-Lasso
Useful Properties
An Upper Bound on the Objective Function
Invariance of the Group-Lasso to Unitary Transformations
Expected Complete Likelihood and Likelihood
Derivation of the M-Step Equations
Prior probabilities
Means
Covariance Matrix
Bibliography

Par Luis Francisco SANCHEZ MERCHANTE

Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

Learning algorithms for sparse classification

Soutenue le 07 juin 2013

Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

CompiegneFrance

Albert Espinosa

Paulo Coelho

Acknowledgements

Contents

List of figures v

List of tables vii

1 Context 5

Abstract 27

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

CompiegneFrance

Albert Espinosa

Paulo Coelho

Acknowledgements

Contents

List of figures v

List of tables vii

1 Context 5

Abstract 27

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Albert Espinosa

Paulo Coelho

Acknowledgements

Contents

List of figures v

List of tables vii

1 Context 5

Abstract 27

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Acknowledgements

Contents

List of figures v

List of tables vii

1 Context 5

Abstract 27

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Contents

List of figures v

List of tables vii

1 Context 5

Abstract 27

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Contents

414 Summary 40

52 Score Matrix 52

Discussion 63

Abstract 69

711 Model 71

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Contents

Conclusions 97

Appendix 103

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Contents

Bibliography 123

List of Figures

rameters 20

List of Tables

gtn )gt

Probability

Mixture Models

Optimization

Penalized models

Part I

1 Context

S HM A

1 Context

21 Motivations

23 Regularization

s t P (β) le t (22)

232 Pure Penalties

23 Regularization

s t β0 le t (24)

psumj=1

|βj | le t (25)

23 Regularization

minβJ(β) + λ β22 (26)

nsumi=1

psumj=1

nsumi=1

psumj=1

(βlsj )2 (28)

βgtw s t w le 1

nsumi=1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

23 Regularization

234 Mixed Penalties

β(rs) =

sumjisinG`

|βj |s r

23 Regularization

partβj

i=1 x2ij

(partJ(β)

partβj

i=1 x2ij

if partJ(β)partβj

i=1 x2ij

if partJ(β)partβj

lt minusλ

23 Regularization

minβisinRp

2(212)

minβisinRp

LnablaJ(β(t)))

∥∥∥∥2

LP (β) (213)

Part II

Abstract

Y = (ygt1 ygtn )gt

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

ΣW =1

Ksumk=1

sumiisinGk

ΣB =1

Ksumk=1

sumiisinGk

tr(BgtΣBB

)tr(BgtΣWB

) (32)

βgtk ΣBβk

321 Inertia Based

minβisinRp

βgtΣWβ

βisinkRpβgtk Σ

Bβk minus Pk(βk)

∥∥∥infinle λ

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

psumj=1

|βj |

Optimal Scoring

(BgtΩB

)(34a)

gtk Ωβk

Kminus1sumk=1

psumj=1

βos =(XgtX + Ω

)minus1XgtYθ (42)

(XgtX + Ω

)minus1XgtYθ

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

)β = 1 (46c)

rArr βcca =1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

βos = αβcca (49)

npartL(βθ γ ν)

rArr θcca =1

maxβisinRp

)β = 1 (412b)

)βcca (413)

rArr α2 = λ

maxβisinRp

βgtΣBβ (414a)

)minus1Ygt

ΣT =1

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

Ksumk=1

sumiyik=1

= nminus1

(XgtXminusXgtY

)minus1YgtX

XgtY(YgtY

(XgtX + ΩminusXgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

1minus λ

(XgtX + Ω

)βlda

414 Summary

(BgtΩB

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

)minus 12 (415)

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

)minus 12

42 Practicalities

)(416a)

)minus1XgtYΘ0

)minus1XgtY

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

) (418)

Σminus1WΩ =

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

(minusd(xmicrok)

)prop πk exp

(minus1

)minus 12

∥∥∥2

) (420)

(minusd(xmicrok)

2 + dmax2

π` exp

(minusd(xmicro`)

2+dmax

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

B =(β1gt βpgt

L = J(B) + λ

psumj=1

∥∥βj∥∥2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

)minus

psumj=1

νjτj

partLpartτj

∥∥βj∥∥2

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

radicλw2

∥∥βj∥∥2

radicλ

2(422)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

Ω = diag

) (425)

partB+ λG

G =(g1gt gpgt

26= 0 then we have

2βj (428)

2le wj (429)

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

At∥∥βj∥∥

partβj

psumm=1

wj βm2

)= partβj

∥∥βj∥∥2

partβj+ λwj

2βj = 0 (432a)

partβj

∥∥∥∥2

le λwj (432b)

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

tr(BgtΣBB

where Ω = diag

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

k)minus12

eigenvalue of

)minus1XgtY

(BgtΩB

5 GLOSS Algorithm

k (51)

5 GLOSS Algorithm

condition

any variablefrom

INACTIVE SET

any variablefrom

compute Θ

and update B end

j isin 1 p

∥∥βj∥∥2gt 0

)minus1XgtAYΘ0

2= 0 do

jisinA

2lt λ then

Output Θ B α

5 GLOSS Algorithm

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

52 Score Matrix

YgtX(XgtX + Ω

ΘgtYgtX(XgtX + Ω

)minus1that

)minus1XgtY 1

Θ0gtYgtX(XgtX + Ω

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

partJ(B)

partβj= xj

gt(XBminusYΘ)

YgtX(XgtX + Ω

5 GLOSS Algorithm

2le λwj

j = maxj

2minus λwj 0

2le λwj

λmax = maxjisin1p

562 Sparse Variant

)are replaced by

5 GLOSS Algorithm

- ΩL =

61 Normalization

63 Simulated Data

Err () Var Dir

0 10 20 30 40 50 60 70 8020

100TPR Vs FPR

glossd

Simulation1

Simulation2

Simulation3

Simulation4

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

GLOSS SLDA

minus25

minus2

minus15

minus1

minus05

1) Synovial sarcoma

4) Myxofibrosarcoma

minus2000 0 2000 4000 6000 8000 10000 12000 14000

1) Synovial sarcoma

4) Myxofibrosarcoma

minus1 minus05 0 05 1 15 2

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

1) NonTumor

2) Astrocytomas

3) Glioblastomas

1st discriminant

65 Correlated Data

Discussion

Part III

Abstract

71 Mixture Models

711 Model

f(xi) =

Ksumk=1

f(xiθ) =

Ksumk=1

L(θ X) = log

(nprodi=1

f(xiθ)

=nsumi=1

(Ksumk=1

πkfk(xiθk)

) (71)

71 Mixture Models

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

nsumi=1

(Ksumk=1

yikπkfk(xiθk)

=nsumi=1

Ksumk=1

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

LC(θ XY) =sumik

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

log f(xiθ)

=sumik

wheresum

Ksumk=1

nsumi=1

= log (p(Y |Xθ))

=sumik

=nsumi=1

Ksumk=1

Gaussian Model

f(xiθ) =Ksumk=1

πkfk(xiθk)

Ksumk=1

(2π)p2 |Σ|

minus1

2(xi minus microk)

Q(θθ(t)) =sumik

tik log(

(2π)p2 |Σ|

)minus 1

tk log(πk)minusnp

2log(2π)︸︷︷︸

constant term

minusn2

log(|Σ|)minus 1

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

2(xi minus microk)

) (78)

tk =nsumi=1

tik (79)

π(t+1)k =

micro(t+1)k =

sumi tikxitk

Σ(t+1) =1

Wk (712)

with Wk =sumi

(p(Yk = 1|x)

p(Y` = 1|x)

2(microk + micro`)

λKsumk=1

psumj=1

|microkj |

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

psumj=1

sum16k6kprime6K

psumj=1

λradicK

psumj=1

micro2kj

pprodj=1

Kminus1sumk=1

∥∥∥uk∥∥∥1

Ksumk=1

∥∥∥2

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

psumj=1

∥∥∥2

s t UgtU = IKminus1

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

B12 =f (X|M1)

f (X|M2)

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

f(X(2)X(1)|M2

W (xi minus microk)

Ksumk=1

BOS =(XgtX + λΩ

)minus1XgtYΘ

YgtX(XgtX + λΩ

)minus1XgtY

k)minus 1

tik prop exp

] (81)

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

minus1

2tr(Λminus1

0 Σminus1)

maximization of

=Ksumk=1

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

1minus j2

2log |Σ| minus 1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

) (82)

with tk =

nsumi=1

νn = ν0 + n

0 + S0

nsumi=1

Ksumk=1

ΣMAP =1

0 + S0) (83)

91 Mix-GLOSS

partβj

∣∣∣βj=0

= xjgt

λmaxj = 1

partβj

∣∣∣βj=0

∥∥∥∥2

M-step(BOSΘ

92 Model Selection

M-Step

E-Step

tik prop exp

92 Model Selection

StartB and StartT

Compute BIC

102 Results

Err () Var Time

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

103 Discussion

0 10 20 30 40 50 600

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

103 Discussion

Conclusions

Summary

Perspectives

Appendix

A Matrix Properties

ΣW =1

gsumk=1

sumiisinCk

ΣB =1

gsumk=1

partx = a

(AXminus1B

minθkβk

Lk(θkβk λkνk) =

sum`ltk

minθk

gtΩkβk = min

= maxθk

tr(ΘgtMΘ

MΘv = λv (B6)

wgtMw = λ (B7)

= ΘgtYgtXB

Kminus1summ=1

)gt(Kminus1summ=1

Kminus1summ=1

α2m = 1 (B9)

Mθk = M

Kminus1summ=1

αmMwm

Mθk =Kminus1summ=1

αmλmwm

θgtk Mθk =

(Kminus1sum`=1

)gt(Kminus1summ=1

αmλmwm

α2mλm

maxθkisinRKtimes1

θgtk Mθk

θkisinRKtimes1

Kminus1summ=1

α2mλm

andKminus1summ=1

α2m = 1

maxβisinRp

βgtΣBβ (C1a)

partL(β ν)

ΣBβ = νΣWβ

Σminus1W ΣBβ

= νβ (C2)

W ΣBβ

= ν from (C1b)

minτisinRp

J(B) + λ

psumj=1

∥∥βj∥∥2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

psumj=1

∥∥βj∥∥2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

radicλ

psumj=1

2sumpj=1wj

∥∥βj∥∥2

J(B) + λ

psumj=1

wj∥∥βj∥∥

Ω = diag

) (D5)

(Ω)jj =wjsump

2∥∥βj∥∥2

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2βj (D8)

2le wj (D9)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

2βj (D10a)

forallj isin S(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

2sumpj=1wj

∥∥βj∥∥2

wj∥∥βj∥∥

psumj=1

wj∥∥βj∥∥

psumj=1

∥∥βj∥∥2

psumj=1

∥∥βj∥∥2

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

minBisinRptimesM

psumj=1

wj∥∥βj∥∥

j minusY)

+ λwj

= 0 (E3a)

j minusY)∥∥∥

2le λwj (E3b)

j minus Y)

+ λwj

= 0 (E4a)

j minus Y)∥∥∥

2le λwj (E4b)

∥∥ugt∥∥2

=∥∥ugtV

L(θ) =

nsumi=1

(Ksumk=1

πkfk(xiθk)

Ksumk=1

with tik(θprime) =

` πprime`f`(xiθ

prime`)

Using (F3) we have

Q(θθprime) =sumik

=sumik

tik(θprime) log

π`f`(xiθ`)

)=sumik

= Q(θθ) +H(T)

(πksumi

)minus np

2log(2π)minus n

2log |Σ| minus 1

πk = 1

πk minus 1

partL(θ)

partπk= 0hArr 1

tik + λ = 0

πk =1

G2 Means

partL(θ)

rArr microk =

sumi tikxisumi tik

partL(θ)

2Σ︸︷︷︸

as per property 4

minus 1

rArr Σ =1

Bibliography

List of figures
List of tables
Context
Motivations
Regularization
Pure Penalties
Hybrid Penalties
Mixed Penalties
Abstract
Inertia Based
Regression Based
Summary
Practicalities
Distance Evaluation
GLOSS Algorithm
Numerical Stability
Score Matrix
Penalty Parameter
Scaling Variables
Sparse Variant
Diagonal Variant
Normalization
Decision Thresholds
Simulated Data
Correlated Data
Discussion
Abstract
Mixture Models
Model
Optimized Criterion
Mix-GLOSS Algorithm
Mix-GLOSS
Model Selection
Results
Discussion
Conclusions
Appendix
Matrix Properties
Useful Properties
Prior probabilities
Means
Covariance Matrix
Bibliography

Luis Francisco Sanchez Merchante To cite this version

Documents

Selva peruana sanchez sanchez

Все возможные варианты : Juan Sanchez...

Comparison between internal and external DSLs via RubyTL and...

E portafolio Katherine Sanchez Sanchez grupo 96

· rehberger garcia rocandio garcia rocandio garcia...

Is HCC increased with DAA therapy? - Virology...

Estrabismo y ambliopía - Pediatría Integral · Estrabismo...

Simone Bertoli, Francesca Marchetta To cite this version ·...

ce.jalisco.gob.mx · romero hernandez miguel rubio gonzalez...

INTEGRANTES: MANUELA RODRIGUEZ AYELEN SANCHEZ LUISA SANCHEZ....

Anita Sanchez Sanchez Tennis & Associates

· sanchez vite bernabe sanchez vega bertha santa maria...

Presidencia y Pleno · 2020-01-17 · alejandra sanchez...

Mi autobiografía Eric Sanchez Sanchez

WEB Bolsa AB AUXJ...1046 SALMERON SANCHEZ BLAS JESUS Reserva...

Carlos Vecina Merchante Universidad de las Islas...