Luis Francisco Sanchez Merchante To cite this version

Post on 22-Jun-2022

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

HAL Id tel-00868847httpstelarchives-ouvertesfrtel-00868847

Submitted on 2 Oct 2013

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers

Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes

Learning algorithms for sparse classificationLuis Francisco Sanchez Merchante

To cite this versionLuis Francisco Sanchez Merchante Learning algorithms for sparse classification Computer scienceUniversiteacute de Technologie de Compiegravegne 2013 English NNT 2013COMP2084 tel-00868847

Par Luis Francisco SANCHEZ MERCHANTE

Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

Learning algorithms for sparse classification

Soutenue le 07 juin 2013

Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

D2084

Algorithmes drsquoestimation pour laclassification parcimonieuse

Luis Francisco Sanchez MerchanteUniversity of Compiegne

CompiegneFrance

ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

Albert Espinosa

ldquoBe brave Take risks Nothing can substitute experiencerdquo

Paulo Coelho

Acknowledgements

If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

Contents

List of figures v

List of tables vii

Notation and Symbols ix

I Context and Foundations 1

1 Context 5

2 Regularization for Feature Selection 921 Motivations 9

22 Categorization of Feature Selection Techniques 11

23 Regularization 13

231 Important Properties 14

232 Pure Penalties 14

233 Hybrid Penalties 18

234 Mixed Penalties 19

235 Sparsity Considerations 19

236 Optimization Tools for Regularized Problems 21

II Sparse Linear Discriminant Analysis 25

Abstract 27

3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

32 Feature Selection in LDA Problems 30

321 Inertia Based 30

322 Regression Based 32

4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

411 Penalized Optimal Scoring Problem 36

412 Penalized Canonical Correlation Analysis 37

i

Contents

413 Penalized Linear Discriminant Analysis 39

414 Summary 40

42 Practicalities 41

421 Solution of the Penalized Optimal Scoring Regression 41

422 Distance Evaluation 42

423 Posterior Probability Evaluation 43

424 Graphical Representation 43

43 From Sparse Optimal Scoring to Sparse LDA 43

431 A Quadratic Variational Form 44

432 Group-Lasso OS as Penalized LDA 47

5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

511 Cholesky decomposition 52

512 Numerical Stability 52

52 Score Matrix 52

53 Optimality Conditions 53

54 Active and Inactive Sets 54

55 Penalty Parameter 54

56 Options and Variants 55

561 Scaling Variables 55

562 Sparse Variant 55

563 Diagonal Variant 55

564 Elastic net and Structured Variant 55

6 Experimental Results 5761 Normalization 57

62 Decision Thresholds 57

63 Simulated Data 58

64 Gene Expression Data 60

65 Correlated Data 63

Discussion 63

III Sparse Clustering Analysis 67

Abstract 69

7 Feature Selection in Mixture Models 7171 Mixture Models 71

711 Model 71

712 Parameter Estimation The EM Algorithm 72

ii

Contents

72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

92 Model Selection 91

10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

Conclusions 97

Appendix 103

A Matrix Properties 105

B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

C Solving Fisherrsquos Discriminant Problem 111

D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

iii

Contents

E Invariance of the Group-Lasso to Unitary Transformations 117

F Expected Complete Likelihood and Likelihood 119

G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

Bibliography 123

iv

List of Figures

11 MASH project logo 5

21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

rameters 20

41 Graphical representation of the variational approach to Group-Lasso 45

51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

v

List of Tables

61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

vii

Notation and Symbols

Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

Sets

N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

Data

X input domainxi input sample xi isin XX design matrix X = (xgt1 x

gtn )gt

xj column j of Xyi class indicator of sample i

Y indicator matrix Y = (ygt1 ygtn )gt

z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

Vectors Matrices and Norms

0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

ix

Notation and Symbols

Probability

E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

Mixture Models

yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

Optimization

J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

x

Notation and Symbols

Penalized models

λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

βj jth row of B = (β1gt βpgt)gt

BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

ΣB sample between-class covariance matrix

ΣW sample within-class covariance matrix

ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

xi

Part I

Context and Foundations

1

This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

3

1 Context

The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

From the point of view of the research the members of the consortium must deal withfour main goals

1 Software development of website framework and APIrsquos

2 Classification and goal-planning in high dimensional feature spaces

3 Interfacing the platform with the 3D virtual environment and the robot arm

4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

S HM A

Figure 11 MASH project logo

5

1 Context

The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

6

All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

7

2 Regularization for Feature Selection

With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

21 Motivations

There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

9

2 Regularization for Feature Selection

Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

10

22 Categorization of Feature Selection Techniques

Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

22 Categorization of Feature Selection Techniques

Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

bull Depending on the type of integration with the machine learning algorithm we have

ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

11

2 Regularization for Feature Selection

the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

bull Depending on the feature searching technique

ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

bull Depending on the evaluation technique

ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

ndash Dependency Measures - Measuring the correlation between features

ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

ndash Predictive Accuracy - Use the selected features to predict the labels

ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

12

23 Regularization

goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

23 Regularization

In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

minβJ(β) + λP (β) (21)

minβ

J(β)

s t P (β) le t (22)

In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

13

2 Regularization for Feature Selection

Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

231 Important Properties

Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

232 Pure Penalties

For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

14

23 Regularization

Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

3penalty has a support region with sharper vertexes that would induce

a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

3results in difficulties during optimization that will not happen with a convex

shape

15

2 Regularization for Feature Selection

To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

minβ

J(β)

s t β0 le t (24)

where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

minβ

J(β)

s t

psumj=1

|βj | le t (25)

Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

16

23 Regularization

minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

minβJ(β) + λ β22 (26)

The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

minβ

nsumi=1

(yi minus xgti β)2 (27)

with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

minβ

nsumi=1

(yi minus xgti β)2 + λ

psumj=1

β2j

(βlsj )2 (28)

The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

17

2 Regularization for Feature Selection

where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

βlowast = maxwisinRp

βgtw s t w le 1

In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

233 Hybrid Penalties

There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

minβ

nsumi=1

(yi minus xgti β)2 + λ1

psumj=1

|βj |+ λ2

psumj=1

β2j (29)

The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

18

23 Regularization

234 Mixed Penalties

Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

sumL`=1 d` Mixed norms are

a type of norms that take into consideration those groups The general expression isshowed below

β(rs) =

sum`

sumjisinG`

|βj |s r

s

1r

(210)

The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

(Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

235 Sparsity Considerations

In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

19

2 Regularization for Feature Selection

(a) L1 Lasso (b) L(12) group-Lasso

Figure 25 Admissible sets for the Lasso and Group-Lasso

(a) L1 induced sparsity (b) L(12) group inducedsparsity

Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

20

23 Regularization

the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

236 Optimization Tools for Regularized Problems

In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

βj =minusλsign(βj)minus partJ(β)

partβj

2sumn

i=1 x2ij

In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

algorithm where β(t+1)j = Sλ

(partJ(β(t))partβj

) The objective function is optimized with respect

21

2 Regularization for Feature Selection

to one variable at a time while all others are kept fixed

(partJ(β)

partβj

)=

λminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

gt λ

minusλminus partJ(β)partβj

2sumn

i=1 x2ij

if partJ(β)partβj

lt minusλ

0 if |partJ(β)partβj| le λ

(211)

The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

22

23 Regularization

and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

minβisinRp

J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

2

∥∥∥β minus β(t)∥∥∥2

2(212)

They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

23

2 Regularization for Feature Selection

(212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

minβisinRp

1

2

∥∥∥∥β minus (β(t) minus 1

LnablaJ(β(t)))

∥∥∥∥2

2

LP (β) (213)

The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

24

Part II

Sparse Linear Discriminant Analysis

25

Abstract

Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

27

3 Feature Selection in Fisher DiscriminantAnalysis

31 Fisher Discriminant Analysis

Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

gtn )gt and the corresponding labels in the ntimesK matrix

Y = (ygt1 ygtn )gt

Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

maxβisinRp

βgtΣBβ

βgtΣWβ (31)

where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

ΣW =1

n

Ksumk=1

sumiisinGk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

Ksumk=1

sumiisinGk

(microminus microk)(microminus microk)gt

where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

29

3 Feature Selection in Fisher Discriminant Analysis

This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

maxBisinRptimesKminus1

tr(BgtΣBB

)tr(BgtΣWB

) (32)

where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

based on a series of K minus 1 subproblemsmaxβkisinRp

βgtk ΣBβk

s t βgtk ΣWβk le 1

βgtk ΣWβ` = 0 forall` lt k

(33)

The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

eigenvalue (see Appendix C)

32 Feature Selection in LDA Problems

LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

321 Inertia Based

The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

30

32 Feature Selection in LDA Problems

classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

minβisinRp

βgtΣWβ

s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

βisinkRpβgtk Σ

k

Bβk minus Pk(βk)

s t βgtk ΣWβk le 1

The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

minimization minβisinRp

β1

s t∥∥∥Σβ minus (micro1 minus micro2)

∥∥∥infinle λ

Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

31

3 Feature Selection in Fisher Discriminant Analysis

Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

322 Regression Based

In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

Predefined Indicator Matrix

Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

32

32 Feature Selection in LDA Problems

obtained by solving

minβisinRpβ0isinR

nminus1nsumi=1

(yi minus β0 minus xgti β)2 + λ

psumj=1

|βj |

where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

vector for λ = 0 but a different intercept β0 is required

Optimal Scoring

In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)(34a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

minθkisinRK βkisinRp

Yθk minusXβk2 + βgtk Ωβk (35a)

s t nminus1 θgtk YgtYθk = 1 (35b)

θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

where each βk corresponds to a discriminant direction

33

3 Feature Selection in Fisher Discriminant Analysis

Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

minβkisinRpθkisinRK

sumk

Yθk minusXβk22 + λ1 βk1 + λ2β

gtk Ωβk

where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

minβkisinRpθkisinRK

Kminus1sumk=1

Yθk minusXβk22 + λ

psumj=1

radicradicradicradicKminus1sumk=1

β2kj

2

(36)

which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

34

4 Formalizing the Objective

In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

41 From Optimal Scoring to Linear Discriminant Analysis

Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

Throughout this chapter we assume that

bull there is no empty class that is the diagonal matrix YgtY is full rank

bull inputs are centered that is Xgt1n = 0

bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

35

4 Formalizing the Objective

411 Penalized Optimal Scoring Problem

For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

minθisinRK βisinRp

Yθ minusXβ2 + βgtΩβ (41a)

s t nminus1 θgtYgtYθ = 1 (41b)

For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

βos =(XgtX + Ω

)minus1XgtYθ (42)

The objective function (41a) is then

Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

(XgtX + Ω

)βos

= θgtYgtYθ minus θgtYgtX(XgtX + Ω

)minus1XgtYθ

where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (43)

which shows that the optimization of the p-OS problem with respect to θk boils down to

finding the kth largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY Indeed Appendix C

details that Problem (43) is solved by

(YgtY)minus1YgtX(XgtX + Ω

)minus1XgtYθ = α2θ (44)

36

41 From Optimal Scoring to Linear Discriminant Analysis

where α2 is the maximal eigenvalue 1

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2nminus1θgt(YgtY)θ

nminus1θgtYgtX(XgtX + Ω

)minus1XgtYθ = α2 (45)

412 Penalized Canonical Correlation Analysis

As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

maxθisinRK βisinRp

nminus1θgtYgtXβ (46a)

s t nminus1 θgtYgtYθ = 1 (46b)

nminus1 βgt(XgtX + Ω

)β = 1 (46c)

The solutions to (46) are obtained by finding saddle points of the Lagrangian

nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

rArr npartL(βθ γ ν)

partβ= XgtYθ minus 2γ(XgtX + Ω)β

rArr βcca =1

2γ(XgtX + Ω)minus1XgtYθ

Then as βcca obeys (46c) we obtain

βcca =(XgtX + Ω)minus1XgtYθradic

nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

so that the optimal objective function (46a) can be expressed with θ alone

nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

=

radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

and the optimization problem with respect to θ can be restated as

maxθnminus1θgtYgtYθ=1

θgtYgtX(XgtX + Ω

)minus1XgtYθ (48)

Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

βos = αβcca (49)

1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

37

4 Formalizing the Objective

where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

the optimality conditions for θ

npartL(βθ γ ν)

partθ= YgtXβ minus 2νYgtYθ

rArr θcca =1

2ν(YgtY)minus1YgtXβ (410)

Then as θcca obeys (46b) we obtain

θcca =(YgtY)minus1YgtXβradic

nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

leading to the following expression of the optimal objective function

nminus1θgtccaYgtXβ =

nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

=

radicnminus1βgtXgtY(YgtY)minus1YgtXβ

The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

maxβisinRp

nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

s t nminus1 βgt(XgtX + Ω

)β = 1 (412b)

where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

)βcca (413)

where λ is the maximal eigenvalue shown below to be equal to α2

nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

rArr nminus1αβgtccaXgtYθ = λ

rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

rArr α2 = λ

The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

38

41 From Optimal Scoring to Linear Discriminant Analysis

413 Penalized Linear Discriminant Analysis

Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

maxβisinRp

βgtΣBβ (414a)

s t βgt(ΣW + nminus1Ω)β = 1 (414b)

where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

to a simple matrix representation using the projection operator Y(YgtY

)minus1Ygt

ΣT =1

n

nsumi=1

xixigt

= nminus1XgtX

ΣB =1

n

Ksumk=1

nk microkmicrogtk

= nminus1XgtY(YgtY

)minus1YgtX

ΣW =1

n

Ksumk=1

sumiyik=1

(xi minus microk) (xi minus microk)gt

= nminus1

(XgtXminusXgtY

(YgtY

)minus1YgtX

)

Using these formulae the solution to the p-LDA problem (414) is obtained as

XgtY(YgtY

)minus1YgtXβlda = λ

(XgtX + ΩminusXgtY

(YgtY

)minus1YgtX

)βlda

XgtY(YgtY

)minus1YgtXβlda =

λ

1minus λ

(XgtX + Ω

)βlda

The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

βlda = (1minus α2)minus12 βcca

= αminus1(1minus α2)minus12 βos

which ends the path from p-OS to p-LDA

39

4 Formalizing the Objective

414 Summary

The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

minΘ BYΘminusXB2F + λ tr

(BgtΩB

)s t nminus1 ΘgtYgtYΘ = IKminus1

Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

square-root of the largest eigenvector of YgtX(XgtX + Ω

)minus1XgtY we have

BLDA = BCCA

(IKminus1 minusA2

)minus 12

= BOS Aminus1(IKminus1 minusA2

)minus 12 (415)

where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

With the aim of performing classification the whole process could be summarized asfollows

1 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

2 Translate the data samples X into the LDA domain as XLDA = XBOSD

where D = Aminus1(IKminus1 minusA2

)minus 12

3 Compute the matrix M of centroids microk from XLDA and Y

4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

6 Graphical Representation

40

42 Practicalities

The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

42 Practicalities

421 Solution of the Penalized Optimal Scoring Regression

Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

minΘisinRKtimesKminus1BisinRptimesKminus1

YΘminusXB2F + λ tr(BgtΩB

)(416a)

s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

2 Compute B =(XgtX + λΩ

)minus1XgtYΘ0

3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

)minus1XgtY

4 Compute the optimal regression coefficients

BOS =(XgtX + λΩ

)minus1XgtYΘ (417)

Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

Θ0gtYgtX(XgtX + λΩ

)minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

41

4 Formalizing the Objective

a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

422 Distance Evaluation

The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

(nkn

) (418)

is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

Σminus1WΩ =

(nminus1(XgtX + λΩ)minus ΣB

)minus1

=(nminus1XgtXminus ΣB + nminus1λΩ

)minus1

=(ΣW + nminus1λΩ

)minus1 (419)

Before explaining how to compute the distances let us summarize some clarifying points

bull The solution BOS of the p-OS problem is enough to accomplish classification

bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

(xi minus microk)BOS2ΣWΩminus 2 log(πk)

where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

(IKminus1 minusA2

)minus 12

∥∥∥2

2minus 2 log(πk)

which is a plain Euclidean distance

42

43 From Sparse Optimal Scoring to Sparse LDA

423 Posterior Probability Evaluation

Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

p(yk = 1|x) prop exp

(minusd(xmicrok)

2

)prop πk exp

(minus1

2

∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

)minus 12

∥∥∥2

2

) (420)

Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

p(yk = 1|x) =πk exp

(minusd(xmicrok)

2

)sum

` π` exp(minusd(xmicro`)

2

)=

πk exp(minusd(xmicrok)

2 + dmax2

)sum`

π` exp

(minusd(xmicro`)

2+dmax

2

)

where dmax = maxk d(xmicrok)

424 Graphical Representation

Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

43 From Sparse Optimal Scoring to Sparse LDA

The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

43

4 Formalizing the Objective

section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

431 A Quadratic Variational Form

Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

Our formulation of group-Lasso is showed below

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(421a)

s tsum

j τj minussum

j wj∥∥βj∥∥

2le 0 (421b)

τj ge 0 j = 1 p (421c)

where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

B =(β1gt βpgt

)gtand wj are predefined nonnegative weights The cost function

J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

j=1wj∥∥βj∥∥

2

Proof The Lagrangian of Problem (421) is

L = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

( psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2

)minus

psumj=1

νjτj

44

43 From Sparse Optimal Scoring to Sparse LDA

Figure 41 Graphical representation of the variational approach to Group-Lasso

Thus the first order optimality conditions for τj are

partLpartτj

(τj ) = 0hArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

hArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0 τ

j

2 = 0

The last line is obtained from complementary slackness which implies here νjτj = 0

Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

for constraint gj(τj) le 0 As a result the optimal value of τj

τj =

radicλw2

j

∥∥βj∥∥2

2

ν0=

radicλ

ν0wj∥∥βj∥∥

2(422)

We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

psumj=1

τj minuspsumj=1

wj∥∥βj∥∥

2= 0 (423)

so that τj = wj∥∥βj∥∥

2 Using this value into (421a) it is possible to conclude that

Problem (421) is equivalent to the standard group-Lasso operator

minBisinRptimesM

J(B) + λ

psumj=1

wj∥∥βj∥∥

2 (424)

So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

45

4 Formalizing the Objective

With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (425)

with

τj = wj∥∥βj∥∥

2

resulting in Ω diagonal components

(Ω)jj =wj∥∥βj∥∥

2

(426)

And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

Lemma 42 If J is convex Problem (421) is convex

Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

V isin RptimesKminus1 V =partJ(B)

partB+ λG

(427)

where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

G =(g1gt gpgt

)gtdefined as follows Let S(B) denote the columnwise support of

B S(B) = j isin 1 p ∥∥βj∥∥

26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (428)

forallj isin S(B) ∥∥gj∥∥

2le wj (429)

46

43 From Sparse Optimal Scoring to Sparse LDA

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Proof When∥∥βj∥∥

26= 0 the gradient of the penalty with respect to βj is

part (λsump

m=1wj βm2)

partβj= λwj

βj∥∥βj∥∥2

(430)

At∥∥βj∥∥

2= 0 the gradient of the objective function is not continuous and the optimality

conditions then make use of the subdifferential (Bach et al 2011)

partβj

psumm=1

wj βm2

)= partβj

(λwj

∥∥βj∥∥2

)=λwjv isin RKminus1 v2 le 1

(431)

That gives the expression (429)

Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

forallj isin S partJ(B)

partβj+ λwj

∥∥βj∥∥minus1

2βj = 0 (432a)

forallj isin S ∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le λwj (432b)

where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

432 Group-Lasso OS as Penalized LDA

With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

Proposition 41 The group-Lasso OS problem

BOS = argminBisinRptimesKminus1

minΘisinRKtimesKminus1

1

2YΘminusXB2F + λ

psumj=1

wj∥∥βj∥∥

2

s t nminus1 ΘgtYgtYΘ = IKminus1

47

4 Formalizing the Objective

is equivalent to the penalized LDA problem

BLDA = maxBisinRptimesKminus1

tr(BgtΣBB

)s t Bgt(ΣW + nminus1λΩ)B = IKminus1

where Ω = diag

(w2

1

τ1

w2p

τp

) with Ωjj =

+infin if βjos = 0

wj∥∥βjos

∥∥minus1

2otherwise

(433)

That is BLDA = BOS diag(αminus1k (1minus α2

k)minus12

) where αk isin (0 1) is the kth leading

eigenvalue of

nminus1YgtX(XgtX + λΩ

)minus1XgtY

Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

(BgtΩB

)

48

5 GLOSS Algorithm

The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

51 Regression Coefficients Updates

Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

XgtAXA + λΩ)βk = XgtAYθ0

k (51)

49

5 GLOSS Algorithm

initialize modelλ B

ACTIVE SETall j st||βj ||2 gt 0

p-OS PROBLEMB must hold1st optimality

condition

any variablefrom

ACTIVE SETmust go toINACTIVE

SET

take it out ofACTIVE SET

test 2nd op-timality con-dition on the

INACTIVE SET

any variablefrom

INACTIVE SETmust go toACTIVE

SET

take it out ofINACTIVE SET

compute Θ

and update B end

yes

no

yes

no

Figure 51 GLOSS block diagram

50

51 Regression Coefficients Updates

Algorithm 1 Adaptively Penalized Optimal Scoring

Input X Y B λInitialize A larr

j isin 1 p

∥∥βj∥∥2gt 0

Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

Step 1 solve (421) in B assuming A optimalrepeat

Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

2

BA larr(XgtAXA + λΩ

)minus1XgtAYΘ0

until condition (432a) holds for all j isin A Step 2 identify inactivated variables

for j isin A ∥∥βj∥∥

2= 0 do

if optimality condition (432b) holds thenA larr AjGo back to Step 1

end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

jisinA

∥∥partJpartβj∥∥2

if∥∥∥partJpartβj∥∥∥

2lt λ then

convergence larr true B is optimalelseA larr Acup j

end ifuntil convergence

(sV)larreigenanalyze(Θ0gtYgtXAB) that is

Θ0gtYgtXABVk = skVk k = 1 K minus 1

Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

Output Θ B α

51

5 GLOSS Algorithm

where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

511 Cholesky decomposition

Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

(XgtX + λΩ)B = XgtYΘ (52)

Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

CgtCB = XgtYΘ

CB = CgtXgtYΘ

B = CCgtXgtYΘ (53)

where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

512 Numerical Stability

The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

B = Ωminus12(Ωminus12XgtXΩminus12 + λI

)minus1Ωminus12XgtYΘ0 (54)

where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

52 Score Matrix

The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY This eigen-analysis is actually solved in the form

ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

vector decomposition does not require the costly computation of(XgtX + Ω

)minus1that

52

53 Optimality Conditions

involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

)minus1XgtY 1

Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

Θ0gtYgtX(XgtX + Ω

)minus1XgtYΘ0 = Θ0gtYgtXB0

Thus the solution to penalized OS problem can be computed trough the singular

value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

)minus1XgtYΘ = Λ and when Θ0 is chosen such

that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

53 Optimality Conditions

GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

1

2YΘminusXB22 + λ

psumj=1

wj∥∥βj∥∥

2(55)

Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

row of B βj is the (K minus 1)-dimensional vector

partJ(B)

partβj= xj

gt(XBminusYΘ)

where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

xjgt

(XBminusYΘ) + λwjβj∥∥βj∥∥

2

1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

)minus1XgtY It is thus suffi-

cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

YgtX(XgtX + Ω

)minus1XgtY In practice to comply with this desideratum and conditions (35b) and

(35c) we set Θ0 =(YgtY

)minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

vectors orthogonal to 1K

53

5 GLOSS Algorithm

The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

54 Active and Inactive Sets

The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

j = maxj

∥∥∥xjgt (XBminusYΘ)∥∥∥

2minus λwj 0

The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

2

is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

2le λwj

The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

55 Penalty Parameter

The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

λmax = maxjisin1p

1

wj

∥∥∥xjgtYΘ0∥∥∥

2

The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

54

56 Options and Variants

56 Options and Variants

561 Scaling Variables

As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

562 Sparse Variant

This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

563 Diagonal Variant

We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

minBisinRptimesKminus1

YΘminusXB2F = minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

)are replaced by

minBisinRptimesKminus1

tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

)Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

564 Elastic net and Structured Variant

For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

55

5 GLOSS Algorithm

7 8 9

4 5 6

1 2 3

- ΩL =

3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

Figure 52 Graph and Laplacian matrix for a 3times 3 image

for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

56

6 Experimental Results

This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

61 Normalization

With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

62 Decision Thresholds

The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

57

6 Experimental Results

63 Simulated Data

We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

is intended to mimic gene expression data correlation

Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

3 1)if j le 100 and Xij sim N(0 1) otherwise

Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

58

63 Simulated Data

Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

Err () Var Dir

Sim 1 K = 4 mean shift ind features

PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

Sim 2 K = 2 mean shift dependent features

PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

Sim 3 K = 4 1D mean shift ind features

PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

Sim 4 K = 4 mean shift ind features

PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

59

6 Experimental Results

0 10 20 30 40 50 60 70 8020

30

40

50

60

70

80

90

100TPR Vs FPR

gloss

glossd

slda

plda

Simulation1

Simulation2

Simulation3

Simulation4

Figure 61 TPR versus FPR (in ) for all algorithms and simulations

Table 62 Average TPR and FPR (in ) computed over 25 repetitions

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

PLDA 990 782 969 603 980 159 743 656

SLDA 739 385 338 163 416 278 507 395

GLOSS 641 106 300 46 511 182 260 121

GLOSS-D 935 394 921 281 956 655 429 299

method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

64 Gene Expression Data

We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

60

64 Gene Expression Data

Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

Err () Var

Nakayama n = 86 p = 22 283 K = 5

PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

Ramaswamy n = 198 p = 16 063 K = 14

PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

Sun n = 180 p = 54 613 K = 4

PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

61

6 Experimental Results

GLOSS SLDA

Naka

yam

a

minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

minus25

minus2

minus15

minus1

minus05

0

05

1

x 104

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

2n

dd

iscr

imin

ant

minus2000 0 2000 4000 6000 8000 10000 12000 14000

2000

4000

6000

8000

10000

12000

14000

16000

1) Synovial sarcoma

2) Myxoid liposarcoma

3) Dedifferentiated liposarcoma

4) Myxofibrosarcoma

5) Malignant fibrous histiocytoma

Su

n

minus1 minus05 0 05 1 15 2

x 104

05

1

15

2

25

3

35

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

2n

dd

iscr

imin

ant

minus2 minus15 minus1 minus05 0

x 104

0

05

1

15

2

x 104

1) NonTumor

2) Astrocytomas

3) Glioblastomas

4) Oligodendrogliomas

1st discriminant

Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

62

65 Correlated Data

Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

65 Correlated Data

When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

63

6 Experimental Results

β for GLOSS β for S-GLOSS

Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

64

Discussion

GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

65

Part III

Sparse Clustering Analysis

67

Abstract

Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

69

7 Feature Selection in Mixture Models

71 Mixture Models

One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

711 Model

We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

f(xi) =

Ksumk=1

πkfk(xi) foralli isin 1 n

where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

sumk πk = 1) Mixture models transcribe that

given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

bull x each xi is assumed to arise from a random vector with probability densityfunction fk

In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

f(xiθ) =

Ksumk=1

πkφ(xiθk) foralli isin 1 n

71

7 Feature Selection in Mixture Models

where θ = (π1 πK θ1 θK) is the parameter of the model

712 Parameter Estimation The EM Algorithm

For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

21 σ

22 π) of a univariate

Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

Maximum Likelihood Definitions

The likelihood is is commonly expressed in its logarithmic version

L(θ X) = log

(nprodi=1

f(xiθ)

)

=nsumi=1

log

(Ksumk=1

πkfk(xiθk)

) (71)

where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

72

71 Mixture Models

classification log-likelihood

LC(θ XY) = log

(nprodi=1

f(xiyiθ)

)

=

nsumi=1

log

(Ksumk=1

yikπkfk(xiθk)

)

=nsumi=1

Ksumk=1

yik log (πkfk(xiθk)) (72)

The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

Defining the soft membership tik(θ) as

tik(θ) = p(Yik = 1|xiθ) (73)

=πkfk(xiθk)

f(xiθ) (74)

To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

LC(θ XY) =sumik

yik log (πkfk(xiθk))

=sumik

yik log (tikf(xiθ))

=sumik

yik log tik +sumik

yik log f(xiθ)

=sumik

yik log tik +nsumi=1

log f(xiθ)

=sumik

yik log tik + L(θ X) (75)

wheresum

ik yik log tik can be reformulated as

sumik

yik log tik =nsumi=1

Ksumk=1

yik log(p(Yik = 1|xiθ))

=

nsumi=1

log(p(Yik = 1|xiθ))

= log (p(Y |Xθ))

As a result the relationship (75) can be rewritten as

L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

73

7 Feature Selection in Mixture Models

Likelihood Maximization

The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

+EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

For the mixture model problem Q(θθprime) is

Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

=sumik

p(Yik = 1|xiθprime) log(πkfk(xiθk))

=nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (77)

Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

prime) are the posterior proba-bilities of cluster memberships

Hence the EM algorithm sketched above results in

bull Initialization (not iterated) choice of the initial parameter θ(0)

bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

74

72 Feature Selection in Model-Based Clustering

Gaussian Model

In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

f(xiθ) =Ksumk=1

πkfk(xiθk)

=

Ksumk=1

πk1

(2π)p2 |Σ|

12

exp

minus1

2(xi minus microk)

gtΣminus1(xi minus microk)

At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

Q(θθ(t)) =sumik

tik log(πk)minussumik

tik log(

(2π)p2 |Σ|

12

)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

=sumk

tk log(πk)minusnp

2log(2π)︸ ︷︷ ︸

constant term

minusn2

log(|Σ|)minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

equivsumk

tk log(πk)minusn

2log(|Σ|)minus

sumik

tik

(1

2(xi minus microk)

gtΣminus1(xi minus microk)

) (78)

where

tk =nsumi=1

tik (79)

The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

π(t+1)k =

tkn

(710)

micro(t+1)k =

sumi tikxitk

(711)

Σ(t+1) =1

n

sumk

Wk (712)

with Wk =sumi

tik(xi minus microk)(xi minus microk)gt (713)

The derivations are detailed in Appendix G

72 Feature Selection in Model-Based Clustering

When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

75

7 Feature Selection in Mixture Models

covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

gtk (Banfield and Raftery 1993)

These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

721 Based on Penalized Likelihood

Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

log

(p(Yk = 1|x)

p(Y` = 1|x)

)= xgtΣminus1(microk minus micro`)minus

1

2(microk + micro`)

gtΣminus1(microk minus micro`) + logπkπ`

In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

λKsumk=1

psumj=1

|microkj |

as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

λ1

Ksumk=1

psumj=1

|microkj |+ λ2

Ksumk=1

psumj=1

psumm=1

|(Σminus1k )jm|

In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

76

72 Feature Selection in Model-Based Clustering

Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

λ

psumj=1

sum16k6kprime6K

|microkj minus microkprimej |

This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

λ

psumj=1

(micro1j micro2j microKj)infin

One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

λradicK

psumj=1

radicradicradicradic Ksum

k=1

micro2kj

The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

722 Based on Model Variants

The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

77

7 Feature Selection in Mixture Models

f(xi|φ πθν) =Ksumk=1

πk

pprodj=1

[f(xij |θjk)]φj [h(xij |νj)]1minusφj

where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

tr(

(UgtΣWU)minus1UgtΣBU) (714)

so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

minUisinRptimesKminus1

∥∥∥XU minusXU∥∥∥2

F+ λ

Kminus1sumk=1

∥∥∥uk∥∥∥1

where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

minABisinRptimesKminus1

Ksumk=1

∥∥∥RminusgtW HBk minusABgtHBk

∥∥∥2

2+ ρ

Kminus1sumj=1

βgtj ΣWβj + λ

Kminus1sumj=1

∥∥βj∥∥1

s t AgtA = IKminus1

where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

78

72 Feature Selection in Model-Based Clustering

triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

minUisinRptimesKminus1

psumj=1

∥∥∥ΣBj minus UUgtΣBj

∥∥∥2

2

s t UgtU = IKminus1

whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

723 Based on Model Selection

Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

bull X(1) set of selected relevant variables

bull X(2) set of variables being considered for inclusion or exclusion of X(1)

bull X(3) set of non relevant variables

79

7 Feature Selection in Mixture Models

With those subsets they defined two different models where Y is the partition toconsider

bull M1

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)|X(1)

)f(X(1)|Y

)bull M2

f (X|Y) = f(X(1)X(2)X(3)|Y

)= f

(X(3)|X(2)X(1)

)f(X(2)X(1)|Y

)Model M1 means that variables in X(2) are independent on clustering Y Model M2

shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

B12 =f (X|M1)

f (X|M2)

where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

B12 =f(X(1)X(2)X(3)|M1

)f(X(1)X(2)X(3)|M2

)=f(X(2)|X(1)M1

)f(X(1)|M1

)f(X(2)X(1)|M2

)

This factor is approximated since the integrated likelihoods f(X(1)|M1

)and

f(X(2)X(1)|M2

)are difficult to calculate exactly Raftery and Dean (2006) use the

BIC approximation The computation of f(X(2)|X(1)M1

) if there is only one variable

in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

80

8 Theoretical Foundations

In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

81 Resolving EM with Optimal Scoring

In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

811 Relationship Between the M-Step and Linear Discriminant Analysis

LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

d(ximicrok) = (xi minus microk)gtΣminus1

W (xi minus microk)

where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

81

8 Theoretical Foundations

The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

2lweight(microΣ) =nsumi=1

Ksumk=1

tikd(ximicrok)minus n log(|ΣW|)

which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

813 Clustering Using Penalized Optimal Scoring

The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

82

82 Optimized Criterion

1 Initialize the membership matrix Y (for example by K-means algorithm)

2 Solve the p-OS problem as

BOS =(XgtX + λΩ

)minus1XgtYΘ

where Θ are the K minus 1 leading eigenvectors of

YgtX(XgtX + λΩ

)minus1XgtY

3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

k)minus 1

2 )

4 Compute the centroids M in the LDA domain

5 Evaluate distances in the LDA domain

6 Translate distances into posterior probabilities tik with

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

] (81)

7 Update the labels using the posterior probabilities matrix Y = T

8 Go back to step 2 and iterate until tik converge

Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

82 Optimized Criterion

In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

83

8 Theoretical Foundations

optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

821 A Bayesian Derivation

This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

f(Σ|Λ0 ν0) =1

2np2 |Λ0|

n2 Γp(

n2 )|Σminus1|

ν0minuspminus12 exp

minus1

2tr(Λminus1

0 Σminus1)

where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

Γp(n2) = πp(pminus1)4pprodj=1

Γ (n2 + (1minus j)2)

The posterior distribution can be maximized similarly to the likelihood through the

84

82 Optimized Criterion

maximization of

Q(θθprime) + log(f(Σ|Λ0 ν0))

=Ksumk=1

tk log πk minus(n+ 1)p

2log 2minus n

2log |Λ0| minus

p(p+ 1)

4log(π)

minuspsumj=1

log

(n

2+

1minus j2

))minus νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

)equiv

Ksumk=1

tk log πk minusn

2log |Λ0| minus

νn minus pminus 1

2log |Σ| minus 1

2tr(Λminus1n Σminus1

) (82)

with tk =

nsumi=1

tik

νn = ν0 + n

Λminus1n = Λminus1

0 + S0

S0 =

nsumi=1

Ksumk=1

tik(xi minus microk)(xi minus microk)gt

Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

822 Maximum a Posteriori Estimator

The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

ΣMAP =1

ν0 + nminus pminus 1(Λminus1

0 + S0) (83)

where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

85

9 Mix-GLOSS Algorithm

Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

91 Mix-GLOSS

The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

911 Outer Loop Whole Algorithm Repetitions

This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

bull the centered ntimes p feature matrix X

bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

bull the number of clusters K

bull the maximum number of iterations for the EM algorithm

bull the convergence tolerance for the EM algorithm

bull the number of whole repetitions of the clustering algorithm

87

9 Mix-GLOSS Algorithm

Figure 91 Mix-GLOSS Loops Scheme

bull a ptimes (K minus 1) initial coefficient matrix (optional)

bull a ntimesK initial posterior probability matrix (optional)

For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

912 Penalty Parameter Loop

The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

88

91 Mix-GLOSS

of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

Algorithm 2 Automatic selection of λ

Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

Estimate λ Compute gradient at βj = 0partJ(B)

partβj

∣∣∣βj=0

= xjgt

(sum

m6=j xmβm minusYΘ)

Compute λmax for every feature using (432b)

λmaxj = 1

wj

∥∥∥∥ partJ(B)

partβj

∣∣∣βj=0

∥∥∥∥2

Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

elselastLAMBDA larr true

end ifuntil lastLAMBDA

Output B L(θ) tik πk microk Σ Y for every λ in solution path

913 Inner Loop EM Algorithm

The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

89

9 Mix-GLOSS Algorithm

Algorithm 3 Mix-GLOSS for one value of λ

Input X K B0 Y0 λInitializeif (B0Y0) available then

BOS larr B0 Y larr Y0

elseBOS larr 0 Y larr kmeans(XK)

end ifconvergenceEM larr false tolEM larr 1e-3repeat

M-step(BOSΘ

α)larr GLOSS(XYBOS λ)

XLDA = XBOS diag (αminus1(1minusα2)minus12

)

πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

sumi |tik minus yik| lt tolEM then

convergenceEM larr trueend ifY larr T

until convergenceEMY larr MAP(T)

Output BOS ΘL(θ) tik πk microk Σ Y

90

92 Model Selection

M-Step

The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

E-Step

The E-step evaluates the posterior probability matrix T using

tik prop exp

[minusd(x microk)minus 2 log(πk)

2

]

The convergence of those tik is used as stopping criterion for EM

92 Model Selection

Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

91

9 Mix-GLOSS Algorithm

Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

X K λEMITER MAXREPMixminusGLOSS

Use B and T frombest repetition as

StartB and StartT

Mix-GLOSS (λStartBStartT)

Compute BIC

Chose λ = minλ BIC

Partition tikπk λBEST BΘ D L(θ)activeset

Figure 92 Mix-GLOSS model selection diagram

with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

92

10 Experimental Results

The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

101 Tested Clustering Algorithms

This section compares Mix-GLOSS with the following methods in the state of the art

bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

93

10 Experimental Results

Figure 101 Class mean vectors for each artificial simulation

bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

94

102 Results

Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

102 Results

In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

Results in percentages are displayed in Figure 102 (or in Table 102 )

95

10 Experimental Results

Table 101 Experimental results for simulated data

Err () Var Time

Sim 1 K = 4 mean shift ind features

CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

Sim 2 K = 2 mean shift dependent features

CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

Sim 3 K = 4 1D mean shift ind features

CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

Sim 4 K = 4 mean shift ind features

CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

MIX-GLOSS 992 015 828 335 884 67 780 12

LUMI-KUAN 992 28 1000 02 1000 005 50 005

FISHER-EM 986 24 888 17 838 5825 620 4075

96

103 Discussion

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100TPR Vs FPR

MIXminusGLOSS

LUMIminusKUAN

FISHERminusEM

Simulation1

Simulation2

Simulation3

Simulation4

Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

103 Discussion

After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

97

Conclusions

99

Conclusions

Summary

The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

Perspectives

Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

101

based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

102

Appendix

103

A Matrix Properties

Property 1 By definition ΣW and ΣB are both symmetric matrices

ΣW =1

n

gsumk=1

sumiisinCk

(xi minus microk)(xi minus microk)gt

ΣB =1

n

gsumk=1

nk(microk minus x)(microk minus x)gt

Property 2 partxgtapartx = partagtx

partx = a

Property 3 partxgtAxpartx = (A + Agt)x

Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

Property 5 partagtXbpartX = abgt

Property 6 partpartXtr

(AXminus1B

)= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

105

B The Penalized-OS Problem is anEigenvector Problem

In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

minθkβk

Yθk minusXβk22 + βgtk Ωkβk (B1)

st θgtk YgtYθk = 1

θgt` YgtYθk = 0 forall` lt k

for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

Lk(θkβk λkνk) =

Yθk minusXβk22 + βgtk Ωkβk + λk(θ

gtk YgtYθk minus 1) +

sum`ltk

ν`θgt` YgtYθk (B2)

Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

βk = (XgtX + Ωk)minus1XgtYθk (B3)

The objective function of (B1) evaluated at βk is

minθk

Yθk minusXβk22 + βk

gtΩkβk = min

θk

θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

= maxθk

θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

B1 How to Solve the Eigenvector Decomposition

Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

107

B The Penalized-OS Problem is an Eigenvector Problem

Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

maxΘisinRKtimes(Kminus1)

tr(ΘgtMΘ

)(B5)

st ΘgtYgtYΘ = IKminus1

If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

MΘv = λv (B6)

where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

vgtMΘv = λhArr vgtΘgtMΘv = λ

Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

wgtMw = λ (B7)

Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

= ΘgtYgtXB

Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

B = (XgtX + Ω)minus1XgtYΘV = BV

108

B2 Why the OS Problem is Solved as an Eigenvector Problem

B2 Why the OS Problem is Solved as an Eigenvector Problem

In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

θk =

Kminus1summ=1

αmwm s t θgtk θk = 1 (B8)

The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

Kminus1summ=1

αmwm

)gt(Kminus1summ=1

αmwm

)= 1

that as per the eigenvector properties can be reduced to

Kminus1summ=1

α2m = 1 (B9)

Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

Mθk = M

Kminus1summ=1

αmwm

=

Kminus1summ=1

αmMwm

As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

Mθk =Kminus1summ=1

αmλmwm

Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

θgtk Mθk =

(Kminus1sum`=1

α`w`

)gt(Kminus1summ=1

αmλmwm

)

This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

θgtk Mθk =Kminus1summ=1

α2mλm

109

B The Penalized-OS Problem is an Eigenvector Problem

The optimization Problem (B5) for discriminant direction k can be rewritten as

maxθkisinRKtimes1

θgtk Mθk

= max

θkisinRKtimes1

Kminus1summ=1

α2mλm

(B10)

with θk =Kminus1summ=1

αmwm

andKminus1summ=1

α2m = 1

One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

sumKminus1m=1 αmwm the resulting score vector θk will be equal to

the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

110

C Solving Fisherrsquos Discriminant Problem

The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

maxβisinRp

βgtΣBβ (C1a)

s t βgtΣWβ = 1 (C1b)

where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

The Lagrangian of Problem (C1) is

L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

so that its first derivative with respect to β is

partL(β ν)

partβ= 2ΣBβ minus 2νΣWβ

A necessary optimality condition for β is that this derivative is zero that is

ΣBβ = νΣWβ

Provided ΣW is full rank we have

Σminus1W ΣBβ

= νβ (C2)

Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

βgtΣBβ = βgtΣWΣminus1

W ΣBβ

= νβgtΣWβ from (C2)

= ν from (C1b)

That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

111

D Alternative Variational Formulation forthe Group-Lasso

In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

minτisinRp

minBisinRptimesKminus1

J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj(D1a)

s tsump

j=1 τj = 1 (D1b)

τj ge 0 j = 1 p (D1c)

Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

of row vectors βj isin RKminus1 B =(β1gt βpgt

)gt

L(B τ λ ν0 νj) = J(B) + λ

psumj=1

w2j

∥∥βj∥∥2

2

τj+ ν0

psumj=1

τj minus 1

minus psumj=1

νjτj (D2)

The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

partL(B τ λ ν0 νj)

partτj

∣∣∣∣τj=τj

= 0 rArr minusλw2j

∥∥βj∥∥2

2

τj2 + ν0 minus νj = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 minus νjτj2 = 0

rArr minusλw2j

∥∥βj∥∥2

2+ ν0τ

j

2 = 0

The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

) = 0 where νj is the Lagrange multiplier and gj(τ) is the

inequality Lagrange condition Then the optimal τj can be deduced

τj =

radicλ

ν0wj∥∥βj∥∥

2

Placing this optimal value of τj into constraint (D1b)

psumj=1

τj = 1rArr τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

(D3)

113

D Alternative Variational Formulation for the Group-Lasso

With this value of τj Problem (D1) is equivalent to

minBisinRptimesKminus1

J(B) + λ

psumj=1

wj∥∥βj∥∥

2

2

(D4)

This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

The penalty term of (D1a) can be conveniently presented as λBgtΩB where

Ω = diag

(w2

1

τ1w2

2

τ2

w2p

τp

) (D5)

Using the value of τj from (D3) each diagonal component of Ω is

(Ω)jj =wjsump

j=1wj∥∥βj∥∥

2∥∥βj∥∥2

(D6)

In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

D1 Useful Properties

Lemma D1 If J is convex Problem (D1) is convex

In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

partJ(B)

partB+ 2λ

Kminus1sumj=1

wj∥∥βj∥∥

2

G

(D7)

where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

∥∥βj∥∥26= 0 then we have

forallj isin S(B) gj = wj∥∥βj∥∥minus1

2βj (D8)

forallj isin S(B) ∥∥gj∥∥

2le wj (D9)

114

D2 An Upper Bound on the Objective Function

This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

∥∥βj∥∥26= 0 and let S(B) be its complement then we have

forallj isin S(B) minus partJ(B)

partβj= 2λ

Kminus1sumj=1

wj∥∥βj∥∥2

wj∥∥βj∥∥minus1

2βj (D10a)

forallj isin S(B)

∥∥∥∥partJ(B)

partβj

∥∥∥∥2

le 2λwj

Kminus1sumj=1

wj∥∥βj∥∥2

(D10b)

In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

D2 An Upper Bound on the Objective Function

Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

τj =wj∥∥βj∥∥

2sumpj=1wj

∥∥βj∥∥2

Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

j=1

wj∥∥βj∥∥

2

2

=

psumj=1

τ12j

wj∥∥βj∥∥

2

τ12j

2

le

psumj=1

τj

psumj=1

w2j

∥∥βj∥∥2

2

τj

le

psumj=1

w2j

∥∥βj∥∥2

2

τj

where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

115

D Alternative Variational Formulation for the Group-Lasso

This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

116

E Invariance of the Group-Lasso to UnitaryTransformations

The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

Proposition E1 Let B be a solution of

minBisinRptimesM

Y minusXB2F + λ

psumj=1

wj∥∥βj∥∥

2(E1)

and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

minBisinRptimesM

∥∥∥Y minusXB∥∥∥2

F+ λ

psumj=1

wj∥∥βj∥∥

2(E2)

Proof The first-order necessary optimality conditions for B are

forallj isin S(B) 2xjgt(xjβ

j minusY)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E3a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minusY)∥∥∥

2le λwj (E3b)

where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

forallj isin S(B) 2xjgt(xjβ

j minus Y)

+ λwj

∥∥∥βj∥∥∥minus1

2βj

= 0 (E4a)

forallj isin S(B) 2∥∥∥xjgt (xjβ

j minus Y)∥∥∥

2le λwj (E4b)

where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

∥∥ugt∥∥2

=∥∥ugtV

∥∥2 Equation (E4b) is also

117

E Invariance of the Group-Lasso to Unitary Transformations

obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

118

F Expected Complete Likelihood andLikelihood

Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

L(θ) =

nsumi=1

log

(Ksumk=1

πkfk(xiθk)

)(F1)

Q(θθprime) =nsumi=1

Ksumk=1

tik(θprime) log (πkfk(xiθk)) (F2)

with tik(θprime) =

πprimekfk(xiθprimek)sum

` πprime`f`(xiθ

prime`)

(F3)

In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

Using (F3) we have

Q(θθprime) =sumik

tik(θprime) log (πkfk(xiθk))

=sumik

tik(θprime) log(tik(θ)) +

sumik

tik(θprime) log

(sum`

π`f`(xiθ`)

)=sumik

tik(θprime) log(tik(θ)) + L(θ)

In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

L(θ) = Q(θθ)minussumik

tik(θ) log(tik(θ))

= Q(θθ) +H(T)

119

G Derivation of the M-Step Equations

This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

Q(θθprime) = maxθ

sumik

tik(θprime) log(πkfk(xiθk))

=sumk

log

(πksumi

tik

)minus np

2log(2π)minus n

2log |Σ| minus 1

2

sumik

tik(xi minus microk)gtΣminus1(xi minus microk)

which has to be maximized subject tosumk

πk = 1

The Lagrangian of this problem is

L(θ) = Q(θθprime) + λ

(sumk

πk minus 1

)

Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

G1 Prior probabilities

partL(θ)

partπk= 0hArr 1

πk

sumi

tik + λ = 0

where λ is identified from the constraint leading to

πk =1

n

sumi

tik

121

G Derivation of the M-Step Equations

G2 Means

partL(θ)

partmicrok= 0hArr minus1

2

sumi

tik2Σminus1(microk minus xi) = 0

rArr microk =

sumi tikxisumi tik

G3 Covariance Matrix

partL(θ)

partΣminus1 = 0hArr n

2Σ︸︷︷︸

as per property 4

minus 1

2

sumik

tik(xi minus microk)(xi minus microk)gt

︸ ︷︷ ︸as per property 5

= 0

rArr Σ =1

n

sumik

tik(xi minus microk)(xi minus microk)gt

122

Bibliography

F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

123

Bibliography

T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

124

Bibliography

J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

125

Bibliography

T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

126

Bibliography

C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

Y Nesterov Gradient methods for minimizing composite functions preprint 2007

S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

127

Bibliography

Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

128

Bibliography

M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

129

Bibliography

M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

130

  • SANCHEZ MERCHANTE PDTpdf
  • Thesis Luis Francisco Sanchez Merchantepdf
    • List of figures
    • List of tables
    • Notation and Symbols
    • Context and Foundations
      • Context
      • Regularization for Feature Selection
        • Motivations
        • Categorization of Feature Selection Techniques
        • Regularization
          • Important Properties
          • Pure Penalties
          • Hybrid Penalties
          • Mixed Penalties
          • Sparsity Considerations
          • Optimization Tools for Regularized Problems
            • Sparse Linear Discriminant Analysis
              • Abstract
              • Feature Selection in Fisher Discriminant Analysis
                • Fisher Discriminant Analysis
                • Feature Selection in LDA Problems
                  • Inertia Based
                  • Regression Based
                      • Formalizing the Objective
                        • From Optimal Scoring to Linear Discriminant Analysis
                          • Penalized Optimal Scoring Problem
                          • Penalized Canonical Correlation Analysis
                          • Penalized Linear Discriminant Analysis
                          • Summary
                            • Practicalities
                              • Solution of the Penalized Optimal Scoring Regression
                              • Distance Evaluation
                              • Posterior Probability Evaluation
                              • Graphical Representation
                                • From Sparse Optimal Scoring to Sparse LDA
                                  • A Quadratic Variational Form
                                  • Group-Lasso OS as Penalized LDA
                                      • GLOSS Algorithm
                                        • Regression Coefficients Updates
                                          • Cholesky decomposition
                                          • Numerical Stability
                                            • Score Matrix
                                            • Optimality Conditions
                                            • Active and Inactive Sets
                                            • Penalty Parameter
                                            • Options and Variants
                                              • Scaling Variables
                                              • Sparse Variant
                                              • Diagonal Variant
                                              • Elastic net and Structured Variant
                                                  • Experimental Results
                                                    • Normalization
                                                    • Decision Thresholds
                                                    • Simulated Data
                                                    • Gene Expression Data
                                                    • Correlated Data
                                                      • Discussion
                                                        • Sparse Clustering Analysis
                                                          • Abstract
                                                          • Feature Selection in Mixture Models
                                                            • Mixture Models
                                                              • Model
                                                              • Parameter Estimation The EM Algorithm
                                                                • Feature Selection in Model-Based Clustering
                                                                  • Based on Penalized Likelihood
                                                                  • Based on Model Variants
                                                                  • Based on Model Selection
                                                                      • Theoretical Foundations
                                                                        • Resolving EM with Optimal Scoring
                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                          • Clustering Using Penalized Optimal Scoring
                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                            • Optimized Criterion
                                                                              • A Bayesian Derivation
                                                                              • Maximum a Posteriori Estimator
                                                                                  • Mix-GLOSS Algorithm
                                                                                    • Mix-GLOSS
                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                      • Penalty Parameter Loop
                                                                                      • Inner Loop EM Algorithm
                                                                                        • Model Selection
                                                                                          • Experimental Results
                                                                                            • Tested Clustering Algorithms
                                                                                            • Results
                                                                                            • Discussion
                                                                                                • Conclusions
                                                                                                • Appendix
                                                                                                  • Matrix Properties
                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                        • Useful Properties
                                                                                                        • An Upper Bound on the Objective Function
                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                          • Derivation of the M-Step Equations
                                                                                                            • Prior probabilities
                                                                                                            • Means
                                                                                                            • Covariance Matrix
                                                                                                                • Bibliography

    Par Luis Francisco SANCHEZ MERCHANTE

    Thegravese preacutesenteacutee pour lrsquoobtention du grade de Docteur de lrsquoUTC

    Learning algorithms for sparse classification

    Soutenue le 07 juin 2013

    Speacutecialiteacute Technologies de lrsquoInformation et des Systegravemes

    D2084

    Algorithmes drsquoestimation pour laclassification parcimonieuse

    Luis Francisco Sanchez MerchanteUniversity of Compiegne

    CompiegneFrance

    ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

    Albert Espinosa

    ldquoBe brave Take risks Nothing can substitute experiencerdquo

    Paulo Coelho

    Acknowledgements

    If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

    Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

    I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

    The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

    Contents

    List of figures v

    List of tables vii

    Notation and Symbols ix

    I Context and Foundations 1

    1 Context 5

    2 Regularization for Feature Selection 921 Motivations 9

    22 Categorization of Feature Selection Techniques 11

    23 Regularization 13

    231 Important Properties 14

    232 Pure Penalties 14

    233 Hybrid Penalties 18

    234 Mixed Penalties 19

    235 Sparsity Considerations 19

    236 Optimization Tools for Regularized Problems 21

    II Sparse Linear Discriminant Analysis 25

    Abstract 27

    3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

    32 Feature Selection in LDA Problems 30

    321 Inertia Based 30

    322 Regression Based 32

    4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

    411 Penalized Optimal Scoring Problem 36

    412 Penalized Canonical Correlation Analysis 37

    i

    Contents

    413 Penalized Linear Discriminant Analysis 39

    414 Summary 40

    42 Practicalities 41

    421 Solution of the Penalized Optimal Scoring Regression 41

    422 Distance Evaluation 42

    423 Posterior Probability Evaluation 43

    424 Graphical Representation 43

    43 From Sparse Optimal Scoring to Sparse LDA 43

    431 A Quadratic Variational Form 44

    432 Group-Lasso OS as Penalized LDA 47

    5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

    511 Cholesky decomposition 52

    512 Numerical Stability 52

    52 Score Matrix 52

    53 Optimality Conditions 53

    54 Active and Inactive Sets 54

    55 Penalty Parameter 54

    56 Options and Variants 55

    561 Scaling Variables 55

    562 Sparse Variant 55

    563 Diagonal Variant 55

    564 Elastic net and Structured Variant 55

    6 Experimental Results 5761 Normalization 57

    62 Decision Thresholds 57

    63 Simulated Data 58

    64 Gene Expression Data 60

    65 Correlated Data 63

    Discussion 63

    III Sparse Clustering Analysis 67

    Abstract 69

    7 Feature Selection in Mixture Models 7171 Mixture Models 71

    711 Model 71

    712 Parameter Estimation The EM Algorithm 72

    ii

    Contents

    72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

    8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

    811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

    Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

    82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

    9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

    911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

    92 Model Selection 91

    10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

    Conclusions 97

    Appendix 103

    A Matrix Properties 105

    B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

    C Solving Fisherrsquos Discriminant Problem 111

    D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

    iii

    Contents

    E Invariance of the Group-Lasso to Unitary Transformations 117

    F Expected Complete Likelihood and Likelihood 119

    G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

    Bibliography 123

    iv

    List of Figures

    11 MASH project logo 5

    21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

    rameters 20

    41 Graphical representation of the variational approach to Group-Lasso 45

    51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

    61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

    discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

    91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

    101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

    v

    List of Tables

    61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

    101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

    vii

    Notation and Symbols

    Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

    Sets

    N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

    Data

    X input domainxi input sample xi isin XX design matrix X = (xgt1 x

    gtn )gt

    xj column j of Xyi class indicator of sample i

    Y indicator matrix Y = (ygt1 ygtn )gt

    z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

    Vectors Matrices and Norms

    0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

    ix

    Notation and Symbols

    Probability

    E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

    W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

    H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

    Mixture Models

    yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

    θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

    Optimization

    J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

    βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

    x

    Notation and Symbols

    Penalized models

    λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

    βj jth row of B = (β1gt βpgt)gt

    BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

    ΣB sample between-class covariance matrix

    ΣW sample within-class covariance matrix

    ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

    xi

    Part I

    Context and Foundations

    1

    This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

    The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

    The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

    3

    1 Context

    The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

    The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

    From the point of view of the research the members of the consortium must deal withfour main goals

    1 Software development of website framework and APIrsquos

    2 Classification and goal-planning in high dimensional feature spaces

    3 Interfacing the platform with the 3D virtual environment and the robot arm

    4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

    S HM A

    Figure 11 MASH project logo

    5

    1 Context

    The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

    Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

    As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

    bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

    bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

    6

    All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

    bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

    I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

    7

    2 Regularization for Feature Selection

    With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

    21 Motivations

    There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

    As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

    When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

    bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

    bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

    9

    2 Regularization for Feature Selection

    Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

    selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

    As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

    ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

    Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

    10

    22 Categorization of Feature Selection Techniques

    Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

    ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

    There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

    22 Categorization of Feature Selection Techniques

    Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

    I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

    The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

    bull Depending on the type of integration with the machine learning algorithm we have

    ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

    ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

    11

    2 Regularization for Feature Selection

    the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

    ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

    bull Depending on the feature searching technique

    ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

    ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

    ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

    bull Depending on the evaluation technique

    ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

    ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

    ndash Dependency Measures - Measuring the correlation between features

    ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

    ndash Predictive Accuracy - Use the selected features to predict the labels

    ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

    The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

    In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

    12

    23 Regularization

    goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

    23 Regularization

    In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

    An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

    minβJ(β) + λP (β) (21)

    minβ

    J(β)

    s t P (β) le t (22)

    In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

    In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

    13

    2 Regularization for Feature Selection

    Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

    231 Important Properties

    Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

    Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

    forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

    for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

    Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

    Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

    232 Pure Penalties

    For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

    14

    23 Regularization

    Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

    this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

    Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

    A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

    After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

    3penalty has a support region with sharper vertexes that would induce

    a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

    3results in difficulties during optimization that will not happen with a convex

    shape

    15

    2 Regularization for Feature Selection

    To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

    L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

    minβ

    J(β)

    s t β0 le t (24)

    where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

    L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

    minβ

    J(β)

    s t

    psumj=1

    |βj | le t (25)

    Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

    Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

    The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

    16

    23 Regularization

    minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

    L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

    minβJ(β) + λ β22 (26)

    The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

    minβ

    nsumi=1

    (yi minus xgti β)2 (27)

    with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

    minβ

    nsumi=1

    (yi minus xgti β)2 + λ

    psumj=1

    β2j

    The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

    the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

    As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

    minβ

    nsumi=1

    (yi minus xgti β)2 + λ

    psumj=1

    β2j

    (βlsj )2 (28)

    The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

    17

    2 Regularization for Feature Selection

    where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

    Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

    Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

    This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

    βlowast = maxwisinRp

    βgtw s t w le 1

    In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

    r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

    233 Hybrid Penalties

    There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

    minβ

    nsumi=1

    (yi minus xgti β)2 + λ1

    psumj=1

    |βj |+ λ2

    psumj=1

    β2j (29)

    The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

    18

    23 Regularization

    234 Mixed Penalties

    Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

    sumL`=1 d` Mixed norms are

    a type of norms that take into consideration those groups The general expression isshowed below

    β(rs) =

    sum`

    sumjisinG`

    |βj |s r

    s

    1r

    (210)

    The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

    Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

    (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

    235 Sparsity Considerations

    In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

    The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

    To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

    19

    2 Regularization for Feature Selection

    (a) L1 Lasso (b) L(12) group-Lasso

    Figure 25 Admissible sets for the Lasso and Group-Lasso

    (a) L1 induced sparsity (b) L(12) group inducedsparsity

    Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

    20

    23 Regularization

    the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

    236 Optimization Tools for Regularized Problems

    In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

    In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

    Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

    β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

    Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

    βj =minusλsign(βj)minus partJ(β)

    partβj

    2sumn

    i=1 x2ij

    In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

    algorithm where β(t+1)j = Sλ

    (partJ(β(t))partβj

    ) The objective function is optimized with respect

    21

    2 Regularization for Feature Selection

    to one variable at a time while all others are kept fixed

    (partJ(β)

    partβj

    )=

    λminus partJ(β)partβj

    2sumn

    i=1 x2ij

    if partJ(β)partβj

    gt λ

    minusλminus partJ(β)partβj

    2sumn

    i=1 x2ij

    if partJ(β)partβj

    lt minusλ

    0 if |partJ(β)partβj| le λ

    (211)

    The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

    Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

    Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

    Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

    This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

    22

    23 Regularization

    and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

    Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

    This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

    This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

    Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

    This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

    Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

    Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

    minβisinRp

    J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

    2

    ∥∥∥β minus β(t)∥∥∥2

    2(212)

    They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

    23

    2 Regularization for Feature Selection

    (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

    minβisinRp

    1

    2

    ∥∥∥∥β minus (β(t) minus 1

    LnablaJ(β(t)))

    ∥∥∥∥2

    2

    LP (β) (213)

    The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

    24

    Part II

    Sparse Linear Discriminant Analysis

    25

    Abstract

    Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

    There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

    In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

    27

    3 Feature Selection in Fisher DiscriminantAnalysis

    31 Fisher Discriminant Analysis

    Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

    We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

    gtn )gt and the corresponding labels in the ntimesK matrix

    Y = (ygt1 ygtn )gt

    Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

    maxβisinRp

    βgtΣBβ

    βgtΣWβ (31)

    where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

    ΣW =1

    n

    Ksumk=1

    sumiisinGk

    (xi minus microk)(xi minus microk)gt

    ΣB =1

    n

    Ksumk=1

    sumiisinGk

    (microminus microk)(microminus microk)gt

    where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

    29

    3 Feature Selection in Fisher Discriminant Analysis

    This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

    maxBisinRptimesKminus1

    tr(BgtΣBB

    )tr(BgtΣWB

    ) (32)

    where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

    based on a series of K minus 1 subproblemsmaxβkisinRp

    βgtk ΣBβk

    s t βgtk ΣWβk le 1

    βgtk ΣWβ` = 0 forall` lt k

    (33)

    The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

    eigenvalue (see Appendix C)

    32 Feature Selection in LDA Problems

    LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

    Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

    The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

    They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

    321 Inertia Based

    The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

    30

    32 Feature Selection in LDA Problems

    classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

    Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

    Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

    minβisinRp

    βgtΣWβ

    s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

    where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

    Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

    βisinkRpβgtk Σ

    k

    Bβk minus Pk(βk)

    s t βgtk ΣWβk le 1

    The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

    Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

    solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

    minimization minβisinRp

    β1

    s t∥∥∥Σβ minus (micro1 minus micro2)

    ∥∥∥infinle λ

    Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

    31

    3 Feature Selection in Fisher Discriminant Analysis

    Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

    322 Regression Based

    In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

    Predefined Indicator Matrix

    Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

    There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

    Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

    In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

    32

    32 Feature Selection in LDA Problems

    obtained by solving

    minβisinRpβ0isinR

    nminus1nsumi=1

    (yi minus β0 minus xgti β)2 + λ

    psumj=1

    |βj |

    where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

    vector for λ = 0 but a different intercept β0 is required

    Optimal Scoring

    In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

    As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

    minΘ BYΘminusXB2F + λ tr

    (BgtΩB

    )(34a)

    s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

    where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

    minθkisinRK βkisinRp

    Yθk minusXβk2 + βgtk Ωβk (35a)

    s t nminus1 θgtk YgtYθk = 1 (35b)

    θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

    where each βk corresponds to a discriminant direction

    33

    3 Feature Selection in Fisher Discriminant Analysis

    Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

    minβkisinRpθkisinRK

    sumk

    Yθk minusXβk22 + λ1 βk1 + λ2β

    gtk Ωβk

    where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

    Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

    minβkisinRpθkisinRK

    Kminus1sumk=1

    Yθk minusXβk22 + λ

    psumj=1

    radicradicradicradicKminus1sumk=1

    β2kj

    2

    (36)

    which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

    this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

    34

    4 Formalizing the Objective

    In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

    The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

    The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

    41 From Optimal Scoring to Linear Discriminant Analysis

    Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

    Throughout this chapter we assume that

    bull there is no empty class that is the diagonal matrix YgtY is full rank

    bull inputs are centered that is Xgt1n = 0

    bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

    35

    4 Formalizing the Objective

    411 Penalized Optimal Scoring Problem

    For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

    The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

    minθisinRK βisinRp

    Yθ minusXβ2 + βgtΩβ (41a)

    s t nminus1 θgtYgtYθ = 1 (41b)

    For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

    βos =(XgtX + Ω

    )minus1XgtYθ (42)

    The objective function (41a) is then

    Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

    (XgtX + Ω

    )βos

    = θgtYgtYθ minus θgtYgtX(XgtX + Ω

    )minus1XgtYθ

    where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

    maxθnminus1θgtYgtYθ=1

    θgtYgtX(XgtX + Ω

    )minus1XgtYθ (43)

    which shows that the optimization of the p-OS problem with respect to θk boils down to

    finding the kth largest eigenvector of YgtX(XgtX + Ω

    )minus1XgtY Indeed Appendix C

    details that Problem (43) is solved by

    (YgtY)minus1YgtX(XgtX + Ω

    )minus1XgtYθ = α2θ (44)

    36

    41 From Optimal Scoring to Linear Discriminant Analysis

    where α2 is the maximal eigenvalue 1

    nminus1θgtYgtX(XgtX + Ω

    )minus1XgtYθ = α2nminus1θgt(YgtY)θ

    nminus1θgtYgtX(XgtX + Ω

    )minus1XgtYθ = α2 (45)

    412 Penalized Canonical Correlation Analysis

    As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

    maxθisinRK βisinRp

    nminus1θgtYgtXβ (46a)

    s t nminus1 θgtYgtYθ = 1 (46b)

    nminus1 βgt(XgtX + Ω

    )β = 1 (46c)

    The solutions to (46) are obtained by finding saddle points of the Lagrangian

    nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

    rArr npartL(βθ γ ν)

    partβ= XgtYθ minus 2γ(XgtX + Ω)β

    rArr βcca =1

    2γ(XgtX + Ω)minus1XgtYθ

    Then as βcca obeys (46c) we obtain

    βcca =(XgtX + Ω)minus1XgtYθradic

    nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

    so that the optimal objective function (46a) can be expressed with θ alone

    nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

    =

    radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

    and the optimization problem with respect to θ can be restated as

    maxθnminus1θgtYgtYθ=1

    θgtYgtX(XgtX + Ω

    )minus1XgtYθ (48)

    Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

    βos = αβcca (49)

    1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

    37

    4 Formalizing the Objective

    where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

    the optimality conditions for θ

    npartL(βθ γ ν)

    partθ= YgtXβ minus 2νYgtYθ

    rArr θcca =1

    2ν(YgtY)minus1YgtXβ (410)

    Then as θcca obeys (46b) we obtain

    θcca =(YgtY)minus1YgtXβradic

    nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

    leading to the following expression of the optimal objective function

    nminus1θgtccaYgtXβ =

    nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

    =

    radicnminus1βgtXgtY(YgtY)minus1YgtXβ

    The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

    maxβisinRp

    nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

    s t nminus1 βgt(XgtX + Ω

    )β = 1 (412b)

    where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

    nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

    )βcca (413)

    where λ is the maximal eigenvalue shown below to be equal to α2

    nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

    rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

    rArr nminus1αβgtccaXgtYθ = λ

    rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

    rArr α2 = λ

    The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

    38

    41 From Optimal Scoring to Linear Discriminant Analysis

    413 Penalized Linear Discriminant Analysis

    Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

    maxβisinRp

    βgtΣBβ (414a)

    s t βgt(ΣW + nminus1Ω)β = 1 (414b)

    where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

    As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

    to a simple matrix representation using the projection operator Y(YgtY

    )minus1Ygt

    ΣT =1

    n

    nsumi=1

    xixigt

    = nminus1XgtX

    ΣB =1

    n

    Ksumk=1

    nk microkmicrogtk

    = nminus1XgtY(YgtY

    )minus1YgtX

    ΣW =1

    n

    Ksumk=1

    sumiyik=1

    (xi minus microk) (xi minus microk)gt

    = nminus1

    (XgtXminusXgtY

    (YgtY

    )minus1YgtX

    )

    Using these formulae the solution to the p-LDA problem (414) is obtained as

    XgtY(YgtY

    )minus1YgtXβlda = λ

    (XgtX + ΩminusXgtY

    (YgtY

    )minus1YgtX

    )βlda

    XgtY(YgtY

    )minus1YgtXβlda =

    λ

    1minus λ

    (XgtX + Ω

    )βlda

    The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

    βlda = (1minus α2)minus12 βcca

    = αminus1(1minus α2)minus12 βos

    which ends the path from p-OS to p-LDA

    39

    4 Formalizing the Objective

    414 Summary

    The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

    minΘ BYΘminusXB2F + λ tr

    (BgtΩB

    )s t nminus1 ΘgtYgtYΘ = IKminus1

    Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

    square-root of the largest eigenvector of YgtX(XgtX + Ω

    )minus1XgtY we have

    BLDA = BCCA

    (IKminus1 minusA2

    )minus 12

    = BOS Aminus1(IKminus1 minusA2

    )minus 12 (415)

    where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

    can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

    or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

    With the aim of performing classification the whole process could be summarized asfollows

    1 Solve the p-OS problem as

    BOS =(XgtX + λΩ

    )minus1XgtYΘ

    where Θ are the K minus 1 leading eigenvectors of

    YgtX(XgtX + λΩ

    )minus1XgtY

    2 Translate the data samples X into the LDA domain as XLDA = XBOSD

    where D = Aminus1(IKminus1 minusA2

    )minus 12

    3 Compute the matrix M of centroids microk from XLDA and Y

    4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

    5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

    6 Graphical Representation

    40

    42 Practicalities

    The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

    42 Practicalities

    421 Solution of the Penalized Optimal Scoring Regression

    Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

    minΘisinRKtimesKminus1BisinRptimesKminus1

    YΘminusXB2F + λ tr(BgtΩB

    )(416a)

    s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

    where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

    Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

    1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

    2 Compute B =(XgtX + λΩ

    )minus1XgtYΘ0

    3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

    )minus1XgtY

    4 Compute the optimal regression coefficients

    BOS =(XgtX + λΩ

    )minus1XgtYΘ (417)

    Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

    Θ0gtYgtX(XgtX + λΩ

    )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

    costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

    This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

    41

    4 Formalizing the Objective

    a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

    422 Distance Evaluation

    The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

    d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

    (nkn

    ) (418)

    is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

    Σminus1WΩ =

    (nminus1(XgtX + λΩ)minus ΣB

    )minus1

    =(nminus1XgtXminus ΣB + nminus1λΩ

    )minus1

    =(ΣW + nminus1λΩ

    )minus1 (419)

    Before explaining how to compute the distances let us summarize some clarifying points

    bull The solution BOS of the p-OS problem is enough to accomplish classification

    bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

    bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

    As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

    (xi minus microk)BOS2ΣWΩminus 2 log(πk)

    where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

    (IKminus1 minusA2

    )minus 12

    ∥∥∥2

    2minus 2 log(πk)

    which is a plain Euclidean distance

    42

    43 From Sparse Optimal Scoring to Sparse LDA

    423 Posterior Probability Evaluation

    Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

    p(yk = 1|x) prop exp

    (minusd(xmicrok)

    2

    )prop πk exp

    (minus1

    2

    ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

    )minus 12

    ∥∥∥2

    2

    ) (420)

    Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

    2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

    p(yk = 1|x) =πk exp

    (minusd(xmicrok)

    2

    )sum

    ` π` exp(minusd(xmicro`)

    2

    )=

    πk exp(minusd(xmicrok)

    2 + dmax2

    )sum`

    π` exp

    (minusd(xmicro`)

    2+dmax

    2

    )

    where dmax = maxk d(xmicrok)

    424 Graphical Representation

    Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

    43 From Sparse Optimal Scoring to Sparse LDA

    The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

    In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

    43

    4 Formalizing the Objective

    section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

    431 A Quadratic Variational Form

    Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

    Our formulation of group-Lasso is showed below

    minτisinRp

    minBisinRptimesKminus1

    J(B) + λ

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj(421a)

    s tsum

    j τj minussum

    j wj∥∥βj∥∥

    2le 0 (421b)

    τj ge 0 j = 1 p (421c)

    where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

    B =(β1gt βpgt

    )gtand wj are predefined nonnegative weights The cost function

    J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

    The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

    Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

    Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

    j=1wj∥∥βj∥∥

    2

    Proof The Lagrangian of Problem (421) is

    L = J(B) + λ

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj+ ν0

    ( psumj=1

    τj minuspsumj=1

    wj∥∥βj∥∥

    2

    )minus

    psumj=1

    νjτj

    44

    43 From Sparse Optimal Scoring to Sparse LDA

    Figure 41 Graphical representation of the variational approach to Group-Lasso

    Thus the first order optimality conditions for τj are

    partLpartτj

    (τj ) = 0hArr minusλw2j

    ∥∥βj∥∥2

    2

    τj2 + ν0 minus νj = 0

    hArr minusλw2j

    ∥∥βj∥∥2

    2+ ν0τ

    j

    2 minus νjτj2 = 0

    rArr minusλw2j

    ∥∥βj∥∥2

    2+ ν0 τ

    j

    2 = 0

    The last line is obtained from complementary slackness which implies here νjτj = 0

    Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

    for constraint gj(τj) le 0 As a result the optimal value of τj

    τj =

    radicλw2

    j

    ∥∥βj∥∥2

    2

    ν0=

    radicλ

    ν0wj∥∥βj∥∥

    2(422)

    We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

    psumj=1

    τj minuspsumj=1

    wj∥∥βj∥∥

    2= 0 (423)

    so that τj = wj∥∥βj∥∥

    2 Using this value into (421a) it is possible to conclude that

    Problem (421) is equivalent to the standard group-Lasso operator

    minBisinRptimesM

    J(B) + λ

    psumj=1

    wj∥∥βj∥∥

    2 (424)

    So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

    45

    4 Formalizing the Objective

    With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

    Ω = diag

    (w2

    1

    τ1w2

    2

    τ2

    w2p

    τp

    ) (425)

    with

    τj = wj∥∥βj∥∥

    2

    resulting in Ω diagonal components

    (Ω)jj =wj∥∥βj∥∥

    2

    (426)

    And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

    The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

    Lemma 42 If J is convex Problem (421) is convex

    Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

    In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

    Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

    V isin RptimesKminus1 V =partJ(B)

    partB+ λG

    (427)

    where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

    G =(g1gt gpgt

    )gtdefined as follows Let S(B) denote the columnwise support of

    B S(B) = j isin 1 p ∥∥βj∥∥

    26= 0 then we have

    forallj isin S(B) gj = wj∥∥βj∥∥minus1

    2βj (428)

    forallj isin S(B) ∥∥gj∥∥

    2le wj (429)

    46

    43 From Sparse Optimal Scoring to Sparse LDA

    This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

    Proof When∥∥βj∥∥

    26= 0 the gradient of the penalty with respect to βj is

    part (λsump

    m=1wj βm2)

    partβj= λwj

    βj∥∥βj∥∥2

    (430)

    At∥∥βj∥∥

    2= 0 the gradient of the objective function is not continuous and the optimality

    conditions then make use of the subdifferential (Bach et al 2011)

    partβj

    psumm=1

    wj βm2

    )= partβj

    (λwj

    ∥∥βj∥∥2

    )=λwjv isin RKminus1 v2 le 1

    (431)

    That gives the expression (429)

    Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

    forallj isin S partJ(B)

    partβj+ λwj

    ∥∥βj∥∥minus1

    2βj = 0 (432a)

    forallj isin S ∥∥∥∥partJ(B)

    partβj

    ∥∥∥∥2

    le λwj (432b)

    where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

    Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

    432 Group-Lasso OS as Penalized LDA

    With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

    Proposition 41 The group-Lasso OS problem

    BOS = argminBisinRptimesKminus1

    minΘisinRKtimesKminus1

    1

    2YΘminusXB2F + λ

    psumj=1

    wj∥∥βj∥∥

    2

    s t nminus1 ΘgtYgtYΘ = IKminus1

    47

    4 Formalizing the Objective

    is equivalent to the penalized LDA problem

    BLDA = maxBisinRptimesKminus1

    tr(BgtΣBB

    )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

    where Ω = diag

    (w2

    1

    τ1

    w2p

    τp

    ) with Ωjj =

    +infin if βjos = 0

    wj∥∥βjos

    ∥∥minus1

    2otherwise

    (433)

    That is BLDA = BOS diag(αminus1k (1minus α2

    k)minus12

    ) where αk isin (0 1) is the kth leading

    eigenvalue of

    nminus1YgtX(XgtX + λΩ

    )minus1XgtY

    Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

    The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

    (BgtΩB

    )

    48

    5 GLOSS Algorithm

    The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

    The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

    1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

    2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

    3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

    This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

    51 Regression Coefficients Updates

    Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

    XgtAXA + λΩ)βk = XgtAYθ0

    k (51)

    49

    5 GLOSS Algorithm

    initialize modelλ B

    ACTIVE SETall j st||βj ||2 gt 0

    p-OS PROBLEMB must hold1st optimality

    condition

    any variablefrom

    ACTIVE SETmust go toINACTIVE

    SET

    take it out ofACTIVE SET

    test 2nd op-timality con-dition on the

    INACTIVE SET

    any variablefrom

    INACTIVE SETmust go toACTIVE

    SET

    take it out ofINACTIVE SET

    compute Θ

    and update B end

    yes

    no

    yes

    no

    Figure 51 GLOSS block diagram

    50

    51 Regression Coefficients Updates

    Algorithm 1 Adaptively Penalized Optimal Scoring

    Input X Y B λInitialize A larr

    j isin 1 p

    ∥∥βj∥∥2gt 0

    Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

    Step 1 solve (421) in B assuming A optimalrepeat

    Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

    2

    BA larr(XgtAXA + λΩ

    )minus1XgtAYΘ0

    until condition (432a) holds for all j isin A Step 2 identify inactivated variables

    for j isin A ∥∥βj∥∥

    2= 0 do

    if optimality condition (432b) holds thenA larr AjGo back to Step 1

    end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

    jisinA

    ∥∥partJpartβj∥∥2

    if∥∥∥partJpartβj∥∥∥

    2lt λ then

    convergence larr true B is optimalelseA larr Acup j

    end ifuntil convergence

    (sV)larreigenanalyze(Θ0gtYgtXAB) that is

    Θ0gtYgtXABVk = skVk k = 1 K minus 1

    Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

    Output Θ B α

    51

    5 GLOSS Algorithm

    where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

    column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

    511 Cholesky decomposition

    Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

    (XgtX + λΩ)B = XgtYΘ (52)

    Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

    CgtCB = XgtYΘ

    CB = CgtXgtYΘ

    B = CCgtXgtYΘ (53)

    where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

    512 Numerical Stability

    The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

    B = Ωminus12(Ωminus12XgtXΩminus12 + λI

    )minus1Ωminus12XgtYΘ0 (54)

    where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

    52 Score Matrix

    The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

    YgtX(XgtX + Ω

    )minus1XgtY This eigen-analysis is actually solved in the form

    ΘgtYgtX(XgtX + Ω

    )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

    vector decomposition does not require the costly computation of(XgtX + Ω

    )minus1that

    52

    53 Optimality Conditions

    involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

    trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

    )minus1XgtY 1

    Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

    Θ0gtYgtX(XgtX + Ω

    )minus1XgtYΘ0 = Θ0gtYgtXB0

    Thus the solution to penalized OS problem can be computed trough the singular

    value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

    Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

    )minus1XgtYΘ = Λ and when Θ0 is chosen such

    that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

    53 Optimality Conditions

    GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

    1

    2YΘminusXB22 + λ

    psumj=1

    wj∥∥βj∥∥

    2(55)

    Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

    row of B βj is the (K minus 1)-dimensional vector

    partJ(B)

    partβj= xj

    gt(XBminusYΘ)

    where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

    xjgt

    (XBminusYΘ) + λwjβj∥∥βj∥∥

    2

    1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

    )minus1XgtY It is thus suffi-

    cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

    YgtX(XgtX + Ω

    )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

    (35c) we set Θ0 =(YgtY

    )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

    vectors orthogonal to 1K

    53

    5 GLOSS Algorithm

    The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

    2le λwj

    54 Active and Inactive Sets

    The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

    j = maxj

    ∥∥∥xjgt (XBminusYΘ)∥∥∥

    2minus λwj 0

    The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

    2

    is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

    2le λwj

    The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

    55 Penalty Parameter

    The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

    The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

    λmax = maxjisin1p

    1

    wj

    ∥∥∥xjgtYΘ0∥∥∥

    2

    The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

    is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

    54

    56 Options and Variants

    56 Options and Variants

    561 Scaling Variables

    As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

    562 Sparse Variant

    This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

    563 Diagonal Variant

    We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

    The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

    minBisinRptimesKminus1

    YΘminusXB2F = minBisinRptimesKminus1

    tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

    )are replaced by

    minBisinRptimesKminus1

    tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

    )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

    564 Elastic net and Structured Variant

    For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

    55

    5 GLOSS Algorithm

    7 8 9

    4 5 6

    1 2 3

    - ΩL =

    3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

    Figure 52 Graph and Laplacian matrix for a 3times 3 image

    for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

    When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

    This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

    56

    6 Experimental Results

    This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

    61 Normalization

    With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

    62 Decision Thresholds

    The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

    1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

    57

    6 Experimental Results

    63 Simulated Data

    We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

    Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

    Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

    dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

    is intended to mimic gene expression data correlation

    Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

    3 1)if j le 100 and Xij sim N(0 1) otherwise

    Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

    Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

    The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

    58

    63 Simulated Data

    Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

    Err () Var Dir

    Sim 1 K = 4 mean shift ind features

    PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

    Sim 2 K = 2 mean shift dependent features

    PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

    Sim 3 K = 4 1D mean shift ind features

    PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

    Sim 4 K = 4 mean shift ind features

    PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

    59

    6 Experimental Results

    0 10 20 30 40 50 60 70 8020

    30

    40

    50

    60

    70

    80

    90

    100TPR Vs FPR

    gloss

    glossd

    slda

    plda

    Simulation1

    Simulation2

    Simulation3

    Simulation4

    Figure 61 TPR versus FPR (in ) for all algorithms and simulations

    Table 62 Average TPR and FPR (in ) computed over 25 repetitions

    Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

    PLDA 990 782 969 603 980 159 743 656

    SLDA 739 385 338 163 416 278 507 395

    GLOSS 641 106 300 46 511 182 260 121

    GLOSS-D 935 394 921 281 956 655 429 299

    method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

    64 Gene Expression Data

    We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

    2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

    60

    64 Gene Expression Data

    Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

    Err () Var

    Nakayama n = 86 p = 22 283 K = 5

    PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

    Ramaswamy n = 198 p = 16 063 K = 14

    PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

    Sun n = 180 p = 54 613 K = 4

    PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

    ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

    dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

    Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

    Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

    Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

    4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

    61

    6 Experimental Results

    GLOSS SLDA

    Naka

    yam

    a

    minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

    minus25

    minus2

    minus15

    minus1

    minus05

    0

    05

    1

    x 104

    1) Synovial sarcoma

    2) Myxoid liposarcoma

    3) Dedifferentiated liposarcoma

    4) Myxofibrosarcoma

    5) Malignant fibrous histiocytoma

    2n

    dd

    iscr

    imin

    ant

    minus2000 0 2000 4000 6000 8000 10000 12000 14000

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    1) Synovial sarcoma

    2) Myxoid liposarcoma

    3) Dedifferentiated liposarcoma

    4) Myxofibrosarcoma

    5) Malignant fibrous histiocytoma

    Su

    n

    minus1 minus05 0 05 1 15 2

    x 104

    05

    1

    15

    2

    25

    3

    35

    x 104

    1) NonTumor

    2) Astrocytomas

    3) Glioblastomas

    4) Oligodendrogliomas

    1st discriminant

    2n

    dd

    iscr

    imin

    ant

    minus2 minus15 minus1 minus05 0

    x 104

    0

    05

    1

    15

    2

    x 104

    1) NonTumor

    2) Astrocytomas

    3) Glioblastomas

    4) Oligodendrogliomas

    1st discriminant

    Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

    62

    65 Correlated Data

    Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

    65 Correlated Data

    When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

    The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

    For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

    As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

    The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

    Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

    63

    6 Experimental Results

    β for GLOSS β for S-GLOSS

    Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

    β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

    Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

    64

    Discussion

    GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

    Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

    The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

    The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

    65

    Part III

    Sparse Clustering Analysis

    67

    Abstract

    Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

    Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

    As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

    Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

    69

    7 Feature Selection in Mixture Models

    71 Mixture Models

    One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

    711 Model

    We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

    from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

    f(xi) =

    Ksumk=1

    πkfk(xi) foralli isin 1 n

    where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

    sumk πk = 1) Mixture models transcribe that

    given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

    bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

    bull x each xi is assumed to arise from a random vector with probability densityfunction fk

    In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

    f(xiθ) =

    Ksumk=1

    πkφ(xiθk) foralli isin 1 n

    71

    7 Feature Selection in Mixture Models

    where θ = (π1 πK θ1 θK) is the parameter of the model

    712 Parameter Estimation The EM Algorithm

    For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

    21 σ

    22 π) of a univariate

    Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

    The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

    The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

    Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

    Maximum Likelihood Definitions

    The likelihood is is commonly expressed in its logarithmic version

    L(θ X) = log

    (nprodi=1

    f(xiθ)

    )

    =nsumi=1

    log

    (Ksumk=1

    πkfk(xiθk)

    ) (71)

    where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

    To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

    72

    71 Mixture Models

    classification log-likelihood

    LC(θ XY) = log

    (nprodi=1

    f(xiyiθ)

    )

    =

    nsumi=1

    log

    (Ksumk=1

    yikπkfk(xiθk)

    )

    =nsumi=1

    Ksumk=1

    yik log (πkfk(xiθk)) (72)

    The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

    Defining the soft membership tik(θ) as

    tik(θ) = p(Yik = 1|xiθ) (73)

    =πkfk(xiθk)

    f(xiθ) (74)

    To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

    LC(θ XY) =sumik

    yik log (πkfk(xiθk))

    =sumik

    yik log (tikf(xiθ))

    =sumik

    yik log tik +sumik

    yik log f(xiθ)

    =sumik

    yik log tik +nsumi=1

    log f(xiθ)

    =sumik

    yik log tik + L(θ X) (75)

    wheresum

    ik yik log tik can be reformulated as

    sumik

    yik log tik =nsumi=1

    Ksumk=1

    yik log(p(Yik = 1|xiθ))

    =

    nsumi=1

    log(p(Yik = 1|xiθ))

    = log (p(Y |Xθ))

    As a result the relationship (75) can be rewritten as

    L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

    73

    7 Feature Selection in Mixture Models

    Likelihood Maximization

    The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

    L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

    +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

    In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

    ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

    minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

    Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

    For the mixture model problem Q(θθprime) is

    Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

    =sumik

    p(Yik = 1|xiθprime) log(πkfk(xiθk))

    =nsumi=1

    Ksumk=1

    tik(θprime) log (πkfk(xiθk)) (77)

    Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

    prime) are the posterior proba-bilities of cluster memberships

    Hence the EM algorithm sketched above results in

    bull Initialization (not iterated) choice of the initial parameter θ(0)

    bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

    bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

    74

    72 Feature Selection in Model-Based Clustering

    Gaussian Model

    In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

    f(xiθ) =Ksumk=1

    πkfk(xiθk)

    =

    Ksumk=1

    πk1

    (2π)p2 |Σ|

    12

    exp

    minus1

    2(xi minus microk)

    gtΣminus1(xi minus microk)

    At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

    Q(θθ(t)) =sumik

    tik log(πk)minussumik

    tik log(

    (2π)p2 |Σ|

    12

    )minus 1

    2

    sumik

    tik(xi minus microk)gtΣminus1(xi minus microk)

    =sumk

    tk log(πk)minusnp

    2log(2π)︸ ︷︷ ︸

    constant term

    minusn2

    log(|Σ|)minus 1

    2

    sumik

    tik(xi minus microk)gtΣminus1(xi minus microk)

    equivsumk

    tk log(πk)minusn

    2log(|Σ|)minus

    sumik

    tik

    (1

    2(xi minus microk)

    gtΣminus1(xi minus microk)

    ) (78)

    where

    tk =nsumi=1

    tik (79)

    The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

    π(t+1)k =

    tkn

    (710)

    micro(t+1)k =

    sumi tikxitk

    (711)

    Σ(t+1) =1

    n

    sumk

    Wk (712)

    with Wk =sumi

    tik(xi minus microk)(xi minus microk)gt (713)

    The derivations are detailed in Appendix G

    72 Feature Selection in Model-Based Clustering

    When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

    75

    7 Feature Selection in Mixture Models

    covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

    In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

    gtk (Banfield and Raftery 1993)

    These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

    In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

    721 Based on Penalized Likelihood

    Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

    log

    (p(Yk = 1|x)

    p(Y` = 1|x)

    )= xgtΣminus1(microk minus micro`)minus

    1

    2(microk + micro`)

    gtΣminus1(microk minus micro`) + logπkπ`

    In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

    λKsumk=1

    psumj=1

    |microkj |

    as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

    λ1

    Ksumk=1

    psumj=1

    |microkj |+ λ2

    Ksumk=1

    psumj=1

    psumm=1

    |(Σminus1k )jm|

    In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

    76

    72 Feature Selection in Model-Based Clustering

    Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

    λ

    psumj=1

    sum16k6kprime6K

    |microkj minus microkprimej |

    This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

    A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

    λ

    psumj=1

    (micro1j micro2j microKj)infin

    One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

    λradicK

    psumj=1

    radicradicradicradic Ksum

    k=1

    micro2kj

    The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

    The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

    722 Based on Model Variants

    The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

    77

    7 Feature Selection in Mixture Models

    f(xi|φ πθν) =Ksumk=1

    πk

    pprodj=1

    [f(xij |θjk)]φj [h(xij |νj)]1minusφj

    where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

    An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

    which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

    tr(

    (UgtΣWU)minus1UgtΣBU) (714)

    so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

    To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

    minUisinRptimesKminus1

    ∥∥∥XU minusXU∥∥∥2

    F+ λ

    Kminus1sumk=1

    ∥∥∥uk∥∥∥1

    where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

    minABisinRptimesKminus1

    Ksumk=1

    ∥∥∥RminusgtW HBk minusABgtHBk

    ∥∥∥2

    2+ ρ

    Kminus1sumj=1

    βgtj ΣWβj + λ

    Kminus1sumj=1

    ∥∥βj∥∥1

    s t AgtA = IKminus1

    where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

    78

    72 Feature Selection in Model-Based Clustering

    triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

    The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

    minUisinRptimesKminus1

    psumj=1

    ∥∥∥ΣBj minus UUgtΣBj

    ∥∥∥2

    2

    s t UgtU = IKminus1

    whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

    To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

    However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

    723 Based on Model Selection

    Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

    bull X(1) set of selected relevant variables

    bull X(2) set of variables being considered for inclusion or exclusion of X(1)

    bull X(3) set of non relevant variables

    79

    7 Feature Selection in Mixture Models

    With those subsets they defined two different models where Y is the partition toconsider

    bull M1

    f (X|Y) = f(X(1)X(2)X(3)|Y

    )= f

    (X(3)|X(2)X(1)

    )f(X(2)|X(1)

    )f(X(1)|Y

    )bull M2

    f (X|Y) = f(X(1)X(2)X(3)|Y

    )= f

    (X(3)|X(2)X(1)

    )f(X(2)X(1)|Y

    )Model M1 means that variables in X(2) are independent on clustering Y Model M2

    shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

    B12 =f (X|M1)

    f (X|M2)

    where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

    B12 =f(X(1)X(2)X(3)|M1

    )f(X(1)X(2)X(3)|M2

    )=f(X(2)|X(1)M1

    )f(X(1)|M1

    )f(X(2)X(1)|M2

    )

    This factor is approximated since the integrated likelihoods f(X(1)|M1

    )and

    f(X(2)X(1)|M2

    )are difficult to calculate exactly Raftery and Dean (2006) use the

    BIC approximation The computation of f(X(2)|X(1)M1

    ) if there is only one variable

    in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

    Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

    Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

    80

    8 Theoretical Foundations

    In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

    We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

    In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

    81 Resolving EM with Optimal Scoring

    In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

    811 Relationship Between the M-Step and Linear Discriminant Analysis

    LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

    d(ximicrok) = (xi minus microk)gtΣminus1

    W (xi minus microk)

    where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

    81

    8 Theoretical Foundations

    The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

    Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

    2lweight(microΣ) =nsumi=1

    Ksumk=1

    tikd(ximicrok)minus n log(|ΣW|)

    which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

    812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

    The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

    813 Clustering Using Penalized Optimal Scoring

    The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

    d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

    This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

    82

    82 Optimized Criterion

    1 Initialize the membership matrix Y (for example by K-means algorithm)

    2 Solve the p-OS problem as

    BOS =(XgtX + λΩ

    )minus1XgtYΘ

    where Θ are the K minus 1 leading eigenvectors of

    YgtX(XgtX + λΩ

    )minus1XgtY

    3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

    k)minus 1

    2 )

    4 Compute the centroids M in the LDA domain

    5 Evaluate distances in the LDA domain

    6 Translate distances into posterior probabilities tik with

    tik prop exp

    [minusd(x microk)minus 2 log(πk)

    2

    ] (81)

    7 Update the labels using the posterior probabilities matrix Y = T

    8 Go back to step 2 and iterate until tik converge

    Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

    814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

    In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

    82 Optimized Criterion

    In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

    83

    8 Theoretical Foundations

    optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

    This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

    821 A Bayesian Derivation

    This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

    The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

    The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

    f(Σ|Λ0 ν0) =1

    2np2 |Λ0|

    n2 Γp(

    n2 )|Σminus1|

    ν0minuspminus12 exp

    minus1

    2tr(Λminus1

    0 Σminus1)

    where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

    Γp(n2) = πp(pminus1)4pprodj=1

    Γ (n2 + (1minus j)2)

    The posterior distribution can be maximized similarly to the likelihood through the

    84

    82 Optimized Criterion

    maximization of

    Q(θθprime) + log(f(Σ|Λ0 ν0))

    =Ksumk=1

    tk log πk minus(n+ 1)p

    2log 2minus n

    2log |Λ0| minus

    p(p+ 1)

    4log(π)

    minuspsumj=1

    log

    (n

    2+

    1minus j2

    ))minus νn minus pminus 1

    2log |Σ| minus 1

    2tr(Λminus1n Σminus1

    )equiv

    Ksumk=1

    tk log πk minusn

    2log |Λ0| minus

    νn minus pminus 1

    2log |Σ| minus 1

    2tr(Λminus1n Σminus1

    ) (82)

    with tk =

    nsumi=1

    tik

    νn = ν0 + n

    Λminus1n = Λminus1

    0 + S0

    S0 =

    nsumi=1

    Ksumk=1

    tik(xi minus microk)(xi minus microk)gt

    Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

    822 Maximum a Posteriori Estimator

    The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

    ΣMAP =1

    ν0 + nminus pminus 1(Λminus1

    0 + S0) (83)

    where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

    0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

    85

    9 Mix-GLOSS Algorithm

    Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

    91 Mix-GLOSS

    The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

    When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

    The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

    911 Outer Loop Whole Algorithm Repetitions

    This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

    bull the centered ntimes p feature matrix X

    bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

    bull the number of clusters K

    bull the maximum number of iterations for the EM algorithm

    bull the convergence tolerance for the EM algorithm

    bull the number of whole repetitions of the clustering algorithm

    87

    9 Mix-GLOSS Algorithm

    Figure 91 Mix-GLOSS Loops Scheme

    bull a ptimes (K minus 1) initial coefficient matrix (optional)

    bull a ntimesK initial posterior probability matrix (optional)

    For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

    912 Penalty Parameter Loop

    The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

    Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

    88

    91 Mix-GLOSS

    of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

    Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

    Algorithm 2 Automatic selection of λ

    Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

    Estimate λ Compute gradient at βj = 0partJ(B)

    partβj

    ∣∣∣βj=0

    = xjgt

    (sum

    m6=j xmβm minusYΘ)

    Compute λmax for every feature using (432b)

    λmaxj = 1

    wj

    ∥∥∥∥ partJ(B)

    partβj

    ∣∣∣βj=0

    ∥∥∥∥2

    Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

    elselastLAMBDA larr true

    end ifuntil lastLAMBDA

    Output B L(θ) tik πk microk Σ Y for every λ in solution path

    913 Inner Loop EM Algorithm

    The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

    89

    9 Mix-GLOSS Algorithm

    Algorithm 3 Mix-GLOSS for one value of λ

    Input X K B0 Y0 λInitializeif (B0Y0) available then

    BOS larr B0 Y larr Y0

    elseBOS larr 0 Y larr kmeans(XK)

    end ifconvergenceEM larr false tolEM larr 1e-3repeat

    M-step(BOSΘ

    α)larr GLOSS(XYBOS λ)

    XLDA = XBOS diag (αminus1(1minusα2)minus12

    )

    πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

    sumi |tik minus yik| lt tolEM then

    convergenceEM larr trueend ifY larr T

    until convergenceEMY larr MAP(T)

    Output BOS ΘL(θ) tik πk microk Σ Y

    90

    92 Model Selection

    M-Step

    The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

    E-Step

    The E-step evaluates the posterior probability matrix T using

    tik prop exp

    [minusd(x microk)minus 2 log(πk)

    2

    ]

    The convergence of those tik is used as stopping criterion for EM

    92 Model Selection

    Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

    In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

    In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

    The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

    91

    9 Mix-GLOSS Algorithm

    Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

    X K λEMITER MAXREPMixminusGLOSS

    Use B and T frombest repetition as

    StartB and StartT

    Mix-GLOSS (λStartBStartT)

    Compute BIC

    Chose λ = minλ BIC

    Partition tikπk λBEST BΘ D L(θ)activeset

    Figure 92 Mix-GLOSS model selection diagram

    with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

    92

    10 Experimental Results

    The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

    This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

    In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

    The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

    101 Tested Clustering Algorithms

    This section compares Mix-GLOSS with the following methods in the state of the art

    bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

    bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

    93

    10 Experimental Results

    Figure 101 Class mean vectors for each artificial simulation

    bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

    After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

    The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

    bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

    bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

    94

    102 Results

    Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

    102 Results

    In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

    bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

    bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

    bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

    The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

    Results in percentages are displayed in Figure 102 (or in Table 102 )

    95

    10 Experimental Results

    Table 101 Experimental results for simulated data

    Err () Var Time

    Sim 1 K = 4 mean shift ind features

    CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

    Sim 2 K = 2 mean shift dependent features

    CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

    Sim 3 K = 4 1D mean shift ind features

    CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

    Sim 4 K = 4 mean shift ind features

    CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

    Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

    Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

    MIX-GLOSS 992 015 828 335 884 67 780 12

    LUMI-KUAN 992 28 1000 02 1000 005 50 005

    FISHER-EM 986 24 888 17 838 5825 620 4075

    96

    103 Discussion

    0 10 20 30 40 50 600

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100TPR Vs FPR

    MIXminusGLOSS

    LUMIminusKUAN

    FISHERminusEM

    Simulation1

    Simulation2

    Simulation3

    Simulation4

    Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

    103 Discussion

    After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

    LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

    The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

    From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

    97

    Conclusions

    99

    Conclusions

    Summary

    The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

    In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

    The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

    In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

    In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

    Perspectives

    Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

    101

    based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

    At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

    The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

    From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

    At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

    102

    Appendix

    103

    A Matrix Properties

    Property 1 By definition ΣW and ΣB are both symmetric matrices

    ΣW =1

    n

    gsumk=1

    sumiisinCk

    (xi minus microk)(xi minus microk)gt

    ΣB =1

    n

    gsumk=1

    nk(microk minus x)(microk minus x)gt

    Property 2 partxgtapartx = partagtx

    partx = a

    Property 3 partxgtAxpartx = (A + Agt)x

    Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

    Property 5 partagtXbpartX = abgt

    Property 6 partpartXtr

    (AXminus1B

    )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

    105

    B The Penalized-OS Problem is anEigenvector Problem

    In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

    minθkβk

    Yθk minusXβk22 + βgtk Ωkβk (B1)

    st θgtk YgtYθk = 1

    θgt` YgtYθk = 0 forall` lt k

    for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

    Lk(θkβk λkνk) =

    Yθk minusXβk22 + βgtk Ωkβk + λk(θ

    gtk YgtYθk minus 1) +

    sum`ltk

    ν`θgt` YgtYθk (B2)

    Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

    βk = (XgtX + Ωk)minus1XgtYθk (B3)

    The objective function of (B1) evaluated at βk is

    minθk

    Yθk minusXβk22 + βk

    gtΩkβk = min

    θk

    θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

    = maxθk

    θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

    If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

    B1 How to Solve the Eigenvector Decomposition

    Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

    107

    B The Penalized-OS Problem is an Eigenvector Problem

    Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

    maxΘisinRKtimes(Kminus1)

    tr(ΘgtMΘ

    )(B5)

    st ΘgtYgtYΘ = IKminus1

    If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

    MΘv = λv (B6)

    where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

    vgtMΘv = λhArr vgtΘgtMΘv = λ

    Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

    wgtMw = λ (B7)

    Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

    MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

    = ΘgtYgtXB

    Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

    To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

    B = (XgtX + Ω)minus1XgtYΘV = BV

    108

    B2 Why the OS Problem is Solved as an Eigenvector Problem

    B2 Why the OS Problem is Solved as an Eigenvector Problem

    In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

    By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

    θk =

    Kminus1summ=1

    αmwm s t θgtk θk = 1 (B8)

    The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

    Kminus1summ=1

    αmwm

    )gt(Kminus1summ=1

    αmwm

    )= 1

    that as per the eigenvector properties can be reduced to

    Kminus1summ=1

    α2m = 1 (B9)

    Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

    Mθk = M

    Kminus1summ=1

    αmwm

    =

    Kminus1summ=1

    αmMwm

    As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

    Mθk =Kminus1summ=1

    αmλmwm

    Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

    θgtk Mθk =

    (Kminus1sum`=1

    α`w`

    )gt(Kminus1summ=1

    αmλmwm

    )

    This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

    θgtk Mθk =Kminus1summ=1

    α2mλm

    109

    B The Penalized-OS Problem is an Eigenvector Problem

    The optimization Problem (B5) for discriminant direction k can be rewritten as

    maxθkisinRKtimes1

    θgtk Mθk

    = max

    θkisinRKtimes1

    Kminus1summ=1

    α2mλm

    (B10)

    with θk =Kminus1summ=1

    αmwm

    andKminus1summ=1

    α2m = 1

    One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

    sumKminus1m=1 αmwm the resulting score vector θk will be equal to

    the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

    be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

    110

    C Solving Fisherrsquos Discriminant Problem

    The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

    maxβisinRp

    βgtΣBβ (C1a)

    s t βgtΣWβ = 1 (C1b)

    where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

    The Lagrangian of Problem (C1) is

    L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

    so that its first derivative with respect to β is

    partL(β ν)

    partβ= 2ΣBβ minus 2νΣWβ

    A necessary optimality condition for β is that this derivative is zero that is

    ΣBβ = νΣWβ

    Provided ΣW is full rank we have

    Σminus1W ΣBβ

    = νβ (C2)

    Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

    eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

    βgtΣBβ = βgtΣWΣminus1

    W ΣBβ

    = νβgtΣWβ from (C2)

    = ν from (C1b)

    That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

    W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

    111

    D Alternative Variational Formulation forthe Group-Lasso

    In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

    minτisinRp

    minBisinRptimesKminus1

    J(B) + λ

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj(D1a)

    s tsump

    j=1 τj = 1 (D1b)

    τj ge 0 j = 1 p (D1c)

    Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

    of row vectors βj isin RKminus1 B =(β1gt βpgt

    )gt

    L(B τ λ ν0 νj) = J(B) + λ

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj+ ν0

    psumj=1

    τj minus 1

    minus psumj=1

    νjτj (D2)

    The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

    partL(B τ λ ν0 νj)

    partτj

    ∣∣∣∣τj=τj

    = 0 rArr minusλw2j

    ∥∥βj∥∥2

    2

    τj2 + ν0 minus νj = 0

    rArr minusλw2j

    ∥∥βj∥∥2

    2+ ν0τ

    j

    2 minus νjτj2 = 0

    rArr minusλw2j

    ∥∥βj∥∥2

    2+ ν0τ

    j

    2 = 0

    The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

    ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

    inequality Lagrange condition Then the optimal τj can be deduced

    τj =

    radicλ

    ν0wj∥∥βj∥∥

    2

    Placing this optimal value of τj into constraint (D1b)

    psumj=1

    τj = 1rArr τj =wj∥∥βj∥∥

    2sumpj=1wj

    ∥∥βj∥∥2

    (D3)

    113

    D Alternative Variational Formulation for the Group-Lasso

    With this value of τj Problem (D1) is equivalent to

    minBisinRptimesKminus1

    J(B) + λ

    psumj=1

    wj∥∥βj∥∥

    2

    2

    (D4)

    This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

    The penalty term of (D1a) can be conveniently presented as λBgtΩB where

    Ω = diag

    (w2

    1

    τ1w2

    2

    τ2

    w2p

    τp

    ) (D5)

    Using the value of τj from (D3) each diagonal component of Ω is

    (Ω)jj =wjsump

    j=1wj∥∥βj∥∥

    2∥∥βj∥∥2

    (D6)

    In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

    D1 Useful Properties

    Lemma D1 If J is convex Problem (D1) is convex

    In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

    Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

    partJ(B)

    partB+ 2λ

    Kminus1sumj=1

    wj∥∥βj∥∥

    2

    G

    (D7)

    where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

    ∥∥βj∥∥26= 0 then we have

    forallj isin S(B) gj = wj∥∥βj∥∥minus1

    2βj (D8)

    forallj isin S(B) ∥∥gj∥∥

    2le wj (D9)

    114

    D2 An Upper Bound on the Objective Function

    This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

    Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

    ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

    forallj isin S(B) minus partJ(B)

    partβj= 2λ

    Kminus1sumj=1

    wj∥∥βj∥∥2

    wj∥∥βj∥∥minus1

    2βj (D10a)

    forallj isin S(B)

    ∥∥∥∥partJ(B)

    partβj

    ∥∥∥∥2

    le 2λwj

    Kminus1sumj=1

    wj∥∥βj∥∥2

    (D10b)

    In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

    D2 An Upper Bound on the Objective Function

    Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

    τj =wj∥∥βj∥∥

    2sumpj=1wj

    ∥∥βj∥∥2

    Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

    j=1

    wj∥∥βj∥∥

    2

    2

    =

    psumj=1

    τ12j

    wj∥∥βj∥∥

    2

    τ12j

    2

    le

    psumj=1

    τj

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj

    le

    psumj=1

    w2j

    ∥∥βj∥∥2

    2

    τj

    where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

    115

    D Alternative Variational Formulation for the Group-Lasso

    This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

    116

    E Invariance of the Group-Lasso to UnitaryTransformations

    The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

    Proposition E1 Let B be a solution of

    minBisinRptimesM

    Y minusXB2F + λ

    psumj=1

    wj∥∥βj∥∥

    2(E1)

    and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

    minBisinRptimesM

    ∥∥∥Y minusXB∥∥∥2

    F+ λ

    psumj=1

    wj∥∥βj∥∥

    2(E2)

    Proof The first-order necessary optimality conditions for B are

    forallj isin S(B) 2xjgt(xjβ

    j minusY)

    + λwj

    ∥∥∥βj∥∥∥minus1

    2βj

    = 0 (E3a)

    forallj isin S(B) 2∥∥∥xjgt (xjβ

    j minusY)∥∥∥

    2le λwj (E3b)

    where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

    First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

    forallj isin S(B) 2xjgt(xjβ

    j minus Y)

    + λwj

    ∥∥∥βj∥∥∥minus1

    2βj

    = 0 (E4a)

    forallj isin S(B) 2∥∥∥xjgt (xjβ

    j minus Y)∥∥∥

    2le λwj (E4b)

    where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

    ∥∥ugt∥∥2

    =∥∥ugtV

    ∥∥2 Equation (E4b) is also

    117

    E Invariance of the Group-Lasso to Unitary Transformations

    obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

    118

    F Expected Complete Likelihood andLikelihood

    Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

    L(θ) =

    nsumi=1

    log

    (Ksumk=1

    πkfk(xiθk)

    )(F1)

    Q(θθprime) =nsumi=1

    Ksumk=1

    tik(θprime) log (πkfk(xiθk)) (F2)

    with tik(θprime) =

    πprimekfk(xiθprimek)sum

    ` πprime`f`(xiθ

    prime`)

    (F3)

    In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

    the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

    Using (F3) we have

    Q(θθprime) =sumik

    tik(θprime) log (πkfk(xiθk))

    =sumik

    tik(θprime) log(tik(θ)) +

    sumik

    tik(θprime) log

    (sum`

    π`f`(xiθ`)

    )=sumik

    tik(θprime) log(tik(θ)) + L(θ)

    In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

    L(θ) = Q(θθ)minussumik

    tik(θ) log(tik(θ))

    = Q(θθ) +H(T)

    119

    G Derivation of the M-Step Equations

    This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

    Q(θθprime) = maxθ

    sumik

    tik(θprime) log(πkfk(xiθk))

    =sumk

    log

    (πksumi

    tik

    )minus np

    2log(2π)minus n

    2log |Σ| minus 1

    2

    sumik

    tik(xi minus microk)gtΣminus1(xi minus microk)

    which has to be maximized subject tosumk

    πk = 1

    The Lagrangian of this problem is

    L(θ) = Q(θθprime) + λ

    (sumk

    πk minus 1

    )

    Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

    G1 Prior probabilities

    partL(θ)

    partπk= 0hArr 1

    πk

    sumi

    tik + λ = 0

    where λ is identified from the constraint leading to

    πk =1

    n

    sumi

    tik

    121

    G Derivation of the M-Step Equations

    G2 Means

    partL(θ)

    partmicrok= 0hArr minus1

    2

    sumi

    tik2Σminus1(microk minus xi) = 0

    rArr microk =

    sumi tikxisumi tik

    G3 Covariance Matrix

    partL(θ)

    partΣminus1 = 0hArr n

    2Σ︸︷︷︸

    as per property 4

    minus 1

    2

    sumik

    tik(xi minus microk)(xi minus microk)gt

    ︸ ︷︷ ︸as per property 5

    = 0

    rArr Σ =1

    n

    sumik

    tik(xi minus microk)(xi minus microk)gt

    122

    Bibliography

    F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

    F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

    F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

    J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

    A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

    H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

    P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

    C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

    C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

    C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

    C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

    S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

    L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

    L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

    123

    Bibliography

    T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

    S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

    C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

    B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

    L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

    C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

    A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

    D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

    R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

    B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

    Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

    R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

    V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

    J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

    124

    Bibliography

    J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

    J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

    W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

    A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

    D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

    G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

    G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

    Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

    Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

    L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

    Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

    J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

    I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

    T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

    T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

    125

    Bibliography

    T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

    A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

    J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

    T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

    K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

    P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

    T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

    M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

    Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

    C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

    C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

    H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

    J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

    Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

    C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

    126

    Bibliography

    C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

    L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

    N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

    B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

    B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

    Y Nesterov Gradient methods for minimizing composite functions preprint 2007

    S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

    B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

    M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

    M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

    W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

    W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

    K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

    S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

    127

    Bibliography

    Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

    A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

    C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

    S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

    V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

    V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

    V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

    C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

    L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

    Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

    A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

    S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

    P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

    M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

    128

    Bibliography

    M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

    R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

    J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

    S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

    D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

    D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

    D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

    M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

    MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

    T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

    B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

    B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

    C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

    J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

    129

    Bibliography

    M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

    P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

    P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

    H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

    H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

    H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

    130

    • SANCHEZ MERCHANTE PDTpdf
    • Thesis Luis Francisco Sanchez Merchantepdf
      • List of figures
      • List of tables
      • Notation and Symbols
      • Context and Foundations
        • Context
        • Regularization for Feature Selection
          • Motivations
          • Categorization of Feature Selection Techniques
          • Regularization
            • Important Properties
            • Pure Penalties
            • Hybrid Penalties
            • Mixed Penalties
            • Sparsity Considerations
            • Optimization Tools for Regularized Problems
              • Sparse Linear Discriminant Analysis
                • Abstract
                • Feature Selection in Fisher Discriminant Analysis
                  • Fisher Discriminant Analysis
                  • Feature Selection in LDA Problems
                    • Inertia Based
                    • Regression Based
                        • Formalizing the Objective
                          • From Optimal Scoring to Linear Discriminant Analysis
                            • Penalized Optimal Scoring Problem
                            • Penalized Canonical Correlation Analysis
                            • Penalized Linear Discriminant Analysis
                            • Summary
                              • Practicalities
                                • Solution of the Penalized Optimal Scoring Regression
                                • Distance Evaluation
                                • Posterior Probability Evaluation
                                • Graphical Representation
                                  • From Sparse Optimal Scoring to Sparse LDA
                                    • A Quadratic Variational Form
                                    • Group-Lasso OS as Penalized LDA
                                        • GLOSS Algorithm
                                          • Regression Coefficients Updates
                                            • Cholesky decomposition
                                            • Numerical Stability
                                              • Score Matrix
                                              • Optimality Conditions
                                              • Active and Inactive Sets
                                              • Penalty Parameter
                                              • Options and Variants
                                                • Scaling Variables
                                                • Sparse Variant
                                                • Diagonal Variant
                                                • Elastic net and Structured Variant
                                                    • Experimental Results
                                                      • Normalization
                                                      • Decision Thresholds
                                                      • Simulated Data
                                                      • Gene Expression Data
                                                      • Correlated Data
                                                        • Discussion
                                                          • Sparse Clustering Analysis
                                                            • Abstract
                                                            • Feature Selection in Mixture Models
                                                              • Mixture Models
                                                                • Model
                                                                • Parameter Estimation The EM Algorithm
                                                                  • Feature Selection in Model-Based Clustering
                                                                    • Based on Penalized Likelihood
                                                                    • Based on Model Variants
                                                                    • Based on Model Selection
                                                                        • Theoretical Foundations
                                                                          • Resolving EM with Optimal Scoring
                                                                            • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                            • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                            • Clustering Using Penalized Optimal Scoring
                                                                            • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                              • Optimized Criterion
                                                                                • A Bayesian Derivation
                                                                                • Maximum a Posteriori Estimator
                                                                                    • Mix-GLOSS Algorithm
                                                                                      • Mix-GLOSS
                                                                                        • Outer Loop Whole Algorithm Repetitions
                                                                                        • Penalty Parameter Loop
                                                                                        • Inner Loop EM Algorithm
                                                                                          • Model Selection
                                                                                            • Experimental Results
                                                                                              • Tested Clustering Algorithms
                                                                                              • Results
                                                                                              • Discussion
                                                                                                  • Conclusions
                                                                                                  • Appendix
                                                                                                    • Matrix Properties
                                                                                                    • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                      • How to Solve the Eigenvector Decomposition
                                                                                                      • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                        • Solving Fishers Discriminant Problem
                                                                                                        • Alternative Variational Formulation for the Group-Lasso
                                                                                                          • Useful Properties
                                                                                                          • An Upper Bound on the Objective Function
                                                                                                            • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                            • Expected Complete Likelihood and Likelihood
                                                                                                            • Derivation of the M-Step Equations
                                                                                                              • Prior probabilities
                                                                                                              • Means
                                                                                                              • Covariance Matrix
                                                                                                                  • Bibliography

      Algorithmes drsquoestimation pour laclassification parcimonieuse

      Luis Francisco Sanchez MerchanteUniversity of Compiegne

      CompiegneFrance

      ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

      Albert Espinosa

      ldquoBe brave Take risks Nothing can substitute experiencerdquo

      Paulo Coelho

      Acknowledgements

      If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

      Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

      I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

      The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

      Contents

      List of figures v

      List of tables vii

      Notation and Symbols ix

      I Context and Foundations 1

      1 Context 5

      2 Regularization for Feature Selection 921 Motivations 9

      22 Categorization of Feature Selection Techniques 11

      23 Regularization 13

      231 Important Properties 14

      232 Pure Penalties 14

      233 Hybrid Penalties 18

      234 Mixed Penalties 19

      235 Sparsity Considerations 19

      236 Optimization Tools for Regularized Problems 21

      II Sparse Linear Discriminant Analysis 25

      Abstract 27

      3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

      32 Feature Selection in LDA Problems 30

      321 Inertia Based 30

      322 Regression Based 32

      4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

      411 Penalized Optimal Scoring Problem 36

      412 Penalized Canonical Correlation Analysis 37

      i

      Contents

      413 Penalized Linear Discriminant Analysis 39

      414 Summary 40

      42 Practicalities 41

      421 Solution of the Penalized Optimal Scoring Regression 41

      422 Distance Evaluation 42

      423 Posterior Probability Evaluation 43

      424 Graphical Representation 43

      43 From Sparse Optimal Scoring to Sparse LDA 43

      431 A Quadratic Variational Form 44

      432 Group-Lasso OS as Penalized LDA 47

      5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

      511 Cholesky decomposition 52

      512 Numerical Stability 52

      52 Score Matrix 52

      53 Optimality Conditions 53

      54 Active and Inactive Sets 54

      55 Penalty Parameter 54

      56 Options and Variants 55

      561 Scaling Variables 55

      562 Sparse Variant 55

      563 Diagonal Variant 55

      564 Elastic net and Structured Variant 55

      6 Experimental Results 5761 Normalization 57

      62 Decision Thresholds 57

      63 Simulated Data 58

      64 Gene Expression Data 60

      65 Correlated Data 63

      Discussion 63

      III Sparse Clustering Analysis 67

      Abstract 69

      7 Feature Selection in Mixture Models 7171 Mixture Models 71

      711 Model 71

      712 Parameter Estimation The EM Algorithm 72

      ii

      Contents

      72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

      8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

      811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

      Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

      82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

      9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

      911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

      92 Model Selection 91

      10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

      Conclusions 97

      Appendix 103

      A Matrix Properties 105

      B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

      C Solving Fisherrsquos Discriminant Problem 111

      D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

      iii

      Contents

      E Invariance of the Group-Lasso to Unitary Transformations 117

      F Expected Complete Likelihood and Likelihood 119

      G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

      Bibliography 123

      iv

      List of Figures

      11 MASH project logo 5

      21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

      rameters 20

      41 Graphical representation of the variational approach to Group-Lasso 45

      51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

      61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

      discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

      91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

      101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

      v

      List of Tables

      61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

      101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

      vii

      Notation and Symbols

      Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

      Sets

      N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

      Data

      X input domainxi input sample xi isin XX design matrix X = (xgt1 x

      gtn )gt

      xj column j of Xyi class indicator of sample i

      Y indicator matrix Y = (ygt1 ygtn )gt

      z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

      Vectors Matrices and Norms

      0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

      ix

      Notation and Symbols

      Probability

      E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

      W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

      H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

      Mixture Models

      yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

      θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

      Optimization

      J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

      βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

      x

      Notation and Symbols

      Penalized models

      λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

      βj jth row of B = (β1gt βpgt)gt

      BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

      ΣB sample between-class covariance matrix

      ΣW sample within-class covariance matrix

      ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

      xi

      Part I

      Context and Foundations

      1

      This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

      The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

      The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

      3

      1 Context

      The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

      The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

      From the point of view of the research the members of the consortium must deal withfour main goals

      1 Software development of website framework and APIrsquos

      2 Classification and goal-planning in high dimensional feature spaces

      3 Interfacing the platform with the 3D virtual environment and the robot arm

      4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

      S HM A

      Figure 11 MASH project logo

      5

      1 Context

      The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

      Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

      As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

      bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

      bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

      6

      All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

      bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

      I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

      7

      2 Regularization for Feature Selection

      With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

      21 Motivations

      There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

      As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

      When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

      bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

      bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

      9

      2 Regularization for Feature Selection

      Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

      selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

      As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

      ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

      Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

      10

      22 Categorization of Feature Selection Techniques

      Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

      ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

      There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

      22 Categorization of Feature Selection Techniques

      Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

      I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

      The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

      bull Depending on the type of integration with the machine learning algorithm we have

      ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

      ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

      11

      2 Regularization for Feature Selection

      the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

      ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

      bull Depending on the feature searching technique

      ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

      ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

      ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

      bull Depending on the evaluation technique

      ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

      ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

      ndash Dependency Measures - Measuring the correlation between features

      ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

      ndash Predictive Accuracy - Use the selected features to predict the labels

      ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

      The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

      In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

      12

      23 Regularization

      goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

      23 Regularization

      In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

      An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

      minβJ(β) + λP (β) (21)

      minβ

      J(β)

      s t P (β) le t (22)

      In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

      In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

      13

      2 Regularization for Feature Selection

      Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

      231 Important Properties

      Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

      Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

      forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

      for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

      Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

      Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

      232 Pure Penalties

      For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

      14

      23 Regularization

      Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

      this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

      Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

      A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

      After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

      3penalty has a support region with sharper vertexes that would induce

      a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

      3results in difficulties during optimization that will not happen with a convex

      shape

      15

      2 Regularization for Feature Selection

      To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

      L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

      minβ

      J(β)

      s t β0 le t (24)

      where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

      L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

      minβ

      J(β)

      s t

      psumj=1

      |βj | le t (25)

      Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

      Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

      The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

      16

      23 Regularization

      minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

      L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

      minβJ(β) + λ β22 (26)

      The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

      minβ

      nsumi=1

      (yi minus xgti β)2 (27)

      with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

      minβ

      nsumi=1

      (yi minus xgti β)2 + λ

      psumj=1

      β2j

      The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

      the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

      As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

      minβ

      nsumi=1

      (yi minus xgti β)2 + λ

      psumj=1

      β2j

      (βlsj )2 (28)

      The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

      17

      2 Regularization for Feature Selection

      where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

      Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

      Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

      This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

      βlowast = maxwisinRp

      βgtw s t w le 1

      In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

      r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

      233 Hybrid Penalties

      There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

      minβ

      nsumi=1

      (yi minus xgti β)2 + λ1

      psumj=1

      |βj |+ λ2

      psumj=1

      β2j (29)

      The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

      18

      23 Regularization

      234 Mixed Penalties

      Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

      sumL`=1 d` Mixed norms are

      a type of norms that take into consideration those groups The general expression isshowed below

      β(rs) =

      sum`

      sumjisinG`

      |βj |s r

      s

      1r

      (210)

      The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

      Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

      (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

      235 Sparsity Considerations

      In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

      The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

      To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

      19

      2 Regularization for Feature Selection

      (a) L1 Lasso (b) L(12) group-Lasso

      Figure 25 Admissible sets for the Lasso and Group-Lasso

      (a) L1 induced sparsity (b) L(12) group inducedsparsity

      Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

      20

      23 Regularization

      the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

      236 Optimization Tools for Regularized Problems

      In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

      In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

      Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

      β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

      Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

      βj =minusλsign(βj)minus partJ(β)

      partβj

      2sumn

      i=1 x2ij

      In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

      algorithm where β(t+1)j = Sλ

      (partJ(β(t))partβj

      ) The objective function is optimized with respect

      21

      2 Regularization for Feature Selection

      to one variable at a time while all others are kept fixed

      (partJ(β)

      partβj

      )=

      λminus partJ(β)partβj

      2sumn

      i=1 x2ij

      if partJ(β)partβj

      gt λ

      minusλminus partJ(β)partβj

      2sumn

      i=1 x2ij

      if partJ(β)partβj

      lt minusλ

      0 if |partJ(β)partβj| le λ

      (211)

      The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

      Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

      Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

      Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

      This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

      22

      23 Regularization

      and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

      Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

      This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

      This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

      Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

      This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

      Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

      Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

      minβisinRp

      J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

      2

      ∥∥∥β minus β(t)∥∥∥2

      2(212)

      They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

      23

      2 Regularization for Feature Selection

      (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

      minβisinRp

      1

      2

      ∥∥∥∥β minus (β(t) minus 1

      LnablaJ(β(t)))

      ∥∥∥∥2

      2

      LP (β) (213)

      The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

      24

      Part II

      Sparse Linear Discriminant Analysis

      25

      Abstract

      Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

      There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

      In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

      27

      3 Feature Selection in Fisher DiscriminantAnalysis

      31 Fisher Discriminant Analysis

      Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

      We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

      gtn )gt and the corresponding labels in the ntimesK matrix

      Y = (ygt1 ygtn )gt

      Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

      maxβisinRp

      βgtΣBβ

      βgtΣWβ (31)

      where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

      ΣW =1

      n

      Ksumk=1

      sumiisinGk

      (xi minus microk)(xi minus microk)gt

      ΣB =1

      n

      Ksumk=1

      sumiisinGk

      (microminus microk)(microminus microk)gt

      where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

      29

      3 Feature Selection in Fisher Discriminant Analysis

      This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

      maxBisinRptimesKminus1

      tr(BgtΣBB

      )tr(BgtΣWB

      ) (32)

      where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

      based on a series of K minus 1 subproblemsmaxβkisinRp

      βgtk ΣBβk

      s t βgtk ΣWβk le 1

      βgtk ΣWβ` = 0 forall` lt k

      (33)

      The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

      eigenvalue (see Appendix C)

      32 Feature Selection in LDA Problems

      LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

      Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

      The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

      They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

      321 Inertia Based

      The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

      30

      32 Feature Selection in LDA Problems

      classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

      Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

      Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

      minβisinRp

      βgtΣWβ

      s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

      where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

      Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

      βisinkRpβgtk Σ

      k

      Bβk minus Pk(βk)

      s t βgtk ΣWβk le 1

      The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

      Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

      solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

      minimization minβisinRp

      β1

      s t∥∥∥Σβ minus (micro1 minus micro2)

      ∥∥∥infinle λ

      Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

      31

      3 Feature Selection in Fisher Discriminant Analysis

      Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

      322 Regression Based

      In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

      Predefined Indicator Matrix

      Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

      There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

      Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

      In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

      32

      32 Feature Selection in LDA Problems

      obtained by solving

      minβisinRpβ0isinR

      nminus1nsumi=1

      (yi minus β0 minus xgti β)2 + λ

      psumj=1

      |βj |

      where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

      vector for λ = 0 but a different intercept β0 is required

      Optimal Scoring

      In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

      As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

      minΘ BYΘminusXB2F + λ tr

      (BgtΩB

      )(34a)

      s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

      where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

      minθkisinRK βkisinRp

      Yθk minusXβk2 + βgtk Ωβk (35a)

      s t nminus1 θgtk YgtYθk = 1 (35b)

      θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

      where each βk corresponds to a discriminant direction

      33

      3 Feature Selection in Fisher Discriminant Analysis

      Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

      minβkisinRpθkisinRK

      sumk

      Yθk minusXβk22 + λ1 βk1 + λ2β

      gtk Ωβk

      where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

      Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

      minβkisinRpθkisinRK

      Kminus1sumk=1

      Yθk minusXβk22 + λ

      psumj=1

      radicradicradicradicKminus1sumk=1

      β2kj

      2

      (36)

      which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

      this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

      34

      4 Formalizing the Objective

      In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

      The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

      The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

      41 From Optimal Scoring to Linear Discriminant Analysis

      Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

      Throughout this chapter we assume that

      bull there is no empty class that is the diagonal matrix YgtY is full rank

      bull inputs are centered that is Xgt1n = 0

      bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

      35

      4 Formalizing the Objective

      411 Penalized Optimal Scoring Problem

      For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

      The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

      minθisinRK βisinRp

      Yθ minusXβ2 + βgtΩβ (41a)

      s t nminus1 θgtYgtYθ = 1 (41b)

      For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

      βos =(XgtX + Ω

      )minus1XgtYθ (42)

      The objective function (41a) is then

      Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

      (XgtX + Ω

      )βos

      = θgtYgtYθ minus θgtYgtX(XgtX + Ω

      )minus1XgtYθ

      where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

      maxθnminus1θgtYgtYθ=1

      θgtYgtX(XgtX + Ω

      )minus1XgtYθ (43)

      which shows that the optimization of the p-OS problem with respect to θk boils down to

      finding the kth largest eigenvector of YgtX(XgtX + Ω

      )minus1XgtY Indeed Appendix C

      details that Problem (43) is solved by

      (YgtY)minus1YgtX(XgtX + Ω

      )minus1XgtYθ = α2θ (44)

      36

      41 From Optimal Scoring to Linear Discriminant Analysis

      where α2 is the maximal eigenvalue 1

      nminus1θgtYgtX(XgtX + Ω

      )minus1XgtYθ = α2nminus1θgt(YgtY)θ

      nminus1θgtYgtX(XgtX + Ω

      )minus1XgtYθ = α2 (45)

      412 Penalized Canonical Correlation Analysis

      As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

      maxθisinRK βisinRp

      nminus1θgtYgtXβ (46a)

      s t nminus1 θgtYgtYθ = 1 (46b)

      nminus1 βgt(XgtX + Ω

      )β = 1 (46c)

      The solutions to (46) are obtained by finding saddle points of the Lagrangian

      nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

      rArr npartL(βθ γ ν)

      partβ= XgtYθ minus 2γ(XgtX + Ω)β

      rArr βcca =1

      2γ(XgtX + Ω)minus1XgtYθ

      Then as βcca obeys (46c) we obtain

      βcca =(XgtX + Ω)minus1XgtYθradic

      nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

      so that the optimal objective function (46a) can be expressed with θ alone

      nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

      =

      radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

      and the optimization problem with respect to θ can be restated as

      maxθnminus1θgtYgtYθ=1

      θgtYgtX(XgtX + Ω

      )minus1XgtYθ (48)

      Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

      βos = αβcca (49)

      1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

      37

      4 Formalizing the Objective

      where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

      the optimality conditions for θ

      npartL(βθ γ ν)

      partθ= YgtXβ minus 2νYgtYθ

      rArr θcca =1

      2ν(YgtY)minus1YgtXβ (410)

      Then as θcca obeys (46b) we obtain

      θcca =(YgtY)minus1YgtXβradic

      nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

      leading to the following expression of the optimal objective function

      nminus1θgtccaYgtXβ =

      nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

      =

      radicnminus1βgtXgtY(YgtY)minus1YgtXβ

      The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

      maxβisinRp

      nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

      s t nminus1 βgt(XgtX + Ω

      )β = 1 (412b)

      where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

      nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

      )βcca (413)

      where λ is the maximal eigenvalue shown below to be equal to α2

      nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

      rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

      rArr nminus1αβgtccaXgtYθ = λ

      rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

      rArr α2 = λ

      The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

      38

      41 From Optimal Scoring to Linear Discriminant Analysis

      413 Penalized Linear Discriminant Analysis

      Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

      maxβisinRp

      βgtΣBβ (414a)

      s t βgt(ΣW + nminus1Ω)β = 1 (414b)

      where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

      As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

      to a simple matrix representation using the projection operator Y(YgtY

      )minus1Ygt

      ΣT =1

      n

      nsumi=1

      xixigt

      = nminus1XgtX

      ΣB =1

      n

      Ksumk=1

      nk microkmicrogtk

      = nminus1XgtY(YgtY

      )minus1YgtX

      ΣW =1

      n

      Ksumk=1

      sumiyik=1

      (xi minus microk) (xi minus microk)gt

      = nminus1

      (XgtXminusXgtY

      (YgtY

      )minus1YgtX

      )

      Using these formulae the solution to the p-LDA problem (414) is obtained as

      XgtY(YgtY

      )minus1YgtXβlda = λ

      (XgtX + ΩminusXgtY

      (YgtY

      )minus1YgtX

      )βlda

      XgtY(YgtY

      )minus1YgtXβlda =

      λ

      1minus λ

      (XgtX + Ω

      )βlda

      The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

      βlda = (1minus α2)minus12 βcca

      = αminus1(1minus α2)minus12 βos

      which ends the path from p-OS to p-LDA

      39

      4 Formalizing the Objective

      414 Summary

      The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

      minΘ BYΘminusXB2F + λ tr

      (BgtΩB

      )s t nminus1 ΘgtYgtYΘ = IKminus1

      Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

      square-root of the largest eigenvector of YgtX(XgtX + Ω

      )minus1XgtY we have

      BLDA = BCCA

      (IKminus1 minusA2

      )minus 12

      = BOS Aminus1(IKminus1 minusA2

      )minus 12 (415)

      where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

      can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

      or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

      With the aim of performing classification the whole process could be summarized asfollows

      1 Solve the p-OS problem as

      BOS =(XgtX + λΩ

      )minus1XgtYΘ

      where Θ are the K minus 1 leading eigenvectors of

      YgtX(XgtX + λΩ

      )minus1XgtY

      2 Translate the data samples X into the LDA domain as XLDA = XBOSD

      where D = Aminus1(IKminus1 minusA2

      )minus 12

      3 Compute the matrix M of centroids microk from XLDA and Y

      4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

      5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

      6 Graphical Representation

      40

      42 Practicalities

      The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

      42 Practicalities

      421 Solution of the Penalized Optimal Scoring Regression

      Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

      minΘisinRKtimesKminus1BisinRptimesKminus1

      YΘminusXB2F + λ tr(BgtΩB

      )(416a)

      s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

      where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

      Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

      1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

      2 Compute B =(XgtX + λΩ

      )minus1XgtYΘ0

      3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

      )minus1XgtY

      4 Compute the optimal regression coefficients

      BOS =(XgtX + λΩ

      )minus1XgtYΘ (417)

      Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

      Θ0gtYgtX(XgtX + λΩ

      )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

      costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

      This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

      41

      4 Formalizing the Objective

      a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

      422 Distance Evaluation

      The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

      d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

      (nkn

      ) (418)

      is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

      Σminus1WΩ =

      (nminus1(XgtX + λΩ)minus ΣB

      )minus1

      =(nminus1XgtXminus ΣB + nminus1λΩ

      )minus1

      =(ΣW + nminus1λΩ

      )minus1 (419)

      Before explaining how to compute the distances let us summarize some clarifying points

      bull The solution BOS of the p-OS problem is enough to accomplish classification

      bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

      bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

      As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

      (xi minus microk)BOS2ΣWΩminus 2 log(πk)

      where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

      (IKminus1 minusA2

      )minus 12

      ∥∥∥2

      2minus 2 log(πk)

      which is a plain Euclidean distance

      42

      43 From Sparse Optimal Scoring to Sparse LDA

      423 Posterior Probability Evaluation

      Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

      p(yk = 1|x) prop exp

      (minusd(xmicrok)

      2

      )prop πk exp

      (minus1

      2

      ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

      )minus 12

      ∥∥∥2

      2

      ) (420)

      Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

      2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

      p(yk = 1|x) =πk exp

      (minusd(xmicrok)

      2

      )sum

      ` π` exp(minusd(xmicro`)

      2

      )=

      πk exp(minusd(xmicrok)

      2 + dmax2

      )sum`

      π` exp

      (minusd(xmicro`)

      2+dmax

      2

      )

      where dmax = maxk d(xmicrok)

      424 Graphical Representation

      Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

      43 From Sparse Optimal Scoring to Sparse LDA

      The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

      In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

      43

      4 Formalizing the Objective

      section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

      431 A Quadratic Variational Form

      Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

      Our formulation of group-Lasso is showed below

      minτisinRp

      minBisinRptimesKminus1

      J(B) + λ

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj(421a)

      s tsum

      j τj minussum

      j wj∥∥βj∥∥

      2le 0 (421b)

      τj ge 0 j = 1 p (421c)

      where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

      B =(β1gt βpgt

      )gtand wj are predefined nonnegative weights The cost function

      J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

      The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

      Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

      Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

      j=1wj∥∥βj∥∥

      2

      Proof The Lagrangian of Problem (421) is

      L = J(B) + λ

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj+ ν0

      ( psumj=1

      τj minuspsumj=1

      wj∥∥βj∥∥

      2

      )minus

      psumj=1

      νjτj

      44

      43 From Sparse Optimal Scoring to Sparse LDA

      Figure 41 Graphical representation of the variational approach to Group-Lasso

      Thus the first order optimality conditions for τj are

      partLpartτj

      (τj ) = 0hArr minusλw2j

      ∥∥βj∥∥2

      2

      τj2 + ν0 minus νj = 0

      hArr minusλw2j

      ∥∥βj∥∥2

      2+ ν0τ

      j

      2 minus νjτj2 = 0

      rArr minusλw2j

      ∥∥βj∥∥2

      2+ ν0 τ

      j

      2 = 0

      The last line is obtained from complementary slackness which implies here νjτj = 0

      Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

      for constraint gj(τj) le 0 As a result the optimal value of τj

      τj =

      radicλw2

      j

      ∥∥βj∥∥2

      2

      ν0=

      radicλ

      ν0wj∥∥βj∥∥

      2(422)

      We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

      psumj=1

      τj minuspsumj=1

      wj∥∥βj∥∥

      2= 0 (423)

      so that τj = wj∥∥βj∥∥

      2 Using this value into (421a) it is possible to conclude that

      Problem (421) is equivalent to the standard group-Lasso operator

      minBisinRptimesM

      J(B) + λ

      psumj=1

      wj∥∥βj∥∥

      2 (424)

      So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

      45

      4 Formalizing the Objective

      With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

      Ω = diag

      (w2

      1

      τ1w2

      2

      τ2

      w2p

      τp

      ) (425)

      with

      τj = wj∥∥βj∥∥

      2

      resulting in Ω diagonal components

      (Ω)jj =wj∥∥βj∥∥

      2

      (426)

      And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

      The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

      Lemma 42 If J is convex Problem (421) is convex

      Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

      In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

      Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

      V isin RptimesKminus1 V =partJ(B)

      partB+ λG

      (427)

      where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

      G =(g1gt gpgt

      )gtdefined as follows Let S(B) denote the columnwise support of

      B S(B) = j isin 1 p ∥∥βj∥∥

      26= 0 then we have

      forallj isin S(B) gj = wj∥∥βj∥∥minus1

      2βj (428)

      forallj isin S(B) ∥∥gj∥∥

      2le wj (429)

      46

      43 From Sparse Optimal Scoring to Sparse LDA

      This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

      Proof When∥∥βj∥∥

      26= 0 the gradient of the penalty with respect to βj is

      part (λsump

      m=1wj βm2)

      partβj= λwj

      βj∥∥βj∥∥2

      (430)

      At∥∥βj∥∥

      2= 0 the gradient of the objective function is not continuous and the optimality

      conditions then make use of the subdifferential (Bach et al 2011)

      partβj

      psumm=1

      wj βm2

      )= partβj

      (λwj

      ∥∥βj∥∥2

      )=λwjv isin RKminus1 v2 le 1

      (431)

      That gives the expression (429)

      Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

      forallj isin S partJ(B)

      partβj+ λwj

      ∥∥βj∥∥minus1

      2βj = 0 (432a)

      forallj isin S ∥∥∥∥partJ(B)

      partβj

      ∥∥∥∥2

      le λwj (432b)

      where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

      Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

      432 Group-Lasso OS as Penalized LDA

      With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

      Proposition 41 The group-Lasso OS problem

      BOS = argminBisinRptimesKminus1

      minΘisinRKtimesKminus1

      1

      2YΘminusXB2F + λ

      psumj=1

      wj∥∥βj∥∥

      2

      s t nminus1 ΘgtYgtYΘ = IKminus1

      47

      4 Formalizing the Objective

      is equivalent to the penalized LDA problem

      BLDA = maxBisinRptimesKminus1

      tr(BgtΣBB

      )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

      where Ω = diag

      (w2

      1

      τ1

      w2p

      τp

      ) with Ωjj =

      +infin if βjos = 0

      wj∥∥βjos

      ∥∥minus1

      2otherwise

      (433)

      That is BLDA = BOS diag(αminus1k (1minus α2

      k)minus12

      ) where αk isin (0 1) is the kth leading

      eigenvalue of

      nminus1YgtX(XgtX + λΩ

      )minus1XgtY

      Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

      The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

      (BgtΩB

      )

      48

      5 GLOSS Algorithm

      The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

      The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

      1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

      2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

      3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

      This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

      51 Regression Coefficients Updates

      Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

      XgtAXA + λΩ)βk = XgtAYθ0

      k (51)

      49

      5 GLOSS Algorithm

      initialize modelλ B

      ACTIVE SETall j st||βj ||2 gt 0

      p-OS PROBLEMB must hold1st optimality

      condition

      any variablefrom

      ACTIVE SETmust go toINACTIVE

      SET

      take it out ofACTIVE SET

      test 2nd op-timality con-dition on the

      INACTIVE SET

      any variablefrom

      INACTIVE SETmust go toACTIVE

      SET

      take it out ofINACTIVE SET

      compute Θ

      and update B end

      yes

      no

      yes

      no

      Figure 51 GLOSS block diagram

      50

      51 Regression Coefficients Updates

      Algorithm 1 Adaptively Penalized Optimal Scoring

      Input X Y B λInitialize A larr

      j isin 1 p

      ∥∥βj∥∥2gt 0

      Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

      Step 1 solve (421) in B assuming A optimalrepeat

      Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

      2

      BA larr(XgtAXA + λΩ

      )minus1XgtAYΘ0

      until condition (432a) holds for all j isin A Step 2 identify inactivated variables

      for j isin A ∥∥βj∥∥

      2= 0 do

      if optimality condition (432b) holds thenA larr AjGo back to Step 1

      end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

      jisinA

      ∥∥partJpartβj∥∥2

      if∥∥∥partJpartβj∥∥∥

      2lt λ then

      convergence larr true B is optimalelseA larr Acup j

      end ifuntil convergence

      (sV)larreigenanalyze(Θ0gtYgtXAB) that is

      Θ0gtYgtXABVk = skVk k = 1 K minus 1

      Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

      Output Θ B α

      51

      5 GLOSS Algorithm

      where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

      column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

      511 Cholesky decomposition

      Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

      (XgtX + λΩ)B = XgtYΘ (52)

      Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

      CgtCB = XgtYΘ

      CB = CgtXgtYΘ

      B = CCgtXgtYΘ (53)

      where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

      512 Numerical Stability

      The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

      B = Ωminus12(Ωminus12XgtXΩminus12 + λI

      )minus1Ωminus12XgtYΘ0 (54)

      where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

      52 Score Matrix

      The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

      YgtX(XgtX + Ω

      )minus1XgtY This eigen-analysis is actually solved in the form

      ΘgtYgtX(XgtX + Ω

      )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

      vector decomposition does not require the costly computation of(XgtX + Ω

      )minus1that

      52

      53 Optimality Conditions

      involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

      trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

      )minus1XgtY 1

      Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

      Θ0gtYgtX(XgtX + Ω

      )minus1XgtYΘ0 = Θ0gtYgtXB0

      Thus the solution to penalized OS problem can be computed trough the singular

      value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

      Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

      )minus1XgtYΘ = Λ and when Θ0 is chosen such

      that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

      53 Optimality Conditions

      GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

      1

      2YΘminusXB22 + λ

      psumj=1

      wj∥∥βj∥∥

      2(55)

      Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

      row of B βj is the (K minus 1)-dimensional vector

      partJ(B)

      partβj= xj

      gt(XBminusYΘ)

      where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

      xjgt

      (XBminusYΘ) + λwjβj∥∥βj∥∥

      2

      1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

      )minus1XgtY It is thus suffi-

      cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

      YgtX(XgtX + Ω

      )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

      (35c) we set Θ0 =(YgtY

      )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

      vectors orthogonal to 1K

      53

      5 GLOSS Algorithm

      The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

      2le λwj

      54 Active and Inactive Sets

      The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

      j = maxj

      ∥∥∥xjgt (XBminusYΘ)∥∥∥

      2minus λwj 0

      The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

      2

      is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

      2le λwj

      The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

      55 Penalty Parameter

      The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

      The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

      λmax = maxjisin1p

      1

      wj

      ∥∥∥xjgtYΘ0∥∥∥

      2

      The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

      is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

      54

      56 Options and Variants

      56 Options and Variants

      561 Scaling Variables

      As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

      562 Sparse Variant

      This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

      563 Diagonal Variant

      We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

      The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

      minBisinRptimesKminus1

      YΘminusXB2F = minBisinRptimesKminus1

      tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

      )are replaced by

      minBisinRptimesKminus1

      tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

      )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

      564 Elastic net and Structured Variant

      For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

      55

      5 GLOSS Algorithm

      7 8 9

      4 5 6

      1 2 3

      - ΩL =

      3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

      Figure 52 Graph and Laplacian matrix for a 3times 3 image

      for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

      When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

      This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

      56

      6 Experimental Results

      This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

      61 Normalization

      With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

      62 Decision Thresholds

      The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

      1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

      57

      6 Experimental Results

      63 Simulated Data

      We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

      Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

      Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

      dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

      is intended to mimic gene expression data correlation

      Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

      3 1)if j le 100 and Xij sim N(0 1) otherwise

      Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

      Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

      The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

      58

      63 Simulated Data

      Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

      Err () Var Dir

      Sim 1 K = 4 mean shift ind features

      PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

      Sim 2 K = 2 mean shift dependent features

      PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

      Sim 3 K = 4 1D mean shift ind features

      PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

      Sim 4 K = 4 mean shift ind features

      PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

      59

      6 Experimental Results

      0 10 20 30 40 50 60 70 8020

      30

      40

      50

      60

      70

      80

      90

      100TPR Vs FPR

      gloss

      glossd

      slda

      plda

      Simulation1

      Simulation2

      Simulation3

      Simulation4

      Figure 61 TPR versus FPR (in ) for all algorithms and simulations

      Table 62 Average TPR and FPR (in ) computed over 25 repetitions

      Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

      PLDA 990 782 969 603 980 159 743 656

      SLDA 739 385 338 163 416 278 507 395

      GLOSS 641 106 300 46 511 182 260 121

      GLOSS-D 935 394 921 281 956 655 429 299

      method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

      64 Gene Expression Data

      We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

      2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

      60

      64 Gene Expression Data

      Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

      Err () Var

      Nakayama n = 86 p = 22 283 K = 5

      PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

      Ramaswamy n = 198 p = 16 063 K = 14

      PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

      Sun n = 180 p = 54 613 K = 4

      PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

      ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

      dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

      Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

      Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

      Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

      4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

      61

      6 Experimental Results

      GLOSS SLDA

      Naka

      yam

      a

      minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

      minus25

      minus2

      minus15

      minus1

      minus05

      0

      05

      1

      x 104

      1) Synovial sarcoma

      2) Myxoid liposarcoma

      3) Dedifferentiated liposarcoma

      4) Myxofibrosarcoma

      5) Malignant fibrous histiocytoma

      2n

      dd

      iscr

      imin

      ant

      minus2000 0 2000 4000 6000 8000 10000 12000 14000

      2000

      4000

      6000

      8000

      10000

      12000

      14000

      16000

      1) Synovial sarcoma

      2) Myxoid liposarcoma

      3) Dedifferentiated liposarcoma

      4) Myxofibrosarcoma

      5) Malignant fibrous histiocytoma

      Su

      n

      minus1 minus05 0 05 1 15 2

      x 104

      05

      1

      15

      2

      25

      3

      35

      x 104

      1) NonTumor

      2) Astrocytomas

      3) Glioblastomas

      4) Oligodendrogliomas

      1st discriminant

      2n

      dd

      iscr

      imin

      ant

      minus2 minus15 minus1 minus05 0

      x 104

      0

      05

      1

      15

      2

      x 104

      1) NonTumor

      2) Astrocytomas

      3) Glioblastomas

      4) Oligodendrogliomas

      1st discriminant

      Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

      62

      65 Correlated Data

      Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

      65 Correlated Data

      When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

      The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

      For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

      As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

      The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

      Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

      63

      6 Experimental Results

      β for GLOSS β for S-GLOSS

      Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

      β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

      Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

      64

      Discussion

      GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

      Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

      The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

      The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

      65

      Part III

      Sparse Clustering Analysis

      67

      Abstract

      Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

      Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

      As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

      Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

      69

      7 Feature Selection in Mixture Models

      71 Mixture Models

      One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

      711 Model

      We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

      from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

      f(xi) =

      Ksumk=1

      πkfk(xi) foralli isin 1 n

      where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

      sumk πk = 1) Mixture models transcribe that

      given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

      bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

      bull x each xi is assumed to arise from a random vector with probability densityfunction fk

      In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

      f(xiθ) =

      Ksumk=1

      πkφ(xiθk) foralli isin 1 n

      71

      7 Feature Selection in Mixture Models

      where θ = (π1 πK θ1 θK) is the parameter of the model

      712 Parameter Estimation The EM Algorithm

      For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

      21 σ

      22 π) of a univariate

      Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

      The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

      The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

      Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

      Maximum Likelihood Definitions

      The likelihood is is commonly expressed in its logarithmic version

      L(θ X) = log

      (nprodi=1

      f(xiθ)

      )

      =nsumi=1

      log

      (Ksumk=1

      πkfk(xiθk)

      ) (71)

      where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

      To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

      72

      71 Mixture Models

      classification log-likelihood

      LC(θ XY) = log

      (nprodi=1

      f(xiyiθ)

      )

      =

      nsumi=1

      log

      (Ksumk=1

      yikπkfk(xiθk)

      )

      =nsumi=1

      Ksumk=1

      yik log (πkfk(xiθk)) (72)

      The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

      Defining the soft membership tik(θ) as

      tik(θ) = p(Yik = 1|xiθ) (73)

      =πkfk(xiθk)

      f(xiθ) (74)

      To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

      LC(θ XY) =sumik

      yik log (πkfk(xiθk))

      =sumik

      yik log (tikf(xiθ))

      =sumik

      yik log tik +sumik

      yik log f(xiθ)

      =sumik

      yik log tik +nsumi=1

      log f(xiθ)

      =sumik

      yik log tik + L(θ X) (75)

      wheresum

      ik yik log tik can be reformulated as

      sumik

      yik log tik =nsumi=1

      Ksumk=1

      yik log(p(Yik = 1|xiθ))

      =

      nsumi=1

      log(p(Yik = 1|xiθ))

      = log (p(Y |Xθ))

      As a result the relationship (75) can be rewritten as

      L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

      73

      7 Feature Selection in Mixture Models

      Likelihood Maximization

      The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

      L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

      +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

      In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

      ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

      minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

      Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

      For the mixture model problem Q(θθprime) is

      Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

      =sumik

      p(Yik = 1|xiθprime) log(πkfk(xiθk))

      =nsumi=1

      Ksumk=1

      tik(θprime) log (πkfk(xiθk)) (77)

      Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

      prime) are the posterior proba-bilities of cluster memberships

      Hence the EM algorithm sketched above results in

      bull Initialization (not iterated) choice of the initial parameter θ(0)

      bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

      bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

      74

      72 Feature Selection in Model-Based Clustering

      Gaussian Model

      In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

      f(xiθ) =Ksumk=1

      πkfk(xiθk)

      =

      Ksumk=1

      πk1

      (2π)p2 |Σ|

      12

      exp

      minus1

      2(xi minus microk)

      gtΣminus1(xi minus microk)

      At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

      Q(θθ(t)) =sumik

      tik log(πk)minussumik

      tik log(

      (2π)p2 |Σ|

      12

      )minus 1

      2

      sumik

      tik(xi minus microk)gtΣminus1(xi minus microk)

      =sumk

      tk log(πk)minusnp

      2log(2π)︸ ︷︷ ︸

      constant term

      minusn2

      log(|Σ|)minus 1

      2

      sumik

      tik(xi minus microk)gtΣminus1(xi minus microk)

      equivsumk

      tk log(πk)minusn

      2log(|Σ|)minus

      sumik

      tik

      (1

      2(xi minus microk)

      gtΣminus1(xi minus microk)

      ) (78)

      where

      tk =nsumi=1

      tik (79)

      The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

      π(t+1)k =

      tkn

      (710)

      micro(t+1)k =

      sumi tikxitk

      (711)

      Σ(t+1) =1

      n

      sumk

      Wk (712)

      with Wk =sumi

      tik(xi minus microk)(xi minus microk)gt (713)

      The derivations are detailed in Appendix G

      72 Feature Selection in Model-Based Clustering

      When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

      75

      7 Feature Selection in Mixture Models

      covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

      In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

      gtk (Banfield and Raftery 1993)

      These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

      In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

      721 Based on Penalized Likelihood

      Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

      log

      (p(Yk = 1|x)

      p(Y` = 1|x)

      )= xgtΣminus1(microk minus micro`)minus

      1

      2(microk + micro`)

      gtΣminus1(microk minus micro`) + logπkπ`

      In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

      λKsumk=1

      psumj=1

      |microkj |

      as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

      λ1

      Ksumk=1

      psumj=1

      |microkj |+ λ2

      Ksumk=1

      psumj=1

      psumm=1

      |(Σminus1k )jm|

      In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

      76

      72 Feature Selection in Model-Based Clustering

      Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

      λ

      psumj=1

      sum16k6kprime6K

      |microkj minus microkprimej |

      This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

      A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

      λ

      psumj=1

      (micro1j micro2j microKj)infin

      One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

      λradicK

      psumj=1

      radicradicradicradic Ksum

      k=1

      micro2kj

      The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

      The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

      722 Based on Model Variants

      The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

      77

      7 Feature Selection in Mixture Models

      f(xi|φ πθν) =Ksumk=1

      πk

      pprodj=1

      [f(xij |θjk)]φj [h(xij |νj)]1minusφj

      where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

      An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

      which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

      tr(

      (UgtΣWU)minus1UgtΣBU) (714)

      so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

      To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

      minUisinRptimesKminus1

      ∥∥∥XU minusXU∥∥∥2

      F+ λ

      Kminus1sumk=1

      ∥∥∥uk∥∥∥1

      where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

      minABisinRptimesKminus1

      Ksumk=1

      ∥∥∥RminusgtW HBk minusABgtHBk

      ∥∥∥2

      2+ ρ

      Kminus1sumj=1

      βgtj ΣWβj + λ

      Kminus1sumj=1

      ∥∥βj∥∥1

      s t AgtA = IKminus1

      where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

      78

      72 Feature Selection in Model-Based Clustering

      triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

      The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

      minUisinRptimesKminus1

      psumj=1

      ∥∥∥ΣBj minus UUgtΣBj

      ∥∥∥2

      2

      s t UgtU = IKminus1

      whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

      To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

      However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

      723 Based on Model Selection

      Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

      bull X(1) set of selected relevant variables

      bull X(2) set of variables being considered for inclusion or exclusion of X(1)

      bull X(3) set of non relevant variables

      79

      7 Feature Selection in Mixture Models

      With those subsets they defined two different models where Y is the partition toconsider

      bull M1

      f (X|Y) = f(X(1)X(2)X(3)|Y

      )= f

      (X(3)|X(2)X(1)

      )f(X(2)|X(1)

      )f(X(1)|Y

      )bull M2

      f (X|Y) = f(X(1)X(2)X(3)|Y

      )= f

      (X(3)|X(2)X(1)

      )f(X(2)X(1)|Y

      )Model M1 means that variables in X(2) are independent on clustering Y Model M2

      shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

      B12 =f (X|M1)

      f (X|M2)

      where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

      B12 =f(X(1)X(2)X(3)|M1

      )f(X(1)X(2)X(3)|M2

      )=f(X(2)|X(1)M1

      )f(X(1)|M1

      )f(X(2)X(1)|M2

      )

      This factor is approximated since the integrated likelihoods f(X(1)|M1

      )and

      f(X(2)X(1)|M2

      )are difficult to calculate exactly Raftery and Dean (2006) use the

      BIC approximation The computation of f(X(2)|X(1)M1

      ) if there is only one variable

      in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

      Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

      Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

      80

      8 Theoretical Foundations

      In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

      We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

      In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

      81 Resolving EM with Optimal Scoring

      In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

      811 Relationship Between the M-Step and Linear Discriminant Analysis

      LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

      d(ximicrok) = (xi minus microk)gtΣminus1

      W (xi minus microk)

      where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

      81

      8 Theoretical Foundations

      The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

      Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

      2lweight(microΣ) =nsumi=1

      Ksumk=1

      tikd(ximicrok)minus n log(|ΣW|)

      which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

      812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

      The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

      813 Clustering Using Penalized Optimal Scoring

      The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

      d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

      This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

      82

      82 Optimized Criterion

      1 Initialize the membership matrix Y (for example by K-means algorithm)

      2 Solve the p-OS problem as

      BOS =(XgtX + λΩ

      )minus1XgtYΘ

      where Θ are the K minus 1 leading eigenvectors of

      YgtX(XgtX + λΩ

      )minus1XgtY

      3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

      k)minus 1

      2 )

      4 Compute the centroids M in the LDA domain

      5 Evaluate distances in the LDA domain

      6 Translate distances into posterior probabilities tik with

      tik prop exp

      [minusd(x microk)minus 2 log(πk)

      2

      ] (81)

      7 Update the labels using the posterior probabilities matrix Y = T

      8 Go back to step 2 and iterate until tik converge

      Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

      814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

      In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

      82 Optimized Criterion

      In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

      83

      8 Theoretical Foundations

      optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

      This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

      821 A Bayesian Derivation

      This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

      The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

      The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

      f(Σ|Λ0 ν0) =1

      2np2 |Λ0|

      n2 Γp(

      n2 )|Σminus1|

      ν0minuspminus12 exp

      minus1

      2tr(Λminus1

      0 Σminus1)

      where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

      Γp(n2) = πp(pminus1)4pprodj=1

      Γ (n2 + (1minus j)2)

      The posterior distribution can be maximized similarly to the likelihood through the

      84

      82 Optimized Criterion

      maximization of

      Q(θθprime) + log(f(Σ|Λ0 ν0))

      =Ksumk=1

      tk log πk minus(n+ 1)p

      2log 2minus n

      2log |Λ0| minus

      p(p+ 1)

      4log(π)

      minuspsumj=1

      log

      (n

      2+

      1minus j2

      ))minus νn minus pminus 1

      2log |Σ| minus 1

      2tr(Λminus1n Σminus1

      )equiv

      Ksumk=1

      tk log πk minusn

      2log |Λ0| minus

      νn minus pminus 1

      2log |Σ| minus 1

      2tr(Λminus1n Σminus1

      ) (82)

      with tk =

      nsumi=1

      tik

      νn = ν0 + n

      Λminus1n = Λminus1

      0 + S0

      S0 =

      nsumi=1

      Ksumk=1

      tik(xi minus microk)(xi minus microk)gt

      Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

      822 Maximum a Posteriori Estimator

      The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

      ΣMAP =1

      ν0 + nminus pminus 1(Λminus1

      0 + S0) (83)

      where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

      0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

      85

      9 Mix-GLOSS Algorithm

      Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

      91 Mix-GLOSS

      The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

      When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

      The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

      911 Outer Loop Whole Algorithm Repetitions

      This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

      bull the centered ntimes p feature matrix X

      bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

      bull the number of clusters K

      bull the maximum number of iterations for the EM algorithm

      bull the convergence tolerance for the EM algorithm

      bull the number of whole repetitions of the clustering algorithm

      87

      9 Mix-GLOSS Algorithm

      Figure 91 Mix-GLOSS Loops Scheme

      bull a ptimes (K minus 1) initial coefficient matrix (optional)

      bull a ntimesK initial posterior probability matrix (optional)

      For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

      912 Penalty Parameter Loop

      The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

      Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

      88

      91 Mix-GLOSS

      of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

      Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

      Algorithm 2 Automatic selection of λ

      Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

      Estimate λ Compute gradient at βj = 0partJ(B)

      partβj

      ∣∣∣βj=0

      = xjgt

      (sum

      m6=j xmβm minusYΘ)

      Compute λmax for every feature using (432b)

      λmaxj = 1

      wj

      ∥∥∥∥ partJ(B)

      partβj

      ∣∣∣βj=0

      ∥∥∥∥2

      Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

      elselastLAMBDA larr true

      end ifuntil lastLAMBDA

      Output B L(θ) tik πk microk Σ Y for every λ in solution path

      913 Inner Loop EM Algorithm

      The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

      89

      9 Mix-GLOSS Algorithm

      Algorithm 3 Mix-GLOSS for one value of λ

      Input X K B0 Y0 λInitializeif (B0Y0) available then

      BOS larr B0 Y larr Y0

      elseBOS larr 0 Y larr kmeans(XK)

      end ifconvergenceEM larr false tolEM larr 1e-3repeat

      M-step(BOSΘ

      α)larr GLOSS(XYBOS λ)

      XLDA = XBOS diag (αminus1(1minusα2)minus12

      )

      πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

      sumi |tik minus yik| lt tolEM then

      convergenceEM larr trueend ifY larr T

      until convergenceEMY larr MAP(T)

      Output BOS ΘL(θ) tik πk microk Σ Y

      90

      92 Model Selection

      M-Step

      The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

      E-Step

      The E-step evaluates the posterior probability matrix T using

      tik prop exp

      [minusd(x microk)minus 2 log(πk)

      2

      ]

      The convergence of those tik is used as stopping criterion for EM

      92 Model Selection

      Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

      In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

      In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

      The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

      91

      9 Mix-GLOSS Algorithm

      Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

      X K λEMITER MAXREPMixminusGLOSS

      Use B and T frombest repetition as

      StartB and StartT

      Mix-GLOSS (λStartBStartT)

      Compute BIC

      Chose λ = minλ BIC

      Partition tikπk λBEST BΘ D L(θ)activeset

      Figure 92 Mix-GLOSS model selection diagram

      with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

      92

      10 Experimental Results

      The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

      This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

      In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

      The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

      101 Tested Clustering Algorithms

      This section compares Mix-GLOSS with the following methods in the state of the art

      bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

      bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

      93

      10 Experimental Results

      Figure 101 Class mean vectors for each artificial simulation

      bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

      After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

      The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

      bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

      bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

      94

      102 Results

      Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

      102 Results

      In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

      bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

      bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

      bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

      The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

      Results in percentages are displayed in Figure 102 (or in Table 102 )

      95

      10 Experimental Results

      Table 101 Experimental results for simulated data

      Err () Var Time

      Sim 1 K = 4 mean shift ind features

      CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

      Sim 2 K = 2 mean shift dependent features

      CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

      Sim 3 K = 4 1D mean shift ind features

      CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

      Sim 4 K = 4 mean shift ind features

      CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

      Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

      Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

      MIX-GLOSS 992 015 828 335 884 67 780 12

      LUMI-KUAN 992 28 1000 02 1000 005 50 005

      FISHER-EM 986 24 888 17 838 5825 620 4075

      96

      103 Discussion

      0 10 20 30 40 50 600

      10

      20

      30

      40

      50

      60

      70

      80

      90

      100TPR Vs FPR

      MIXminusGLOSS

      LUMIminusKUAN

      FISHERminusEM

      Simulation1

      Simulation2

      Simulation3

      Simulation4

      Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

      103 Discussion

      After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

      LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

      The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

      From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

      97

      Conclusions

      99

      Conclusions

      Summary

      The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

      In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

      The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

      In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

      In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

      Perspectives

      Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

      101

      based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

      At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

      The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

      From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

      At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

      102

      Appendix

      103

      A Matrix Properties

      Property 1 By definition ΣW and ΣB are both symmetric matrices

      ΣW =1

      n

      gsumk=1

      sumiisinCk

      (xi minus microk)(xi minus microk)gt

      ΣB =1

      n

      gsumk=1

      nk(microk minus x)(microk minus x)gt

      Property 2 partxgtapartx = partagtx

      partx = a

      Property 3 partxgtAxpartx = (A + Agt)x

      Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

      Property 5 partagtXbpartX = abgt

      Property 6 partpartXtr

      (AXminus1B

      )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

      105

      B The Penalized-OS Problem is anEigenvector Problem

      In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

      minθkβk

      Yθk minusXβk22 + βgtk Ωkβk (B1)

      st θgtk YgtYθk = 1

      θgt` YgtYθk = 0 forall` lt k

      for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

      Lk(θkβk λkνk) =

      Yθk minusXβk22 + βgtk Ωkβk + λk(θ

      gtk YgtYθk minus 1) +

      sum`ltk

      ν`θgt` YgtYθk (B2)

      Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

      βk = (XgtX + Ωk)minus1XgtYθk (B3)

      The objective function of (B1) evaluated at βk is

      minθk

      Yθk minusXβk22 + βk

      gtΩkβk = min

      θk

      θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

      = maxθk

      θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

      If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

      B1 How to Solve the Eigenvector Decomposition

      Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

      107

      B The Penalized-OS Problem is an Eigenvector Problem

      Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

      maxΘisinRKtimes(Kminus1)

      tr(ΘgtMΘ

      )(B5)

      st ΘgtYgtYΘ = IKminus1

      If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

      MΘv = λv (B6)

      where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

      vgtMΘv = λhArr vgtΘgtMΘv = λ

      Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

      wgtMw = λ (B7)

      Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

      MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

      = ΘgtYgtXB

      Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

      To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

      B = (XgtX + Ω)minus1XgtYΘV = BV

      108

      B2 Why the OS Problem is Solved as an Eigenvector Problem

      B2 Why the OS Problem is Solved as an Eigenvector Problem

      In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

      By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

      θk =

      Kminus1summ=1

      αmwm s t θgtk θk = 1 (B8)

      The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

      Kminus1summ=1

      αmwm

      )gt(Kminus1summ=1

      αmwm

      )= 1

      that as per the eigenvector properties can be reduced to

      Kminus1summ=1

      α2m = 1 (B9)

      Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

      Mθk = M

      Kminus1summ=1

      αmwm

      =

      Kminus1summ=1

      αmMwm

      As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

      Mθk =Kminus1summ=1

      αmλmwm

      Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

      θgtk Mθk =

      (Kminus1sum`=1

      α`w`

      )gt(Kminus1summ=1

      αmλmwm

      )

      This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

      θgtk Mθk =Kminus1summ=1

      α2mλm

      109

      B The Penalized-OS Problem is an Eigenvector Problem

      The optimization Problem (B5) for discriminant direction k can be rewritten as

      maxθkisinRKtimes1

      θgtk Mθk

      = max

      θkisinRKtimes1

      Kminus1summ=1

      α2mλm

      (B10)

      with θk =Kminus1summ=1

      αmwm

      andKminus1summ=1

      α2m = 1

      One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

      sumKminus1m=1 αmwm the resulting score vector θk will be equal to

      the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

      be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

      110

      C Solving Fisherrsquos Discriminant Problem

      The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

      maxβisinRp

      βgtΣBβ (C1a)

      s t βgtΣWβ = 1 (C1b)

      where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

      The Lagrangian of Problem (C1) is

      L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

      so that its first derivative with respect to β is

      partL(β ν)

      partβ= 2ΣBβ minus 2νΣWβ

      A necessary optimality condition for β is that this derivative is zero that is

      ΣBβ = νΣWβ

      Provided ΣW is full rank we have

      Σminus1W ΣBβ

      = νβ (C2)

      Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

      eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

      βgtΣBβ = βgtΣWΣminus1

      W ΣBβ

      = νβgtΣWβ from (C2)

      = ν from (C1b)

      That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

      W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

      111

      D Alternative Variational Formulation forthe Group-Lasso

      In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

      minτisinRp

      minBisinRptimesKminus1

      J(B) + λ

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj(D1a)

      s tsump

      j=1 τj = 1 (D1b)

      τj ge 0 j = 1 p (D1c)

      Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

      of row vectors βj isin RKminus1 B =(β1gt βpgt

      )gt

      L(B τ λ ν0 νj) = J(B) + λ

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj+ ν0

      psumj=1

      τj minus 1

      minus psumj=1

      νjτj (D2)

      The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

      partL(B τ λ ν0 νj)

      partτj

      ∣∣∣∣τj=τj

      = 0 rArr minusλw2j

      ∥∥βj∥∥2

      2

      τj2 + ν0 minus νj = 0

      rArr minusλw2j

      ∥∥βj∥∥2

      2+ ν0τ

      j

      2 minus νjτj2 = 0

      rArr minusλw2j

      ∥∥βj∥∥2

      2+ ν0τ

      j

      2 = 0

      The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

      ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

      inequality Lagrange condition Then the optimal τj can be deduced

      τj =

      radicλ

      ν0wj∥∥βj∥∥

      2

      Placing this optimal value of τj into constraint (D1b)

      psumj=1

      τj = 1rArr τj =wj∥∥βj∥∥

      2sumpj=1wj

      ∥∥βj∥∥2

      (D3)

      113

      D Alternative Variational Formulation for the Group-Lasso

      With this value of τj Problem (D1) is equivalent to

      minBisinRptimesKminus1

      J(B) + λ

      psumj=1

      wj∥∥βj∥∥

      2

      2

      (D4)

      This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

      The penalty term of (D1a) can be conveniently presented as λBgtΩB where

      Ω = diag

      (w2

      1

      τ1w2

      2

      τ2

      w2p

      τp

      ) (D5)

      Using the value of τj from (D3) each diagonal component of Ω is

      (Ω)jj =wjsump

      j=1wj∥∥βj∥∥

      2∥∥βj∥∥2

      (D6)

      In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

      D1 Useful Properties

      Lemma D1 If J is convex Problem (D1) is convex

      In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

      Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

      partJ(B)

      partB+ 2λ

      Kminus1sumj=1

      wj∥∥βj∥∥

      2

      G

      (D7)

      where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

      ∥∥βj∥∥26= 0 then we have

      forallj isin S(B) gj = wj∥∥βj∥∥minus1

      2βj (D8)

      forallj isin S(B) ∥∥gj∥∥

      2le wj (D9)

      114

      D2 An Upper Bound on the Objective Function

      This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

      Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

      ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

      forallj isin S(B) minus partJ(B)

      partβj= 2λ

      Kminus1sumj=1

      wj∥∥βj∥∥2

      wj∥∥βj∥∥minus1

      2βj (D10a)

      forallj isin S(B)

      ∥∥∥∥partJ(B)

      partβj

      ∥∥∥∥2

      le 2λwj

      Kminus1sumj=1

      wj∥∥βj∥∥2

      (D10b)

      In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

      D2 An Upper Bound on the Objective Function

      Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

      τj =wj∥∥βj∥∥

      2sumpj=1wj

      ∥∥βj∥∥2

      Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

      j=1

      wj∥∥βj∥∥

      2

      2

      =

      psumj=1

      τ12j

      wj∥∥βj∥∥

      2

      τ12j

      2

      le

      psumj=1

      τj

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj

      le

      psumj=1

      w2j

      ∥∥βj∥∥2

      2

      τj

      where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

      115

      D Alternative Variational Formulation for the Group-Lasso

      This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

      116

      E Invariance of the Group-Lasso to UnitaryTransformations

      The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

      Proposition E1 Let B be a solution of

      minBisinRptimesM

      Y minusXB2F + λ

      psumj=1

      wj∥∥βj∥∥

      2(E1)

      and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

      minBisinRptimesM

      ∥∥∥Y minusXB∥∥∥2

      F+ λ

      psumj=1

      wj∥∥βj∥∥

      2(E2)

      Proof The first-order necessary optimality conditions for B are

      forallj isin S(B) 2xjgt(xjβ

      j minusY)

      + λwj

      ∥∥∥βj∥∥∥minus1

      2βj

      = 0 (E3a)

      forallj isin S(B) 2∥∥∥xjgt (xjβ

      j minusY)∥∥∥

      2le λwj (E3b)

      where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

      First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

      forallj isin S(B) 2xjgt(xjβ

      j minus Y)

      + λwj

      ∥∥∥βj∥∥∥minus1

      2βj

      = 0 (E4a)

      forallj isin S(B) 2∥∥∥xjgt (xjβ

      j minus Y)∥∥∥

      2le λwj (E4b)

      where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

      ∥∥ugt∥∥2

      =∥∥ugtV

      ∥∥2 Equation (E4b) is also

      117

      E Invariance of the Group-Lasso to Unitary Transformations

      obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

      118

      F Expected Complete Likelihood andLikelihood

      Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

      L(θ) =

      nsumi=1

      log

      (Ksumk=1

      πkfk(xiθk)

      )(F1)

      Q(θθprime) =nsumi=1

      Ksumk=1

      tik(θprime) log (πkfk(xiθk)) (F2)

      with tik(θprime) =

      πprimekfk(xiθprimek)sum

      ` πprime`f`(xiθ

      prime`)

      (F3)

      In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

      the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

      Using (F3) we have

      Q(θθprime) =sumik

      tik(θprime) log (πkfk(xiθk))

      =sumik

      tik(θprime) log(tik(θ)) +

      sumik

      tik(θprime) log

      (sum`

      π`f`(xiθ`)

      )=sumik

      tik(θprime) log(tik(θ)) + L(θ)

      In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

      L(θ) = Q(θθ)minussumik

      tik(θ) log(tik(θ))

      = Q(θθ) +H(T)

      119

      G Derivation of the M-Step Equations

      This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

      Q(θθprime) = maxθ

      sumik

      tik(θprime) log(πkfk(xiθk))

      =sumk

      log

      (πksumi

      tik

      )minus np

      2log(2π)minus n

      2log |Σ| minus 1

      2

      sumik

      tik(xi minus microk)gtΣminus1(xi minus microk)

      which has to be maximized subject tosumk

      πk = 1

      The Lagrangian of this problem is

      L(θ) = Q(θθprime) + λ

      (sumk

      πk minus 1

      )

      Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

      G1 Prior probabilities

      partL(θ)

      partπk= 0hArr 1

      πk

      sumi

      tik + λ = 0

      where λ is identified from the constraint leading to

      πk =1

      n

      sumi

      tik

      121

      G Derivation of the M-Step Equations

      G2 Means

      partL(θ)

      partmicrok= 0hArr minus1

      2

      sumi

      tik2Σminus1(microk minus xi) = 0

      rArr microk =

      sumi tikxisumi tik

      G3 Covariance Matrix

      partL(θ)

      partΣminus1 = 0hArr n

      2Σ︸︷︷︸

      as per property 4

      minus 1

      2

      sumik

      tik(xi minus microk)(xi minus microk)gt

      ︸ ︷︷ ︸as per property 5

      = 0

      rArr Σ =1

      n

      sumik

      tik(xi minus microk)(xi minus microk)gt

      122

      Bibliography

      F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

      F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

      F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

      J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

      A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

      H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

      P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

      C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

      C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

      C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

      C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

      S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

      L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

      L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

      123

      Bibliography

      T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

      S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

      C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

      B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

      L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

      C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

      A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

      D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

      R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

      B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

      Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

      R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

      V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

      J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

      124

      Bibliography

      J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

      J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

      W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

      A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

      D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

      G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

      G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

      Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

      Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

      L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

      Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

      J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

      I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

      T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

      T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

      125

      Bibliography

      T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

      A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

      J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

      T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

      K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

      P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

      T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

      M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

      Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

      C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

      C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

      H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

      J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

      Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

      C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

      126

      Bibliography

      C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

      L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

      N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

      B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

      B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

      Y Nesterov Gradient methods for minimizing composite functions preprint 2007

      S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

      B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

      M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

      M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

      W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

      W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

      K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

      S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

      127

      Bibliography

      Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

      A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

      C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

      S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

      V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

      V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

      V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

      C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

      L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

      Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

      A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

      S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

      P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

      M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

      128

      Bibliography

      M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

      R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

      J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

      S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

      D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

      D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

      D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

      M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

      MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

      T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

      B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

      B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

      C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

      J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

      129

      Bibliography

      M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

      P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

      P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

      H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

      H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

      H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

      130

      • SANCHEZ MERCHANTE PDTpdf
      • Thesis Luis Francisco Sanchez Merchantepdf
        • List of figures
        • List of tables
        • Notation and Symbols
        • Context and Foundations
          • Context
          • Regularization for Feature Selection
            • Motivations
            • Categorization of Feature Selection Techniques
            • Regularization
              • Important Properties
              • Pure Penalties
              • Hybrid Penalties
              • Mixed Penalties
              • Sparsity Considerations
              • Optimization Tools for Regularized Problems
                • Sparse Linear Discriminant Analysis
                  • Abstract
                  • Feature Selection in Fisher Discriminant Analysis
                    • Fisher Discriminant Analysis
                    • Feature Selection in LDA Problems
                      • Inertia Based
                      • Regression Based
                          • Formalizing the Objective
                            • From Optimal Scoring to Linear Discriminant Analysis
                              • Penalized Optimal Scoring Problem
                              • Penalized Canonical Correlation Analysis
                              • Penalized Linear Discriminant Analysis
                              • Summary
                                • Practicalities
                                  • Solution of the Penalized Optimal Scoring Regression
                                  • Distance Evaluation
                                  • Posterior Probability Evaluation
                                  • Graphical Representation
                                    • From Sparse Optimal Scoring to Sparse LDA
                                      • A Quadratic Variational Form
                                      • Group-Lasso OS as Penalized LDA
                                          • GLOSS Algorithm
                                            • Regression Coefficients Updates
                                              • Cholesky decomposition
                                              • Numerical Stability
                                                • Score Matrix
                                                • Optimality Conditions
                                                • Active and Inactive Sets
                                                • Penalty Parameter
                                                • Options and Variants
                                                  • Scaling Variables
                                                  • Sparse Variant
                                                  • Diagonal Variant
                                                  • Elastic net and Structured Variant
                                                      • Experimental Results
                                                        • Normalization
                                                        • Decision Thresholds
                                                        • Simulated Data
                                                        • Gene Expression Data
                                                        • Correlated Data
                                                          • Discussion
                                                            • Sparse Clustering Analysis
                                                              • Abstract
                                                              • Feature Selection in Mixture Models
                                                                • Mixture Models
                                                                  • Model
                                                                  • Parameter Estimation The EM Algorithm
                                                                    • Feature Selection in Model-Based Clustering
                                                                      • Based on Penalized Likelihood
                                                                      • Based on Model Variants
                                                                      • Based on Model Selection
                                                                          • Theoretical Foundations
                                                                            • Resolving EM with Optimal Scoring
                                                                              • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                              • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                              • Clustering Using Penalized Optimal Scoring
                                                                              • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                • Optimized Criterion
                                                                                  • A Bayesian Derivation
                                                                                  • Maximum a Posteriori Estimator
                                                                                      • Mix-GLOSS Algorithm
                                                                                        • Mix-GLOSS
                                                                                          • Outer Loop Whole Algorithm Repetitions
                                                                                          • Penalty Parameter Loop
                                                                                          • Inner Loop EM Algorithm
                                                                                            • Model Selection
                                                                                              • Experimental Results
                                                                                                • Tested Clustering Algorithms
                                                                                                • Results
                                                                                                • Discussion
                                                                                                    • Conclusions
                                                                                                    • Appendix
                                                                                                      • Matrix Properties
                                                                                                      • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                        • How to Solve the Eigenvector Decomposition
                                                                                                        • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                          • Solving Fishers Discriminant Problem
                                                                                                          • Alternative Variational Formulation for the Group-Lasso
                                                                                                            • Useful Properties
                                                                                                            • An Upper Bound on the Objective Function
                                                                                                              • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                              • Expected Complete Likelihood and Likelihood
                                                                                                              • Derivation of the M-Step Equations
                                                                                                                • Prior probabilities
                                                                                                                • Means
                                                                                                                • Covariance Matrix
                                                                                                                    • Bibliography

        ldquoNunca se sabe que encontrara uno tras una puerta Quiza en eso consistela vida en girar pomosrdquo

        Albert Espinosa

        ldquoBe brave Take risks Nothing can substitute experiencerdquo

        Paulo Coelho

        Acknowledgements

        If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

        Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

        I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

        The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

        Contents

        List of figures v

        List of tables vii

        Notation and Symbols ix

        I Context and Foundations 1

        1 Context 5

        2 Regularization for Feature Selection 921 Motivations 9

        22 Categorization of Feature Selection Techniques 11

        23 Regularization 13

        231 Important Properties 14

        232 Pure Penalties 14

        233 Hybrid Penalties 18

        234 Mixed Penalties 19

        235 Sparsity Considerations 19

        236 Optimization Tools for Regularized Problems 21

        II Sparse Linear Discriminant Analysis 25

        Abstract 27

        3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

        32 Feature Selection in LDA Problems 30

        321 Inertia Based 30

        322 Regression Based 32

        4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

        411 Penalized Optimal Scoring Problem 36

        412 Penalized Canonical Correlation Analysis 37

        i

        Contents

        413 Penalized Linear Discriminant Analysis 39

        414 Summary 40

        42 Practicalities 41

        421 Solution of the Penalized Optimal Scoring Regression 41

        422 Distance Evaluation 42

        423 Posterior Probability Evaluation 43

        424 Graphical Representation 43

        43 From Sparse Optimal Scoring to Sparse LDA 43

        431 A Quadratic Variational Form 44

        432 Group-Lasso OS as Penalized LDA 47

        5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

        511 Cholesky decomposition 52

        512 Numerical Stability 52

        52 Score Matrix 52

        53 Optimality Conditions 53

        54 Active and Inactive Sets 54

        55 Penalty Parameter 54

        56 Options and Variants 55

        561 Scaling Variables 55

        562 Sparse Variant 55

        563 Diagonal Variant 55

        564 Elastic net and Structured Variant 55

        6 Experimental Results 5761 Normalization 57

        62 Decision Thresholds 57

        63 Simulated Data 58

        64 Gene Expression Data 60

        65 Correlated Data 63

        Discussion 63

        III Sparse Clustering Analysis 67

        Abstract 69

        7 Feature Selection in Mixture Models 7171 Mixture Models 71

        711 Model 71

        712 Parameter Estimation The EM Algorithm 72

        ii

        Contents

        72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

        8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

        811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

        Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

        82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

        9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

        911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

        92 Model Selection 91

        10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

        Conclusions 97

        Appendix 103

        A Matrix Properties 105

        B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

        C Solving Fisherrsquos Discriminant Problem 111

        D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

        iii

        Contents

        E Invariance of the Group-Lasso to Unitary Transformations 117

        F Expected Complete Likelihood and Likelihood 119

        G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

        Bibliography 123

        iv

        List of Figures

        11 MASH project logo 5

        21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

        rameters 20

        41 Graphical representation of the variational approach to Group-Lasso 45

        51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

        61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

        discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

        91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

        101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

        v

        List of Tables

        61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

        101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

        vii

        Notation and Symbols

        Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

        Sets

        N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

        Data

        X input domainxi input sample xi isin XX design matrix X = (xgt1 x

        gtn )gt

        xj column j of Xyi class indicator of sample i

        Y indicator matrix Y = (ygt1 ygtn )gt

        z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

        Vectors Matrices and Norms

        0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

        ix

        Notation and Symbols

        Probability

        E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

        W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

        H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

        Mixture Models

        yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

        θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

        Optimization

        J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

        βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

        x

        Notation and Symbols

        Penalized models

        λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

        βj jth row of B = (β1gt βpgt)gt

        BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

        ΣB sample between-class covariance matrix

        ΣW sample within-class covariance matrix

        ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

        xi

        Part I

        Context and Foundations

        1

        This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

        The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

        The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

        3

        1 Context

        The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

        The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

        From the point of view of the research the members of the consortium must deal withfour main goals

        1 Software development of website framework and APIrsquos

        2 Classification and goal-planning in high dimensional feature spaces

        3 Interfacing the platform with the 3D virtual environment and the robot arm

        4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

        S HM A

        Figure 11 MASH project logo

        5

        1 Context

        The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

        Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

        As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

        bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

        bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

        6

        All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

        bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

        I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

        7

        2 Regularization for Feature Selection

        With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

        21 Motivations

        There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

        As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

        When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

        bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

        bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

        9

        2 Regularization for Feature Selection

        Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

        selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

        As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

        ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

        Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

        10

        22 Categorization of Feature Selection Techniques

        Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

        ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

        There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

        22 Categorization of Feature Selection Techniques

        Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

        I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

        The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

        bull Depending on the type of integration with the machine learning algorithm we have

        ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

        ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

        11

        2 Regularization for Feature Selection

        the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

        ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

        bull Depending on the feature searching technique

        ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

        ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

        ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

        bull Depending on the evaluation technique

        ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

        ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

        ndash Dependency Measures - Measuring the correlation between features

        ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

        ndash Predictive Accuracy - Use the selected features to predict the labels

        ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

        The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

        In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

        12

        23 Regularization

        goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

        23 Regularization

        In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

        An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

        minβJ(β) + λP (β) (21)

        minβ

        J(β)

        s t P (β) le t (22)

        In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

        In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

        13

        2 Regularization for Feature Selection

        Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

        231 Important Properties

        Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

        Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

        forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

        for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

        Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

        Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

        232 Pure Penalties

        For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

        14

        23 Regularization

        Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

        this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

        Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

        A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

        After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

        3penalty has a support region with sharper vertexes that would induce

        a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

        3results in difficulties during optimization that will not happen with a convex

        shape

        15

        2 Regularization for Feature Selection

        To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

        L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

        minβ

        J(β)

        s t β0 le t (24)

        where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

        L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

        minβ

        J(β)

        s t

        psumj=1

        |βj | le t (25)

        Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

        Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

        The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

        16

        23 Regularization

        minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

        L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

        minβJ(β) + λ β22 (26)

        The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

        minβ

        nsumi=1

        (yi minus xgti β)2 (27)

        with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

        minβ

        nsumi=1

        (yi minus xgti β)2 + λ

        psumj=1

        β2j

        The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

        the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

        As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

        minβ

        nsumi=1

        (yi minus xgti β)2 + λ

        psumj=1

        β2j

        (βlsj )2 (28)

        The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

        17

        2 Regularization for Feature Selection

        where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

        Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

        Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

        This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

        βlowast = maxwisinRp

        βgtw s t w le 1

        In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

        r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

        233 Hybrid Penalties

        There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

        minβ

        nsumi=1

        (yi minus xgti β)2 + λ1

        psumj=1

        |βj |+ λ2

        psumj=1

        β2j (29)

        The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

        18

        23 Regularization

        234 Mixed Penalties

        Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

        sumL`=1 d` Mixed norms are

        a type of norms that take into consideration those groups The general expression isshowed below

        β(rs) =

        sum`

        sumjisinG`

        |βj |s r

        s

        1r

        (210)

        The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

        Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

        (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

        235 Sparsity Considerations

        In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

        The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

        To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

        19

        2 Regularization for Feature Selection

        (a) L1 Lasso (b) L(12) group-Lasso

        Figure 25 Admissible sets for the Lasso and Group-Lasso

        (a) L1 induced sparsity (b) L(12) group inducedsparsity

        Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

        20

        23 Regularization

        the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

        236 Optimization Tools for Regularized Problems

        In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

        In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

        Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

        β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

        Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

        βj =minusλsign(βj)minus partJ(β)

        partβj

        2sumn

        i=1 x2ij

        In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

        algorithm where β(t+1)j = Sλ

        (partJ(β(t))partβj

        ) The objective function is optimized with respect

        21

        2 Regularization for Feature Selection

        to one variable at a time while all others are kept fixed

        (partJ(β)

        partβj

        )=

        λminus partJ(β)partβj

        2sumn

        i=1 x2ij

        if partJ(β)partβj

        gt λ

        minusλminus partJ(β)partβj

        2sumn

        i=1 x2ij

        if partJ(β)partβj

        lt minusλ

        0 if |partJ(β)partβj| le λ

        (211)

        The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

        Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

        Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

        Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

        This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

        22

        23 Regularization

        and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

        Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

        This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

        This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

        Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

        This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

        Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

        Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

        minβisinRp

        J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

        2

        ∥∥∥β minus β(t)∥∥∥2

        2(212)

        They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

        23

        2 Regularization for Feature Selection

        (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

        minβisinRp

        1

        2

        ∥∥∥∥β minus (β(t) minus 1

        LnablaJ(β(t)))

        ∥∥∥∥2

        2

        LP (β) (213)

        The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

        24

        Part II

        Sparse Linear Discriminant Analysis

        25

        Abstract

        Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

        There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

        In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

        27

        3 Feature Selection in Fisher DiscriminantAnalysis

        31 Fisher Discriminant Analysis

        Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

        We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

        gtn )gt and the corresponding labels in the ntimesK matrix

        Y = (ygt1 ygtn )gt

        Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

        maxβisinRp

        βgtΣBβ

        βgtΣWβ (31)

        where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

        ΣW =1

        n

        Ksumk=1

        sumiisinGk

        (xi minus microk)(xi minus microk)gt

        ΣB =1

        n

        Ksumk=1

        sumiisinGk

        (microminus microk)(microminus microk)gt

        where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

        29

        3 Feature Selection in Fisher Discriminant Analysis

        This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

        maxBisinRptimesKminus1

        tr(BgtΣBB

        )tr(BgtΣWB

        ) (32)

        where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

        based on a series of K minus 1 subproblemsmaxβkisinRp

        βgtk ΣBβk

        s t βgtk ΣWβk le 1

        βgtk ΣWβ` = 0 forall` lt k

        (33)

        The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

        eigenvalue (see Appendix C)

        32 Feature Selection in LDA Problems

        LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

        Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

        The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

        They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

        321 Inertia Based

        The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

        30

        32 Feature Selection in LDA Problems

        classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

        Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

        Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

        minβisinRp

        βgtΣWβ

        s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

        where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

        Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

        βisinkRpβgtk Σ

        k

        Bβk minus Pk(βk)

        s t βgtk ΣWβk le 1

        The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

        Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

        solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

        minimization minβisinRp

        β1

        s t∥∥∥Σβ minus (micro1 minus micro2)

        ∥∥∥infinle λ

        Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

        31

        3 Feature Selection in Fisher Discriminant Analysis

        Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

        322 Regression Based

        In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

        Predefined Indicator Matrix

        Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

        There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

        Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

        In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

        32

        32 Feature Selection in LDA Problems

        obtained by solving

        minβisinRpβ0isinR

        nminus1nsumi=1

        (yi minus β0 minus xgti β)2 + λ

        psumj=1

        |βj |

        where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

        vector for λ = 0 but a different intercept β0 is required

        Optimal Scoring

        In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

        As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

        minΘ BYΘminusXB2F + λ tr

        (BgtΩB

        )(34a)

        s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

        where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

        minθkisinRK βkisinRp

        Yθk minusXβk2 + βgtk Ωβk (35a)

        s t nminus1 θgtk YgtYθk = 1 (35b)

        θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

        where each βk corresponds to a discriminant direction

        33

        3 Feature Selection in Fisher Discriminant Analysis

        Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

        minβkisinRpθkisinRK

        sumk

        Yθk minusXβk22 + λ1 βk1 + λ2β

        gtk Ωβk

        where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

        Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

        minβkisinRpθkisinRK

        Kminus1sumk=1

        Yθk minusXβk22 + λ

        psumj=1

        radicradicradicradicKminus1sumk=1

        β2kj

        2

        (36)

        which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

        this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

        34

        4 Formalizing the Objective

        In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

        The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

        The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

        41 From Optimal Scoring to Linear Discriminant Analysis

        Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

        Throughout this chapter we assume that

        bull there is no empty class that is the diagonal matrix YgtY is full rank

        bull inputs are centered that is Xgt1n = 0

        bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

        35

        4 Formalizing the Objective

        411 Penalized Optimal Scoring Problem

        For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

        The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

        minθisinRK βisinRp

        Yθ minusXβ2 + βgtΩβ (41a)

        s t nminus1 θgtYgtYθ = 1 (41b)

        For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

        βos =(XgtX + Ω

        )minus1XgtYθ (42)

        The objective function (41a) is then

        Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

        (XgtX + Ω

        )βos

        = θgtYgtYθ minus θgtYgtX(XgtX + Ω

        )minus1XgtYθ

        where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

        maxθnminus1θgtYgtYθ=1

        θgtYgtX(XgtX + Ω

        )minus1XgtYθ (43)

        which shows that the optimization of the p-OS problem with respect to θk boils down to

        finding the kth largest eigenvector of YgtX(XgtX + Ω

        )minus1XgtY Indeed Appendix C

        details that Problem (43) is solved by

        (YgtY)minus1YgtX(XgtX + Ω

        )minus1XgtYθ = α2θ (44)

        36

        41 From Optimal Scoring to Linear Discriminant Analysis

        where α2 is the maximal eigenvalue 1

        nminus1θgtYgtX(XgtX + Ω

        )minus1XgtYθ = α2nminus1θgt(YgtY)θ

        nminus1θgtYgtX(XgtX + Ω

        )minus1XgtYθ = α2 (45)

        412 Penalized Canonical Correlation Analysis

        As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

        maxθisinRK βisinRp

        nminus1θgtYgtXβ (46a)

        s t nminus1 θgtYgtYθ = 1 (46b)

        nminus1 βgt(XgtX + Ω

        )β = 1 (46c)

        The solutions to (46) are obtained by finding saddle points of the Lagrangian

        nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

        rArr npartL(βθ γ ν)

        partβ= XgtYθ minus 2γ(XgtX + Ω)β

        rArr βcca =1

        2γ(XgtX + Ω)minus1XgtYθ

        Then as βcca obeys (46c) we obtain

        βcca =(XgtX + Ω)minus1XgtYθradic

        nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

        so that the optimal objective function (46a) can be expressed with θ alone

        nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

        =

        radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

        and the optimization problem with respect to θ can be restated as

        maxθnminus1θgtYgtYθ=1

        θgtYgtX(XgtX + Ω

        )minus1XgtYθ (48)

        Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

        βos = αβcca (49)

        1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

        37

        4 Formalizing the Objective

        where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

        the optimality conditions for θ

        npartL(βθ γ ν)

        partθ= YgtXβ minus 2νYgtYθ

        rArr θcca =1

        2ν(YgtY)minus1YgtXβ (410)

        Then as θcca obeys (46b) we obtain

        θcca =(YgtY)minus1YgtXβradic

        nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

        leading to the following expression of the optimal objective function

        nminus1θgtccaYgtXβ =

        nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

        =

        radicnminus1βgtXgtY(YgtY)minus1YgtXβ

        The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

        maxβisinRp

        nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

        s t nminus1 βgt(XgtX + Ω

        )β = 1 (412b)

        where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

        nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

        )βcca (413)

        where λ is the maximal eigenvalue shown below to be equal to α2

        nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

        rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

        rArr nminus1αβgtccaXgtYθ = λ

        rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

        rArr α2 = λ

        The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

        38

        41 From Optimal Scoring to Linear Discriminant Analysis

        413 Penalized Linear Discriminant Analysis

        Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

        maxβisinRp

        βgtΣBβ (414a)

        s t βgt(ΣW + nminus1Ω)β = 1 (414b)

        where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

        As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

        to a simple matrix representation using the projection operator Y(YgtY

        )minus1Ygt

        ΣT =1

        n

        nsumi=1

        xixigt

        = nminus1XgtX

        ΣB =1

        n

        Ksumk=1

        nk microkmicrogtk

        = nminus1XgtY(YgtY

        )minus1YgtX

        ΣW =1

        n

        Ksumk=1

        sumiyik=1

        (xi minus microk) (xi minus microk)gt

        = nminus1

        (XgtXminusXgtY

        (YgtY

        )minus1YgtX

        )

        Using these formulae the solution to the p-LDA problem (414) is obtained as

        XgtY(YgtY

        )minus1YgtXβlda = λ

        (XgtX + ΩminusXgtY

        (YgtY

        )minus1YgtX

        )βlda

        XgtY(YgtY

        )minus1YgtXβlda =

        λ

        1minus λ

        (XgtX + Ω

        )βlda

        The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

        βlda = (1minus α2)minus12 βcca

        = αminus1(1minus α2)minus12 βos

        which ends the path from p-OS to p-LDA

        39

        4 Formalizing the Objective

        414 Summary

        The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

        minΘ BYΘminusXB2F + λ tr

        (BgtΩB

        )s t nminus1 ΘgtYgtYΘ = IKminus1

        Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

        square-root of the largest eigenvector of YgtX(XgtX + Ω

        )minus1XgtY we have

        BLDA = BCCA

        (IKminus1 minusA2

        )minus 12

        = BOS Aminus1(IKminus1 minusA2

        )minus 12 (415)

        where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

        can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

        or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

        With the aim of performing classification the whole process could be summarized asfollows

        1 Solve the p-OS problem as

        BOS =(XgtX + λΩ

        )minus1XgtYΘ

        where Θ are the K minus 1 leading eigenvectors of

        YgtX(XgtX + λΩ

        )minus1XgtY

        2 Translate the data samples X into the LDA domain as XLDA = XBOSD

        where D = Aminus1(IKminus1 minusA2

        )minus 12

        3 Compute the matrix M of centroids microk from XLDA and Y

        4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

        5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

        6 Graphical Representation

        40

        42 Practicalities

        The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

        42 Practicalities

        421 Solution of the Penalized Optimal Scoring Regression

        Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

        minΘisinRKtimesKminus1BisinRptimesKminus1

        YΘminusXB2F + λ tr(BgtΩB

        )(416a)

        s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

        where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

        Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

        1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

        2 Compute B =(XgtX + λΩ

        )minus1XgtYΘ0

        3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

        )minus1XgtY

        4 Compute the optimal regression coefficients

        BOS =(XgtX + λΩ

        )minus1XgtYΘ (417)

        Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

        Θ0gtYgtX(XgtX + λΩ

        )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

        costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

        This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

        41

        4 Formalizing the Objective

        a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

        422 Distance Evaluation

        The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

        d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

        (nkn

        ) (418)

        is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

        Σminus1WΩ =

        (nminus1(XgtX + λΩ)minus ΣB

        )minus1

        =(nminus1XgtXminus ΣB + nminus1λΩ

        )minus1

        =(ΣW + nminus1λΩ

        )minus1 (419)

        Before explaining how to compute the distances let us summarize some clarifying points

        bull The solution BOS of the p-OS problem is enough to accomplish classification

        bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

        bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

        As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

        (xi minus microk)BOS2ΣWΩminus 2 log(πk)

        where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

        (IKminus1 minusA2

        )minus 12

        ∥∥∥2

        2minus 2 log(πk)

        which is a plain Euclidean distance

        42

        43 From Sparse Optimal Scoring to Sparse LDA

        423 Posterior Probability Evaluation

        Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

        p(yk = 1|x) prop exp

        (minusd(xmicrok)

        2

        )prop πk exp

        (minus1

        2

        ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

        )minus 12

        ∥∥∥2

        2

        ) (420)

        Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

        2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

        p(yk = 1|x) =πk exp

        (minusd(xmicrok)

        2

        )sum

        ` π` exp(minusd(xmicro`)

        2

        )=

        πk exp(minusd(xmicrok)

        2 + dmax2

        )sum`

        π` exp

        (minusd(xmicro`)

        2+dmax

        2

        )

        where dmax = maxk d(xmicrok)

        424 Graphical Representation

        Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

        43 From Sparse Optimal Scoring to Sparse LDA

        The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

        In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

        43

        4 Formalizing the Objective

        section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

        431 A Quadratic Variational Form

        Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

        Our formulation of group-Lasso is showed below

        minτisinRp

        minBisinRptimesKminus1

        J(B) + λ

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj(421a)

        s tsum

        j τj minussum

        j wj∥∥βj∥∥

        2le 0 (421b)

        τj ge 0 j = 1 p (421c)

        where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

        B =(β1gt βpgt

        )gtand wj are predefined nonnegative weights The cost function

        J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

        The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

        Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

        Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

        j=1wj∥∥βj∥∥

        2

        Proof The Lagrangian of Problem (421) is

        L = J(B) + λ

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj+ ν0

        ( psumj=1

        τj minuspsumj=1

        wj∥∥βj∥∥

        2

        )minus

        psumj=1

        νjτj

        44

        43 From Sparse Optimal Scoring to Sparse LDA

        Figure 41 Graphical representation of the variational approach to Group-Lasso

        Thus the first order optimality conditions for τj are

        partLpartτj

        (τj ) = 0hArr minusλw2j

        ∥∥βj∥∥2

        2

        τj2 + ν0 minus νj = 0

        hArr minusλw2j

        ∥∥βj∥∥2

        2+ ν0τ

        j

        2 minus νjτj2 = 0

        rArr minusλw2j

        ∥∥βj∥∥2

        2+ ν0 τ

        j

        2 = 0

        The last line is obtained from complementary slackness which implies here νjτj = 0

        Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

        for constraint gj(τj) le 0 As a result the optimal value of τj

        τj =

        radicλw2

        j

        ∥∥βj∥∥2

        2

        ν0=

        radicλ

        ν0wj∥∥βj∥∥

        2(422)

        We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

        psumj=1

        τj minuspsumj=1

        wj∥∥βj∥∥

        2= 0 (423)

        so that τj = wj∥∥βj∥∥

        2 Using this value into (421a) it is possible to conclude that

        Problem (421) is equivalent to the standard group-Lasso operator

        minBisinRptimesM

        J(B) + λ

        psumj=1

        wj∥∥βj∥∥

        2 (424)

        So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

        45

        4 Formalizing the Objective

        With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

        Ω = diag

        (w2

        1

        τ1w2

        2

        τ2

        w2p

        τp

        ) (425)

        with

        τj = wj∥∥βj∥∥

        2

        resulting in Ω diagonal components

        (Ω)jj =wj∥∥βj∥∥

        2

        (426)

        And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

        The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

        Lemma 42 If J is convex Problem (421) is convex

        Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

        In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

        Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

        V isin RptimesKminus1 V =partJ(B)

        partB+ λG

        (427)

        where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

        G =(g1gt gpgt

        )gtdefined as follows Let S(B) denote the columnwise support of

        B S(B) = j isin 1 p ∥∥βj∥∥

        26= 0 then we have

        forallj isin S(B) gj = wj∥∥βj∥∥minus1

        2βj (428)

        forallj isin S(B) ∥∥gj∥∥

        2le wj (429)

        46

        43 From Sparse Optimal Scoring to Sparse LDA

        This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

        Proof When∥∥βj∥∥

        26= 0 the gradient of the penalty with respect to βj is

        part (λsump

        m=1wj βm2)

        partβj= λwj

        βj∥∥βj∥∥2

        (430)

        At∥∥βj∥∥

        2= 0 the gradient of the objective function is not continuous and the optimality

        conditions then make use of the subdifferential (Bach et al 2011)

        partβj

        psumm=1

        wj βm2

        )= partβj

        (λwj

        ∥∥βj∥∥2

        )=λwjv isin RKminus1 v2 le 1

        (431)

        That gives the expression (429)

        Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

        forallj isin S partJ(B)

        partβj+ λwj

        ∥∥βj∥∥minus1

        2βj = 0 (432a)

        forallj isin S ∥∥∥∥partJ(B)

        partβj

        ∥∥∥∥2

        le λwj (432b)

        where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

        Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

        432 Group-Lasso OS as Penalized LDA

        With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

        Proposition 41 The group-Lasso OS problem

        BOS = argminBisinRptimesKminus1

        minΘisinRKtimesKminus1

        1

        2YΘminusXB2F + λ

        psumj=1

        wj∥∥βj∥∥

        2

        s t nminus1 ΘgtYgtYΘ = IKminus1

        47

        4 Formalizing the Objective

        is equivalent to the penalized LDA problem

        BLDA = maxBisinRptimesKminus1

        tr(BgtΣBB

        )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

        where Ω = diag

        (w2

        1

        τ1

        w2p

        τp

        ) with Ωjj =

        +infin if βjos = 0

        wj∥∥βjos

        ∥∥minus1

        2otherwise

        (433)

        That is BLDA = BOS diag(αminus1k (1minus α2

        k)minus12

        ) where αk isin (0 1) is the kth leading

        eigenvalue of

        nminus1YgtX(XgtX + λΩ

        )minus1XgtY

        Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

        The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

        (BgtΩB

        )

        48

        5 GLOSS Algorithm

        The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

        The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

        1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

        2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

        3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

        This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

        51 Regression Coefficients Updates

        Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

        XgtAXA + λΩ)βk = XgtAYθ0

        k (51)

        49

        5 GLOSS Algorithm

        initialize modelλ B

        ACTIVE SETall j st||βj ||2 gt 0

        p-OS PROBLEMB must hold1st optimality

        condition

        any variablefrom

        ACTIVE SETmust go toINACTIVE

        SET

        take it out ofACTIVE SET

        test 2nd op-timality con-dition on the

        INACTIVE SET

        any variablefrom

        INACTIVE SETmust go toACTIVE

        SET

        take it out ofINACTIVE SET

        compute Θ

        and update B end

        yes

        no

        yes

        no

        Figure 51 GLOSS block diagram

        50

        51 Regression Coefficients Updates

        Algorithm 1 Adaptively Penalized Optimal Scoring

        Input X Y B λInitialize A larr

        j isin 1 p

        ∥∥βj∥∥2gt 0

        Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

        Step 1 solve (421) in B assuming A optimalrepeat

        Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

        2

        BA larr(XgtAXA + λΩ

        )minus1XgtAYΘ0

        until condition (432a) holds for all j isin A Step 2 identify inactivated variables

        for j isin A ∥∥βj∥∥

        2= 0 do

        if optimality condition (432b) holds thenA larr AjGo back to Step 1

        end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

        jisinA

        ∥∥partJpartβj∥∥2

        if∥∥∥partJpartβj∥∥∥

        2lt λ then

        convergence larr true B is optimalelseA larr Acup j

        end ifuntil convergence

        (sV)larreigenanalyze(Θ0gtYgtXAB) that is

        Θ0gtYgtXABVk = skVk k = 1 K minus 1

        Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

        Output Θ B α

        51

        5 GLOSS Algorithm

        where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

        column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

        511 Cholesky decomposition

        Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

        (XgtX + λΩ)B = XgtYΘ (52)

        Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

        CgtCB = XgtYΘ

        CB = CgtXgtYΘ

        B = CCgtXgtYΘ (53)

        where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

        512 Numerical Stability

        The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

        B = Ωminus12(Ωminus12XgtXΩminus12 + λI

        )minus1Ωminus12XgtYΘ0 (54)

        where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

        52 Score Matrix

        The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

        YgtX(XgtX + Ω

        )minus1XgtY This eigen-analysis is actually solved in the form

        ΘgtYgtX(XgtX + Ω

        )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

        vector decomposition does not require the costly computation of(XgtX + Ω

        )minus1that

        52

        53 Optimality Conditions

        involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

        trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

        )minus1XgtY 1

        Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

        Θ0gtYgtX(XgtX + Ω

        )minus1XgtYΘ0 = Θ0gtYgtXB0

        Thus the solution to penalized OS problem can be computed trough the singular

        value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

        Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

        )minus1XgtYΘ = Λ and when Θ0 is chosen such

        that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

        53 Optimality Conditions

        GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

        1

        2YΘminusXB22 + λ

        psumj=1

        wj∥∥βj∥∥

        2(55)

        Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

        row of B βj is the (K minus 1)-dimensional vector

        partJ(B)

        partβj= xj

        gt(XBminusYΘ)

        where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

        xjgt

        (XBminusYΘ) + λwjβj∥∥βj∥∥

        2

        1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

        )minus1XgtY It is thus suffi-

        cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

        YgtX(XgtX + Ω

        )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

        (35c) we set Θ0 =(YgtY

        )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

        vectors orthogonal to 1K

        53

        5 GLOSS Algorithm

        The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

        2le λwj

        54 Active and Inactive Sets

        The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

        j = maxj

        ∥∥∥xjgt (XBminusYΘ)∥∥∥

        2minus λwj 0

        The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

        2

        is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

        2le λwj

        The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

        55 Penalty Parameter

        The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

        The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

        λmax = maxjisin1p

        1

        wj

        ∥∥∥xjgtYΘ0∥∥∥

        2

        The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

        is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

        54

        56 Options and Variants

        56 Options and Variants

        561 Scaling Variables

        As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

        562 Sparse Variant

        This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

        563 Diagonal Variant

        We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

        The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

        minBisinRptimesKminus1

        YΘminusXB2F = minBisinRptimesKminus1

        tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

        )are replaced by

        minBisinRptimesKminus1

        tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

        )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

        564 Elastic net and Structured Variant

        For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

        55

        5 GLOSS Algorithm

        7 8 9

        4 5 6

        1 2 3

        - ΩL =

        3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

        Figure 52 Graph and Laplacian matrix for a 3times 3 image

        for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

        When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

        This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

        56

        6 Experimental Results

        This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

        61 Normalization

        With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

        62 Decision Thresholds

        The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

        1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

        57

        6 Experimental Results

        63 Simulated Data

        We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

        Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

        Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

        dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

        is intended to mimic gene expression data correlation

        Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

        3 1)if j le 100 and Xij sim N(0 1) otherwise

        Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

        Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

        The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

        58

        63 Simulated Data

        Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

        Err () Var Dir

        Sim 1 K = 4 mean shift ind features

        PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

        Sim 2 K = 2 mean shift dependent features

        PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

        Sim 3 K = 4 1D mean shift ind features

        PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

        Sim 4 K = 4 mean shift ind features

        PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

        59

        6 Experimental Results

        0 10 20 30 40 50 60 70 8020

        30

        40

        50

        60

        70

        80

        90

        100TPR Vs FPR

        gloss

        glossd

        slda

        plda

        Simulation1

        Simulation2

        Simulation3

        Simulation4

        Figure 61 TPR versus FPR (in ) for all algorithms and simulations

        Table 62 Average TPR and FPR (in ) computed over 25 repetitions

        Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

        PLDA 990 782 969 603 980 159 743 656

        SLDA 739 385 338 163 416 278 507 395

        GLOSS 641 106 300 46 511 182 260 121

        GLOSS-D 935 394 921 281 956 655 429 299

        method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

        64 Gene Expression Data

        We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

        2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

        60

        64 Gene Expression Data

        Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

        Err () Var

        Nakayama n = 86 p = 22 283 K = 5

        PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

        Ramaswamy n = 198 p = 16 063 K = 14

        PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

        Sun n = 180 p = 54 613 K = 4

        PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

        ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

        dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

        Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

        Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

        Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

        4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

        61

        6 Experimental Results

        GLOSS SLDA

        Naka

        yam

        a

        minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

        minus25

        minus2

        minus15

        minus1

        minus05

        0

        05

        1

        x 104

        1) Synovial sarcoma

        2) Myxoid liposarcoma

        3) Dedifferentiated liposarcoma

        4) Myxofibrosarcoma

        5) Malignant fibrous histiocytoma

        2n

        dd

        iscr

        imin

        ant

        minus2000 0 2000 4000 6000 8000 10000 12000 14000

        2000

        4000

        6000

        8000

        10000

        12000

        14000

        16000

        1) Synovial sarcoma

        2) Myxoid liposarcoma

        3) Dedifferentiated liposarcoma

        4) Myxofibrosarcoma

        5) Malignant fibrous histiocytoma

        Su

        n

        minus1 minus05 0 05 1 15 2

        x 104

        05

        1

        15

        2

        25

        3

        35

        x 104

        1) NonTumor

        2) Astrocytomas

        3) Glioblastomas

        4) Oligodendrogliomas

        1st discriminant

        2n

        dd

        iscr

        imin

        ant

        minus2 minus15 minus1 minus05 0

        x 104

        0

        05

        1

        15

        2

        x 104

        1) NonTumor

        2) Astrocytomas

        3) Glioblastomas

        4) Oligodendrogliomas

        1st discriminant

        Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

        62

        65 Correlated Data

        Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

        65 Correlated Data

        When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

        The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

        For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

        As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

        The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

        Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

        63

        6 Experimental Results

        β for GLOSS β for S-GLOSS

        Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

        β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

        Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

        64

        Discussion

        GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

        Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

        The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

        The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

        65

        Part III

        Sparse Clustering Analysis

        67

        Abstract

        Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

        Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

        As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

        Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

        69

        7 Feature Selection in Mixture Models

        71 Mixture Models

        One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

        711 Model

        We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

        from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

        f(xi) =

        Ksumk=1

        πkfk(xi) foralli isin 1 n

        where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

        sumk πk = 1) Mixture models transcribe that

        given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

        bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

        bull x each xi is assumed to arise from a random vector with probability densityfunction fk

        In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

        f(xiθ) =

        Ksumk=1

        πkφ(xiθk) foralli isin 1 n

        71

        7 Feature Selection in Mixture Models

        where θ = (π1 πK θ1 θK) is the parameter of the model

        712 Parameter Estimation The EM Algorithm

        For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

        21 σ

        22 π) of a univariate

        Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

        The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

        The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

        Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

        Maximum Likelihood Definitions

        The likelihood is is commonly expressed in its logarithmic version

        L(θ X) = log

        (nprodi=1

        f(xiθ)

        )

        =nsumi=1

        log

        (Ksumk=1

        πkfk(xiθk)

        ) (71)

        where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

        To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

        72

        71 Mixture Models

        classification log-likelihood

        LC(θ XY) = log

        (nprodi=1

        f(xiyiθ)

        )

        =

        nsumi=1

        log

        (Ksumk=1

        yikπkfk(xiθk)

        )

        =nsumi=1

        Ksumk=1

        yik log (πkfk(xiθk)) (72)

        The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

        Defining the soft membership tik(θ) as

        tik(θ) = p(Yik = 1|xiθ) (73)

        =πkfk(xiθk)

        f(xiθ) (74)

        To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

        LC(θ XY) =sumik

        yik log (πkfk(xiθk))

        =sumik

        yik log (tikf(xiθ))

        =sumik

        yik log tik +sumik

        yik log f(xiθ)

        =sumik

        yik log tik +nsumi=1

        log f(xiθ)

        =sumik

        yik log tik + L(θ X) (75)

        wheresum

        ik yik log tik can be reformulated as

        sumik

        yik log tik =nsumi=1

        Ksumk=1

        yik log(p(Yik = 1|xiθ))

        =

        nsumi=1

        log(p(Yik = 1|xiθ))

        = log (p(Y |Xθ))

        As a result the relationship (75) can be rewritten as

        L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

        73

        7 Feature Selection in Mixture Models

        Likelihood Maximization

        The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

        L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

        +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

        In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

        ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

        minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

        Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

        For the mixture model problem Q(θθprime) is

        Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

        =sumik

        p(Yik = 1|xiθprime) log(πkfk(xiθk))

        =nsumi=1

        Ksumk=1

        tik(θprime) log (πkfk(xiθk)) (77)

        Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

        prime) are the posterior proba-bilities of cluster memberships

        Hence the EM algorithm sketched above results in

        bull Initialization (not iterated) choice of the initial parameter θ(0)

        bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

        bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

        74

        72 Feature Selection in Model-Based Clustering

        Gaussian Model

        In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

        f(xiθ) =Ksumk=1

        πkfk(xiθk)

        =

        Ksumk=1

        πk1

        (2π)p2 |Σ|

        12

        exp

        minus1

        2(xi minus microk)

        gtΣminus1(xi minus microk)

        At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

        Q(θθ(t)) =sumik

        tik log(πk)minussumik

        tik log(

        (2π)p2 |Σ|

        12

        )minus 1

        2

        sumik

        tik(xi minus microk)gtΣminus1(xi minus microk)

        =sumk

        tk log(πk)minusnp

        2log(2π)︸ ︷︷ ︸

        constant term

        minusn2

        log(|Σ|)minus 1

        2

        sumik

        tik(xi minus microk)gtΣminus1(xi minus microk)

        equivsumk

        tk log(πk)minusn

        2log(|Σ|)minus

        sumik

        tik

        (1

        2(xi minus microk)

        gtΣminus1(xi minus microk)

        ) (78)

        where

        tk =nsumi=1

        tik (79)

        The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

        π(t+1)k =

        tkn

        (710)

        micro(t+1)k =

        sumi tikxitk

        (711)

        Σ(t+1) =1

        n

        sumk

        Wk (712)

        with Wk =sumi

        tik(xi minus microk)(xi minus microk)gt (713)

        The derivations are detailed in Appendix G

        72 Feature Selection in Model-Based Clustering

        When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

        75

        7 Feature Selection in Mixture Models

        covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

        In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

        gtk (Banfield and Raftery 1993)

        These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

        In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

        721 Based on Penalized Likelihood

        Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

        log

        (p(Yk = 1|x)

        p(Y` = 1|x)

        )= xgtΣminus1(microk minus micro`)minus

        1

        2(microk + micro`)

        gtΣminus1(microk minus micro`) + logπkπ`

        In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

        λKsumk=1

        psumj=1

        |microkj |

        as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

        λ1

        Ksumk=1

        psumj=1

        |microkj |+ λ2

        Ksumk=1

        psumj=1

        psumm=1

        |(Σminus1k )jm|

        In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

        76

        72 Feature Selection in Model-Based Clustering

        Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

        λ

        psumj=1

        sum16k6kprime6K

        |microkj minus microkprimej |

        This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

        A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

        λ

        psumj=1

        (micro1j micro2j microKj)infin

        One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

        λradicK

        psumj=1

        radicradicradicradic Ksum

        k=1

        micro2kj

        The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

        The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

        722 Based on Model Variants

        The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

        77

        7 Feature Selection in Mixture Models

        f(xi|φ πθν) =Ksumk=1

        πk

        pprodj=1

        [f(xij |θjk)]φj [h(xij |νj)]1minusφj

        where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

        An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

        which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

        tr(

        (UgtΣWU)minus1UgtΣBU) (714)

        so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

        To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

        minUisinRptimesKminus1

        ∥∥∥XU minusXU∥∥∥2

        F+ λ

        Kminus1sumk=1

        ∥∥∥uk∥∥∥1

        where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

        minABisinRptimesKminus1

        Ksumk=1

        ∥∥∥RminusgtW HBk minusABgtHBk

        ∥∥∥2

        2+ ρ

        Kminus1sumj=1

        βgtj ΣWβj + λ

        Kminus1sumj=1

        ∥∥βj∥∥1

        s t AgtA = IKminus1

        where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

        78

        72 Feature Selection in Model-Based Clustering

        triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

        The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

        minUisinRptimesKminus1

        psumj=1

        ∥∥∥ΣBj minus UUgtΣBj

        ∥∥∥2

        2

        s t UgtU = IKminus1

        whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

        To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

        However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

        723 Based on Model Selection

        Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

        bull X(1) set of selected relevant variables

        bull X(2) set of variables being considered for inclusion or exclusion of X(1)

        bull X(3) set of non relevant variables

        79

        7 Feature Selection in Mixture Models

        With those subsets they defined two different models where Y is the partition toconsider

        bull M1

        f (X|Y) = f(X(1)X(2)X(3)|Y

        )= f

        (X(3)|X(2)X(1)

        )f(X(2)|X(1)

        )f(X(1)|Y

        )bull M2

        f (X|Y) = f(X(1)X(2)X(3)|Y

        )= f

        (X(3)|X(2)X(1)

        )f(X(2)X(1)|Y

        )Model M1 means that variables in X(2) are independent on clustering Y Model M2

        shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

        B12 =f (X|M1)

        f (X|M2)

        where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

        B12 =f(X(1)X(2)X(3)|M1

        )f(X(1)X(2)X(3)|M2

        )=f(X(2)|X(1)M1

        )f(X(1)|M1

        )f(X(2)X(1)|M2

        )

        This factor is approximated since the integrated likelihoods f(X(1)|M1

        )and

        f(X(2)X(1)|M2

        )are difficult to calculate exactly Raftery and Dean (2006) use the

        BIC approximation The computation of f(X(2)|X(1)M1

        ) if there is only one variable

        in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

        Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

        Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

        80

        8 Theoretical Foundations

        In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

        We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

        In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

        81 Resolving EM with Optimal Scoring

        In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

        811 Relationship Between the M-Step and Linear Discriminant Analysis

        LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

        d(ximicrok) = (xi minus microk)gtΣminus1

        W (xi minus microk)

        where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

        81

        8 Theoretical Foundations

        The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

        Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

        2lweight(microΣ) =nsumi=1

        Ksumk=1

        tikd(ximicrok)minus n log(|ΣW|)

        which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

        812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

        The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

        813 Clustering Using Penalized Optimal Scoring

        The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

        d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

        This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

        82

        82 Optimized Criterion

        1 Initialize the membership matrix Y (for example by K-means algorithm)

        2 Solve the p-OS problem as

        BOS =(XgtX + λΩ

        )minus1XgtYΘ

        where Θ are the K minus 1 leading eigenvectors of

        YgtX(XgtX + λΩ

        )minus1XgtY

        3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

        k)minus 1

        2 )

        4 Compute the centroids M in the LDA domain

        5 Evaluate distances in the LDA domain

        6 Translate distances into posterior probabilities tik with

        tik prop exp

        [minusd(x microk)minus 2 log(πk)

        2

        ] (81)

        7 Update the labels using the posterior probabilities matrix Y = T

        8 Go back to step 2 and iterate until tik converge

        Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

        814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

        In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

        82 Optimized Criterion

        In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

        83

        8 Theoretical Foundations

        optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

        This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

        821 A Bayesian Derivation

        This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

        The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

        The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

        f(Σ|Λ0 ν0) =1

        2np2 |Λ0|

        n2 Γp(

        n2 )|Σminus1|

        ν0minuspminus12 exp

        minus1

        2tr(Λminus1

        0 Σminus1)

        where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

        Γp(n2) = πp(pminus1)4pprodj=1

        Γ (n2 + (1minus j)2)

        The posterior distribution can be maximized similarly to the likelihood through the

        84

        82 Optimized Criterion

        maximization of

        Q(θθprime) + log(f(Σ|Λ0 ν0))

        =Ksumk=1

        tk log πk minus(n+ 1)p

        2log 2minus n

        2log |Λ0| minus

        p(p+ 1)

        4log(π)

        minuspsumj=1

        log

        (n

        2+

        1minus j2

        ))minus νn minus pminus 1

        2log |Σ| minus 1

        2tr(Λminus1n Σminus1

        )equiv

        Ksumk=1

        tk log πk minusn

        2log |Λ0| minus

        νn minus pminus 1

        2log |Σ| minus 1

        2tr(Λminus1n Σminus1

        ) (82)

        with tk =

        nsumi=1

        tik

        νn = ν0 + n

        Λminus1n = Λminus1

        0 + S0

        S0 =

        nsumi=1

        Ksumk=1

        tik(xi minus microk)(xi minus microk)gt

        Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

        822 Maximum a Posteriori Estimator

        The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

        ΣMAP =1

        ν0 + nminus pminus 1(Λminus1

        0 + S0) (83)

        where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

        0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

        85

        9 Mix-GLOSS Algorithm

        Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

        91 Mix-GLOSS

        The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

        When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

        The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

        911 Outer Loop Whole Algorithm Repetitions

        This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

        bull the centered ntimes p feature matrix X

        bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

        bull the number of clusters K

        bull the maximum number of iterations for the EM algorithm

        bull the convergence tolerance for the EM algorithm

        bull the number of whole repetitions of the clustering algorithm

        87

        9 Mix-GLOSS Algorithm

        Figure 91 Mix-GLOSS Loops Scheme

        bull a ptimes (K minus 1) initial coefficient matrix (optional)

        bull a ntimesK initial posterior probability matrix (optional)

        For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

        912 Penalty Parameter Loop

        The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

        Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

        88

        91 Mix-GLOSS

        of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

        Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

        Algorithm 2 Automatic selection of λ

        Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

        Estimate λ Compute gradient at βj = 0partJ(B)

        partβj

        ∣∣∣βj=0

        = xjgt

        (sum

        m6=j xmβm minusYΘ)

        Compute λmax for every feature using (432b)

        λmaxj = 1

        wj

        ∥∥∥∥ partJ(B)

        partβj

        ∣∣∣βj=0

        ∥∥∥∥2

        Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

        elselastLAMBDA larr true

        end ifuntil lastLAMBDA

        Output B L(θ) tik πk microk Σ Y for every λ in solution path

        913 Inner Loop EM Algorithm

        The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

        89

        9 Mix-GLOSS Algorithm

        Algorithm 3 Mix-GLOSS for one value of λ

        Input X K B0 Y0 λInitializeif (B0Y0) available then

        BOS larr B0 Y larr Y0

        elseBOS larr 0 Y larr kmeans(XK)

        end ifconvergenceEM larr false tolEM larr 1e-3repeat

        M-step(BOSΘ

        α)larr GLOSS(XYBOS λ)

        XLDA = XBOS diag (αminus1(1minusα2)minus12

        )

        πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

        sumi |tik minus yik| lt tolEM then

        convergenceEM larr trueend ifY larr T

        until convergenceEMY larr MAP(T)

        Output BOS ΘL(θ) tik πk microk Σ Y

        90

        92 Model Selection

        M-Step

        The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

        E-Step

        The E-step evaluates the posterior probability matrix T using

        tik prop exp

        [minusd(x microk)minus 2 log(πk)

        2

        ]

        The convergence of those tik is used as stopping criterion for EM

        92 Model Selection

        Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

        In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

        In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

        The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

        91

        9 Mix-GLOSS Algorithm

        Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

        X K λEMITER MAXREPMixminusGLOSS

        Use B and T frombest repetition as

        StartB and StartT

        Mix-GLOSS (λStartBStartT)

        Compute BIC

        Chose λ = minλ BIC

        Partition tikπk λBEST BΘ D L(θ)activeset

        Figure 92 Mix-GLOSS model selection diagram

        with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

        92

        10 Experimental Results

        The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

        This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

        In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

        The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

        101 Tested Clustering Algorithms

        This section compares Mix-GLOSS with the following methods in the state of the art

        bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

        bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

        93

        10 Experimental Results

        Figure 101 Class mean vectors for each artificial simulation

        bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

        After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

        The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

        bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

        bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

        94

        102 Results

        Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

        102 Results

        In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

        bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

        bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

        bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

        The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

        Results in percentages are displayed in Figure 102 (or in Table 102 )

        95

        10 Experimental Results

        Table 101 Experimental results for simulated data

        Err () Var Time

        Sim 1 K = 4 mean shift ind features

        CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

        Sim 2 K = 2 mean shift dependent features

        CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

        Sim 3 K = 4 1D mean shift ind features

        CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

        Sim 4 K = 4 mean shift ind features

        CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

        Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

        Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

        MIX-GLOSS 992 015 828 335 884 67 780 12

        LUMI-KUAN 992 28 1000 02 1000 005 50 005

        FISHER-EM 986 24 888 17 838 5825 620 4075

        96

        103 Discussion

        0 10 20 30 40 50 600

        10

        20

        30

        40

        50

        60

        70

        80

        90

        100TPR Vs FPR

        MIXminusGLOSS

        LUMIminusKUAN

        FISHERminusEM

        Simulation1

        Simulation2

        Simulation3

        Simulation4

        Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

        103 Discussion

        After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

        LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

        The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

        From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

        97

        Conclusions

        99

        Conclusions

        Summary

        The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

        In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

        The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

        In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

        In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

        Perspectives

        Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

        101

        based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

        At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

        The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

        From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

        At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

        102

        Appendix

        103

        A Matrix Properties

        Property 1 By definition ΣW and ΣB are both symmetric matrices

        ΣW =1

        n

        gsumk=1

        sumiisinCk

        (xi minus microk)(xi minus microk)gt

        ΣB =1

        n

        gsumk=1

        nk(microk minus x)(microk minus x)gt

        Property 2 partxgtapartx = partagtx

        partx = a

        Property 3 partxgtAxpartx = (A + Agt)x

        Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

        Property 5 partagtXbpartX = abgt

        Property 6 partpartXtr

        (AXminus1B

        )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

        105

        B The Penalized-OS Problem is anEigenvector Problem

        In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

        minθkβk

        Yθk minusXβk22 + βgtk Ωkβk (B1)

        st θgtk YgtYθk = 1

        θgt` YgtYθk = 0 forall` lt k

        for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

        Lk(θkβk λkνk) =

        Yθk minusXβk22 + βgtk Ωkβk + λk(θ

        gtk YgtYθk minus 1) +

        sum`ltk

        ν`θgt` YgtYθk (B2)

        Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

        βk = (XgtX + Ωk)minus1XgtYθk (B3)

        The objective function of (B1) evaluated at βk is

        minθk

        Yθk minusXβk22 + βk

        gtΩkβk = min

        θk

        θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

        = maxθk

        θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

        If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

        B1 How to Solve the Eigenvector Decomposition

        Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

        107

        B The Penalized-OS Problem is an Eigenvector Problem

        Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

        maxΘisinRKtimes(Kminus1)

        tr(ΘgtMΘ

        )(B5)

        st ΘgtYgtYΘ = IKminus1

        If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

        MΘv = λv (B6)

        where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

        vgtMΘv = λhArr vgtΘgtMΘv = λ

        Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

        wgtMw = λ (B7)

        Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

        MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

        = ΘgtYgtXB

        Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

        To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

        B = (XgtX + Ω)minus1XgtYΘV = BV

        108

        B2 Why the OS Problem is Solved as an Eigenvector Problem

        B2 Why the OS Problem is Solved as an Eigenvector Problem

        In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

        By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

        θk =

        Kminus1summ=1

        αmwm s t θgtk θk = 1 (B8)

        The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

        Kminus1summ=1

        αmwm

        )gt(Kminus1summ=1

        αmwm

        )= 1

        that as per the eigenvector properties can be reduced to

        Kminus1summ=1

        α2m = 1 (B9)

        Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

        Mθk = M

        Kminus1summ=1

        αmwm

        =

        Kminus1summ=1

        αmMwm

        As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

        Mθk =Kminus1summ=1

        αmλmwm

        Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

        θgtk Mθk =

        (Kminus1sum`=1

        α`w`

        )gt(Kminus1summ=1

        αmλmwm

        )

        This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

        θgtk Mθk =Kminus1summ=1

        α2mλm

        109

        B The Penalized-OS Problem is an Eigenvector Problem

        The optimization Problem (B5) for discriminant direction k can be rewritten as

        maxθkisinRKtimes1

        θgtk Mθk

        = max

        θkisinRKtimes1

        Kminus1summ=1

        α2mλm

        (B10)

        with θk =Kminus1summ=1

        αmwm

        andKminus1summ=1

        α2m = 1

        One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

        sumKminus1m=1 αmwm the resulting score vector θk will be equal to

        the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

        be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

        110

        C Solving Fisherrsquos Discriminant Problem

        The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

        maxβisinRp

        βgtΣBβ (C1a)

        s t βgtΣWβ = 1 (C1b)

        where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

        The Lagrangian of Problem (C1) is

        L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

        so that its first derivative with respect to β is

        partL(β ν)

        partβ= 2ΣBβ minus 2νΣWβ

        A necessary optimality condition for β is that this derivative is zero that is

        ΣBβ = νΣWβ

        Provided ΣW is full rank we have

        Σminus1W ΣBβ

        = νβ (C2)

        Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

        eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

        βgtΣBβ = βgtΣWΣminus1

        W ΣBβ

        = νβgtΣWβ from (C2)

        = ν from (C1b)

        That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

        W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

        111

        D Alternative Variational Formulation forthe Group-Lasso

        In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

        minτisinRp

        minBisinRptimesKminus1

        J(B) + λ

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj(D1a)

        s tsump

        j=1 τj = 1 (D1b)

        τj ge 0 j = 1 p (D1c)

        Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

        of row vectors βj isin RKminus1 B =(β1gt βpgt

        )gt

        L(B τ λ ν0 νj) = J(B) + λ

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj+ ν0

        psumj=1

        τj minus 1

        minus psumj=1

        νjτj (D2)

        The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

        partL(B τ λ ν0 νj)

        partτj

        ∣∣∣∣τj=τj

        = 0 rArr minusλw2j

        ∥∥βj∥∥2

        2

        τj2 + ν0 minus νj = 0

        rArr minusλw2j

        ∥∥βj∥∥2

        2+ ν0τ

        j

        2 minus νjτj2 = 0

        rArr minusλw2j

        ∥∥βj∥∥2

        2+ ν0τ

        j

        2 = 0

        The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

        ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

        inequality Lagrange condition Then the optimal τj can be deduced

        τj =

        radicλ

        ν0wj∥∥βj∥∥

        2

        Placing this optimal value of τj into constraint (D1b)

        psumj=1

        τj = 1rArr τj =wj∥∥βj∥∥

        2sumpj=1wj

        ∥∥βj∥∥2

        (D3)

        113

        D Alternative Variational Formulation for the Group-Lasso

        With this value of τj Problem (D1) is equivalent to

        minBisinRptimesKminus1

        J(B) + λ

        psumj=1

        wj∥∥βj∥∥

        2

        2

        (D4)

        This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

        The penalty term of (D1a) can be conveniently presented as λBgtΩB where

        Ω = diag

        (w2

        1

        τ1w2

        2

        τ2

        w2p

        τp

        ) (D5)

        Using the value of τj from (D3) each diagonal component of Ω is

        (Ω)jj =wjsump

        j=1wj∥∥βj∥∥

        2∥∥βj∥∥2

        (D6)

        In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

        D1 Useful Properties

        Lemma D1 If J is convex Problem (D1) is convex

        In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

        Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

        partJ(B)

        partB+ 2λ

        Kminus1sumj=1

        wj∥∥βj∥∥

        2

        G

        (D7)

        where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

        ∥∥βj∥∥26= 0 then we have

        forallj isin S(B) gj = wj∥∥βj∥∥minus1

        2βj (D8)

        forallj isin S(B) ∥∥gj∥∥

        2le wj (D9)

        114

        D2 An Upper Bound on the Objective Function

        This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

        Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

        ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

        forallj isin S(B) minus partJ(B)

        partβj= 2λ

        Kminus1sumj=1

        wj∥∥βj∥∥2

        wj∥∥βj∥∥minus1

        2βj (D10a)

        forallj isin S(B)

        ∥∥∥∥partJ(B)

        partβj

        ∥∥∥∥2

        le 2λwj

        Kminus1sumj=1

        wj∥∥βj∥∥2

        (D10b)

        In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

        D2 An Upper Bound on the Objective Function

        Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

        τj =wj∥∥βj∥∥

        2sumpj=1wj

        ∥∥βj∥∥2

        Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

        j=1

        wj∥∥βj∥∥

        2

        2

        =

        psumj=1

        τ12j

        wj∥∥βj∥∥

        2

        τ12j

        2

        le

        psumj=1

        τj

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj

        le

        psumj=1

        w2j

        ∥∥βj∥∥2

        2

        τj

        where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

        115

        D Alternative Variational Formulation for the Group-Lasso

        This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

        116

        E Invariance of the Group-Lasso to UnitaryTransformations

        The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

        Proposition E1 Let B be a solution of

        minBisinRptimesM

        Y minusXB2F + λ

        psumj=1

        wj∥∥βj∥∥

        2(E1)

        and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

        minBisinRptimesM

        ∥∥∥Y minusXB∥∥∥2

        F+ λ

        psumj=1

        wj∥∥βj∥∥

        2(E2)

        Proof The first-order necessary optimality conditions for B are

        forallj isin S(B) 2xjgt(xjβ

        j minusY)

        + λwj

        ∥∥∥βj∥∥∥minus1

        2βj

        = 0 (E3a)

        forallj isin S(B) 2∥∥∥xjgt (xjβ

        j minusY)∥∥∥

        2le λwj (E3b)

        where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

        First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

        forallj isin S(B) 2xjgt(xjβ

        j minus Y)

        + λwj

        ∥∥∥βj∥∥∥minus1

        2βj

        = 0 (E4a)

        forallj isin S(B) 2∥∥∥xjgt (xjβ

        j minus Y)∥∥∥

        2le λwj (E4b)

        where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

        ∥∥ugt∥∥2

        =∥∥ugtV

        ∥∥2 Equation (E4b) is also

        117

        E Invariance of the Group-Lasso to Unitary Transformations

        obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

        118

        F Expected Complete Likelihood andLikelihood

        Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

        L(θ) =

        nsumi=1

        log

        (Ksumk=1

        πkfk(xiθk)

        )(F1)

        Q(θθprime) =nsumi=1

        Ksumk=1

        tik(θprime) log (πkfk(xiθk)) (F2)

        with tik(θprime) =

        πprimekfk(xiθprimek)sum

        ` πprime`f`(xiθ

        prime`)

        (F3)

        In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

        the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

        Using (F3) we have

        Q(θθprime) =sumik

        tik(θprime) log (πkfk(xiθk))

        =sumik

        tik(θprime) log(tik(θ)) +

        sumik

        tik(θprime) log

        (sum`

        π`f`(xiθ`)

        )=sumik

        tik(θprime) log(tik(θ)) + L(θ)

        In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

        L(θ) = Q(θθ)minussumik

        tik(θ) log(tik(θ))

        = Q(θθ) +H(T)

        119

        G Derivation of the M-Step Equations

        This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

        Q(θθprime) = maxθ

        sumik

        tik(θprime) log(πkfk(xiθk))

        =sumk

        log

        (πksumi

        tik

        )minus np

        2log(2π)minus n

        2log |Σ| minus 1

        2

        sumik

        tik(xi minus microk)gtΣminus1(xi minus microk)

        which has to be maximized subject tosumk

        πk = 1

        The Lagrangian of this problem is

        L(θ) = Q(θθprime) + λ

        (sumk

        πk minus 1

        )

        Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

        G1 Prior probabilities

        partL(θ)

        partπk= 0hArr 1

        πk

        sumi

        tik + λ = 0

        where λ is identified from the constraint leading to

        πk =1

        n

        sumi

        tik

        121

        G Derivation of the M-Step Equations

        G2 Means

        partL(θ)

        partmicrok= 0hArr minus1

        2

        sumi

        tik2Σminus1(microk minus xi) = 0

        rArr microk =

        sumi tikxisumi tik

        G3 Covariance Matrix

        partL(θ)

        partΣminus1 = 0hArr n

        2Σ︸︷︷︸

        as per property 4

        minus 1

        2

        sumik

        tik(xi minus microk)(xi minus microk)gt

        ︸ ︷︷ ︸as per property 5

        = 0

        rArr Σ =1

        n

        sumik

        tik(xi minus microk)(xi minus microk)gt

        122

        Bibliography

        F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

        F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

        F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

        J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

        A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

        H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

        P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

        C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

        C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

        C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

        C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

        S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

        L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

        L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

        123

        Bibliography

        T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

        S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

        C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

        B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

        L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

        C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

        A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

        D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

        R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

        B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

        Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

        R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

        V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

        J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

        124

        Bibliography

        J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

        J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

        W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

        A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

        D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

        G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

        G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

        Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

        Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

        L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

        Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

        J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

        I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

        T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

        T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

        125

        Bibliography

        T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

        A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

        J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

        T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

        K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

        P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

        T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

        M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

        Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

        C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

        C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

        H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

        J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

        Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

        C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

        126

        Bibliography

        C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

        L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

        N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

        B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

        B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

        Y Nesterov Gradient methods for minimizing composite functions preprint 2007

        S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

        B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

        M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

        M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

        W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

        W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

        K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

        S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

        127

        Bibliography

        Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

        A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

        C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

        S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

        V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

        V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

        V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

        C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

        L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

        Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

        A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

        S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

        P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

        M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

        128

        Bibliography

        M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

        R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

        J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

        S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

        D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

        D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

        D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

        M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

        MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

        T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

        B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

        B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

        C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

        J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

        129

        Bibliography

        M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

        P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

        P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

        H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

        H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

        H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

        130

        • SANCHEZ MERCHANTE PDTpdf
        • Thesis Luis Francisco Sanchez Merchantepdf
          • List of figures
          • List of tables
          • Notation and Symbols
          • Context and Foundations
            • Context
            • Regularization for Feature Selection
              • Motivations
              • Categorization of Feature Selection Techniques
              • Regularization
                • Important Properties
                • Pure Penalties
                • Hybrid Penalties
                • Mixed Penalties
                • Sparsity Considerations
                • Optimization Tools for Regularized Problems
                  • Sparse Linear Discriminant Analysis
                    • Abstract
                    • Feature Selection in Fisher Discriminant Analysis
                      • Fisher Discriminant Analysis
                      • Feature Selection in LDA Problems
                        • Inertia Based
                        • Regression Based
                            • Formalizing the Objective
                              • From Optimal Scoring to Linear Discriminant Analysis
                                • Penalized Optimal Scoring Problem
                                • Penalized Canonical Correlation Analysis
                                • Penalized Linear Discriminant Analysis
                                • Summary
                                  • Practicalities
                                    • Solution of the Penalized Optimal Scoring Regression
                                    • Distance Evaluation
                                    • Posterior Probability Evaluation
                                    • Graphical Representation
                                      • From Sparse Optimal Scoring to Sparse LDA
                                        • A Quadratic Variational Form
                                        • Group-Lasso OS as Penalized LDA
                                            • GLOSS Algorithm
                                              • Regression Coefficients Updates
                                                • Cholesky decomposition
                                                • Numerical Stability
                                                  • Score Matrix
                                                  • Optimality Conditions
                                                  • Active and Inactive Sets
                                                  • Penalty Parameter
                                                  • Options and Variants
                                                    • Scaling Variables
                                                    • Sparse Variant
                                                    • Diagonal Variant
                                                    • Elastic net and Structured Variant
                                                        • Experimental Results
                                                          • Normalization
                                                          • Decision Thresholds
                                                          • Simulated Data
                                                          • Gene Expression Data
                                                          • Correlated Data
                                                            • Discussion
                                                              • Sparse Clustering Analysis
                                                                • Abstract
                                                                • Feature Selection in Mixture Models
                                                                  • Mixture Models
                                                                    • Model
                                                                    • Parameter Estimation The EM Algorithm
                                                                      • Feature Selection in Model-Based Clustering
                                                                        • Based on Penalized Likelihood
                                                                        • Based on Model Variants
                                                                        • Based on Model Selection
                                                                            • Theoretical Foundations
                                                                              • Resolving EM with Optimal Scoring
                                                                                • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                • Clustering Using Penalized Optimal Scoring
                                                                                • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                  • Optimized Criterion
                                                                                    • A Bayesian Derivation
                                                                                    • Maximum a Posteriori Estimator
                                                                                        • Mix-GLOSS Algorithm
                                                                                          • Mix-GLOSS
                                                                                            • Outer Loop Whole Algorithm Repetitions
                                                                                            • Penalty Parameter Loop
                                                                                            • Inner Loop EM Algorithm
                                                                                              • Model Selection
                                                                                                • Experimental Results
                                                                                                  • Tested Clustering Algorithms
                                                                                                  • Results
                                                                                                  • Discussion
                                                                                                      • Conclusions
                                                                                                      • Appendix
                                                                                                        • Matrix Properties
                                                                                                        • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                          • How to Solve the Eigenvector Decomposition
                                                                                                          • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                            • Solving Fishers Discriminant Problem
                                                                                                            • Alternative Variational Formulation for the Group-Lasso
                                                                                                              • Useful Properties
                                                                                                              • An Upper Bound on the Objective Function
                                                                                                                • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                • Expected Complete Likelihood and Likelihood
                                                                                                                • Derivation of the M-Step Equations
                                                                                                                  • Prior probabilities
                                                                                                                  • Means
                                                                                                                  • Covariance Matrix
                                                                                                                      • Bibliography

          Acknowledgements

          If this thesis has fallen into your hands and you have the curiosity to read this para-graph you must know that even though it is a short section there are quite a lot ofpeople behind this volume All of them supported me during the three years threemonths and three weeks that it took me to finish this work However you will hardlyfind any names I think it is a little sad writing peoplersquos names in a document that theywill probably not see and that will be condemned to gather dust on a bookshelf It islike losing a wallet with pictures of your beloved family and friends It makes me feelsomething like melancholy

          Obviously this does not mean that I have nothing to be grateful for I always feltunconditional love and support from my family and I never felt homesick since my spanishfriends did the best they could to visit me frequently During my time in CompiegneI met wonderful people that are now friends for life I am sure that all this people donot need to be listed in this section to know how much I love them I thank them everytime we see each other by giving them the best of myself

          I enjoyed my time in Compiegne It was an exciting adventure and I do not regreta single thing I am sure that I will miss these days but this does not make me sadbecause as the Beatles sang in ldquoThe endrdquo or Jorge Drexler in ldquoTodo se transformardquo theamount that you miss people is equal to the love you gave them and received from them

          The only names I am including are my supervisorsrsquo Yves Grandvalet and GerardGovaert I do not think it is possible to have had better teaching and supervision andI am sure that the reason I finished this work was not only thanks to their technicaladvice but also but also thanks to their close support humanity and patience

          Contents

          List of figures v

          List of tables vii

          Notation and Symbols ix

          I Context and Foundations 1

          1 Context 5

          2 Regularization for Feature Selection 921 Motivations 9

          22 Categorization of Feature Selection Techniques 11

          23 Regularization 13

          231 Important Properties 14

          232 Pure Penalties 14

          233 Hybrid Penalties 18

          234 Mixed Penalties 19

          235 Sparsity Considerations 19

          236 Optimization Tools for Regularized Problems 21

          II Sparse Linear Discriminant Analysis 25

          Abstract 27

          3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

          32 Feature Selection in LDA Problems 30

          321 Inertia Based 30

          322 Regression Based 32

          4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

          411 Penalized Optimal Scoring Problem 36

          412 Penalized Canonical Correlation Analysis 37

          i

          Contents

          413 Penalized Linear Discriminant Analysis 39

          414 Summary 40

          42 Practicalities 41

          421 Solution of the Penalized Optimal Scoring Regression 41

          422 Distance Evaluation 42

          423 Posterior Probability Evaluation 43

          424 Graphical Representation 43

          43 From Sparse Optimal Scoring to Sparse LDA 43

          431 A Quadratic Variational Form 44

          432 Group-Lasso OS as Penalized LDA 47

          5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

          511 Cholesky decomposition 52

          512 Numerical Stability 52

          52 Score Matrix 52

          53 Optimality Conditions 53

          54 Active and Inactive Sets 54

          55 Penalty Parameter 54

          56 Options and Variants 55

          561 Scaling Variables 55

          562 Sparse Variant 55

          563 Diagonal Variant 55

          564 Elastic net and Structured Variant 55

          6 Experimental Results 5761 Normalization 57

          62 Decision Thresholds 57

          63 Simulated Data 58

          64 Gene Expression Data 60

          65 Correlated Data 63

          Discussion 63

          III Sparse Clustering Analysis 67

          Abstract 69

          7 Feature Selection in Mixture Models 7171 Mixture Models 71

          711 Model 71

          712 Parameter Estimation The EM Algorithm 72

          ii

          Contents

          72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

          8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

          811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

          Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

          82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

          9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

          911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

          92 Model Selection 91

          10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

          Conclusions 97

          Appendix 103

          A Matrix Properties 105

          B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

          C Solving Fisherrsquos Discriminant Problem 111

          D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

          iii

          Contents

          E Invariance of the Group-Lasso to Unitary Transformations 117

          F Expected Complete Likelihood and Likelihood 119

          G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

          Bibliography 123

          iv

          List of Figures

          11 MASH project logo 5

          21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

          rameters 20

          41 Graphical representation of the variational approach to Group-Lasso 45

          51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

          61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

          discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

          91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

          101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

          v

          List of Tables

          61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

          101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

          vii

          Notation and Symbols

          Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

          Sets

          N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

          Data

          X input domainxi input sample xi isin XX design matrix X = (xgt1 x

          gtn )gt

          xj column j of Xyi class indicator of sample i

          Y indicator matrix Y = (ygt1 ygtn )gt

          z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

          Vectors Matrices and Norms

          0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

          ix

          Notation and Symbols

          Probability

          E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

          W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

          H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

          Mixture Models

          yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

          θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

          Optimization

          J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

          βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

          x

          Notation and Symbols

          Penalized models

          λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

          βj jth row of B = (β1gt βpgt)gt

          BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

          ΣB sample between-class covariance matrix

          ΣW sample within-class covariance matrix

          ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

          xi

          Part I

          Context and Foundations

          1

          This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

          The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

          The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

          3

          1 Context

          The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

          The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

          From the point of view of the research the members of the consortium must deal withfour main goals

          1 Software development of website framework and APIrsquos

          2 Classification and goal-planning in high dimensional feature spaces

          3 Interfacing the platform with the 3D virtual environment and the robot arm

          4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

          S HM A

          Figure 11 MASH project logo

          5

          1 Context

          The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

          Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

          As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

          bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

          bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

          6

          All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

          bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

          I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

          7

          2 Regularization for Feature Selection

          With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

          21 Motivations

          There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

          As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

          When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

          bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

          bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

          9

          2 Regularization for Feature Selection

          Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

          selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

          As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

          ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

          Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

          10

          22 Categorization of Feature Selection Techniques

          Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

          ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

          There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

          22 Categorization of Feature Selection Techniques

          Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

          I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

          The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

          bull Depending on the type of integration with the machine learning algorithm we have

          ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

          ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

          11

          2 Regularization for Feature Selection

          the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

          ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

          bull Depending on the feature searching technique

          ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

          ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

          ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

          bull Depending on the evaluation technique

          ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

          ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

          ndash Dependency Measures - Measuring the correlation between features

          ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

          ndash Predictive Accuracy - Use the selected features to predict the labels

          ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

          The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

          In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

          12

          23 Regularization

          goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

          23 Regularization

          In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

          An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

          minβJ(β) + λP (β) (21)

          minβ

          J(β)

          s t P (β) le t (22)

          In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

          In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

          13

          2 Regularization for Feature Selection

          Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

          231 Important Properties

          Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

          Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

          forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

          for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

          Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

          Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

          232 Pure Penalties

          For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

          14

          23 Regularization

          Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

          this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

          Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

          A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

          After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

          3penalty has a support region with sharper vertexes that would induce

          a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

          3results in difficulties during optimization that will not happen with a convex

          shape

          15

          2 Regularization for Feature Selection

          To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

          L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

          minβ

          J(β)

          s t β0 le t (24)

          where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

          L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

          minβ

          J(β)

          s t

          psumj=1

          |βj | le t (25)

          Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

          Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

          The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

          16

          23 Regularization

          minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

          L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

          minβJ(β) + λ β22 (26)

          The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

          minβ

          nsumi=1

          (yi minus xgti β)2 (27)

          with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

          minβ

          nsumi=1

          (yi minus xgti β)2 + λ

          psumj=1

          β2j

          The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

          the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

          As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

          minβ

          nsumi=1

          (yi minus xgti β)2 + λ

          psumj=1

          β2j

          (βlsj )2 (28)

          The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

          17

          2 Regularization for Feature Selection

          where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

          Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

          Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

          This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

          βlowast = maxwisinRp

          βgtw s t w le 1

          In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

          r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

          233 Hybrid Penalties

          There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

          minβ

          nsumi=1

          (yi minus xgti β)2 + λ1

          psumj=1

          |βj |+ λ2

          psumj=1

          β2j (29)

          The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

          18

          23 Regularization

          234 Mixed Penalties

          Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

          sumL`=1 d` Mixed norms are

          a type of norms that take into consideration those groups The general expression isshowed below

          β(rs) =

          sum`

          sumjisinG`

          |βj |s r

          s

          1r

          (210)

          The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

          Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

          (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

          235 Sparsity Considerations

          In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

          The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

          To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

          19

          2 Regularization for Feature Selection

          (a) L1 Lasso (b) L(12) group-Lasso

          Figure 25 Admissible sets for the Lasso and Group-Lasso

          (a) L1 induced sparsity (b) L(12) group inducedsparsity

          Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

          20

          23 Regularization

          the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

          236 Optimization Tools for Regularized Problems

          In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

          In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

          Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

          β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

          Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

          βj =minusλsign(βj)minus partJ(β)

          partβj

          2sumn

          i=1 x2ij

          In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

          algorithm where β(t+1)j = Sλ

          (partJ(β(t))partβj

          ) The objective function is optimized with respect

          21

          2 Regularization for Feature Selection

          to one variable at a time while all others are kept fixed

          (partJ(β)

          partβj

          )=

          λminus partJ(β)partβj

          2sumn

          i=1 x2ij

          if partJ(β)partβj

          gt λ

          minusλminus partJ(β)partβj

          2sumn

          i=1 x2ij

          if partJ(β)partβj

          lt minusλ

          0 if |partJ(β)partβj| le λ

          (211)

          The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

          Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

          Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

          Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

          This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

          22

          23 Regularization

          and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

          Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

          This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

          This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

          Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

          This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

          Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

          Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

          minβisinRp

          J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

          2

          ∥∥∥β minus β(t)∥∥∥2

          2(212)

          They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

          23

          2 Regularization for Feature Selection

          (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

          minβisinRp

          1

          2

          ∥∥∥∥β minus (β(t) minus 1

          LnablaJ(β(t)))

          ∥∥∥∥2

          2

          LP (β) (213)

          The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

          24

          Part II

          Sparse Linear Discriminant Analysis

          25

          Abstract

          Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

          There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

          In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

          27

          3 Feature Selection in Fisher DiscriminantAnalysis

          31 Fisher Discriminant Analysis

          Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

          We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

          gtn )gt and the corresponding labels in the ntimesK matrix

          Y = (ygt1 ygtn )gt

          Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

          maxβisinRp

          βgtΣBβ

          βgtΣWβ (31)

          where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

          ΣW =1

          n

          Ksumk=1

          sumiisinGk

          (xi minus microk)(xi minus microk)gt

          ΣB =1

          n

          Ksumk=1

          sumiisinGk

          (microminus microk)(microminus microk)gt

          where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

          29

          3 Feature Selection in Fisher Discriminant Analysis

          This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

          maxBisinRptimesKminus1

          tr(BgtΣBB

          )tr(BgtΣWB

          ) (32)

          where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

          based on a series of K minus 1 subproblemsmaxβkisinRp

          βgtk ΣBβk

          s t βgtk ΣWβk le 1

          βgtk ΣWβ` = 0 forall` lt k

          (33)

          The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

          eigenvalue (see Appendix C)

          32 Feature Selection in LDA Problems

          LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

          Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

          The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

          They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

          321 Inertia Based

          The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

          30

          32 Feature Selection in LDA Problems

          classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

          Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

          Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

          minβisinRp

          βgtΣWβ

          s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

          where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

          Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

          βisinkRpβgtk Σ

          k

          Bβk minus Pk(βk)

          s t βgtk ΣWβk le 1

          The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

          Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

          solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

          minimization minβisinRp

          β1

          s t∥∥∥Σβ minus (micro1 minus micro2)

          ∥∥∥infinle λ

          Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

          31

          3 Feature Selection in Fisher Discriminant Analysis

          Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

          322 Regression Based

          In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

          Predefined Indicator Matrix

          Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

          There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

          Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

          In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

          32

          32 Feature Selection in LDA Problems

          obtained by solving

          minβisinRpβ0isinR

          nminus1nsumi=1

          (yi minus β0 minus xgti β)2 + λ

          psumj=1

          |βj |

          where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

          vector for λ = 0 but a different intercept β0 is required

          Optimal Scoring

          In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

          As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

          minΘ BYΘminusXB2F + λ tr

          (BgtΩB

          )(34a)

          s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

          where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

          minθkisinRK βkisinRp

          Yθk minusXβk2 + βgtk Ωβk (35a)

          s t nminus1 θgtk YgtYθk = 1 (35b)

          θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

          where each βk corresponds to a discriminant direction

          33

          3 Feature Selection in Fisher Discriminant Analysis

          Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

          minβkisinRpθkisinRK

          sumk

          Yθk minusXβk22 + λ1 βk1 + λ2β

          gtk Ωβk

          where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

          Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

          minβkisinRpθkisinRK

          Kminus1sumk=1

          Yθk minusXβk22 + λ

          psumj=1

          radicradicradicradicKminus1sumk=1

          β2kj

          2

          (36)

          which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

          this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

          34

          4 Formalizing the Objective

          In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

          The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

          The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

          41 From Optimal Scoring to Linear Discriminant Analysis

          Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

          Throughout this chapter we assume that

          bull there is no empty class that is the diagonal matrix YgtY is full rank

          bull inputs are centered that is Xgt1n = 0

          bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

          35

          4 Formalizing the Objective

          411 Penalized Optimal Scoring Problem

          For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

          The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

          minθisinRK βisinRp

          Yθ minusXβ2 + βgtΩβ (41a)

          s t nminus1 θgtYgtYθ = 1 (41b)

          For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

          βos =(XgtX + Ω

          )minus1XgtYθ (42)

          The objective function (41a) is then

          Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

          (XgtX + Ω

          )βos

          = θgtYgtYθ minus θgtYgtX(XgtX + Ω

          )minus1XgtYθ

          where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

          maxθnminus1θgtYgtYθ=1

          θgtYgtX(XgtX + Ω

          )minus1XgtYθ (43)

          which shows that the optimization of the p-OS problem with respect to θk boils down to

          finding the kth largest eigenvector of YgtX(XgtX + Ω

          )minus1XgtY Indeed Appendix C

          details that Problem (43) is solved by

          (YgtY)minus1YgtX(XgtX + Ω

          )minus1XgtYθ = α2θ (44)

          36

          41 From Optimal Scoring to Linear Discriminant Analysis

          where α2 is the maximal eigenvalue 1

          nminus1θgtYgtX(XgtX + Ω

          )minus1XgtYθ = α2nminus1θgt(YgtY)θ

          nminus1θgtYgtX(XgtX + Ω

          )minus1XgtYθ = α2 (45)

          412 Penalized Canonical Correlation Analysis

          As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

          maxθisinRK βisinRp

          nminus1θgtYgtXβ (46a)

          s t nminus1 θgtYgtYθ = 1 (46b)

          nminus1 βgt(XgtX + Ω

          )β = 1 (46c)

          The solutions to (46) are obtained by finding saddle points of the Lagrangian

          nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

          rArr npartL(βθ γ ν)

          partβ= XgtYθ minus 2γ(XgtX + Ω)β

          rArr βcca =1

          2γ(XgtX + Ω)minus1XgtYθ

          Then as βcca obeys (46c) we obtain

          βcca =(XgtX + Ω)minus1XgtYθradic

          nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

          so that the optimal objective function (46a) can be expressed with θ alone

          nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

          =

          radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

          and the optimization problem with respect to θ can be restated as

          maxθnminus1θgtYgtYθ=1

          θgtYgtX(XgtX + Ω

          )minus1XgtYθ (48)

          Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

          βos = αβcca (49)

          1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

          37

          4 Formalizing the Objective

          where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

          the optimality conditions for θ

          npartL(βθ γ ν)

          partθ= YgtXβ minus 2νYgtYθ

          rArr θcca =1

          2ν(YgtY)minus1YgtXβ (410)

          Then as θcca obeys (46b) we obtain

          θcca =(YgtY)minus1YgtXβradic

          nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

          leading to the following expression of the optimal objective function

          nminus1θgtccaYgtXβ =

          nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

          =

          radicnminus1βgtXgtY(YgtY)minus1YgtXβ

          The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

          maxβisinRp

          nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

          s t nminus1 βgt(XgtX + Ω

          )β = 1 (412b)

          where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

          nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

          )βcca (413)

          where λ is the maximal eigenvalue shown below to be equal to α2

          nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

          rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

          rArr nminus1αβgtccaXgtYθ = λ

          rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

          rArr α2 = λ

          The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

          38

          41 From Optimal Scoring to Linear Discriminant Analysis

          413 Penalized Linear Discriminant Analysis

          Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

          maxβisinRp

          βgtΣBβ (414a)

          s t βgt(ΣW + nminus1Ω)β = 1 (414b)

          where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

          As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

          to a simple matrix representation using the projection operator Y(YgtY

          )minus1Ygt

          ΣT =1

          n

          nsumi=1

          xixigt

          = nminus1XgtX

          ΣB =1

          n

          Ksumk=1

          nk microkmicrogtk

          = nminus1XgtY(YgtY

          )minus1YgtX

          ΣW =1

          n

          Ksumk=1

          sumiyik=1

          (xi minus microk) (xi minus microk)gt

          = nminus1

          (XgtXminusXgtY

          (YgtY

          )minus1YgtX

          )

          Using these formulae the solution to the p-LDA problem (414) is obtained as

          XgtY(YgtY

          )minus1YgtXβlda = λ

          (XgtX + ΩminusXgtY

          (YgtY

          )minus1YgtX

          )βlda

          XgtY(YgtY

          )minus1YgtXβlda =

          λ

          1minus λ

          (XgtX + Ω

          )βlda

          The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

          βlda = (1minus α2)minus12 βcca

          = αminus1(1minus α2)minus12 βos

          which ends the path from p-OS to p-LDA

          39

          4 Formalizing the Objective

          414 Summary

          The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

          minΘ BYΘminusXB2F + λ tr

          (BgtΩB

          )s t nminus1 ΘgtYgtYΘ = IKminus1

          Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

          square-root of the largest eigenvector of YgtX(XgtX + Ω

          )minus1XgtY we have

          BLDA = BCCA

          (IKminus1 minusA2

          )minus 12

          = BOS Aminus1(IKminus1 minusA2

          )minus 12 (415)

          where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

          can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

          or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

          With the aim of performing classification the whole process could be summarized asfollows

          1 Solve the p-OS problem as

          BOS =(XgtX + λΩ

          )minus1XgtYΘ

          where Θ are the K minus 1 leading eigenvectors of

          YgtX(XgtX + λΩ

          )minus1XgtY

          2 Translate the data samples X into the LDA domain as XLDA = XBOSD

          where D = Aminus1(IKminus1 minusA2

          )minus 12

          3 Compute the matrix M of centroids microk from XLDA and Y

          4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

          5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

          6 Graphical Representation

          40

          42 Practicalities

          The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

          42 Practicalities

          421 Solution of the Penalized Optimal Scoring Regression

          Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

          minΘisinRKtimesKminus1BisinRptimesKminus1

          YΘminusXB2F + λ tr(BgtΩB

          )(416a)

          s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

          where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

          Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

          1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

          2 Compute B =(XgtX + λΩ

          )minus1XgtYΘ0

          3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

          )minus1XgtY

          4 Compute the optimal regression coefficients

          BOS =(XgtX + λΩ

          )minus1XgtYΘ (417)

          Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

          Θ0gtYgtX(XgtX + λΩ

          )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

          costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

          This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

          41

          4 Formalizing the Objective

          a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

          422 Distance Evaluation

          The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

          d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

          (nkn

          ) (418)

          is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

          Σminus1WΩ =

          (nminus1(XgtX + λΩ)minus ΣB

          )minus1

          =(nminus1XgtXminus ΣB + nminus1λΩ

          )minus1

          =(ΣW + nminus1λΩ

          )minus1 (419)

          Before explaining how to compute the distances let us summarize some clarifying points

          bull The solution BOS of the p-OS problem is enough to accomplish classification

          bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

          bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

          As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

          (xi minus microk)BOS2ΣWΩminus 2 log(πk)

          where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

          (IKminus1 minusA2

          )minus 12

          ∥∥∥2

          2minus 2 log(πk)

          which is a plain Euclidean distance

          42

          43 From Sparse Optimal Scoring to Sparse LDA

          423 Posterior Probability Evaluation

          Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

          p(yk = 1|x) prop exp

          (minusd(xmicrok)

          2

          )prop πk exp

          (minus1

          2

          ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

          )minus 12

          ∥∥∥2

          2

          ) (420)

          Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

          2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

          p(yk = 1|x) =πk exp

          (minusd(xmicrok)

          2

          )sum

          ` π` exp(minusd(xmicro`)

          2

          )=

          πk exp(minusd(xmicrok)

          2 + dmax2

          )sum`

          π` exp

          (minusd(xmicro`)

          2+dmax

          2

          )

          where dmax = maxk d(xmicrok)

          424 Graphical Representation

          Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

          43 From Sparse Optimal Scoring to Sparse LDA

          The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

          In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

          43

          4 Formalizing the Objective

          section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

          431 A Quadratic Variational Form

          Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

          Our formulation of group-Lasso is showed below

          minτisinRp

          minBisinRptimesKminus1

          J(B) + λ

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj(421a)

          s tsum

          j τj minussum

          j wj∥∥βj∥∥

          2le 0 (421b)

          τj ge 0 j = 1 p (421c)

          where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

          B =(β1gt βpgt

          )gtand wj are predefined nonnegative weights The cost function

          J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

          The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

          Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

          Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

          j=1wj∥∥βj∥∥

          2

          Proof The Lagrangian of Problem (421) is

          L = J(B) + λ

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj+ ν0

          ( psumj=1

          τj minuspsumj=1

          wj∥∥βj∥∥

          2

          )minus

          psumj=1

          νjτj

          44

          43 From Sparse Optimal Scoring to Sparse LDA

          Figure 41 Graphical representation of the variational approach to Group-Lasso

          Thus the first order optimality conditions for τj are

          partLpartτj

          (τj ) = 0hArr minusλw2j

          ∥∥βj∥∥2

          2

          τj2 + ν0 minus νj = 0

          hArr minusλw2j

          ∥∥βj∥∥2

          2+ ν0τ

          j

          2 minus νjτj2 = 0

          rArr minusλw2j

          ∥∥βj∥∥2

          2+ ν0 τ

          j

          2 = 0

          The last line is obtained from complementary slackness which implies here νjτj = 0

          Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

          for constraint gj(τj) le 0 As a result the optimal value of τj

          τj =

          radicλw2

          j

          ∥∥βj∥∥2

          2

          ν0=

          radicλ

          ν0wj∥∥βj∥∥

          2(422)

          We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

          psumj=1

          τj minuspsumj=1

          wj∥∥βj∥∥

          2= 0 (423)

          so that τj = wj∥∥βj∥∥

          2 Using this value into (421a) it is possible to conclude that

          Problem (421) is equivalent to the standard group-Lasso operator

          minBisinRptimesM

          J(B) + λ

          psumj=1

          wj∥∥βj∥∥

          2 (424)

          So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

          45

          4 Formalizing the Objective

          With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

          Ω = diag

          (w2

          1

          τ1w2

          2

          τ2

          w2p

          τp

          ) (425)

          with

          τj = wj∥∥βj∥∥

          2

          resulting in Ω diagonal components

          (Ω)jj =wj∥∥βj∥∥

          2

          (426)

          And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

          The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

          Lemma 42 If J is convex Problem (421) is convex

          Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

          In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

          Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

          V isin RptimesKminus1 V =partJ(B)

          partB+ λG

          (427)

          where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

          G =(g1gt gpgt

          )gtdefined as follows Let S(B) denote the columnwise support of

          B S(B) = j isin 1 p ∥∥βj∥∥

          26= 0 then we have

          forallj isin S(B) gj = wj∥∥βj∥∥minus1

          2βj (428)

          forallj isin S(B) ∥∥gj∥∥

          2le wj (429)

          46

          43 From Sparse Optimal Scoring to Sparse LDA

          This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

          Proof When∥∥βj∥∥

          26= 0 the gradient of the penalty with respect to βj is

          part (λsump

          m=1wj βm2)

          partβj= λwj

          βj∥∥βj∥∥2

          (430)

          At∥∥βj∥∥

          2= 0 the gradient of the objective function is not continuous and the optimality

          conditions then make use of the subdifferential (Bach et al 2011)

          partβj

          psumm=1

          wj βm2

          )= partβj

          (λwj

          ∥∥βj∥∥2

          )=λwjv isin RKminus1 v2 le 1

          (431)

          That gives the expression (429)

          Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

          forallj isin S partJ(B)

          partβj+ λwj

          ∥∥βj∥∥minus1

          2βj = 0 (432a)

          forallj isin S ∥∥∥∥partJ(B)

          partβj

          ∥∥∥∥2

          le λwj (432b)

          where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

          Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

          432 Group-Lasso OS as Penalized LDA

          With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

          Proposition 41 The group-Lasso OS problem

          BOS = argminBisinRptimesKminus1

          minΘisinRKtimesKminus1

          1

          2YΘminusXB2F + λ

          psumj=1

          wj∥∥βj∥∥

          2

          s t nminus1 ΘgtYgtYΘ = IKminus1

          47

          4 Formalizing the Objective

          is equivalent to the penalized LDA problem

          BLDA = maxBisinRptimesKminus1

          tr(BgtΣBB

          )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

          where Ω = diag

          (w2

          1

          τ1

          w2p

          τp

          ) with Ωjj =

          +infin if βjos = 0

          wj∥∥βjos

          ∥∥minus1

          2otherwise

          (433)

          That is BLDA = BOS diag(αminus1k (1minus α2

          k)minus12

          ) where αk isin (0 1) is the kth leading

          eigenvalue of

          nminus1YgtX(XgtX + λΩ

          )minus1XgtY

          Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

          The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

          (BgtΩB

          )

          48

          5 GLOSS Algorithm

          The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

          The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

          1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

          2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

          3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

          This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

          51 Regression Coefficients Updates

          Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

          XgtAXA + λΩ)βk = XgtAYθ0

          k (51)

          49

          5 GLOSS Algorithm

          initialize modelλ B

          ACTIVE SETall j st||βj ||2 gt 0

          p-OS PROBLEMB must hold1st optimality

          condition

          any variablefrom

          ACTIVE SETmust go toINACTIVE

          SET

          take it out ofACTIVE SET

          test 2nd op-timality con-dition on the

          INACTIVE SET

          any variablefrom

          INACTIVE SETmust go toACTIVE

          SET

          take it out ofINACTIVE SET

          compute Θ

          and update B end

          yes

          no

          yes

          no

          Figure 51 GLOSS block diagram

          50

          51 Regression Coefficients Updates

          Algorithm 1 Adaptively Penalized Optimal Scoring

          Input X Y B λInitialize A larr

          j isin 1 p

          ∥∥βj∥∥2gt 0

          Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

          Step 1 solve (421) in B assuming A optimalrepeat

          Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

          2

          BA larr(XgtAXA + λΩ

          )minus1XgtAYΘ0

          until condition (432a) holds for all j isin A Step 2 identify inactivated variables

          for j isin A ∥∥βj∥∥

          2= 0 do

          if optimality condition (432b) holds thenA larr AjGo back to Step 1

          end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

          jisinA

          ∥∥partJpartβj∥∥2

          if∥∥∥partJpartβj∥∥∥

          2lt λ then

          convergence larr true B is optimalelseA larr Acup j

          end ifuntil convergence

          (sV)larreigenanalyze(Θ0gtYgtXAB) that is

          Θ0gtYgtXABVk = skVk k = 1 K minus 1

          Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

          Output Θ B α

          51

          5 GLOSS Algorithm

          where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

          column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

          511 Cholesky decomposition

          Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

          (XgtX + λΩ)B = XgtYΘ (52)

          Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

          CgtCB = XgtYΘ

          CB = CgtXgtYΘ

          B = CCgtXgtYΘ (53)

          where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

          512 Numerical Stability

          The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

          B = Ωminus12(Ωminus12XgtXΩminus12 + λI

          )minus1Ωminus12XgtYΘ0 (54)

          where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

          52 Score Matrix

          The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

          YgtX(XgtX + Ω

          )minus1XgtY This eigen-analysis is actually solved in the form

          ΘgtYgtX(XgtX + Ω

          )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

          vector decomposition does not require the costly computation of(XgtX + Ω

          )minus1that

          52

          53 Optimality Conditions

          involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

          trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

          )minus1XgtY 1

          Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

          Θ0gtYgtX(XgtX + Ω

          )minus1XgtYΘ0 = Θ0gtYgtXB0

          Thus the solution to penalized OS problem can be computed trough the singular

          value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

          Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

          )minus1XgtYΘ = Λ and when Θ0 is chosen such

          that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

          53 Optimality Conditions

          GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

          1

          2YΘminusXB22 + λ

          psumj=1

          wj∥∥βj∥∥

          2(55)

          Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

          row of B βj is the (K minus 1)-dimensional vector

          partJ(B)

          partβj= xj

          gt(XBminusYΘ)

          where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

          xjgt

          (XBminusYΘ) + λwjβj∥∥βj∥∥

          2

          1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

          )minus1XgtY It is thus suffi-

          cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

          YgtX(XgtX + Ω

          )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

          (35c) we set Θ0 =(YgtY

          )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

          vectors orthogonal to 1K

          53

          5 GLOSS Algorithm

          The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

          2le λwj

          54 Active and Inactive Sets

          The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

          j = maxj

          ∥∥∥xjgt (XBminusYΘ)∥∥∥

          2minus λwj 0

          The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

          2

          is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

          2le λwj

          The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

          55 Penalty Parameter

          The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

          The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

          λmax = maxjisin1p

          1

          wj

          ∥∥∥xjgtYΘ0∥∥∥

          2

          The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

          is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

          54

          56 Options and Variants

          56 Options and Variants

          561 Scaling Variables

          As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

          562 Sparse Variant

          This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

          563 Diagonal Variant

          We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

          The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

          minBisinRptimesKminus1

          YΘminusXB2F = minBisinRptimesKminus1

          tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

          )are replaced by

          minBisinRptimesKminus1

          tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

          )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

          564 Elastic net and Structured Variant

          For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

          55

          5 GLOSS Algorithm

          7 8 9

          4 5 6

          1 2 3

          - ΩL =

          3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

          Figure 52 Graph and Laplacian matrix for a 3times 3 image

          for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

          When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

          This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

          56

          6 Experimental Results

          This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

          61 Normalization

          With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

          62 Decision Thresholds

          The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

          1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

          57

          6 Experimental Results

          63 Simulated Data

          We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

          Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

          Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

          dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

          is intended to mimic gene expression data correlation

          Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

          3 1)if j le 100 and Xij sim N(0 1) otherwise

          Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

          Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

          The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

          58

          63 Simulated Data

          Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

          Err () Var Dir

          Sim 1 K = 4 mean shift ind features

          PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

          Sim 2 K = 2 mean shift dependent features

          PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

          Sim 3 K = 4 1D mean shift ind features

          PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

          Sim 4 K = 4 mean shift ind features

          PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

          59

          6 Experimental Results

          0 10 20 30 40 50 60 70 8020

          30

          40

          50

          60

          70

          80

          90

          100TPR Vs FPR

          gloss

          glossd

          slda

          plda

          Simulation1

          Simulation2

          Simulation3

          Simulation4

          Figure 61 TPR versus FPR (in ) for all algorithms and simulations

          Table 62 Average TPR and FPR (in ) computed over 25 repetitions

          Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

          PLDA 990 782 969 603 980 159 743 656

          SLDA 739 385 338 163 416 278 507 395

          GLOSS 641 106 300 46 511 182 260 121

          GLOSS-D 935 394 921 281 956 655 429 299

          method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

          64 Gene Expression Data

          We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

          2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

          60

          64 Gene Expression Data

          Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

          Err () Var

          Nakayama n = 86 p = 22 283 K = 5

          PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

          Ramaswamy n = 198 p = 16 063 K = 14

          PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

          Sun n = 180 p = 54 613 K = 4

          PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

          ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

          dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

          Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

          Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

          Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

          4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

          61

          6 Experimental Results

          GLOSS SLDA

          Naka

          yam

          a

          minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

          minus25

          minus2

          minus15

          minus1

          minus05

          0

          05

          1

          x 104

          1) Synovial sarcoma

          2) Myxoid liposarcoma

          3) Dedifferentiated liposarcoma

          4) Myxofibrosarcoma

          5) Malignant fibrous histiocytoma

          2n

          dd

          iscr

          imin

          ant

          minus2000 0 2000 4000 6000 8000 10000 12000 14000

          2000

          4000

          6000

          8000

          10000

          12000

          14000

          16000

          1) Synovial sarcoma

          2) Myxoid liposarcoma

          3) Dedifferentiated liposarcoma

          4) Myxofibrosarcoma

          5) Malignant fibrous histiocytoma

          Su

          n

          minus1 minus05 0 05 1 15 2

          x 104

          05

          1

          15

          2

          25

          3

          35

          x 104

          1) NonTumor

          2) Astrocytomas

          3) Glioblastomas

          4) Oligodendrogliomas

          1st discriminant

          2n

          dd

          iscr

          imin

          ant

          minus2 minus15 minus1 minus05 0

          x 104

          0

          05

          1

          15

          2

          x 104

          1) NonTumor

          2) Astrocytomas

          3) Glioblastomas

          4) Oligodendrogliomas

          1st discriminant

          Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

          62

          65 Correlated Data

          Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

          65 Correlated Data

          When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

          The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

          For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

          As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

          The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

          Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

          63

          6 Experimental Results

          β for GLOSS β for S-GLOSS

          Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

          β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

          Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

          64

          Discussion

          GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

          Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

          The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

          The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

          65

          Part III

          Sparse Clustering Analysis

          67

          Abstract

          Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

          Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

          As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

          Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

          69

          7 Feature Selection in Mixture Models

          71 Mixture Models

          One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

          711 Model

          We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

          from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

          f(xi) =

          Ksumk=1

          πkfk(xi) foralli isin 1 n

          where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

          sumk πk = 1) Mixture models transcribe that

          given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

          bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

          bull x each xi is assumed to arise from a random vector with probability densityfunction fk

          In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

          f(xiθ) =

          Ksumk=1

          πkφ(xiθk) foralli isin 1 n

          71

          7 Feature Selection in Mixture Models

          where θ = (π1 πK θ1 θK) is the parameter of the model

          712 Parameter Estimation The EM Algorithm

          For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

          21 σ

          22 π) of a univariate

          Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

          The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

          The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

          Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

          Maximum Likelihood Definitions

          The likelihood is is commonly expressed in its logarithmic version

          L(θ X) = log

          (nprodi=1

          f(xiθ)

          )

          =nsumi=1

          log

          (Ksumk=1

          πkfk(xiθk)

          ) (71)

          where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

          To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

          72

          71 Mixture Models

          classification log-likelihood

          LC(θ XY) = log

          (nprodi=1

          f(xiyiθ)

          )

          =

          nsumi=1

          log

          (Ksumk=1

          yikπkfk(xiθk)

          )

          =nsumi=1

          Ksumk=1

          yik log (πkfk(xiθk)) (72)

          The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

          Defining the soft membership tik(θ) as

          tik(θ) = p(Yik = 1|xiθ) (73)

          =πkfk(xiθk)

          f(xiθ) (74)

          To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

          LC(θ XY) =sumik

          yik log (πkfk(xiθk))

          =sumik

          yik log (tikf(xiθ))

          =sumik

          yik log tik +sumik

          yik log f(xiθ)

          =sumik

          yik log tik +nsumi=1

          log f(xiθ)

          =sumik

          yik log tik + L(θ X) (75)

          wheresum

          ik yik log tik can be reformulated as

          sumik

          yik log tik =nsumi=1

          Ksumk=1

          yik log(p(Yik = 1|xiθ))

          =

          nsumi=1

          log(p(Yik = 1|xiθ))

          = log (p(Y |Xθ))

          As a result the relationship (75) can be rewritten as

          L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

          73

          7 Feature Selection in Mixture Models

          Likelihood Maximization

          The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

          L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

          +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

          In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

          ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

          minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

          Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

          For the mixture model problem Q(θθprime) is

          Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

          =sumik

          p(Yik = 1|xiθprime) log(πkfk(xiθk))

          =nsumi=1

          Ksumk=1

          tik(θprime) log (πkfk(xiθk)) (77)

          Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

          prime) are the posterior proba-bilities of cluster memberships

          Hence the EM algorithm sketched above results in

          bull Initialization (not iterated) choice of the initial parameter θ(0)

          bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

          bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

          74

          72 Feature Selection in Model-Based Clustering

          Gaussian Model

          In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

          f(xiθ) =Ksumk=1

          πkfk(xiθk)

          =

          Ksumk=1

          πk1

          (2π)p2 |Σ|

          12

          exp

          minus1

          2(xi minus microk)

          gtΣminus1(xi minus microk)

          At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

          Q(θθ(t)) =sumik

          tik log(πk)minussumik

          tik log(

          (2π)p2 |Σ|

          12

          )minus 1

          2

          sumik

          tik(xi minus microk)gtΣminus1(xi minus microk)

          =sumk

          tk log(πk)minusnp

          2log(2π)︸ ︷︷ ︸

          constant term

          minusn2

          log(|Σ|)minus 1

          2

          sumik

          tik(xi minus microk)gtΣminus1(xi minus microk)

          equivsumk

          tk log(πk)minusn

          2log(|Σ|)minus

          sumik

          tik

          (1

          2(xi minus microk)

          gtΣminus1(xi minus microk)

          ) (78)

          where

          tk =nsumi=1

          tik (79)

          The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

          π(t+1)k =

          tkn

          (710)

          micro(t+1)k =

          sumi tikxitk

          (711)

          Σ(t+1) =1

          n

          sumk

          Wk (712)

          with Wk =sumi

          tik(xi minus microk)(xi minus microk)gt (713)

          The derivations are detailed in Appendix G

          72 Feature Selection in Model-Based Clustering

          When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

          75

          7 Feature Selection in Mixture Models

          covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

          In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

          gtk (Banfield and Raftery 1993)

          These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

          In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

          721 Based on Penalized Likelihood

          Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

          log

          (p(Yk = 1|x)

          p(Y` = 1|x)

          )= xgtΣminus1(microk minus micro`)minus

          1

          2(microk + micro`)

          gtΣminus1(microk minus micro`) + logπkπ`

          In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

          λKsumk=1

          psumj=1

          |microkj |

          as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

          λ1

          Ksumk=1

          psumj=1

          |microkj |+ λ2

          Ksumk=1

          psumj=1

          psumm=1

          |(Σminus1k )jm|

          In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

          76

          72 Feature Selection in Model-Based Clustering

          Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

          λ

          psumj=1

          sum16k6kprime6K

          |microkj minus microkprimej |

          This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

          A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

          λ

          psumj=1

          (micro1j micro2j microKj)infin

          One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

          λradicK

          psumj=1

          radicradicradicradic Ksum

          k=1

          micro2kj

          The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

          The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

          722 Based on Model Variants

          The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

          77

          7 Feature Selection in Mixture Models

          f(xi|φ πθν) =Ksumk=1

          πk

          pprodj=1

          [f(xij |θjk)]φj [h(xij |νj)]1minusφj

          where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

          An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

          which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

          tr(

          (UgtΣWU)minus1UgtΣBU) (714)

          so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

          To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

          minUisinRptimesKminus1

          ∥∥∥XU minusXU∥∥∥2

          F+ λ

          Kminus1sumk=1

          ∥∥∥uk∥∥∥1

          where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

          minABisinRptimesKminus1

          Ksumk=1

          ∥∥∥RminusgtW HBk minusABgtHBk

          ∥∥∥2

          2+ ρ

          Kminus1sumj=1

          βgtj ΣWβj + λ

          Kminus1sumj=1

          ∥∥βj∥∥1

          s t AgtA = IKminus1

          where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

          78

          72 Feature Selection in Model-Based Clustering

          triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

          The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

          minUisinRptimesKminus1

          psumj=1

          ∥∥∥ΣBj minus UUgtΣBj

          ∥∥∥2

          2

          s t UgtU = IKminus1

          whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

          To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

          However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

          723 Based on Model Selection

          Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

          bull X(1) set of selected relevant variables

          bull X(2) set of variables being considered for inclusion or exclusion of X(1)

          bull X(3) set of non relevant variables

          79

          7 Feature Selection in Mixture Models

          With those subsets they defined two different models where Y is the partition toconsider

          bull M1

          f (X|Y) = f(X(1)X(2)X(3)|Y

          )= f

          (X(3)|X(2)X(1)

          )f(X(2)|X(1)

          )f(X(1)|Y

          )bull M2

          f (X|Y) = f(X(1)X(2)X(3)|Y

          )= f

          (X(3)|X(2)X(1)

          )f(X(2)X(1)|Y

          )Model M1 means that variables in X(2) are independent on clustering Y Model M2

          shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

          B12 =f (X|M1)

          f (X|M2)

          where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

          B12 =f(X(1)X(2)X(3)|M1

          )f(X(1)X(2)X(3)|M2

          )=f(X(2)|X(1)M1

          )f(X(1)|M1

          )f(X(2)X(1)|M2

          )

          This factor is approximated since the integrated likelihoods f(X(1)|M1

          )and

          f(X(2)X(1)|M2

          )are difficult to calculate exactly Raftery and Dean (2006) use the

          BIC approximation The computation of f(X(2)|X(1)M1

          ) if there is only one variable

          in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

          Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

          Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

          80

          8 Theoretical Foundations

          In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

          We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

          In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

          81 Resolving EM with Optimal Scoring

          In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

          811 Relationship Between the M-Step and Linear Discriminant Analysis

          LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

          d(ximicrok) = (xi minus microk)gtΣminus1

          W (xi minus microk)

          where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

          81

          8 Theoretical Foundations

          The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

          Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

          2lweight(microΣ) =nsumi=1

          Ksumk=1

          tikd(ximicrok)minus n log(|ΣW|)

          which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

          812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

          The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

          813 Clustering Using Penalized Optimal Scoring

          The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

          d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

          This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

          82

          82 Optimized Criterion

          1 Initialize the membership matrix Y (for example by K-means algorithm)

          2 Solve the p-OS problem as

          BOS =(XgtX + λΩ

          )minus1XgtYΘ

          where Θ are the K minus 1 leading eigenvectors of

          YgtX(XgtX + λΩ

          )minus1XgtY

          3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

          k)minus 1

          2 )

          4 Compute the centroids M in the LDA domain

          5 Evaluate distances in the LDA domain

          6 Translate distances into posterior probabilities tik with

          tik prop exp

          [minusd(x microk)minus 2 log(πk)

          2

          ] (81)

          7 Update the labels using the posterior probabilities matrix Y = T

          8 Go back to step 2 and iterate until tik converge

          Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

          814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

          In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

          82 Optimized Criterion

          In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

          83

          8 Theoretical Foundations

          optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

          This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

          821 A Bayesian Derivation

          This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

          The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

          The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

          f(Σ|Λ0 ν0) =1

          2np2 |Λ0|

          n2 Γp(

          n2 )|Σminus1|

          ν0minuspminus12 exp

          minus1

          2tr(Λminus1

          0 Σminus1)

          where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

          Γp(n2) = πp(pminus1)4pprodj=1

          Γ (n2 + (1minus j)2)

          The posterior distribution can be maximized similarly to the likelihood through the

          84

          82 Optimized Criterion

          maximization of

          Q(θθprime) + log(f(Σ|Λ0 ν0))

          =Ksumk=1

          tk log πk minus(n+ 1)p

          2log 2minus n

          2log |Λ0| minus

          p(p+ 1)

          4log(π)

          minuspsumj=1

          log

          (n

          2+

          1minus j2

          ))minus νn minus pminus 1

          2log |Σ| minus 1

          2tr(Λminus1n Σminus1

          )equiv

          Ksumk=1

          tk log πk minusn

          2log |Λ0| minus

          νn minus pminus 1

          2log |Σ| minus 1

          2tr(Λminus1n Σminus1

          ) (82)

          with tk =

          nsumi=1

          tik

          νn = ν0 + n

          Λminus1n = Λminus1

          0 + S0

          S0 =

          nsumi=1

          Ksumk=1

          tik(xi minus microk)(xi minus microk)gt

          Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

          822 Maximum a Posteriori Estimator

          The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

          ΣMAP =1

          ν0 + nminus pminus 1(Λminus1

          0 + S0) (83)

          where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

          0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

          85

          9 Mix-GLOSS Algorithm

          Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

          91 Mix-GLOSS

          The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

          When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

          The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

          911 Outer Loop Whole Algorithm Repetitions

          This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

          bull the centered ntimes p feature matrix X

          bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

          bull the number of clusters K

          bull the maximum number of iterations for the EM algorithm

          bull the convergence tolerance for the EM algorithm

          bull the number of whole repetitions of the clustering algorithm

          87

          9 Mix-GLOSS Algorithm

          Figure 91 Mix-GLOSS Loops Scheme

          bull a ptimes (K minus 1) initial coefficient matrix (optional)

          bull a ntimesK initial posterior probability matrix (optional)

          For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

          912 Penalty Parameter Loop

          The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

          Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

          88

          91 Mix-GLOSS

          of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

          Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

          Algorithm 2 Automatic selection of λ

          Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

          Estimate λ Compute gradient at βj = 0partJ(B)

          partβj

          ∣∣∣βj=0

          = xjgt

          (sum

          m6=j xmβm minusYΘ)

          Compute λmax for every feature using (432b)

          λmaxj = 1

          wj

          ∥∥∥∥ partJ(B)

          partβj

          ∣∣∣βj=0

          ∥∥∥∥2

          Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

          elselastLAMBDA larr true

          end ifuntil lastLAMBDA

          Output B L(θ) tik πk microk Σ Y for every λ in solution path

          913 Inner Loop EM Algorithm

          The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

          89

          9 Mix-GLOSS Algorithm

          Algorithm 3 Mix-GLOSS for one value of λ

          Input X K B0 Y0 λInitializeif (B0Y0) available then

          BOS larr B0 Y larr Y0

          elseBOS larr 0 Y larr kmeans(XK)

          end ifconvergenceEM larr false tolEM larr 1e-3repeat

          M-step(BOSΘ

          α)larr GLOSS(XYBOS λ)

          XLDA = XBOS diag (αminus1(1minusα2)minus12

          )

          πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

          sumi |tik minus yik| lt tolEM then

          convergenceEM larr trueend ifY larr T

          until convergenceEMY larr MAP(T)

          Output BOS ΘL(θ) tik πk microk Σ Y

          90

          92 Model Selection

          M-Step

          The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

          E-Step

          The E-step evaluates the posterior probability matrix T using

          tik prop exp

          [minusd(x microk)minus 2 log(πk)

          2

          ]

          The convergence of those tik is used as stopping criterion for EM

          92 Model Selection

          Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

          In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

          In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

          The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

          91

          9 Mix-GLOSS Algorithm

          Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

          X K λEMITER MAXREPMixminusGLOSS

          Use B and T frombest repetition as

          StartB and StartT

          Mix-GLOSS (λStartBStartT)

          Compute BIC

          Chose λ = minλ BIC

          Partition tikπk λBEST BΘ D L(θ)activeset

          Figure 92 Mix-GLOSS model selection diagram

          with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

          92

          10 Experimental Results

          The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

          This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

          In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

          The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

          101 Tested Clustering Algorithms

          This section compares Mix-GLOSS with the following methods in the state of the art

          bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

          bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

          93

          10 Experimental Results

          Figure 101 Class mean vectors for each artificial simulation

          bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

          After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

          The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

          bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

          bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

          94

          102 Results

          Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

          102 Results

          In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

          bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

          bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

          bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

          The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

          Results in percentages are displayed in Figure 102 (or in Table 102 )

          95

          10 Experimental Results

          Table 101 Experimental results for simulated data

          Err () Var Time

          Sim 1 K = 4 mean shift ind features

          CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

          Sim 2 K = 2 mean shift dependent features

          CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

          Sim 3 K = 4 1D mean shift ind features

          CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

          Sim 4 K = 4 mean shift ind features

          CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

          Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

          Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

          MIX-GLOSS 992 015 828 335 884 67 780 12

          LUMI-KUAN 992 28 1000 02 1000 005 50 005

          FISHER-EM 986 24 888 17 838 5825 620 4075

          96

          103 Discussion

          0 10 20 30 40 50 600

          10

          20

          30

          40

          50

          60

          70

          80

          90

          100TPR Vs FPR

          MIXminusGLOSS

          LUMIminusKUAN

          FISHERminusEM

          Simulation1

          Simulation2

          Simulation3

          Simulation4

          Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

          103 Discussion

          After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

          LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

          The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

          From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

          97

          Conclusions

          99

          Conclusions

          Summary

          The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

          In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

          The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

          In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

          In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

          Perspectives

          Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

          101

          based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

          At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

          The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

          From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

          At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

          102

          Appendix

          103

          A Matrix Properties

          Property 1 By definition ΣW and ΣB are both symmetric matrices

          ΣW =1

          n

          gsumk=1

          sumiisinCk

          (xi minus microk)(xi minus microk)gt

          ΣB =1

          n

          gsumk=1

          nk(microk minus x)(microk minus x)gt

          Property 2 partxgtapartx = partagtx

          partx = a

          Property 3 partxgtAxpartx = (A + Agt)x

          Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

          Property 5 partagtXbpartX = abgt

          Property 6 partpartXtr

          (AXminus1B

          )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

          105

          B The Penalized-OS Problem is anEigenvector Problem

          In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

          minθkβk

          Yθk minusXβk22 + βgtk Ωkβk (B1)

          st θgtk YgtYθk = 1

          θgt` YgtYθk = 0 forall` lt k

          for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

          Lk(θkβk λkνk) =

          Yθk minusXβk22 + βgtk Ωkβk + λk(θ

          gtk YgtYθk minus 1) +

          sum`ltk

          ν`θgt` YgtYθk (B2)

          Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

          βk = (XgtX + Ωk)minus1XgtYθk (B3)

          The objective function of (B1) evaluated at βk is

          minθk

          Yθk minusXβk22 + βk

          gtΩkβk = min

          θk

          θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

          = maxθk

          θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

          If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

          B1 How to Solve the Eigenvector Decomposition

          Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

          107

          B The Penalized-OS Problem is an Eigenvector Problem

          Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

          maxΘisinRKtimes(Kminus1)

          tr(ΘgtMΘ

          )(B5)

          st ΘgtYgtYΘ = IKminus1

          If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

          MΘv = λv (B6)

          where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

          vgtMΘv = λhArr vgtΘgtMΘv = λ

          Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

          wgtMw = λ (B7)

          Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

          MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

          = ΘgtYgtXB

          Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

          To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

          B = (XgtX + Ω)minus1XgtYΘV = BV

          108

          B2 Why the OS Problem is Solved as an Eigenvector Problem

          B2 Why the OS Problem is Solved as an Eigenvector Problem

          In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

          By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

          θk =

          Kminus1summ=1

          αmwm s t θgtk θk = 1 (B8)

          The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

          Kminus1summ=1

          αmwm

          )gt(Kminus1summ=1

          αmwm

          )= 1

          that as per the eigenvector properties can be reduced to

          Kminus1summ=1

          α2m = 1 (B9)

          Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

          Mθk = M

          Kminus1summ=1

          αmwm

          =

          Kminus1summ=1

          αmMwm

          As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

          Mθk =Kminus1summ=1

          αmλmwm

          Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

          θgtk Mθk =

          (Kminus1sum`=1

          α`w`

          )gt(Kminus1summ=1

          αmλmwm

          )

          This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

          θgtk Mθk =Kminus1summ=1

          α2mλm

          109

          B The Penalized-OS Problem is an Eigenvector Problem

          The optimization Problem (B5) for discriminant direction k can be rewritten as

          maxθkisinRKtimes1

          θgtk Mθk

          = max

          θkisinRKtimes1

          Kminus1summ=1

          α2mλm

          (B10)

          with θk =Kminus1summ=1

          αmwm

          andKminus1summ=1

          α2m = 1

          One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

          sumKminus1m=1 αmwm the resulting score vector θk will be equal to

          the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

          be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

          110

          C Solving Fisherrsquos Discriminant Problem

          The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

          maxβisinRp

          βgtΣBβ (C1a)

          s t βgtΣWβ = 1 (C1b)

          where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

          The Lagrangian of Problem (C1) is

          L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

          so that its first derivative with respect to β is

          partL(β ν)

          partβ= 2ΣBβ minus 2νΣWβ

          A necessary optimality condition for β is that this derivative is zero that is

          ΣBβ = νΣWβ

          Provided ΣW is full rank we have

          Σminus1W ΣBβ

          = νβ (C2)

          Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

          eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

          βgtΣBβ = βgtΣWΣminus1

          W ΣBβ

          = νβgtΣWβ from (C2)

          = ν from (C1b)

          That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

          W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

          111

          D Alternative Variational Formulation forthe Group-Lasso

          In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

          minτisinRp

          minBisinRptimesKminus1

          J(B) + λ

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj(D1a)

          s tsump

          j=1 τj = 1 (D1b)

          τj ge 0 j = 1 p (D1c)

          Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

          of row vectors βj isin RKminus1 B =(β1gt βpgt

          )gt

          L(B τ λ ν0 νj) = J(B) + λ

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj+ ν0

          psumj=1

          τj minus 1

          minus psumj=1

          νjτj (D2)

          The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

          partL(B τ λ ν0 νj)

          partτj

          ∣∣∣∣τj=τj

          = 0 rArr minusλw2j

          ∥∥βj∥∥2

          2

          τj2 + ν0 minus νj = 0

          rArr minusλw2j

          ∥∥βj∥∥2

          2+ ν0τ

          j

          2 minus νjτj2 = 0

          rArr minusλw2j

          ∥∥βj∥∥2

          2+ ν0τ

          j

          2 = 0

          The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

          ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

          inequality Lagrange condition Then the optimal τj can be deduced

          τj =

          radicλ

          ν0wj∥∥βj∥∥

          2

          Placing this optimal value of τj into constraint (D1b)

          psumj=1

          τj = 1rArr τj =wj∥∥βj∥∥

          2sumpj=1wj

          ∥∥βj∥∥2

          (D3)

          113

          D Alternative Variational Formulation for the Group-Lasso

          With this value of τj Problem (D1) is equivalent to

          minBisinRptimesKminus1

          J(B) + λ

          psumj=1

          wj∥∥βj∥∥

          2

          2

          (D4)

          This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

          The penalty term of (D1a) can be conveniently presented as λBgtΩB where

          Ω = diag

          (w2

          1

          τ1w2

          2

          τ2

          w2p

          τp

          ) (D5)

          Using the value of τj from (D3) each diagonal component of Ω is

          (Ω)jj =wjsump

          j=1wj∥∥βj∥∥

          2∥∥βj∥∥2

          (D6)

          In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

          D1 Useful Properties

          Lemma D1 If J is convex Problem (D1) is convex

          In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

          Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

          partJ(B)

          partB+ 2λ

          Kminus1sumj=1

          wj∥∥βj∥∥

          2

          G

          (D7)

          where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

          ∥∥βj∥∥26= 0 then we have

          forallj isin S(B) gj = wj∥∥βj∥∥minus1

          2βj (D8)

          forallj isin S(B) ∥∥gj∥∥

          2le wj (D9)

          114

          D2 An Upper Bound on the Objective Function

          This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

          Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

          ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

          forallj isin S(B) minus partJ(B)

          partβj= 2λ

          Kminus1sumj=1

          wj∥∥βj∥∥2

          wj∥∥βj∥∥minus1

          2βj (D10a)

          forallj isin S(B)

          ∥∥∥∥partJ(B)

          partβj

          ∥∥∥∥2

          le 2λwj

          Kminus1sumj=1

          wj∥∥βj∥∥2

          (D10b)

          In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

          D2 An Upper Bound on the Objective Function

          Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

          τj =wj∥∥βj∥∥

          2sumpj=1wj

          ∥∥βj∥∥2

          Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

          j=1

          wj∥∥βj∥∥

          2

          2

          =

          psumj=1

          τ12j

          wj∥∥βj∥∥

          2

          τ12j

          2

          le

          psumj=1

          τj

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj

          le

          psumj=1

          w2j

          ∥∥βj∥∥2

          2

          τj

          where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

          115

          D Alternative Variational Formulation for the Group-Lasso

          This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

          116

          E Invariance of the Group-Lasso to UnitaryTransformations

          The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

          Proposition E1 Let B be a solution of

          minBisinRptimesM

          Y minusXB2F + λ

          psumj=1

          wj∥∥βj∥∥

          2(E1)

          and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

          minBisinRptimesM

          ∥∥∥Y minusXB∥∥∥2

          F+ λ

          psumj=1

          wj∥∥βj∥∥

          2(E2)

          Proof The first-order necessary optimality conditions for B are

          forallj isin S(B) 2xjgt(xjβ

          j minusY)

          + λwj

          ∥∥∥βj∥∥∥minus1

          2βj

          = 0 (E3a)

          forallj isin S(B) 2∥∥∥xjgt (xjβ

          j minusY)∥∥∥

          2le λwj (E3b)

          where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

          First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

          forallj isin S(B) 2xjgt(xjβ

          j minus Y)

          + λwj

          ∥∥∥βj∥∥∥minus1

          2βj

          = 0 (E4a)

          forallj isin S(B) 2∥∥∥xjgt (xjβ

          j minus Y)∥∥∥

          2le λwj (E4b)

          where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

          ∥∥ugt∥∥2

          =∥∥ugtV

          ∥∥2 Equation (E4b) is also

          117

          E Invariance of the Group-Lasso to Unitary Transformations

          obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

          118

          F Expected Complete Likelihood andLikelihood

          Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

          L(θ) =

          nsumi=1

          log

          (Ksumk=1

          πkfk(xiθk)

          )(F1)

          Q(θθprime) =nsumi=1

          Ksumk=1

          tik(θprime) log (πkfk(xiθk)) (F2)

          with tik(θprime) =

          πprimekfk(xiθprimek)sum

          ` πprime`f`(xiθ

          prime`)

          (F3)

          In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

          the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

          Using (F3) we have

          Q(θθprime) =sumik

          tik(θprime) log (πkfk(xiθk))

          =sumik

          tik(θprime) log(tik(θ)) +

          sumik

          tik(θprime) log

          (sum`

          π`f`(xiθ`)

          )=sumik

          tik(θprime) log(tik(θ)) + L(θ)

          In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

          L(θ) = Q(θθ)minussumik

          tik(θ) log(tik(θ))

          = Q(θθ) +H(T)

          119

          G Derivation of the M-Step Equations

          This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

          Q(θθprime) = maxθ

          sumik

          tik(θprime) log(πkfk(xiθk))

          =sumk

          log

          (πksumi

          tik

          )minus np

          2log(2π)minus n

          2log |Σ| minus 1

          2

          sumik

          tik(xi minus microk)gtΣminus1(xi minus microk)

          which has to be maximized subject tosumk

          πk = 1

          The Lagrangian of this problem is

          L(θ) = Q(θθprime) + λ

          (sumk

          πk minus 1

          )

          Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

          G1 Prior probabilities

          partL(θ)

          partπk= 0hArr 1

          πk

          sumi

          tik + λ = 0

          where λ is identified from the constraint leading to

          πk =1

          n

          sumi

          tik

          121

          G Derivation of the M-Step Equations

          G2 Means

          partL(θ)

          partmicrok= 0hArr minus1

          2

          sumi

          tik2Σminus1(microk minus xi) = 0

          rArr microk =

          sumi tikxisumi tik

          G3 Covariance Matrix

          partL(θ)

          partΣminus1 = 0hArr n

          2Σ︸︷︷︸

          as per property 4

          minus 1

          2

          sumik

          tik(xi minus microk)(xi minus microk)gt

          ︸ ︷︷ ︸as per property 5

          = 0

          rArr Σ =1

          n

          sumik

          tik(xi minus microk)(xi minus microk)gt

          122

          Bibliography

          F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

          F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

          F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

          J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

          A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

          H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

          P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

          C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

          C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

          C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

          C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

          S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

          L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

          L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

          123

          Bibliography

          T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

          S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

          C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

          B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

          L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

          C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

          A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

          D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

          R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

          B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

          Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

          R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

          V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

          J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

          124

          Bibliography

          J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

          J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

          W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

          A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

          D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

          G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

          G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

          Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

          Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

          L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

          Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

          J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

          I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

          T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

          T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

          125

          Bibliography

          T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

          A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

          J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

          T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

          K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

          P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

          T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

          M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

          Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

          C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

          C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

          H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

          J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

          Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

          C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

          126

          Bibliography

          C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

          L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

          N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

          B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

          B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

          Y Nesterov Gradient methods for minimizing composite functions preprint 2007

          S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

          B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

          M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

          M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

          W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

          W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

          K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

          S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

          127

          Bibliography

          Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

          A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

          C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

          S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

          V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

          V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

          V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

          C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

          L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

          Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

          A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

          S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

          P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

          M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

          128

          Bibliography

          M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

          R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

          J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

          S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

          D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

          D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

          D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

          M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

          MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

          T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

          B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

          B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

          C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

          J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

          129

          Bibliography

          M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

          P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

          P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

          H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

          H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

          H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

          130

          • SANCHEZ MERCHANTE PDTpdf
          • Thesis Luis Francisco Sanchez Merchantepdf
            • List of figures
            • List of tables
            • Notation and Symbols
            • Context and Foundations
              • Context
              • Regularization for Feature Selection
                • Motivations
                • Categorization of Feature Selection Techniques
                • Regularization
                  • Important Properties
                  • Pure Penalties
                  • Hybrid Penalties
                  • Mixed Penalties
                  • Sparsity Considerations
                  • Optimization Tools for Regularized Problems
                    • Sparse Linear Discriminant Analysis
                      • Abstract
                      • Feature Selection in Fisher Discriminant Analysis
                        • Fisher Discriminant Analysis
                        • Feature Selection in LDA Problems
                          • Inertia Based
                          • Regression Based
                              • Formalizing the Objective
                                • From Optimal Scoring to Linear Discriminant Analysis
                                  • Penalized Optimal Scoring Problem
                                  • Penalized Canonical Correlation Analysis
                                  • Penalized Linear Discriminant Analysis
                                  • Summary
                                    • Practicalities
                                      • Solution of the Penalized Optimal Scoring Regression
                                      • Distance Evaluation
                                      • Posterior Probability Evaluation
                                      • Graphical Representation
                                        • From Sparse Optimal Scoring to Sparse LDA
                                          • A Quadratic Variational Form
                                          • Group-Lasso OS as Penalized LDA
                                              • GLOSS Algorithm
                                                • Regression Coefficients Updates
                                                  • Cholesky decomposition
                                                  • Numerical Stability
                                                    • Score Matrix
                                                    • Optimality Conditions
                                                    • Active and Inactive Sets
                                                    • Penalty Parameter
                                                    • Options and Variants
                                                      • Scaling Variables
                                                      • Sparse Variant
                                                      • Diagonal Variant
                                                      • Elastic net and Structured Variant
                                                          • Experimental Results
                                                            • Normalization
                                                            • Decision Thresholds
                                                            • Simulated Data
                                                            • Gene Expression Data
                                                            • Correlated Data
                                                              • Discussion
                                                                • Sparse Clustering Analysis
                                                                  • Abstract
                                                                  • Feature Selection in Mixture Models
                                                                    • Mixture Models
                                                                      • Model
                                                                      • Parameter Estimation The EM Algorithm
                                                                        • Feature Selection in Model-Based Clustering
                                                                          • Based on Penalized Likelihood
                                                                          • Based on Model Variants
                                                                          • Based on Model Selection
                                                                              • Theoretical Foundations
                                                                                • Resolving EM with Optimal Scoring
                                                                                  • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                  • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                  • Clustering Using Penalized Optimal Scoring
                                                                                  • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                    • Optimized Criterion
                                                                                      • A Bayesian Derivation
                                                                                      • Maximum a Posteriori Estimator
                                                                                          • Mix-GLOSS Algorithm
                                                                                            • Mix-GLOSS
                                                                                              • Outer Loop Whole Algorithm Repetitions
                                                                                              • Penalty Parameter Loop
                                                                                              • Inner Loop EM Algorithm
                                                                                                • Model Selection
                                                                                                  • Experimental Results
                                                                                                    • Tested Clustering Algorithms
                                                                                                    • Results
                                                                                                    • Discussion
                                                                                                        • Conclusions
                                                                                                        • Appendix
                                                                                                          • Matrix Properties
                                                                                                          • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                            • How to Solve the Eigenvector Decomposition
                                                                                                            • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                              • Solving Fishers Discriminant Problem
                                                                                                              • Alternative Variational Formulation for the Group-Lasso
                                                                                                                • Useful Properties
                                                                                                                • An Upper Bound on the Objective Function
                                                                                                                  • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                  • Expected Complete Likelihood and Likelihood
                                                                                                                  • Derivation of the M-Step Equations
                                                                                                                    • Prior probabilities
                                                                                                                    • Means
                                                                                                                    • Covariance Matrix
                                                                                                                        • Bibliography

            Contents

            List of figures v

            List of tables vii

            Notation and Symbols ix

            I Context and Foundations 1

            1 Context 5

            2 Regularization for Feature Selection 921 Motivations 9

            22 Categorization of Feature Selection Techniques 11

            23 Regularization 13

            231 Important Properties 14

            232 Pure Penalties 14

            233 Hybrid Penalties 18

            234 Mixed Penalties 19

            235 Sparsity Considerations 19

            236 Optimization Tools for Regularized Problems 21

            II Sparse Linear Discriminant Analysis 25

            Abstract 27

            3 Feature Selection in Fisher Discriminant Analysis 2931 Fisher Discriminant Analysis 29

            32 Feature Selection in LDA Problems 30

            321 Inertia Based 30

            322 Regression Based 32

            4 Formalizing the Objective 3541 From Optimal Scoring to Linear Discriminant Analysis 35

            411 Penalized Optimal Scoring Problem 36

            412 Penalized Canonical Correlation Analysis 37

            i

            Contents

            413 Penalized Linear Discriminant Analysis 39

            414 Summary 40

            42 Practicalities 41

            421 Solution of the Penalized Optimal Scoring Regression 41

            422 Distance Evaluation 42

            423 Posterior Probability Evaluation 43

            424 Graphical Representation 43

            43 From Sparse Optimal Scoring to Sparse LDA 43

            431 A Quadratic Variational Form 44

            432 Group-Lasso OS as Penalized LDA 47

            5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

            511 Cholesky decomposition 52

            512 Numerical Stability 52

            52 Score Matrix 52

            53 Optimality Conditions 53

            54 Active and Inactive Sets 54

            55 Penalty Parameter 54

            56 Options and Variants 55

            561 Scaling Variables 55

            562 Sparse Variant 55

            563 Diagonal Variant 55

            564 Elastic net and Structured Variant 55

            6 Experimental Results 5761 Normalization 57

            62 Decision Thresholds 57

            63 Simulated Data 58

            64 Gene Expression Data 60

            65 Correlated Data 63

            Discussion 63

            III Sparse Clustering Analysis 67

            Abstract 69

            7 Feature Selection in Mixture Models 7171 Mixture Models 71

            711 Model 71

            712 Parameter Estimation The EM Algorithm 72

            ii

            Contents

            72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

            8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

            811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

            Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

            82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

            9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

            911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

            92 Model Selection 91

            10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

            Conclusions 97

            Appendix 103

            A Matrix Properties 105

            B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

            C Solving Fisherrsquos Discriminant Problem 111

            D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

            iii

            Contents

            E Invariance of the Group-Lasso to Unitary Transformations 117

            F Expected Complete Likelihood and Likelihood 119

            G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

            Bibliography 123

            iv

            List of Figures

            11 MASH project logo 5

            21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

            rameters 20

            41 Graphical representation of the variational approach to Group-Lasso 45

            51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

            61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

            discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

            91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

            101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

            v

            List of Tables

            61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

            101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

            vii

            Notation and Symbols

            Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

            Sets

            N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

            Data

            X input domainxi input sample xi isin XX design matrix X = (xgt1 x

            gtn )gt

            xj column j of Xyi class indicator of sample i

            Y indicator matrix Y = (ygt1 ygtn )gt

            z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

            Vectors Matrices and Norms

            0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

            ix

            Notation and Symbols

            Probability

            E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

            W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

            H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

            Mixture Models

            yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

            θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

            Optimization

            J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

            βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

            x

            Notation and Symbols

            Penalized models

            λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

            βj jth row of B = (β1gt βpgt)gt

            BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

            ΣB sample between-class covariance matrix

            ΣW sample within-class covariance matrix

            ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

            xi

            Part I

            Context and Foundations

            1

            This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

            The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

            The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

            3

            1 Context

            The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

            The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

            From the point of view of the research the members of the consortium must deal withfour main goals

            1 Software development of website framework and APIrsquos

            2 Classification and goal-planning in high dimensional feature spaces

            3 Interfacing the platform with the 3D virtual environment and the robot arm

            4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

            S HM A

            Figure 11 MASH project logo

            5

            1 Context

            The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

            Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

            As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

            bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

            bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

            6

            All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

            bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

            I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

            7

            2 Regularization for Feature Selection

            With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

            21 Motivations

            There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

            As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

            When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

            bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

            bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

            9

            2 Regularization for Feature Selection

            Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

            selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

            As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

            ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

            Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

            10

            22 Categorization of Feature Selection Techniques

            Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

            ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

            There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

            22 Categorization of Feature Selection Techniques

            Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

            I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

            The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

            bull Depending on the type of integration with the machine learning algorithm we have

            ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

            ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

            11

            2 Regularization for Feature Selection

            the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

            ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

            bull Depending on the feature searching technique

            ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

            ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

            ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

            bull Depending on the evaluation technique

            ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

            ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

            ndash Dependency Measures - Measuring the correlation between features

            ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

            ndash Predictive Accuracy - Use the selected features to predict the labels

            ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

            The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

            In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

            12

            23 Regularization

            goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

            23 Regularization

            In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

            An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

            minβJ(β) + λP (β) (21)

            minβ

            J(β)

            s t P (β) le t (22)

            In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

            In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

            13

            2 Regularization for Feature Selection

            Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

            231 Important Properties

            Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

            Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

            forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

            for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

            Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

            Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

            232 Pure Penalties

            For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

            14

            23 Regularization

            Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

            this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

            Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

            A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

            After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

            3penalty has a support region with sharper vertexes that would induce

            a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

            3results in difficulties during optimization that will not happen with a convex

            shape

            15

            2 Regularization for Feature Selection

            To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

            L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

            minβ

            J(β)

            s t β0 le t (24)

            where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

            L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

            minβ

            J(β)

            s t

            psumj=1

            |βj | le t (25)

            Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

            Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

            The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

            16

            23 Regularization

            minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

            L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

            minβJ(β) + λ β22 (26)

            The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

            minβ

            nsumi=1

            (yi minus xgti β)2 (27)

            with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

            minβ

            nsumi=1

            (yi minus xgti β)2 + λ

            psumj=1

            β2j

            The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

            the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

            As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

            minβ

            nsumi=1

            (yi minus xgti β)2 + λ

            psumj=1

            β2j

            (βlsj )2 (28)

            The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

            17

            2 Regularization for Feature Selection

            where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

            Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

            Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

            This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

            βlowast = maxwisinRp

            βgtw s t w le 1

            In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

            r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

            233 Hybrid Penalties

            There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

            minβ

            nsumi=1

            (yi minus xgti β)2 + λ1

            psumj=1

            |βj |+ λ2

            psumj=1

            β2j (29)

            The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

            18

            23 Regularization

            234 Mixed Penalties

            Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

            sumL`=1 d` Mixed norms are

            a type of norms that take into consideration those groups The general expression isshowed below

            β(rs) =

            sum`

            sumjisinG`

            |βj |s r

            s

            1r

            (210)

            The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

            Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

            (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

            235 Sparsity Considerations

            In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

            The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

            To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

            19

            2 Regularization for Feature Selection

            (a) L1 Lasso (b) L(12) group-Lasso

            Figure 25 Admissible sets for the Lasso and Group-Lasso

            (a) L1 induced sparsity (b) L(12) group inducedsparsity

            Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

            20

            23 Regularization

            the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

            236 Optimization Tools for Regularized Problems

            In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

            In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

            Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

            β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

            Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

            βj =minusλsign(βj)minus partJ(β)

            partβj

            2sumn

            i=1 x2ij

            In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

            algorithm where β(t+1)j = Sλ

            (partJ(β(t))partβj

            ) The objective function is optimized with respect

            21

            2 Regularization for Feature Selection

            to one variable at a time while all others are kept fixed

            (partJ(β)

            partβj

            )=

            λminus partJ(β)partβj

            2sumn

            i=1 x2ij

            if partJ(β)partβj

            gt λ

            minusλminus partJ(β)partβj

            2sumn

            i=1 x2ij

            if partJ(β)partβj

            lt minusλ

            0 if |partJ(β)partβj| le λ

            (211)

            The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

            Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

            Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

            Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

            This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

            22

            23 Regularization

            and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

            Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

            This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

            This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

            Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

            This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

            Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

            Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

            minβisinRp

            J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

            2

            ∥∥∥β minus β(t)∥∥∥2

            2(212)

            They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

            23

            2 Regularization for Feature Selection

            (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

            minβisinRp

            1

            2

            ∥∥∥∥β minus (β(t) minus 1

            LnablaJ(β(t)))

            ∥∥∥∥2

            2

            LP (β) (213)

            The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

            24

            Part II

            Sparse Linear Discriminant Analysis

            25

            Abstract

            Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

            There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

            In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

            27

            3 Feature Selection in Fisher DiscriminantAnalysis

            31 Fisher Discriminant Analysis

            Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

            We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

            gtn )gt and the corresponding labels in the ntimesK matrix

            Y = (ygt1 ygtn )gt

            Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

            maxβisinRp

            βgtΣBβ

            βgtΣWβ (31)

            where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

            ΣW =1

            n

            Ksumk=1

            sumiisinGk

            (xi minus microk)(xi minus microk)gt

            ΣB =1

            n

            Ksumk=1

            sumiisinGk

            (microminus microk)(microminus microk)gt

            where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

            29

            3 Feature Selection in Fisher Discriminant Analysis

            This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

            maxBisinRptimesKminus1

            tr(BgtΣBB

            )tr(BgtΣWB

            ) (32)

            where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

            based on a series of K minus 1 subproblemsmaxβkisinRp

            βgtk ΣBβk

            s t βgtk ΣWβk le 1

            βgtk ΣWβ` = 0 forall` lt k

            (33)

            The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

            eigenvalue (see Appendix C)

            32 Feature Selection in LDA Problems

            LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

            Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

            The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

            They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

            321 Inertia Based

            The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

            30

            32 Feature Selection in LDA Problems

            classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

            Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

            Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

            minβisinRp

            βgtΣWβ

            s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

            where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

            Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

            βisinkRpβgtk Σ

            k

            Bβk minus Pk(βk)

            s t βgtk ΣWβk le 1

            The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

            Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

            solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

            minimization minβisinRp

            β1

            s t∥∥∥Σβ minus (micro1 minus micro2)

            ∥∥∥infinle λ

            Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

            31

            3 Feature Selection in Fisher Discriminant Analysis

            Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

            322 Regression Based

            In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

            Predefined Indicator Matrix

            Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

            There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

            Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

            In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

            32

            32 Feature Selection in LDA Problems

            obtained by solving

            minβisinRpβ0isinR

            nminus1nsumi=1

            (yi minus β0 minus xgti β)2 + λ

            psumj=1

            |βj |

            where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

            vector for λ = 0 but a different intercept β0 is required

            Optimal Scoring

            In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

            As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

            minΘ BYΘminusXB2F + λ tr

            (BgtΩB

            )(34a)

            s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

            where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

            minθkisinRK βkisinRp

            Yθk minusXβk2 + βgtk Ωβk (35a)

            s t nminus1 θgtk YgtYθk = 1 (35b)

            θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

            where each βk corresponds to a discriminant direction

            33

            3 Feature Selection in Fisher Discriminant Analysis

            Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

            minβkisinRpθkisinRK

            sumk

            Yθk minusXβk22 + λ1 βk1 + λ2β

            gtk Ωβk

            where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

            Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

            minβkisinRpθkisinRK

            Kminus1sumk=1

            Yθk minusXβk22 + λ

            psumj=1

            radicradicradicradicKminus1sumk=1

            β2kj

            2

            (36)

            which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

            this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

            34

            4 Formalizing the Objective

            In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

            The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

            The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

            41 From Optimal Scoring to Linear Discriminant Analysis

            Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

            Throughout this chapter we assume that

            bull there is no empty class that is the diagonal matrix YgtY is full rank

            bull inputs are centered that is Xgt1n = 0

            bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

            35

            4 Formalizing the Objective

            411 Penalized Optimal Scoring Problem

            For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

            The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

            minθisinRK βisinRp

            Yθ minusXβ2 + βgtΩβ (41a)

            s t nminus1 θgtYgtYθ = 1 (41b)

            For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

            βos =(XgtX + Ω

            )minus1XgtYθ (42)

            The objective function (41a) is then

            Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

            (XgtX + Ω

            )βos

            = θgtYgtYθ minus θgtYgtX(XgtX + Ω

            )minus1XgtYθ

            where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

            maxθnminus1θgtYgtYθ=1

            θgtYgtX(XgtX + Ω

            )minus1XgtYθ (43)

            which shows that the optimization of the p-OS problem with respect to θk boils down to

            finding the kth largest eigenvector of YgtX(XgtX + Ω

            )minus1XgtY Indeed Appendix C

            details that Problem (43) is solved by

            (YgtY)minus1YgtX(XgtX + Ω

            )minus1XgtYθ = α2θ (44)

            36

            41 From Optimal Scoring to Linear Discriminant Analysis

            where α2 is the maximal eigenvalue 1

            nminus1θgtYgtX(XgtX + Ω

            )minus1XgtYθ = α2nminus1θgt(YgtY)θ

            nminus1θgtYgtX(XgtX + Ω

            )minus1XgtYθ = α2 (45)

            412 Penalized Canonical Correlation Analysis

            As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

            maxθisinRK βisinRp

            nminus1θgtYgtXβ (46a)

            s t nminus1 θgtYgtYθ = 1 (46b)

            nminus1 βgt(XgtX + Ω

            )β = 1 (46c)

            The solutions to (46) are obtained by finding saddle points of the Lagrangian

            nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

            rArr npartL(βθ γ ν)

            partβ= XgtYθ minus 2γ(XgtX + Ω)β

            rArr βcca =1

            2γ(XgtX + Ω)minus1XgtYθ

            Then as βcca obeys (46c) we obtain

            βcca =(XgtX + Ω)minus1XgtYθradic

            nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

            so that the optimal objective function (46a) can be expressed with θ alone

            nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

            =

            radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

            and the optimization problem with respect to θ can be restated as

            maxθnminus1θgtYgtYθ=1

            θgtYgtX(XgtX + Ω

            )minus1XgtYθ (48)

            Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

            βos = αβcca (49)

            1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

            37

            4 Formalizing the Objective

            where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

            the optimality conditions for θ

            npartL(βθ γ ν)

            partθ= YgtXβ minus 2νYgtYθ

            rArr θcca =1

            2ν(YgtY)minus1YgtXβ (410)

            Then as θcca obeys (46b) we obtain

            θcca =(YgtY)minus1YgtXβradic

            nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

            leading to the following expression of the optimal objective function

            nminus1θgtccaYgtXβ =

            nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

            =

            radicnminus1βgtXgtY(YgtY)minus1YgtXβ

            The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

            maxβisinRp

            nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

            s t nminus1 βgt(XgtX + Ω

            )β = 1 (412b)

            where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

            nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

            )βcca (413)

            where λ is the maximal eigenvalue shown below to be equal to α2

            nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

            rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

            rArr nminus1αβgtccaXgtYθ = λ

            rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

            rArr α2 = λ

            The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

            38

            41 From Optimal Scoring to Linear Discriminant Analysis

            413 Penalized Linear Discriminant Analysis

            Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

            maxβisinRp

            βgtΣBβ (414a)

            s t βgt(ΣW + nminus1Ω)β = 1 (414b)

            where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

            As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

            to a simple matrix representation using the projection operator Y(YgtY

            )minus1Ygt

            ΣT =1

            n

            nsumi=1

            xixigt

            = nminus1XgtX

            ΣB =1

            n

            Ksumk=1

            nk microkmicrogtk

            = nminus1XgtY(YgtY

            )minus1YgtX

            ΣW =1

            n

            Ksumk=1

            sumiyik=1

            (xi minus microk) (xi minus microk)gt

            = nminus1

            (XgtXminusXgtY

            (YgtY

            )minus1YgtX

            )

            Using these formulae the solution to the p-LDA problem (414) is obtained as

            XgtY(YgtY

            )minus1YgtXβlda = λ

            (XgtX + ΩminusXgtY

            (YgtY

            )minus1YgtX

            )βlda

            XgtY(YgtY

            )minus1YgtXβlda =

            λ

            1minus λ

            (XgtX + Ω

            )βlda

            The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

            βlda = (1minus α2)minus12 βcca

            = αminus1(1minus α2)minus12 βos

            which ends the path from p-OS to p-LDA

            39

            4 Formalizing the Objective

            414 Summary

            The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

            minΘ BYΘminusXB2F + λ tr

            (BgtΩB

            )s t nminus1 ΘgtYgtYΘ = IKminus1

            Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

            square-root of the largest eigenvector of YgtX(XgtX + Ω

            )minus1XgtY we have

            BLDA = BCCA

            (IKminus1 minusA2

            )minus 12

            = BOS Aminus1(IKminus1 minusA2

            )minus 12 (415)

            where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

            can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

            or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

            With the aim of performing classification the whole process could be summarized asfollows

            1 Solve the p-OS problem as

            BOS =(XgtX + λΩ

            )minus1XgtYΘ

            where Θ are the K minus 1 leading eigenvectors of

            YgtX(XgtX + λΩ

            )minus1XgtY

            2 Translate the data samples X into the LDA domain as XLDA = XBOSD

            where D = Aminus1(IKminus1 minusA2

            )minus 12

            3 Compute the matrix M of centroids microk from XLDA and Y

            4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

            5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

            6 Graphical Representation

            40

            42 Practicalities

            The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

            42 Practicalities

            421 Solution of the Penalized Optimal Scoring Regression

            Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

            minΘisinRKtimesKminus1BisinRptimesKminus1

            YΘminusXB2F + λ tr(BgtΩB

            )(416a)

            s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

            where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

            Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

            1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

            2 Compute B =(XgtX + λΩ

            )minus1XgtYΘ0

            3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

            )minus1XgtY

            4 Compute the optimal regression coefficients

            BOS =(XgtX + λΩ

            )minus1XgtYΘ (417)

            Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

            Θ0gtYgtX(XgtX + λΩ

            )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

            costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

            This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

            41

            4 Formalizing the Objective

            a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

            422 Distance Evaluation

            The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

            d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

            (nkn

            ) (418)

            is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

            Σminus1WΩ =

            (nminus1(XgtX + λΩ)minus ΣB

            )minus1

            =(nminus1XgtXminus ΣB + nminus1λΩ

            )minus1

            =(ΣW + nminus1λΩ

            )minus1 (419)

            Before explaining how to compute the distances let us summarize some clarifying points

            bull The solution BOS of the p-OS problem is enough to accomplish classification

            bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

            bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

            As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

            (xi minus microk)BOS2ΣWΩminus 2 log(πk)

            where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

            (IKminus1 minusA2

            )minus 12

            ∥∥∥2

            2minus 2 log(πk)

            which is a plain Euclidean distance

            42

            43 From Sparse Optimal Scoring to Sparse LDA

            423 Posterior Probability Evaluation

            Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

            p(yk = 1|x) prop exp

            (minusd(xmicrok)

            2

            )prop πk exp

            (minus1

            2

            ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

            )minus 12

            ∥∥∥2

            2

            ) (420)

            Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

            2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

            p(yk = 1|x) =πk exp

            (minusd(xmicrok)

            2

            )sum

            ` π` exp(minusd(xmicro`)

            2

            )=

            πk exp(minusd(xmicrok)

            2 + dmax2

            )sum`

            π` exp

            (minusd(xmicro`)

            2+dmax

            2

            )

            where dmax = maxk d(xmicrok)

            424 Graphical Representation

            Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

            43 From Sparse Optimal Scoring to Sparse LDA

            The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

            In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

            43

            4 Formalizing the Objective

            section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

            431 A Quadratic Variational Form

            Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

            Our formulation of group-Lasso is showed below

            minτisinRp

            minBisinRptimesKminus1

            J(B) + λ

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj(421a)

            s tsum

            j τj minussum

            j wj∥∥βj∥∥

            2le 0 (421b)

            τj ge 0 j = 1 p (421c)

            where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

            B =(β1gt βpgt

            )gtand wj are predefined nonnegative weights The cost function

            J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

            The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

            Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

            Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

            j=1wj∥∥βj∥∥

            2

            Proof The Lagrangian of Problem (421) is

            L = J(B) + λ

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj+ ν0

            ( psumj=1

            τj minuspsumj=1

            wj∥∥βj∥∥

            2

            )minus

            psumj=1

            νjτj

            44

            43 From Sparse Optimal Scoring to Sparse LDA

            Figure 41 Graphical representation of the variational approach to Group-Lasso

            Thus the first order optimality conditions for τj are

            partLpartτj

            (τj ) = 0hArr minusλw2j

            ∥∥βj∥∥2

            2

            τj2 + ν0 minus νj = 0

            hArr minusλw2j

            ∥∥βj∥∥2

            2+ ν0τ

            j

            2 minus νjτj2 = 0

            rArr minusλw2j

            ∥∥βj∥∥2

            2+ ν0 τ

            j

            2 = 0

            The last line is obtained from complementary slackness which implies here νjτj = 0

            Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

            for constraint gj(τj) le 0 As a result the optimal value of τj

            τj =

            radicλw2

            j

            ∥∥βj∥∥2

            2

            ν0=

            radicλ

            ν0wj∥∥βj∥∥

            2(422)

            We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

            psumj=1

            τj minuspsumj=1

            wj∥∥βj∥∥

            2= 0 (423)

            so that τj = wj∥∥βj∥∥

            2 Using this value into (421a) it is possible to conclude that

            Problem (421) is equivalent to the standard group-Lasso operator

            minBisinRptimesM

            J(B) + λ

            psumj=1

            wj∥∥βj∥∥

            2 (424)

            So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

            45

            4 Formalizing the Objective

            With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

            Ω = diag

            (w2

            1

            τ1w2

            2

            τ2

            w2p

            τp

            ) (425)

            with

            τj = wj∥∥βj∥∥

            2

            resulting in Ω diagonal components

            (Ω)jj =wj∥∥βj∥∥

            2

            (426)

            And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

            The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

            Lemma 42 If J is convex Problem (421) is convex

            Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

            In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

            Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

            V isin RptimesKminus1 V =partJ(B)

            partB+ λG

            (427)

            where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

            G =(g1gt gpgt

            )gtdefined as follows Let S(B) denote the columnwise support of

            B S(B) = j isin 1 p ∥∥βj∥∥

            26= 0 then we have

            forallj isin S(B) gj = wj∥∥βj∥∥minus1

            2βj (428)

            forallj isin S(B) ∥∥gj∥∥

            2le wj (429)

            46

            43 From Sparse Optimal Scoring to Sparse LDA

            This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

            Proof When∥∥βj∥∥

            26= 0 the gradient of the penalty with respect to βj is

            part (λsump

            m=1wj βm2)

            partβj= λwj

            βj∥∥βj∥∥2

            (430)

            At∥∥βj∥∥

            2= 0 the gradient of the objective function is not continuous and the optimality

            conditions then make use of the subdifferential (Bach et al 2011)

            partβj

            psumm=1

            wj βm2

            )= partβj

            (λwj

            ∥∥βj∥∥2

            )=λwjv isin RKminus1 v2 le 1

            (431)

            That gives the expression (429)

            Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

            forallj isin S partJ(B)

            partβj+ λwj

            ∥∥βj∥∥minus1

            2βj = 0 (432a)

            forallj isin S ∥∥∥∥partJ(B)

            partβj

            ∥∥∥∥2

            le λwj (432b)

            where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

            Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

            432 Group-Lasso OS as Penalized LDA

            With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

            Proposition 41 The group-Lasso OS problem

            BOS = argminBisinRptimesKminus1

            minΘisinRKtimesKminus1

            1

            2YΘminusXB2F + λ

            psumj=1

            wj∥∥βj∥∥

            2

            s t nminus1 ΘgtYgtYΘ = IKminus1

            47

            4 Formalizing the Objective

            is equivalent to the penalized LDA problem

            BLDA = maxBisinRptimesKminus1

            tr(BgtΣBB

            )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

            where Ω = diag

            (w2

            1

            τ1

            w2p

            τp

            ) with Ωjj =

            +infin if βjos = 0

            wj∥∥βjos

            ∥∥minus1

            2otherwise

            (433)

            That is BLDA = BOS diag(αminus1k (1minus α2

            k)minus12

            ) where αk isin (0 1) is the kth leading

            eigenvalue of

            nminus1YgtX(XgtX + λΩ

            )minus1XgtY

            Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

            The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

            (BgtΩB

            )

            48

            5 GLOSS Algorithm

            The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

            The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

            1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

            2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

            3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

            This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

            51 Regression Coefficients Updates

            Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

            XgtAXA + λΩ)βk = XgtAYθ0

            k (51)

            49

            5 GLOSS Algorithm

            initialize modelλ B

            ACTIVE SETall j st||βj ||2 gt 0

            p-OS PROBLEMB must hold1st optimality

            condition

            any variablefrom

            ACTIVE SETmust go toINACTIVE

            SET

            take it out ofACTIVE SET

            test 2nd op-timality con-dition on the

            INACTIVE SET

            any variablefrom

            INACTIVE SETmust go toACTIVE

            SET

            take it out ofINACTIVE SET

            compute Θ

            and update B end

            yes

            no

            yes

            no

            Figure 51 GLOSS block diagram

            50

            51 Regression Coefficients Updates

            Algorithm 1 Adaptively Penalized Optimal Scoring

            Input X Y B λInitialize A larr

            j isin 1 p

            ∥∥βj∥∥2gt 0

            Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

            Step 1 solve (421) in B assuming A optimalrepeat

            Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

            2

            BA larr(XgtAXA + λΩ

            )minus1XgtAYΘ0

            until condition (432a) holds for all j isin A Step 2 identify inactivated variables

            for j isin A ∥∥βj∥∥

            2= 0 do

            if optimality condition (432b) holds thenA larr AjGo back to Step 1

            end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

            jisinA

            ∥∥partJpartβj∥∥2

            if∥∥∥partJpartβj∥∥∥

            2lt λ then

            convergence larr true B is optimalelseA larr Acup j

            end ifuntil convergence

            (sV)larreigenanalyze(Θ0gtYgtXAB) that is

            Θ0gtYgtXABVk = skVk k = 1 K minus 1

            Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

            Output Θ B α

            51

            5 GLOSS Algorithm

            where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

            column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

            511 Cholesky decomposition

            Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

            (XgtX + λΩ)B = XgtYΘ (52)

            Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

            CgtCB = XgtYΘ

            CB = CgtXgtYΘ

            B = CCgtXgtYΘ (53)

            where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

            512 Numerical Stability

            The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

            B = Ωminus12(Ωminus12XgtXΩminus12 + λI

            )minus1Ωminus12XgtYΘ0 (54)

            where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

            52 Score Matrix

            The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

            YgtX(XgtX + Ω

            )minus1XgtY This eigen-analysis is actually solved in the form

            ΘgtYgtX(XgtX + Ω

            )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

            vector decomposition does not require the costly computation of(XgtX + Ω

            )minus1that

            52

            53 Optimality Conditions

            involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

            trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

            )minus1XgtY 1

            Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

            Θ0gtYgtX(XgtX + Ω

            )minus1XgtYΘ0 = Θ0gtYgtXB0

            Thus the solution to penalized OS problem can be computed trough the singular

            value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

            Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

            )minus1XgtYΘ = Λ and when Θ0 is chosen such

            that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

            53 Optimality Conditions

            GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

            1

            2YΘminusXB22 + λ

            psumj=1

            wj∥∥βj∥∥

            2(55)

            Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

            row of B βj is the (K minus 1)-dimensional vector

            partJ(B)

            partβj= xj

            gt(XBminusYΘ)

            where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

            xjgt

            (XBminusYΘ) + λwjβj∥∥βj∥∥

            2

            1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

            )minus1XgtY It is thus suffi-

            cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

            YgtX(XgtX + Ω

            )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

            (35c) we set Θ0 =(YgtY

            )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

            vectors orthogonal to 1K

            53

            5 GLOSS Algorithm

            The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

            2le λwj

            54 Active and Inactive Sets

            The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

            j = maxj

            ∥∥∥xjgt (XBminusYΘ)∥∥∥

            2minus λwj 0

            The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

            2

            is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

            2le λwj

            The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

            55 Penalty Parameter

            The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

            The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

            λmax = maxjisin1p

            1

            wj

            ∥∥∥xjgtYΘ0∥∥∥

            2

            The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

            is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

            54

            56 Options and Variants

            56 Options and Variants

            561 Scaling Variables

            As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

            562 Sparse Variant

            This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

            563 Diagonal Variant

            We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

            The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

            minBisinRptimesKminus1

            YΘminusXB2F = minBisinRptimesKminus1

            tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

            )are replaced by

            minBisinRptimesKminus1

            tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

            )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

            564 Elastic net and Structured Variant

            For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

            55

            5 GLOSS Algorithm

            7 8 9

            4 5 6

            1 2 3

            - ΩL =

            3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

            Figure 52 Graph and Laplacian matrix for a 3times 3 image

            for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

            When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

            This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

            56

            6 Experimental Results

            This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

            61 Normalization

            With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

            62 Decision Thresholds

            The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

            1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

            57

            6 Experimental Results

            63 Simulated Data

            We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

            Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

            Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

            dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

            is intended to mimic gene expression data correlation

            Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

            3 1)if j le 100 and Xij sim N(0 1) otherwise

            Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

            Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

            The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

            58

            63 Simulated Data

            Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

            Err () Var Dir

            Sim 1 K = 4 mean shift ind features

            PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

            Sim 2 K = 2 mean shift dependent features

            PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

            Sim 3 K = 4 1D mean shift ind features

            PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

            Sim 4 K = 4 mean shift ind features

            PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

            59

            6 Experimental Results

            0 10 20 30 40 50 60 70 8020

            30

            40

            50

            60

            70

            80

            90

            100TPR Vs FPR

            gloss

            glossd

            slda

            plda

            Simulation1

            Simulation2

            Simulation3

            Simulation4

            Figure 61 TPR versus FPR (in ) for all algorithms and simulations

            Table 62 Average TPR and FPR (in ) computed over 25 repetitions

            Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

            PLDA 990 782 969 603 980 159 743 656

            SLDA 739 385 338 163 416 278 507 395

            GLOSS 641 106 300 46 511 182 260 121

            GLOSS-D 935 394 921 281 956 655 429 299

            method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

            64 Gene Expression Data

            We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

            2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

            60

            64 Gene Expression Data

            Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

            Err () Var

            Nakayama n = 86 p = 22 283 K = 5

            PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

            Ramaswamy n = 198 p = 16 063 K = 14

            PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

            Sun n = 180 p = 54 613 K = 4

            PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

            ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

            dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

            Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

            Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

            Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

            4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

            61

            6 Experimental Results

            GLOSS SLDA

            Naka

            yam

            a

            minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

            minus25

            minus2

            minus15

            minus1

            minus05

            0

            05

            1

            x 104

            1) Synovial sarcoma

            2) Myxoid liposarcoma

            3) Dedifferentiated liposarcoma

            4) Myxofibrosarcoma

            5) Malignant fibrous histiocytoma

            2n

            dd

            iscr

            imin

            ant

            minus2000 0 2000 4000 6000 8000 10000 12000 14000

            2000

            4000

            6000

            8000

            10000

            12000

            14000

            16000

            1) Synovial sarcoma

            2) Myxoid liposarcoma

            3) Dedifferentiated liposarcoma

            4) Myxofibrosarcoma

            5) Malignant fibrous histiocytoma

            Su

            n

            minus1 minus05 0 05 1 15 2

            x 104

            05

            1

            15

            2

            25

            3

            35

            x 104

            1) NonTumor

            2) Astrocytomas

            3) Glioblastomas

            4) Oligodendrogliomas

            1st discriminant

            2n

            dd

            iscr

            imin

            ant

            minus2 minus15 minus1 minus05 0

            x 104

            0

            05

            1

            15

            2

            x 104

            1) NonTumor

            2) Astrocytomas

            3) Glioblastomas

            4) Oligodendrogliomas

            1st discriminant

            Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

            62

            65 Correlated Data

            Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

            65 Correlated Data

            When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

            The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

            For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

            As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

            The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

            Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

            63

            6 Experimental Results

            β for GLOSS β for S-GLOSS

            Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

            β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

            Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

            64

            Discussion

            GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

            Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

            The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

            The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

            65

            Part III

            Sparse Clustering Analysis

            67

            Abstract

            Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

            Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

            As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

            Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

            69

            7 Feature Selection in Mixture Models

            71 Mixture Models

            One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

            711 Model

            We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

            from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

            f(xi) =

            Ksumk=1

            πkfk(xi) foralli isin 1 n

            where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

            sumk πk = 1) Mixture models transcribe that

            given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

            bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

            bull x each xi is assumed to arise from a random vector with probability densityfunction fk

            In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

            f(xiθ) =

            Ksumk=1

            πkφ(xiθk) foralli isin 1 n

            71

            7 Feature Selection in Mixture Models

            where θ = (π1 πK θ1 θK) is the parameter of the model

            712 Parameter Estimation The EM Algorithm

            For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

            21 σ

            22 π) of a univariate

            Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

            The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

            The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

            Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

            Maximum Likelihood Definitions

            The likelihood is is commonly expressed in its logarithmic version

            L(θ X) = log

            (nprodi=1

            f(xiθ)

            )

            =nsumi=1

            log

            (Ksumk=1

            πkfk(xiθk)

            ) (71)

            where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

            To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

            72

            71 Mixture Models

            classification log-likelihood

            LC(θ XY) = log

            (nprodi=1

            f(xiyiθ)

            )

            =

            nsumi=1

            log

            (Ksumk=1

            yikπkfk(xiθk)

            )

            =nsumi=1

            Ksumk=1

            yik log (πkfk(xiθk)) (72)

            The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

            Defining the soft membership tik(θ) as

            tik(θ) = p(Yik = 1|xiθ) (73)

            =πkfk(xiθk)

            f(xiθ) (74)

            To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

            LC(θ XY) =sumik

            yik log (πkfk(xiθk))

            =sumik

            yik log (tikf(xiθ))

            =sumik

            yik log tik +sumik

            yik log f(xiθ)

            =sumik

            yik log tik +nsumi=1

            log f(xiθ)

            =sumik

            yik log tik + L(θ X) (75)

            wheresum

            ik yik log tik can be reformulated as

            sumik

            yik log tik =nsumi=1

            Ksumk=1

            yik log(p(Yik = 1|xiθ))

            =

            nsumi=1

            log(p(Yik = 1|xiθ))

            = log (p(Y |Xθ))

            As a result the relationship (75) can be rewritten as

            L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

            73

            7 Feature Selection in Mixture Models

            Likelihood Maximization

            The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

            L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

            +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

            In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

            ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

            minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

            Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

            For the mixture model problem Q(θθprime) is

            Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

            =sumik

            p(Yik = 1|xiθprime) log(πkfk(xiθk))

            =nsumi=1

            Ksumk=1

            tik(θprime) log (πkfk(xiθk)) (77)

            Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

            prime) are the posterior proba-bilities of cluster memberships

            Hence the EM algorithm sketched above results in

            bull Initialization (not iterated) choice of the initial parameter θ(0)

            bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

            bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

            74

            72 Feature Selection in Model-Based Clustering

            Gaussian Model

            In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

            f(xiθ) =Ksumk=1

            πkfk(xiθk)

            =

            Ksumk=1

            πk1

            (2π)p2 |Σ|

            12

            exp

            minus1

            2(xi minus microk)

            gtΣminus1(xi minus microk)

            At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

            Q(θθ(t)) =sumik

            tik log(πk)minussumik

            tik log(

            (2π)p2 |Σ|

            12

            )minus 1

            2

            sumik

            tik(xi minus microk)gtΣminus1(xi minus microk)

            =sumk

            tk log(πk)minusnp

            2log(2π)︸ ︷︷ ︸

            constant term

            minusn2

            log(|Σ|)minus 1

            2

            sumik

            tik(xi minus microk)gtΣminus1(xi minus microk)

            equivsumk

            tk log(πk)minusn

            2log(|Σ|)minus

            sumik

            tik

            (1

            2(xi minus microk)

            gtΣminus1(xi minus microk)

            ) (78)

            where

            tk =nsumi=1

            tik (79)

            The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

            π(t+1)k =

            tkn

            (710)

            micro(t+1)k =

            sumi tikxitk

            (711)

            Σ(t+1) =1

            n

            sumk

            Wk (712)

            with Wk =sumi

            tik(xi minus microk)(xi minus microk)gt (713)

            The derivations are detailed in Appendix G

            72 Feature Selection in Model-Based Clustering

            When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

            75

            7 Feature Selection in Mixture Models

            covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

            In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

            gtk (Banfield and Raftery 1993)

            These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

            In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

            721 Based on Penalized Likelihood

            Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

            log

            (p(Yk = 1|x)

            p(Y` = 1|x)

            )= xgtΣminus1(microk minus micro`)minus

            1

            2(microk + micro`)

            gtΣminus1(microk minus micro`) + logπkπ`

            In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

            λKsumk=1

            psumj=1

            |microkj |

            as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

            λ1

            Ksumk=1

            psumj=1

            |microkj |+ λ2

            Ksumk=1

            psumj=1

            psumm=1

            |(Σminus1k )jm|

            In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

            76

            72 Feature Selection in Model-Based Clustering

            Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

            λ

            psumj=1

            sum16k6kprime6K

            |microkj minus microkprimej |

            This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

            A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

            λ

            psumj=1

            (micro1j micro2j microKj)infin

            One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

            λradicK

            psumj=1

            radicradicradicradic Ksum

            k=1

            micro2kj

            The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

            The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

            722 Based on Model Variants

            The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

            77

            7 Feature Selection in Mixture Models

            f(xi|φ πθν) =Ksumk=1

            πk

            pprodj=1

            [f(xij |θjk)]φj [h(xij |νj)]1minusφj

            where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

            An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

            which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

            tr(

            (UgtΣWU)minus1UgtΣBU) (714)

            so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

            To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

            minUisinRptimesKminus1

            ∥∥∥XU minusXU∥∥∥2

            F+ λ

            Kminus1sumk=1

            ∥∥∥uk∥∥∥1

            where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

            minABisinRptimesKminus1

            Ksumk=1

            ∥∥∥RminusgtW HBk minusABgtHBk

            ∥∥∥2

            2+ ρ

            Kminus1sumj=1

            βgtj ΣWβj + λ

            Kminus1sumj=1

            ∥∥βj∥∥1

            s t AgtA = IKminus1

            where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

            78

            72 Feature Selection in Model-Based Clustering

            triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

            The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

            minUisinRptimesKminus1

            psumj=1

            ∥∥∥ΣBj minus UUgtΣBj

            ∥∥∥2

            2

            s t UgtU = IKminus1

            whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

            To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

            However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

            723 Based on Model Selection

            Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

            bull X(1) set of selected relevant variables

            bull X(2) set of variables being considered for inclusion or exclusion of X(1)

            bull X(3) set of non relevant variables

            79

            7 Feature Selection in Mixture Models

            With those subsets they defined two different models where Y is the partition toconsider

            bull M1

            f (X|Y) = f(X(1)X(2)X(3)|Y

            )= f

            (X(3)|X(2)X(1)

            )f(X(2)|X(1)

            )f(X(1)|Y

            )bull M2

            f (X|Y) = f(X(1)X(2)X(3)|Y

            )= f

            (X(3)|X(2)X(1)

            )f(X(2)X(1)|Y

            )Model M1 means that variables in X(2) are independent on clustering Y Model M2

            shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

            B12 =f (X|M1)

            f (X|M2)

            where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

            B12 =f(X(1)X(2)X(3)|M1

            )f(X(1)X(2)X(3)|M2

            )=f(X(2)|X(1)M1

            )f(X(1)|M1

            )f(X(2)X(1)|M2

            )

            This factor is approximated since the integrated likelihoods f(X(1)|M1

            )and

            f(X(2)X(1)|M2

            )are difficult to calculate exactly Raftery and Dean (2006) use the

            BIC approximation The computation of f(X(2)|X(1)M1

            ) if there is only one variable

            in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

            Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

            Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

            80

            8 Theoretical Foundations

            In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

            We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

            In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

            81 Resolving EM with Optimal Scoring

            In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

            811 Relationship Between the M-Step and Linear Discriminant Analysis

            LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

            d(ximicrok) = (xi minus microk)gtΣminus1

            W (xi minus microk)

            where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

            81

            8 Theoretical Foundations

            The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

            Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

            2lweight(microΣ) =nsumi=1

            Ksumk=1

            tikd(ximicrok)minus n log(|ΣW|)

            which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

            812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

            The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

            813 Clustering Using Penalized Optimal Scoring

            The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

            d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

            This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

            82

            82 Optimized Criterion

            1 Initialize the membership matrix Y (for example by K-means algorithm)

            2 Solve the p-OS problem as

            BOS =(XgtX + λΩ

            )minus1XgtYΘ

            where Θ are the K minus 1 leading eigenvectors of

            YgtX(XgtX + λΩ

            )minus1XgtY

            3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

            k)minus 1

            2 )

            4 Compute the centroids M in the LDA domain

            5 Evaluate distances in the LDA domain

            6 Translate distances into posterior probabilities tik with

            tik prop exp

            [minusd(x microk)minus 2 log(πk)

            2

            ] (81)

            7 Update the labels using the posterior probabilities matrix Y = T

            8 Go back to step 2 and iterate until tik converge

            Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

            814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

            In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

            82 Optimized Criterion

            In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

            83

            8 Theoretical Foundations

            optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

            This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

            821 A Bayesian Derivation

            This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

            The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

            The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

            f(Σ|Λ0 ν0) =1

            2np2 |Λ0|

            n2 Γp(

            n2 )|Σminus1|

            ν0minuspminus12 exp

            minus1

            2tr(Λminus1

            0 Σminus1)

            where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

            Γp(n2) = πp(pminus1)4pprodj=1

            Γ (n2 + (1minus j)2)

            The posterior distribution can be maximized similarly to the likelihood through the

            84

            82 Optimized Criterion

            maximization of

            Q(θθprime) + log(f(Σ|Λ0 ν0))

            =Ksumk=1

            tk log πk minus(n+ 1)p

            2log 2minus n

            2log |Λ0| minus

            p(p+ 1)

            4log(π)

            minuspsumj=1

            log

            (n

            2+

            1minus j2

            ))minus νn minus pminus 1

            2log |Σ| minus 1

            2tr(Λminus1n Σminus1

            )equiv

            Ksumk=1

            tk log πk minusn

            2log |Λ0| minus

            νn minus pminus 1

            2log |Σ| minus 1

            2tr(Λminus1n Σminus1

            ) (82)

            with tk =

            nsumi=1

            tik

            νn = ν0 + n

            Λminus1n = Λminus1

            0 + S0

            S0 =

            nsumi=1

            Ksumk=1

            tik(xi minus microk)(xi minus microk)gt

            Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

            822 Maximum a Posteriori Estimator

            The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

            ΣMAP =1

            ν0 + nminus pminus 1(Λminus1

            0 + S0) (83)

            where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

            0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

            85

            9 Mix-GLOSS Algorithm

            Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

            91 Mix-GLOSS

            The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

            When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

            The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

            911 Outer Loop Whole Algorithm Repetitions

            This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

            bull the centered ntimes p feature matrix X

            bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

            bull the number of clusters K

            bull the maximum number of iterations for the EM algorithm

            bull the convergence tolerance for the EM algorithm

            bull the number of whole repetitions of the clustering algorithm

            87

            9 Mix-GLOSS Algorithm

            Figure 91 Mix-GLOSS Loops Scheme

            bull a ptimes (K minus 1) initial coefficient matrix (optional)

            bull a ntimesK initial posterior probability matrix (optional)

            For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

            912 Penalty Parameter Loop

            The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

            Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

            88

            91 Mix-GLOSS

            of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

            Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

            Algorithm 2 Automatic selection of λ

            Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

            Estimate λ Compute gradient at βj = 0partJ(B)

            partβj

            ∣∣∣βj=0

            = xjgt

            (sum

            m6=j xmβm minusYΘ)

            Compute λmax for every feature using (432b)

            λmaxj = 1

            wj

            ∥∥∥∥ partJ(B)

            partβj

            ∣∣∣βj=0

            ∥∥∥∥2

            Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

            elselastLAMBDA larr true

            end ifuntil lastLAMBDA

            Output B L(θ) tik πk microk Σ Y for every λ in solution path

            913 Inner Loop EM Algorithm

            The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

            89

            9 Mix-GLOSS Algorithm

            Algorithm 3 Mix-GLOSS for one value of λ

            Input X K B0 Y0 λInitializeif (B0Y0) available then

            BOS larr B0 Y larr Y0

            elseBOS larr 0 Y larr kmeans(XK)

            end ifconvergenceEM larr false tolEM larr 1e-3repeat

            M-step(BOSΘ

            α)larr GLOSS(XYBOS λ)

            XLDA = XBOS diag (αminus1(1minusα2)minus12

            )

            πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

            sumi |tik minus yik| lt tolEM then

            convergenceEM larr trueend ifY larr T

            until convergenceEMY larr MAP(T)

            Output BOS ΘL(θ) tik πk microk Σ Y

            90

            92 Model Selection

            M-Step

            The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

            E-Step

            The E-step evaluates the posterior probability matrix T using

            tik prop exp

            [minusd(x microk)minus 2 log(πk)

            2

            ]

            The convergence of those tik is used as stopping criterion for EM

            92 Model Selection

            Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

            In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

            In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

            The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

            91

            9 Mix-GLOSS Algorithm

            Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

            X K λEMITER MAXREPMixminusGLOSS

            Use B and T frombest repetition as

            StartB and StartT

            Mix-GLOSS (λStartBStartT)

            Compute BIC

            Chose λ = minλ BIC

            Partition tikπk λBEST BΘ D L(θ)activeset

            Figure 92 Mix-GLOSS model selection diagram

            with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

            92

            10 Experimental Results

            The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

            This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

            In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

            The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

            101 Tested Clustering Algorithms

            This section compares Mix-GLOSS with the following methods in the state of the art

            bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

            bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

            93

            10 Experimental Results

            Figure 101 Class mean vectors for each artificial simulation

            bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

            After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

            The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

            bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

            bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

            94

            102 Results

            Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

            102 Results

            In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

            bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

            bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

            bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

            The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

            Results in percentages are displayed in Figure 102 (or in Table 102 )

            95

            10 Experimental Results

            Table 101 Experimental results for simulated data

            Err () Var Time

            Sim 1 K = 4 mean shift ind features

            CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

            Sim 2 K = 2 mean shift dependent features

            CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

            Sim 3 K = 4 1D mean shift ind features

            CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

            Sim 4 K = 4 mean shift ind features

            CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

            Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

            Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

            MIX-GLOSS 992 015 828 335 884 67 780 12

            LUMI-KUAN 992 28 1000 02 1000 005 50 005

            FISHER-EM 986 24 888 17 838 5825 620 4075

            96

            103 Discussion

            0 10 20 30 40 50 600

            10

            20

            30

            40

            50

            60

            70

            80

            90

            100TPR Vs FPR

            MIXminusGLOSS

            LUMIminusKUAN

            FISHERminusEM

            Simulation1

            Simulation2

            Simulation3

            Simulation4

            Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

            103 Discussion

            After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

            LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

            The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

            From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

            97

            Conclusions

            99

            Conclusions

            Summary

            The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

            In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

            The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

            In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

            In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

            Perspectives

            Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

            101

            based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

            At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

            The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

            From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

            At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

            102

            Appendix

            103

            A Matrix Properties

            Property 1 By definition ΣW and ΣB are both symmetric matrices

            ΣW =1

            n

            gsumk=1

            sumiisinCk

            (xi minus microk)(xi minus microk)gt

            ΣB =1

            n

            gsumk=1

            nk(microk minus x)(microk minus x)gt

            Property 2 partxgtapartx = partagtx

            partx = a

            Property 3 partxgtAxpartx = (A + Agt)x

            Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

            Property 5 partagtXbpartX = abgt

            Property 6 partpartXtr

            (AXminus1B

            )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

            105

            B The Penalized-OS Problem is anEigenvector Problem

            In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

            minθkβk

            Yθk minusXβk22 + βgtk Ωkβk (B1)

            st θgtk YgtYθk = 1

            θgt` YgtYθk = 0 forall` lt k

            for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

            Lk(θkβk λkνk) =

            Yθk minusXβk22 + βgtk Ωkβk + λk(θ

            gtk YgtYθk minus 1) +

            sum`ltk

            ν`θgt` YgtYθk (B2)

            Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

            βk = (XgtX + Ωk)minus1XgtYθk (B3)

            The objective function of (B1) evaluated at βk is

            minθk

            Yθk minusXβk22 + βk

            gtΩkβk = min

            θk

            θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

            = maxθk

            θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

            If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

            B1 How to Solve the Eigenvector Decomposition

            Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

            107

            B The Penalized-OS Problem is an Eigenvector Problem

            Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

            maxΘisinRKtimes(Kminus1)

            tr(ΘgtMΘ

            )(B5)

            st ΘgtYgtYΘ = IKminus1

            If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

            MΘv = λv (B6)

            where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

            vgtMΘv = λhArr vgtΘgtMΘv = λ

            Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

            wgtMw = λ (B7)

            Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

            MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

            = ΘgtYgtXB

            Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

            To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

            B = (XgtX + Ω)minus1XgtYΘV = BV

            108

            B2 Why the OS Problem is Solved as an Eigenvector Problem

            B2 Why the OS Problem is Solved as an Eigenvector Problem

            In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

            By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

            θk =

            Kminus1summ=1

            αmwm s t θgtk θk = 1 (B8)

            The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

            Kminus1summ=1

            αmwm

            )gt(Kminus1summ=1

            αmwm

            )= 1

            that as per the eigenvector properties can be reduced to

            Kminus1summ=1

            α2m = 1 (B9)

            Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

            Mθk = M

            Kminus1summ=1

            αmwm

            =

            Kminus1summ=1

            αmMwm

            As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

            Mθk =Kminus1summ=1

            αmλmwm

            Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

            θgtk Mθk =

            (Kminus1sum`=1

            α`w`

            )gt(Kminus1summ=1

            αmλmwm

            )

            This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

            θgtk Mθk =Kminus1summ=1

            α2mλm

            109

            B The Penalized-OS Problem is an Eigenvector Problem

            The optimization Problem (B5) for discriminant direction k can be rewritten as

            maxθkisinRKtimes1

            θgtk Mθk

            = max

            θkisinRKtimes1

            Kminus1summ=1

            α2mλm

            (B10)

            with θk =Kminus1summ=1

            αmwm

            andKminus1summ=1

            α2m = 1

            One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

            sumKminus1m=1 αmwm the resulting score vector θk will be equal to

            the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

            be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

            110

            C Solving Fisherrsquos Discriminant Problem

            The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

            maxβisinRp

            βgtΣBβ (C1a)

            s t βgtΣWβ = 1 (C1b)

            where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

            The Lagrangian of Problem (C1) is

            L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

            so that its first derivative with respect to β is

            partL(β ν)

            partβ= 2ΣBβ minus 2νΣWβ

            A necessary optimality condition for β is that this derivative is zero that is

            ΣBβ = νΣWβ

            Provided ΣW is full rank we have

            Σminus1W ΣBβ

            = νβ (C2)

            Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

            eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

            βgtΣBβ = βgtΣWΣminus1

            W ΣBβ

            = νβgtΣWβ from (C2)

            = ν from (C1b)

            That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

            W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

            111

            D Alternative Variational Formulation forthe Group-Lasso

            In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

            minτisinRp

            minBisinRptimesKminus1

            J(B) + λ

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj(D1a)

            s tsump

            j=1 τj = 1 (D1b)

            τj ge 0 j = 1 p (D1c)

            Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

            of row vectors βj isin RKminus1 B =(β1gt βpgt

            )gt

            L(B τ λ ν0 νj) = J(B) + λ

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj+ ν0

            psumj=1

            τj minus 1

            minus psumj=1

            νjτj (D2)

            The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

            partL(B τ λ ν0 νj)

            partτj

            ∣∣∣∣τj=τj

            = 0 rArr minusλw2j

            ∥∥βj∥∥2

            2

            τj2 + ν0 minus νj = 0

            rArr minusλw2j

            ∥∥βj∥∥2

            2+ ν0τ

            j

            2 minus νjτj2 = 0

            rArr minusλw2j

            ∥∥βj∥∥2

            2+ ν0τ

            j

            2 = 0

            The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

            ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

            inequality Lagrange condition Then the optimal τj can be deduced

            τj =

            radicλ

            ν0wj∥∥βj∥∥

            2

            Placing this optimal value of τj into constraint (D1b)

            psumj=1

            τj = 1rArr τj =wj∥∥βj∥∥

            2sumpj=1wj

            ∥∥βj∥∥2

            (D3)

            113

            D Alternative Variational Formulation for the Group-Lasso

            With this value of τj Problem (D1) is equivalent to

            minBisinRptimesKminus1

            J(B) + λ

            psumj=1

            wj∥∥βj∥∥

            2

            2

            (D4)

            This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

            The penalty term of (D1a) can be conveniently presented as λBgtΩB where

            Ω = diag

            (w2

            1

            τ1w2

            2

            τ2

            w2p

            τp

            ) (D5)

            Using the value of τj from (D3) each diagonal component of Ω is

            (Ω)jj =wjsump

            j=1wj∥∥βj∥∥

            2∥∥βj∥∥2

            (D6)

            In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

            D1 Useful Properties

            Lemma D1 If J is convex Problem (D1) is convex

            In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

            Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

            partJ(B)

            partB+ 2λ

            Kminus1sumj=1

            wj∥∥βj∥∥

            2

            G

            (D7)

            where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

            ∥∥βj∥∥26= 0 then we have

            forallj isin S(B) gj = wj∥∥βj∥∥minus1

            2βj (D8)

            forallj isin S(B) ∥∥gj∥∥

            2le wj (D9)

            114

            D2 An Upper Bound on the Objective Function

            This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

            Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

            ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

            forallj isin S(B) minus partJ(B)

            partβj= 2λ

            Kminus1sumj=1

            wj∥∥βj∥∥2

            wj∥∥βj∥∥minus1

            2βj (D10a)

            forallj isin S(B)

            ∥∥∥∥partJ(B)

            partβj

            ∥∥∥∥2

            le 2λwj

            Kminus1sumj=1

            wj∥∥βj∥∥2

            (D10b)

            In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

            D2 An Upper Bound on the Objective Function

            Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

            τj =wj∥∥βj∥∥

            2sumpj=1wj

            ∥∥βj∥∥2

            Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

            j=1

            wj∥∥βj∥∥

            2

            2

            =

            psumj=1

            τ12j

            wj∥∥βj∥∥

            2

            τ12j

            2

            le

            psumj=1

            τj

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj

            le

            psumj=1

            w2j

            ∥∥βj∥∥2

            2

            τj

            where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

            115

            D Alternative Variational Formulation for the Group-Lasso

            This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

            116

            E Invariance of the Group-Lasso to UnitaryTransformations

            The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

            Proposition E1 Let B be a solution of

            minBisinRptimesM

            Y minusXB2F + λ

            psumj=1

            wj∥∥βj∥∥

            2(E1)

            and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

            minBisinRptimesM

            ∥∥∥Y minusXB∥∥∥2

            F+ λ

            psumj=1

            wj∥∥βj∥∥

            2(E2)

            Proof The first-order necessary optimality conditions for B are

            forallj isin S(B) 2xjgt(xjβ

            j minusY)

            + λwj

            ∥∥∥βj∥∥∥minus1

            2βj

            = 0 (E3a)

            forallj isin S(B) 2∥∥∥xjgt (xjβ

            j minusY)∥∥∥

            2le λwj (E3b)

            where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

            First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

            forallj isin S(B) 2xjgt(xjβ

            j minus Y)

            + λwj

            ∥∥∥βj∥∥∥minus1

            2βj

            = 0 (E4a)

            forallj isin S(B) 2∥∥∥xjgt (xjβ

            j minus Y)∥∥∥

            2le λwj (E4b)

            where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

            ∥∥ugt∥∥2

            =∥∥ugtV

            ∥∥2 Equation (E4b) is also

            117

            E Invariance of the Group-Lasso to Unitary Transformations

            obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

            118

            F Expected Complete Likelihood andLikelihood

            Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

            L(θ) =

            nsumi=1

            log

            (Ksumk=1

            πkfk(xiθk)

            )(F1)

            Q(θθprime) =nsumi=1

            Ksumk=1

            tik(θprime) log (πkfk(xiθk)) (F2)

            with tik(θprime) =

            πprimekfk(xiθprimek)sum

            ` πprime`f`(xiθ

            prime`)

            (F3)

            In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

            the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

            Using (F3) we have

            Q(θθprime) =sumik

            tik(θprime) log (πkfk(xiθk))

            =sumik

            tik(θprime) log(tik(θ)) +

            sumik

            tik(θprime) log

            (sum`

            π`f`(xiθ`)

            )=sumik

            tik(θprime) log(tik(θ)) + L(θ)

            In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

            L(θ) = Q(θθ)minussumik

            tik(θ) log(tik(θ))

            = Q(θθ) +H(T)

            119

            G Derivation of the M-Step Equations

            This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

            Q(θθprime) = maxθ

            sumik

            tik(θprime) log(πkfk(xiθk))

            =sumk

            log

            (πksumi

            tik

            )minus np

            2log(2π)minus n

            2log |Σ| minus 1

            2

            sumik

            tik(xi minus microk)gtΣminus1(xi minus microk)

            which has to be maximized subject tosumk

            πk = 1

            The Lagrangian of this problem is

            L(θ) = Q(θθprime) + λ

            (sumk

            πk minus 1

            )

            Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

            G1 Prior probabilities

            partL(θ)

            partπk= 0hArr 1

            πk

            sumi

            tik + λ = 0

            where λ is identified from the constraint leading to

            πk =1

            n

            sumi

            tik

            121

            G Derivation of the M-Step Equations

            G2 Means

            partL(θ)

            partmicrok= 0hArr minus1

            2

            sumi

            tik2Σminus1(microk minus xi) = 0

            rArr microk =

            sumi tikxisumi tik

            G3 Covariance Matrix

            partL(θ)

            partΣminus1 = 0hArr n

            2Σ︸︷︷︸

            as per property 4

            minus 1

            2

            sumik

            tik(xi minus microk)(xi minus microk)gt

            ︸ ︷︷ ︸as per property 5

            = 0

            rArr Σ =1

            n

            sumik

            tik(xi minus microk)(xi minus microk)gt

            122

            Bibliography

            F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

            F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

            F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

            J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

            A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

            H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

            P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

            C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

            C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

            C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

            C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

            S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

            L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

            L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

            123

            Bibliography

            T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

            S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

            C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

            B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

            L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

            C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

            A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

            D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

            R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

            B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

            Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

            R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

            V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

            J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

            124

            Bibliography

            J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

            J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

            W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

            A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

            D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

            G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

            G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

            Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

            Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

            L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

            Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

            J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

            I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

            T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

            T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

            125

            Bibliography

            T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

            A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

            J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

            T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

            K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

            P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

            T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

            M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

            Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

            C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

            C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

            H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

            J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

            Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

            C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

            126

            Bibliography

            C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

            L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

            N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

            B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

            B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

            Y Nesterov Gradient methods for minimizing composite functions preprint 2007

            S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

            B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

            M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

            M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

            W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

            W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

            K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

            S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

            127

            Bibliography

            Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

            A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

            C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

            S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

            V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

            V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

            V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

            C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

            L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

            Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

            A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

            S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

            P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

            M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

            128

            Bibliography

            M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

            R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

            J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

            S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

            D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

            D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

            D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

            M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

            MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

            T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

            B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

            B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

            C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

            J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

            129

            Bibliography

            M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

            P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

            P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

            H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

            H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

            H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

            130

            • SANCHEZ MERCHANTE PDTpdf
            • Thesis Luis Francisco Sanchez Merchantepdf
              • List of figures
              • List of tables
              • Notation and Symbols
              • Context and Foundations
                • Context
                • Regularization for Feature Selection
                  • Motivations
                  • Categorization of Feature Selection Techniques
                  • Regularization
                    • Important Properties
                    • Pure Penalties
                    • Hybrid Penalties
                    • Mixed Penalties
                    • Sparsity Considerations
                    • Optimization Tools for Regularized Problems
                      • Sparse Linear Discriminant Analysis
                        • Abstract
                        • Feature Selection in Fisher Discriminant Analysis
                          • Fisher Discriminant Analysis
                          • Feature Selection in LDA Problems
                            • Inertia Based
                            • Regression Based
                                • Formalizing the Objective
                                  • From Optimal Scoring to Linear Discriminant Analysis
                                    • Penalized Optimal Scoring Problem
                                    • Penalized Canonical Correlation Analysis
                                    • Penalized Linear Discriminant Analysis
                                    • Summary
                                      • Practicalities
                                        • Solution of the Penalized Optimal Scoring Regression
                                        • Distance Evaluation
                                        • Posterior Probability Evaluation
                                        • Graphical Representation
                                          • From Sparse Optimal Scoring to Sparse LDA
                                            • A Quadratic Variational Form
                                            • Group-Lasso OS as Penalized LDA
                                                • GLOSS Algorithm
                                                  • Regression Coefficients Updates
                                                    • Cholesky decomposition
                                                    • Numerical Stability
                                                      • Score Matrix
                                                      • Optimality Conditions
                                                      • Active and Inactive Sets
                                                      • Penalty Parameter
                                                      • Options and Variants
                                                        • Scaling Variables
                                                        • Sparse Variant
                                                        • Diagonal Variant
                                                        • Elastic net and Structured Variant
                                                            • Experimental Results
                                                              • Normalization
                                                              • Decision Thresholds
                                                              • Simulated Data
                                                              • Gene Expression Data
                                                              • Correlated Data
                                                                • Discussion
                                                                  • Sparse Clustering Analysis
                                                                    • Abstract
                                                                    • Feature Selection in Mixture Models
                                                                      • Mixture Models
                                                                        • Model
                                                                        • Parameter Estimation The EM Algorithm
                                                                          • Feature Selection in Model-Based Clustering
                                                                            • Based on Penalized Likelihood
                                                                            • Based on Model Variants
                                                                            • Based on Model Selection
                                                                                • Theoretical Foundations
                                                                                  • Resolving EM with Optimal Scoring
                                                                                    • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                    • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                    • Clustering Using Penalized Optimal Scoring
                                                                                    • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                      • Optimized Criterion
                                                                                        • A Bayesian Derivation
                                                                                        • Maximum a Posteriori Estimator
                                                                                            • Mix-GLOSS Algorithm
                                                                                              • Mix-GLOSS
                                                                                                • Outer Loop Whole Algorithm Repetitions
                                                                                                • Penalty Parameter Loop
                                                                                                • Inner Loop EM Algorithm
                                                                                                  • Model Selection
                                                                                                    • Experimental Results
                                                                                                      • Tested Clustering Algorithms
                                                                                                      • Results
                                                                                                      • Discussion
                                                                                                          • Conclusions
                                                                                                          • Appendix
                                                                                                            • Matrix Properties
                                                                                                            • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                              • How to Solve the Eigenvector Decomposition
                                                                                                              • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                                • Solving Fishers Discriminant Problem
                                                                                                                • Alternative Variational Formulation for the Group-Lasso
                                                                                                                  • Useful Properties
                                                                                                                  • An Upper Bound on the Objective Function
                                                                                                                    • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                    • Expected Complete Likelihood and Likelihood
                                                                                                                    • Derivation of the M-Step Equations
                                                                                                                      • Prior probabilities
                                                                                                                      • Means
                                                                                                                      • Covariance Matrix
                                                                                                                          • Bibliography

              Contents

              413 Penalized Linear Discriminant Analysis 39

              414 Summary 40

              42 Practicalities 41

              421 Solution of the Penalized Optimal Scoring Regression 41

              422 Distance Evaluation 42

              423 Posterior Probability Evaluation 43

              424 Graphical Representation 43

              43 From Sparse Optimal Scoring to Sparse LDA 43

              431 A Quadratic Variational Form 44

              432 Group-Lasso OS as Penalized LDA 47

              5 GLOSS Algorithm 4951 Regression Coefficients Updates 49

              511 Cholesky decomposition 52

              512 Numerical Stability 52

              52 Score Matrix 52

              53 Optimality Conditions 53

              54 Active and Inactive Sets 54

              55 Penalty Parameter 54

              56 Options and Variants 55

              561 Scaling Variables 55

              562 Sparse Variant 55

              563 Diagonal Variant 55

              564 Elastic net and Structured Variant 55

              6 Experimental Results 5761 Normalization 57

              62 Decision Thresholds 57

              63 Simulated Data 58

              64 Gene Expression Data 60

              65 Correlated Data 63

              Discussion 63

              III Sparse Clustering Analysis 67

              Abstract 69

              7 Feature Selection in Mixture Models 7171 Mixture Models 71

              711 Model 71

              712 Parameter Estimation The EM Algorithm 72

              ii

              Contents

              72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

              8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

              811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

              Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

              82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

              9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

              911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

              92 Model Selection 91

              10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

              Conclusions 97

              Appendix 103

              A Matrix Properties 105

              B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

              C Solving Fisherrsquos Discriminant Problem 111

              D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

              iii

              Contents

              E Invariance of the Group-Lasso to Unitary Transformations 117

              F Expected Complete Likelihood and Likelihood 119

              G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

              Bibliography 123

              iv

              List of Figures

              11 MASH project logo 5

              21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

              rameters 20

              41 Graphical representation of the variational approach to Group-Lasso 45

              51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

              61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

              discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

              91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

              101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

              v

              List of Tables

              61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

              101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

              vii

              Notation and Symbols

              Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

              Sets

              N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

              Data

              X input domainxi input sample xi isin XX design matrix X = (xgt1 x

              gtn )gt

              xj column j of Xyi class indicator of sample i

              Y indicator matrix Y = (ygt1 ygtn )gt

              z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

              Vectors Matrices and Norms

              0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

              ix

              Notation and Symbols

              Probability

              E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

              W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

              H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

              Mixture Models

              yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

              θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

              Optimization

              J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

              βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

              x

              Notation and Symbols

              Penalized models

              λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

              βj jth row of B = (β1gt βpgt)gt

              BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

              ΣB sample between-class covariance matrix

              ΣW sample within-class covariance matrix

              ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

              xi

              Part I

              Context and Foundations

              1

              This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

              The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

              The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

              3

              1 Context

              The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

              The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

              From the point of view of the research the members of the consortium must deal withfour main goals

              1 Software development of website framework and APIrsquos

              2 Classification and goal-planning in high dimensional feature spaces

              3 Interfacing the platform with the 3D virtual environment and the robot arm

              4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

              S HM A

              Figure 11 MASH project logo

              5

              1 Context

              The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

              Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

              As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

              bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

              bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

              6

              All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

              bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

              I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

              7

              2 Regularization for Feature Selection

              With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

              21 Motivations

              There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

              As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

              When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

              bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

              bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

              9

              2 Regularization for Feature Selection

              Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

              selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

              As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

              ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

              Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

              10

              22 Categorization of Feature Selection Techniques

              Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

              ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

              There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

              22 Categorization of Feature Selection Techniques

              Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

              I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

              The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

              bull Depending on the type of integration with the machine learning algorithm we have

              ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

              ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

              11

              2 Regularization for Feature Selection

              the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

              ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

              bull Depending on the feature searching technique

              ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

              ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

              ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

              bull Depending on the evaluation technique

              ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

              ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

              ndash Dependency Measures - Measuring the correlation between features

              ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

              ndash Predictive Accuracy - Use the selected features to predict the labels

              ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

              The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

              In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

              12

              23 Regularization

              goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

              23 Regularization

              In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

              An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

              minβJ(β) + λP (β) (21)

              minβ

              J(β)

              s t P (β) le t (22)

              In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

              In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

              13

              2 Regularization for Feature Selection

              Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

              231 Important Properties

              Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

              Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

              forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

              for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

              Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

              Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

              232 Pure Penalties

              For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

              14

              23 Regularization

              Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

              this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

              Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

              A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

              After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

              3penalty has a support region with sharper vertexes that would induce

              a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

              3results in difficulties during optimization that will not happen with a convex

              shape

              15

              2 Regularization for Feature Selection

              To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

              L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

              minβ

              J(β)

              s t β0 le t (24)

              where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

              L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

              minβ

              J(β)

              s t

              psumj=1

              |βj | le t (25)

              Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

              Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

              The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

              16

              23 Regularization

              minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

              L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

              minβJ(β) + λ β22 (26)

              The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

              minβ

              nsumi=1

              (yi minus xgti β)2 (27)

              with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

              minβ

              nsumi=1

              (yi minus xgti β)2 + λ

              psumj=1

              β2j

              The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

              the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

              As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

              minβ

              nsumi=1

              (yi minus xgti β)2 + λ

              psumj=1

              β2j

              (βlsj )2 (28)

              The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

              17

              2 Regularization for Feature Selection

              where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

              Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

              Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

              This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

              βlowast = maxwisinRp

              βgtw s t w le 1

              In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

              r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

              233 Hybrid Penalties

              There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

              minβ

              nsumi=1

              (yi minus xgti β)2 + λ1

              psumj=1

              |βj |+ λ2

              psumj=1

              β2j (29)

              The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

              18

              23 Regularization

              234 Mixed Penalties

              Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

              sumL`=1 d` Mixed norms are

              a type of norms that take into consideration those groups The general expression isshowed below

              β(rs) =

              sum`

              sumjisinG`

              |βj |s r

              s

              1r

              (210)

              The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

              Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

              (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

              235 Sparsity Considerations

              In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

              The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

              To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

              19

              2 Regularization for Feature Selection

              (a) L1 Lasso (b) L(12) group-Lasso

              Figure 25 Admissible sets for the Lasso and Group-Lasso

              (a) L1 induced sparsity (b) L(12) group inducedsparsity

              Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

              20

              23 Regularization

              the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

              236 Optimization Tools for Regularized Problems

              In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

              In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

              Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

              β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

              Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

              βj =minusλsign(βj)minus partJ(β)

              partβj

              2sumn

              i=1 x2ij

              In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

              algorithm where β(t+1)j = Sλ

              (partJ(β(t))partβj

              ) The objective function is optimized with respect

              21

              2 Regularization for Feature Selection

              to one variable at a time while all others are kept fixed

              (partJ(β)

              partβj

              )=

              λminus partJ(β)partβj

              2sumn

              i=1 x2ij

              if partJ(β)partβj

              gt λ

              minusλminus partJ(β)partβj

              2sumn

              i=1 x2ij

              if partJ(β)partβj

              lt minusλ

              0 if |partJ(β)partβj| le λ

              (211)

              The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

              Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

              Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

              Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

              This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

              22

              23 Regularization

              and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

              Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

              This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

              This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

              Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

              This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

              Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

              Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

              minβisinRp

              J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

              2

              ∥∥∥β minus β(t)∥∥∥2

              2(212)

              They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

              23

              2 Regularization for Feature Selection

              (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

              minβisinRp

              1

              2

              ∥∥∥∥β minus (β(t) minus 1

              LnablaJ(β(t)))

              ∥∥∥∥2

              2

              LP (β) (213)

              The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

              24

              Part II

              Sparse Linear Discriminant Analysis

              25

              Abstract

              Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

              There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

              In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

              27

              3 Feature Selection in Fisher DiscriminantAnalysis

              31 Fisher Discriminant Analysis

              Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

              We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

              gtn )gt and the corresponding labels in the ntimesK matrix

              Y = (ygt1 ygtn )gt

              Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

              maxβisinRp

              βgtΣBβ

              βgtΣWβ (31)

              where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

              ΣW =1

              n

              Ksumk=1

              sumiisinGk

              (xi minus microk)(xi minus microk)gt

              ΣB =1

              n

              Ksumk=1

              sumiisinGk

              (microminus microk)(microminus microk)gt

              where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

              29

              3 Feature Selection in Fisher Discriminant Analysis

              This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

              maxBisinRptimesKminus1

              tr(BgtΣBB

              )tr(BgtΣWB

              ) (32)

              where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

              based on a series of K minus 1 subproblemsmaxβkisinRp

              βgtk ΣBβk

              s t βgtk ΣWβk le 1

              βgtk ΣWβ` = 0 forall` lt k

              (33)

              The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

              eigenvalue (see Appendix C)

              32 Feature Selection in LDA Problems

              LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

              Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

              The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

              They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

              321 Inertia Based

              The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

              30

              32 Feature Selection in LDA Problems

              classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

              Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

              Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

              minβisinRp

              βgtΣWβ

              s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

              where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

              Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

              βisinkRpβgtk Σ

              k

              Bβk minus Pk(βk)

              s t βgtk ΣWβk le 1

              The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

              Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

              solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

              minimization minβisinRp

              β1

              s t∥∥∥Σβ minus (micro1 minus micro2)

              ∥∥∥infinle λ

              Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

              31

              3 Feature Selection in Fisher Discriminant Analysis

              Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

              322 Regression Based

              In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

              Predefined Indicator Matrix

              Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

              There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

              Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

              In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

              32

              32 Feature Selection in LDA Problems

              obtained by solving

              minβisinRpβ0isinR

              nminus1nsumi=1

              (yi minus β0 minus xgti β)2 + λ

              psumj=1

              |βj |

              where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

              vector for λ = 0 but a different intercept β0 is required

              Optimal Scoring

              In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

              As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

              minΘ BYΘminusXB2F + λ tr

              (BgtΩB

              )(34a)

              s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

              where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

              minθkisinRK βkisinRp

              Yθk minusXβk2 + βgtk Ωβk (35a)

              s t nminus1 θgtk YgtYθk = 1 (35b)

              θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

              where each βk corresponds to a discriminant direction

              33

              3 Feature Selection in Fisher Discriminant Analysis

              Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

              minβkisinRpθkisinRK

              sumk

              Yθk minusXβk22 + λ1 βk1 + λ2β

              gtk Ωβk

              where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

              Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

              minβkisinRpθkisinRK

              Kminus1sumk=1

              Yθk minusXβk22 + λ

              psumj=1

              radicradicradicradicKminus1sumk=1

              β2kj

              2

              (36)

              which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

              this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

              34

              4 Formalizing the Objective

              In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

              The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

              The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

              41 From Optimal Scoring to Linear Discriminant Analysis

              Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

              Throughout this chapter we assume that

              bull there is no empty class that is the diagonal matrix YgtY is full rank

              bull inputs are centered that is Xgt1n = 0

              bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

              35

              4 Formalizing the Objective

              411 Penalized Optimal Scoring Problem

              For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

              The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

              minθisinRK βisinRp

              Yθ minusXβ2 + βgtΩβ (41a)

              s t nminus1 θgtYgtYθ = 1 (41b)

              For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

              βos =(XgtX + Ω

              )minus1XgtYθ (42)

              The objective function (41a) is then

              Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

              (XgtX + Ω

              )βos

              = θgtYgtYθ minus θgtYgtX(XgtX + Ω

              )minus1XgtYθ

              where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

              maxθnminus1θgtYgtYθ=1

              θgtYgtX(XgtX + Ω

              )minus1XgtYθ (43)

              which shows that the optimization of the p-OS problem with respect to θk boils down to

              finding the kth largest eigenvector of YgtX(XgtX + Ω

              )minus1XgtY Indeed Appendix C

              details that Problem (43) is solved by

              (YgtY)minus1YgtX(XgtX + Ω

              )minus1XgtYθ = α2θ (44)

              36

              41 From Optimal Scoring to Linear Discriminant Analysis

              where α2 is the maximal eigenvalue 1

              nminus1θgtYgtX(XgtX + Ω

              )minus1XgtYθ = α2nminus1θgt(YgtY)θ

              nminus1θgtYgtX(XgtX + Ω

              )minus1XgtYθ = α2 (45)

              412 Penalized Canonical Correlation Analysis

              As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

              maxθisinRK βisinRp

              nminus1θgtYgtXβ (46a)

              s t nminus1 θgtYgtYθ = 1 (46b)

              nminus1 βgt(XgtX + Ω

              )β = 1 (46c)

              The solutions to (46) are obtained by finding saddle points of the Lagrangian

              nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

              rArr npartL(βθ γ ν)

              partβ= XgtYθ minus 2γ(XgtX + Ω)β

              rArr βcca =1

              2γ(XgtX + Ω)minus1XgtYθ

              Then as βcca obeys (46c) we obtain

              βcca =(XgtX + Ω)minus1XgtYθradic

              nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

              so that the optimal objective function (46a) can be expressed with θ alone

              nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

              =

              radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

              and the optimization problem with respect to θ can be restated as

              maxθnminus1θgtYgtYθ=1

              θgtYgtX(XgtX + Ω

              )minus1XgtYθ (48)

              Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

              βos = αβcca (49)

              1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

              37

              4 Formalizing the Objective

              where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

              the optimality conditions for θ

              npartL(βθ γ ν)

              partθ= YgtXβ minus 2νYgtYθ

              rArr θcca =1

              2ν(YgtY)minus1YgtXβ (410)

              Then as θcca obeys (46b) we obtain

              θcca =(YgtY)minus1YgtXβradic

              nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

              leading to the following expression of the optimal objective function

              nminus1θgtccaYgtXβ =

              nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

              =

              radicnminus1βgtXgtY(YgtY)minus1YgtXβ

              The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

              maxβisinRp

              nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

              s t nminus1 βgt(XgtX + Ω

              )β = 1 (412b)

              where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

              nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

              )βcca (413)

              where λ is the maximal eigenvalue shown below to be equal to α2

              nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

              rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

              rArr nminus1αβgtccaXgtYθ = λ

              rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

              rArr α2 = λ

              The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

              38

              41 From Optimal Scoring to Linear Discriminant Analysis

              413 Penalized Linear Discriminant Analysis

              Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

              maxβisinRp

              βgtΣBβ (414a)

              s t βgt(ΣW + nminus1Ω)β = 1 (414b)

              where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

              As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

              to a simple matrix representation using the projection operator Y(YgtY

              )minus1Ygt

              ΣT =1

              n

              nsumi=1

              xixigt

              = nminus1XgtX

              ΣB =1

              n

              Ksumk=1

              nk microkmicrogtk

              = nminus1XgtY(YgtY

              )minus1YgtX

              ΣW =1

              n

              Ksumk=1

              sumiyik=1

              (xi minus microk) (xi minus microk)gt

              = nminus1

              (XgtXminusXgtY

              (YgtY

              )minus1YgtX

              )

              Using these formulae the solution to the p-LDA problem (414) is obtained as

              XgtY(YgtY

              )minus1YgtXβlda = λ

              (XgtX + ΩminusXgtY

              (YgtY

              )minus1YgtX

              )βlda

              XgtY(YgtY

              )minus1YgtXβlda =

              λ

              1minus λ

              (XgtX + Ω

              )βlda

              The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

              βlda = (1minus α2)minus12 βcca

              = αminus1(1minus α2)minus12 βos

              which ends the path from p-OS to p-LDA

              39

              4 Formalizing the Objective

              414 Summary

              The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

              minΘ BYΘminusXB2F + λ tr

              (BgtΩB

              )s t nminus1 ΘgtYgtYΘ = IKminus1

              Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

              square-root of the largest eigenvector of YgtX(XgtX + Ω

              )minus1XgtY we have

              BLDA = BCCA

              (IKminus1 minusA2

              )minus 12

              = BOS Aminus1(IKminus1 minusA2

              )minus 12 (415)

              where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

              can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

              or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

              With the aim of performing classification the whole process could be summarized asfollows

              1 Solve the p-OS problem as

              BOS =(XgtX + λΩ

              )minus1XgtYΘ

              where Θ are the K minus 1 leading eigenvectors of

              YgtX(XgtX + λΩ

              )minus1XgtY

              2 Translate the data samples X into the LDA domain as XLDA = XBOSD

              where D = Aminus1(IKminus1 minusA2

              )minus 12

              3 Compute the matrix M of centroids microk from XLDA and Y

              4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

              5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

              6 Graphical Representation

              40

              42 Practicalities

              The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

              42 Practicalities

              421 Solution of the Penalized Optimal Scoring Regression

              Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

              minΘisinRKtimesKminus1BisinRptimesKminus1

              YΘminusXB2F + λ tr(BgtΩB

              )(416a)

              s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

              where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

              Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

              1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

              2 Compute B =(XgtX + λΩ

              )minus1XgtYΘ0

              3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

              )minus1XgtY

              4 Compute the optimal regression coefficients

              BOS =(XgtX + λΩ

              )minus1XgtYΘ (417)

              Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

              Θ0gtYgtX(XgtX + λΩ

              )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

              costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

              This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

              41

              4 Formalizing the Objective

              a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

              422 Distance Evaluation

              The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

              d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

              (nkn

              ) (418)

              is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

              Σminus1WΩ =

              (nminus1(XgtX + λΩ)minus ΣB

              )minus1

              =(nminus1XgtXminus ΣB + nminus1λΩ

              )minus1

              =(ΣW + nminus1λΩ

              )minus1 (419)

              Before explaining how to compute the distances let us summarize some clarifying points

              bull The solution BOS of the p-OS problem is enough to accomplish classification

              bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

              bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

              As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

              (xi minus microk)BOS2ΣWΩminus 2 log(πk)

              where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

              (IKminus1 minusA2

              )minus 12

              ∥∥∥2

              2minus 2 log(πk)

              which is a plain Euclidean distance

              42

              43 From Sparse Optimal Scoring to Sparse LDA

              423 Posterior Probability Evaluation

              Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

              p(yk = 1|x) prop exp

              (minusd(xmicrok)

              2

              )prop πk exp

              (minus1

              2

              ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

              )minus 12

              ∥∥∥2

              2

              ) (420)

              Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

              2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

              p(yk = 1|x) =πk exp

              (minusd(xmicrok)

              2

              )sum

              ` π` exp(minusd(xmicro`)

              2

              )=

              πk exp(minusd(xmicrok)

              2 + dmax2

              )sum`

              π` exp

              (minusd(xmicro`)

              2+dmax

              2

              )

              where dmax = maxk d(xmicrok)

              424 Graphical Representation

              Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

              43 From Sparse Optimal Scoring to Sparse LDA

              The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

              In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

              43

              4 Formalizing the Objective

              section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

              431 A Quadratic Variational Form

              Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

              Our formulation of group-Lasso is showed below

              minτisinRp

              minBisinRptimesKminus1

              J(B) + λ

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj(421a)

              s tsum

              j τj minussum

              j wj∥∥βj∥∥

              2le 0 (421b)

              τj ge 0 j = 1 p (421c)

              where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

              B =(β1gt βpgt

              )gtand wj are predefined nonnegative weights The cost function

              J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

              The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

              Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

              Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

              j=1wj∥∥βj∥∥

              2

              Proof The Lagrangian of Problem (421) is

              L = J(B) + λ

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj+ ν0

              ( psumj=1

              τj minuspsumj=1

              wj∥∥βj∥∥

              2

              )minus

              psumj=1

              νjτj

              44

              43 From Sparse Optimal Scoring to Sparse LDA

              Figure 41 Graphical representation of the variational approach to Group-Lasso

              Thus the first order optimality conditions for τj are

              partLpartτj

              (τj ) = 0hArr minusλw2j

              ∥∥βj∥∥2

              2

              τj2 + ν0 minus νj = 0

              hArr minusλw2j

              ∥∥βj∥∥2

              2+ ν0τ

              j

              2 minus νjτj2 = 0

              rArr minusλw2j

              ∥∥βj∥∥2

              2+ ν0 τ

              j

              2 = 0

              The last line is obtained from complementary slackness which implies here νjτj = 0

              Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

              for constraint gj(τj) le 0 As a result the optimal value of τj

              τj =

              radicλw2

              j

              ∥∥βj∥∥2

              2

              ν0=

              radicλ

              ν0wj∥∥βj∥∥

              2(422)

              We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

              psumj=1

              τj minuspsumj=1

              wj∥∥βj∥∥

              2= 0 (423)

              so that τj = wj∥∥βj∥∥

              2 Using this value into (421a) it is possible to conclude that

              Problem (421) is equivalent to the standard group-Lasso operator

              minBisinRptimesM

              J(B) + λ

              psumj=1

              wj∥∥βj∥∥

              2 (424)

              So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

              45

              4 Formalizing the Objective

              With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

              Ω = diag

              (w2

              1

              τ1w2

              2

              τ2

              w2p

              τp

              ) (425)

              with

              τj = wj∥∥βj∥∥

              2

              resulting in Ω diagonal components

              (Ω)jj =wj∥∥βj∥∥

              2

              (426)

              And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

              The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

              Lemma 42 If J is convex Problem (421) is convex

              Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

              In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

              Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

              V isin RptimesKminus1 V =partJ(B)

              partB+ λG

              (427)

              where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

              G =(g1gt gpgt

              )gtdefined as follows Let S(B) denote the columnwise support of

              B S(B) = j isin 1 p ∥∥βj∥∥

              26= 0 then we have

              forallj isin S(B) gj = wj∥∥βj∥∥minus1

              2βj (428)

              forallj isin S(B) ∥∥gj∥∥

              2le wj (429)

              46

              43 From Sparse Optimal Scoring to Sparse LDA

              This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

              Proof When∥∥βj∥∥

              26= 0 the gradient of the penalty with respect to βj is

              part (λsump

              m=1wj βm2)

              partβj= λwj

              βj∥∥βj∥∥2

              (430)

              At∥∥βj∥∥

              2= 0 the gradient of the objective function is not continuous and the optimality

              conditions then make use of the subdifferential (Bach et al 2011)

              partβj

              psumm=1

              wj βm2

              )= partβj

              (λwj

              ∥∥βj∥∥2

              )=λwjv isin RKminus1 v2 le 1

              (431)

              That gives the expression (429)

              Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

              forallj isin S partJ(B)

              partβj+ λwj

              ∥∥βj∥∥minus1

              2βj = 0 (432a)

              forallj isin S ∥∥∥∥partJ(B)

              partβj

              ∥∥∥∥2

              le λwj (432b)

              where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

              Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

              432 Group-Lasso OS as Penalized LDA

              With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

              Proposition 41 The group-Lasso OS problem

              BOS = argminBisinRptimesKminus1

              minΘisinRKtimesKminus1

              1

              2YΘminusXB2F + λ

              psumj=1

              wj∥∥βj∥∥

              2

              s t nminus1 ΘgtYgtYΘ = IKminus1

              47

              4 Formalizing the Objective

              is equivalent to the penalized LDA problem

              BLDA = maxBisinRptimesKminus1

              tr(BgtΣBB

              )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

              where Ω = diag

              (w2

              1

              τ1

              w2p

              τp

              ) with Ωjj =

              +infin if βjos = 0

              wj∥∥βjos

              ∥∥minus1

              2otherwise

              (433)

              That is BLDA = BOS diag(αminus1k (1minus α2

              k)minus12

              ) where αk isin (0 1) is the kth leading

              eigenvalue of

              nminus1YgtX(XgtX + λΩ

              )minus1XgtY

              Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

              The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

              (BgtΩB

              )

              48

              5 GLOSS Algorithm

              The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

              The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

              1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

              2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

              3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

              This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

              51 Regression Coefficients Updates

              Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

              XgtAXA + λΩ)βk = XgtAYθ0

              k (51)

              49

              5 GLOSS Algorithm

              initialize modelλ B

              ACTIVE SETall j st||βj ||2 gt 0

              p-OS PROBLEMB must hold1st optimality

              condition

              any variablefrom

              ACTIVE SETmust go toINACTIVE

              SET

              take it out ofACTIVE SET

              test 2nd op-timality con-dition on the

              INACTIVE SET

              any variablefrom

              INACTIVE SETmust go toACTIVE

              SET

              take it out ofINACTIVE SET

              compute Θ

              and update B end

              yes

              no

              yes

              no

              Figure 51 GLOSS block diagram

              50

              51 Regression Coefficients Updates

              Algorithm 1 Adaptively Penalized Optimal Scoring

              Input X Y B λInitialize A larr

              j isin 1 p

              ∥∥βj∥∥2gt 0

              Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

              Step 1 solve (421) in B assuming A optimalrepeat

              Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

              2

              BA larr(XgtAXA + λΩ

              )minus1XgtAYΘ0

              until condition (432a) holds for all j isin A Step 2 identify inactivated variables

              for j isin A ∥∥βj∥∥

              2= 0 do

              if optimality condition (432b) holds thenA larr AjGo back to Step 1

              end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

              jisinA

              ∥∥partJpartβj∥∥2

              if∥∥∥partJpartβj∥∥∥

              2lt λ then

              convergence larr true B is optimalelseA larr Acup j

              end ifuntil convergence

              (sV)larreigenanalyze(Θ0gtYgtXAB) that is

              Θ0gtYgtXABVk = skVk k = 1 K minus 1

              Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

              Output Θ B α

              51

              5 GLOSS Algorithm

              where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

              column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

              511 Cholesky decomposition

              Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

              (XgtX + λΩ)B = XgtYΘ (52)

              Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

              CgtCB = XgtYΘ

              CB = CgtXgtYΘ

              B = CCgtXgtYΘ (53)

              where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

              512 Numerical Stability

              The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

              B = Ωminus12(Ωminus12XgtXΩminus12 + λI

              )minus1Ωminus12XgtYΘ0 (54)

              where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

              52 Score Matrix

              The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

              YgtX(XgtX + Ω

              )minus1XgtY This eigen-analysis is actually solved in the form

              ΘgtYgtX(XgtX + Ω

              )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

              vector decomposition does not require the costly computation of(XgtX + Ω

              )minus1that

              52

              53 Optimality Conditions

              involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

              trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

              )minus1XgtY 1

              Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

              Θ0gtYgtX(XgtX + Ω

              )minus1XgtYΘ0 = Θ0gtYgtXB0

              Thus the solution to penalized OS problem can be computed trough the singular

              value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

              Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

              )minus1XgtYΘ = Λ and when Θ0 is chosen such

              that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

              53 Optimality Conditions

              GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

              1

              2YΘminusXB22 + λ

              psumj=1

              wj∥∥βj∥∥

              2(55)

              Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

              row of B βj is the (K minus 1)-dimensional vector

              partJ(B)

              partβj= xj

              gt(XBminusYΘ)

              where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

              xjgt

              (XBminusYΘ) + λwjβj∥∥βj∥∥

              2

              1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

              )minus1XgtY It is thus suffi-

              cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

              YgtX(XgtX + Ω

              )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

              (35c) we set Θ0 =(YgtY

              )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

              vectors orthogonal to 1K

              53

              5 GLOSS Algorithm

              The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

              2le λwj

              54 Active and Inactive Sets

              The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

              j = maxj

              ∥∥∥xjgt (XBminusYΘ)∥∥∥

              2minus λwj 0

              The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

              2

              is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

              2le λwj

              The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

              55 Penalty Parameter

              The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

              The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

              λmax = maxjisin1p

              1

              wj

              ∥∥∥xjgtYΘ0∥∥∥

              2

              The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

              is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

              54

              56 Options and Variants

              56 Options and Variants

              561 Scaling Variables

              As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

              562 Sparse Variant

              This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

              563 Diagonal Variant

              We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

              The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

              minBisinRptimesKminus1

              YΘminusXB2F = minBisinRptimesKminus1

              tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

              )are replaced by

              minBisinRptimesKminus1

              tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

              )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

              564 Elastic net and Structured Variant

              For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

              55

              5 GLOSS Algorithm

              7 8 9

              4 5 6

              1 2 3

              - ΩL =

              3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

              Figure 52 Graph and Laplacian matrix for a 3times 3 image

              for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

              When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

              This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

              56

              6 Experimental Results

              This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

              61 Normalization

              With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

              62 Decision Thresholds

              The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

              1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

              57

              6 Experimental Results

              63 Simulated Data

              We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

              Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

              Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

              dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

              is intended to mimic gene expression data correlation

              Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

              3 1)if j le 100 and Xij sim N(0 1) otherwise

              Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

              Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

              The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

              58

              63 Simulated Data

              Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

              Err () Var Dir

              Sim 1 K = 4 mean shift ind features

              PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

              Sim 2 K = 2 mean shift dependent features

              PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

              Sim 3 K = 4 1D mean shift ind features

              PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

              Sim 4 K = 4 mean shift ind features

              PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

              59

              6 Experimental Results

              0 10 20 30 40 50 60 70 8020

              30

              40

              50

              60

              70

              80

              90

              100TPR Vs FPR

              gloss

              glossd

              slda

              plda

              Simulation1

              Simulation2

              Simulation3

              Simulation4

              Figure 61 TPR versus FPR (in ) for all algorithms and simulations

              Table 62 Average TPR and FPR (in ) computed over 25 repetitions

              Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

              PLDA 990 782 969 603 980 159 743 656

              SLDA 739 385 338 163 416 278 507 395

              GLOSS 641 106 300 46 511 182 260 121

              GLOSS-D 935 394 921 281 956 655 429 299

              method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

              64 Gene Expression Data

              We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

              2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

              60

              64 Gene Expression Data

              Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

              Err () Var

              Nakayama n = 86 p = 22 283 K = 5

              PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

              Ramaswamy n = 198 p = 16 063 K = 14

              PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

              Sun n = 180 p = 54 613 K = 4

              PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

              ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

              dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

              Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

              Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

              Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

              4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

              61

              6 Experimental Results

              GLOSS SLDA

              Naka

              yam

              a

              minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

              minus25

              minus2

              minus15

              minus1

              minus05

              0

              05

              1

              x 104

              1) Synovial sarcoma

              2) Myxoid liposarcoma

              3) Dedifferentiated liposarcoma

              4) Myxofibrosarcoma

              5) Malignant fibrous histiocytoma

              2n

              dd

              iscr

              imin

              ant

              minus2000 0 2000 4000 6000 8000 10000 12000 14000

              2000

              4000

              6000

              8000

              10000

              12000

              14000

              16000

              1) Synovial sarcoma

              2) Myxoid liposarcoma

              3) Dedifferentiated liposarcoma

              4) Myxofibrosarcoma

              5) Malignant fibrous histiocytoma

              Su

              n

              minus1 minus05 0 05 1 15 2

              x 104

              05

              1

              15

              2

              25

              3

              35

              x 104

              1) NonTumor

              2) Astrocytomas

              3) Glioblastomas

              4) Oligodendrogliomas

              1st discriminant

              2n

              dd

              iscr

              imin

              ant

              minus2 minus15 minus1 minus05 0

              x 104

              0

              05

              1

              15

              2

              x 104

              1) NonTumor

              2) Astrocytomas

              3) Glioblastomas

              4) Oligodendrogliomas

              1st discriminant

              Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

              62

              65 Correlated Data

              Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

              65 Correlated Data

              When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

              The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

              For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

              As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

              The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

              Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

              63

              6 Experimental Results

              β for GLOSS β for S-GLOSS

              Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

              β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

              Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

              64

              Discussion

              GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

              Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

              The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

              The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

              65

              Part III

              Sparse Clustering Analysis

              67

              Abstract

              Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

              Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

              As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

              Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

              69

              7 Feature Selection in Mixture Models

              71 Mixture Models

              One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

              711 Model

              We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

              from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

              f(xi) =

              Ksumk=1

              πkfk(xi) foralli isin 1 n

              where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

              sumk πk = 1) Mixture models transcribe that

              given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

              bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

              bull x each xi is assumed to arise from a random vector with probability densityfunction fk

              In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

              f(xiθ) =

              Ksumk=1

              πkφ(xiθk) foralli isin 1 n

              71

              7 Feature Selection in Mixture Models

              where θ = (π1 πK θ1 θK) is the parameter of the model

              712 Parameter Estimation The EM Algorithm

              For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

              21 σ

              22 π) of a univariate

              Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

              The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

              The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

              Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

              Maximum Likelihood Definitions

              The likelihood is is commonly expressed in its logarithmic version

              L(θ X) = log

              (nprodi=1

              f(xiθ)

              )

              =nsumi=1

              log

              (Ksumk=1

              πkfk(xiθk)

              ) (71)

              where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

              To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

              72

              71 Mixture Models

              classification log-likelihood

              LC(θ XY) = log

              (nprodi=1

              f(xiyiθ)

              )

              =

              nsumi=1

              log

              (Ksumk=1

              yikπkfk(xiθk)

              )

              =nsumi=1

              Ksumk=1

              yik log (πkfk(xiθk)) (72)

              The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

              Defining the soft membership tik(θ) as

              tik(θ) = p(Yik = 1|xiθ) (73)

              =πkfk(xiθk)

              f(xiθ) (74)

              To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

              LC(θ XY) =sumik

              yik log (πkfk(xiθk))

              =sumik

              yik log (tikf(xiθ))

              =sumik

              yik log tik +sumik

              yik log f(xiθ)

              =sumik

              yik log tik +nsumi=1

              log f(xiθ)

              =sumik

              yik log tik + L(θ X) (75)

              wheresum

              ik yik log tik can be reformulated as

              sumik

              yik log tik =nsumi=1

              Ksumk=1

              yik log(p(Yik = 1|xiθ))

              =

              nsumi=1

              log(p(Yik = 1|xiθ))

              = log (p(Y |Xθ))

              As a result the relationship (75) can be rewritten as

              L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

              73

              7 Feature Selection in Mixture Models

              Likelihood Maximization

              The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

              L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

              +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

              In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

              ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

              minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

              Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

              For the mixture model problem Q(θθprime) is

              Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

              =sumik

              p(Yik = 1|xiθprime) log(πkfk(xiθk))

              =nsumi=1

              Ksumk=1

              tik(θprime) log (πkfk(xiθk)) (77)

              Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

              prime) are the posterior proba-bilities of cluster memberships

              Hence the EM algorithm sketched above results in

              bull Initialization (not iterated) choice of the initial parameter θ(0)

              bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

              bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

              74

              72 Feature Selection in Model-Based Clustering

              Gaussian Model

              In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

              f(xiθ) =Ksumk=1

              πkfk(xiθk)

              =

              Ksumk=1

              πk1

              (2π)p2 |Σ|

              12

              exp

              minus1

              2(xi minus microk)

              gtΣminus1(xi minus microk)

              At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

              Q(θθ(t)) =sumik

              tik log(πk)minussumik

              tik log(

              (2π)p2 |Σ|

              12

              )minus 1

              2

              sumik

              tik(xi minus microk)gtΣminus1(xi minus microk)

              =sumk

              tk log(πk)minusnp

              2log(2π)︸ ︷︷ ︸

              constant term

              minusn2

              log(|Σ|)minus 1

              2

              sumik

              tik(xi minus microk)gtΣminus1(xi minus microk)

              equivsumk

              tk log(πk)minusn

              2log(|Σ|)minus

              sumik

              tik

              (1

              2(xi minus microk)

              gtΣminus1(xi minus microk)

              ) (78)

              where

              tk =nsumi=1

              tik (79)

              The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

              π(t+1)k =

              tkn

              (710)

              micro(t+1)k =

              sumi tikxitk

              (711)

              Σ(t+1) =1

              n

              sumk

              Wk (712)

              with Wk =sumi

              tik(xi minus microk)(xi minus microk)gt (713)

              The derivations are detailed in Appendix G

              72 Feature Selection in Model-Based Clustering

              When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

              75

              7 Feature Selection in Mixture Models

              covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

              In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

              gtk (Banfield and Raftery 1993)

              These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

              In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

              721 Based on Penalized Likelihood

              Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

              log

              (p(Yk = 1|x)

              p(Y` = 1|x)

              )= xgtΣminus1(microk minus micro`)minus

              1

              2(microk + micro`)

              gtΣminus1(microk minus micro`) + logπkπ`

              In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

              λKsumk=1

              psumj=1

              |microkj |

              as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

              λ1

              Ksumk=1

              psumj=1

              |microkj |+ λ2

              Ksumk=1

              psumj=1

              psumm=1

              |(Σminus1k )jm|

              In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

              76

              72 Feature Selection in Model-Based Clustering

              Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

              λ

              psumj=1

              sum16k6kprime6K

              |microkj minus microkprimej |

              This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

              A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

              λ

              psumj=1

              (micro1j micro2j microKj)infin

              One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

              λradicK

              psumj=1

              radicradicradicradic Ksum

              k=1

              micro2kj

              The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

              The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

              722 Based on Model Variants

              The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

              77

              7 Feature Selection in Mixture Models

              f(xi|φ πθν) =Ksumk=1

              πk

              pprodj=1

              [f(xij |θjk)]φj [h(xij |νj)]1minusφj

              where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

              An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

              which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

              tr(

              (UgtΣWU)minus1UgtΣBU) (714)

              so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

              To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

              minUisinRptimesKminus1

              ∥∥∥XU minusXU∥∥∥2

              F+ λ

              Kminus1sumk=1

              ∥∥∥uk∥∥∥1

              where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

              minABisinRptimesKminus1

              Ksumk=1

              ∥∥∥RminusgtW HBk minusABgtHBk

              ∥∥∥2

              2+ ρ

              Kminus1sumj=1

              βgtj ΣWβj + λ

              Kminus1sumj=1

              ∥∥βj∥∥1

              s t AgtA = IKminus1

              where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

              78

              72 Feature Selection in Model-Based Clustering

              triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

              The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

              minUisinRptimesKminus1

              psumj=1

              ∥∥∥ΣBj minus UUgtΣBj

              ∥∥∥2

              2

              s t UgtU = IKminus1

              whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

              To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

              However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

              723 Based on Model Selection

              Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

              bull X(1) set of selected relevant variables

              bull X(2) set of variables being considered for inclusion or exclusion of X(1)

              bull X(3) set of non relevant variables

              79

              7 Feature Selection in Mixture Models

              With those subsets they defined two different models where Y is the partition toconsider

              bull M1

              f (X|Y) = f(X(1)X(2)X(3)|Y

              )= f

              (X(3)|X(2)X(1)

              )f(X(2)|X(1)

              )f(X(1)|Y

              )bull M2

              f (X|Y) = f(X(1)X(2)X(3)|Y

              )= f

              (X(3)|X(2)X(1)

              )f(X(2)X(1)|Y

              )Model M1 means that variables in X(2) are independent on clustering Y Model M2

              shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

              B12 =f (X|M1)

              f (X|M2)

              where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

              B12 =f(X(1)X(2)X(3)|M1

              )f(X(1)X(2)X(3)|M2

              )=f(X(2)|X(1)M1

              )f(X(1)|M1

              )f(X(2)X(1)|M2

              )

              This factor is approximated since the integrated likelihoods f(X(1)|M1

              )and

              f(X(2)X(1)|M2

              )are difficult to calculate exactly Raftery and Dean (2006) use the

              BIC approximation The computation of f(X(2)|X(1)M1

              ) if there is only one variable

              in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

              Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

              Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

              80

              8 Theoretical Foundations

              In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

              We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

              In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

              81 Resolving EM with Optimal Scoring

              In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

              811 Relationship Between the M-Step and Linear Discriminant Analysis

              LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

              d(ximicrok) = (xi minus microk)gtΣminus1

              W (xi minus microk)

              where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

              81

              8 Theoretical Foundations

              The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

              Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

              2lweight(microΣ) =nsumi=1

              Ksumk=1

              tikd(ximicrok)minus n log(|ΣW|)

              which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

              812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

              The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

              813 Clustering Using Penalized Optimal Scoring

              The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

              d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

              This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

              82

              82 Optimized Criterion

              1 Initialize the membership matrix Y (for example by K-means algorithm)

              2 Solve the p-OS problem as

              BOS =(XgtX + λΩ

              )minus1XgtYΘ

              where Θ are the K minus 1 leading eigenvectors of

              YgtX(XgtX + λΩ

              )minus1XgtY

              3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

              k)minus 1

              2 )

              4 Compute the centroids M in the LDA domain

              5 Evaluate distances in the LDA domain

              6 Translate distances into posterior probabilities tik with

              tik prop exp

              [minusd(x microk)minus 2 log(πk)

              2

              ] (81)

              7 Update the labels using the posterior probabilities matrix Y = T

              8 Go back to step 2 and iterate until tik converge

              Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

              814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

              In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

              82 Optimized Criterion

              In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

              83

              8 Theoretical Foundations

              optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

              This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

              821 A Bayesian Derivation

              This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

              The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

              The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

              f(Σ|Λ0 ν0) =1

              2np2 |Λ0|

              n2 Γp(

              n2 )|Σminus1|

              ν0minuspminus12 exp

              minus1

              2tr(Λminus1

              0 Σminus1)

              where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

              Γp(n2) = πp(pminus1)4pprodj=1

              Γ (n2 + (1minus j)2)

              The posterior distribution can be maximized similarly to the likelihood through the

              84

              82 Optimized Criterion

              maximization of

              Q(θθprime) + log(f(Σ|Λ0 ν0))

              =Ksumk=1

              tk log πk minus(n+ 1)p

              2log 2minus n

              2log |Λ0| minus

              p(p+ 1)

              4log(π)

              minuspsumj=1

              log

              (n

              2+

              1minus j2

              ))minus νn minus pminus 1

              2log |Σ| minus 1

              2tr(Λminus1n Σminus1

              )equiv

              Ksumk=1

              tk log πk minusn

              2log |Λ0| minus

              νn minus pminus 1

              2log |Σ| minus 1

              2tr(Λminus1n Σminus1

              ) (82)

              with tk =

              nsumi=1

              tik

              νn = ν0 + n

              Λminus1n = Λminus1

              0 + S0

              S0 =

              nsumi=1

              Ksumk=1

              tik(xi minus microk)(xi minus microk)gt

              Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

              822 Maximum a Posteriori Estimator

              The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

              ΣMAP =1

              ν0 + nminus pminus 1(Λminus1

              0 + S0) (83)

              where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

              0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

              85

              9 Mix-GLOSS Algorithm

              Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

              91 Mix-GLOSS

              The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

              When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

              The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

              911 Outer Loop Whole Algorithm Repetitions

              This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

              bull the centered ntimes p feature matrix X

              bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

              bull the number of clusters K

              bull the maximum number of iterations for the EM algorithm

              bull the convergence tolerance for the EM algorithm

              bull the number of whole repetitions of the clustering algorithm

              87

              9 Mix-GLOSS Algorithm

              Figure 91 Mix-GLOSS Loops Scheme

              bull a ptimes (K minus 1) initial coefficient matrix (optional)

              bull a ntimesK initial posterior probability matrix (optional)

              For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

              912 Penalty Parameter Loop

              The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

              Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

              88

              91 Mix-GLOSS

              of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

              Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

              Algorithm 2 Automatic selection of λ

              Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

              Estimate λ Compute gradient at βj = 0partJ(B)

              partβj

              ∣∣∣βj=0

              = xjgt

              (sum

              m6=j xmβm minusYΘ)

              Compute λmax for every feature using (432b)

              λmaxj = 1

              wj

              ∥∥∥∥ partJ(B)

              partβj

              ∣∣∣βj=0

              ∥∥∥∥2

              Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

              elselastLAMBDA larr true

              end ifuntil lastLAMBDA

              Output B L(θ) tik πk microk Σ Y for every λ in solution path

              913 Inner Loop EM Algorithm

              The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

              89

              9 Mix-GLOSS Algorithm

              Algorithm 3 Mix-GLOSS for one value of λ

              Input X K B0 Y0 λInitializeif (B0Y0) available then

              BOS larr B0 Y larr Y0

              elseBOS larr 0 Y larr kmeans(XK)

              end ifconvergenceEM larr false tolEM larr 1e-3repeat

              M-step(BOSΘ

              α)larr GLOSS(XYBOS λ)

              XLDA = XBOS diag (αminus1(1minusα2)minus12

              )

              πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

              sumi |tik minus yik| lt tolEM then

              convergenceEM larr trueend ifY larr T

              until convergenceEMY larr MAP(T)

              Output BOS ΘL(θ) tik πk microk Σ Y

              90

              92 Model Selection

              M-Step

              The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

              E-Step

              The E-step evaluates the posterior probability matrix T using

              tik prop exp

              [minusd(x microk)minus 2 log(πk)

              2

              ]

              The convergence of those tik is used as stopping criterion for EM

              92 Model Selection

              Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

              In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

              In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

              The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

              91

              9 Mix-GLOSS Algorithm

              Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

              X K λEMITER MAXREPMixminusGLOSS

              Use B and T frombest repetition as

              StartB and StartT

              Mix-GLOSS (λStartBStartT)

              Compute BIC

              Chose λ = minλ BIC

              Partition tikπk λBEST BΘ D L(θ)activeset

              Figure 92 Mix-GLOSS model selection diagram

              with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

              92

              10 Experimental Results

              The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

              This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

              In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

              The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

              101 Tested Clustering Algorithms

              This section compares Mix-GLOSS with the following methods in the state of the art

              bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

              bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

              93

              10 Experimental Results

              Figure 101 Class mean vectors for each artificial simulation

              bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

              After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

              The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

              bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

              bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

              94

              102 Results

              Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

              102 Results

              In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

              bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

              bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

              bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

              The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

              Results in percentages are displayed in Figure 102 (or in Table 102 )

              95

              10 Experimental Results

              Table 101 Experimental results for simulated data

              Err () Var Time

              Sim 1 K = 4 mean shift ind features

              CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

              Sim 2 K = 2 mean shift dependent features

              CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

              Sim 3 K = 4 1D mean shift ind features

              CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

              Sim 4 K = 4 mean shift ind features

              CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

              Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

              Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

              MIX-GLOSS 992 015 828 335 884 67 780 12

              LUMI-KUAN 992 28 1000 02 1000 005 50 005

              FISHER-EM 986 24 888 17 838 5825 620 4075

              96

              103 Discussion

              0 10 20 30 40 50 600

              10

              20

              30

              40

              50

              60

              70

              80

              90

              100TPR Vs FPR

              MIXminusGLOSS

              LUMIminusKUAN

              FISHERminusEM

              Simulation1

              Simulation2

              Simulation3

              Simulation4

              Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

              103 Discussion

              After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

              LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

              The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

              From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

              97

              Conclusions

              99

              Conclusions

              Summary

              The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

              In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

              The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

              In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

              In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

              Perspectives

              Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

              101

              based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

              At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

              The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

              From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

              At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

              102

              Appendix

              103

              A Matrix Properties

              Property 1 By definition ΣW and ΣB are both symmetric matrices

              ΣW =1

              n

              gsumk=1

              sumiisinCk

              (xi minus microk)(xi minus microk)gt

              ΣB =1

              n

              gsumk=1

              nk(microk minus x)(microk minus x)gt

              Property 2 partxgtapartx = partagtx

              partx = a

              Property 3 partxgtAxpartx = (A + Agt)x

              Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

              Property 5 partagtXbpartX = abgt

              Property 6 partpartXtr

              (AXminus1B

              )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

              105

              B The Penalized-OS Problem is anEigenvector Problem

              In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

              minθkβk

              Yθk minusXβk22 + βgtk Ωkβk (B1)

              st θgtk YgtYθk = 1

              θgt` YgtYθk = 0 forall` lt k

              for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

              Lk(θkβk λkνk) =

              Yθk minusXβk22 + βgtk Ωkβk + λk(θ

              gtk YgtYθk minus 1) +

              sum`ltk

              ν`θgt` YgtYθk (B2)

              Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

              βk = (XgtX + Ωk)minus1XgtYθk (B3)

              The objective function of (B1) evaluated at βk is

              minθk

              Yθk minusXβk22 + βk

              gtΩkβk = min

              θk

              θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

              = maxθk

              θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

              If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

              B1 How to Solve the Eigenvector Decomposition

              Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

              107

              B The Penalized-OS Problem is an Eigenvector Problem

              Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

              maxΘisinRKtimes(Kminus1)

              tr(ΘgtMΘ

              )(B5)

              st ΘgtYgtYΘ = IKminus1

              If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

              MΘv = λv (B6)

              where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

              vgtMΘv = λhArr vgtΘgtMΘv = λ

              Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

              wgtMw = λ (B7)

              Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

              MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

              = ΘgtYgtXB

              Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

              To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

              B = (XgtX + Ω)minus1XgtYΘV = BV

              108

              B2 Why the OS Problem is Solved as an Eigenvector Problem

              B2 Why the OS Problem is Solved as an Eigenvector Problem

              In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

              By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

              θk =

              Kminus1summ=1

              αmwm s t θgtk θk = 1 (B8)

              The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

              Kminus1summ=1

              αmwm

              )gt(Kminus1summ=1

              αmwm

              )= 1

              that as per the eigenvector properties can be reduced to

              Kminus1summ=1

              α2m = 1 (B9)

              Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

              Mθk = M

              Kminus1summ=1

              αmwm

              =

              Kminus1summ=1

              αmMwm

              As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

              Mθk =Kminus1summ=1

              αmλmwm

              Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

              θgtk Mθk =

              (Kminus1sum`=1

              α`w`

              )gt(Kminus1summ=1

              αmλmwm

              )

              This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

              θgtk Mθk =Kminus1summ=1

              α2mλm

              109

              B The Penalized-OS Problem is an Eigenvector Problem

              The optimization Problem (B5) for discriminant direction k can be rewritten as

              maxθkisinRKtimes1

              θgtk Mθk

              = max

              θkisinRKtimes1

              Kminus1summ=1

              α2mλm

              (B10)

              with θk =Kminus1summ=1

              αmwm

              andKminus1summ=1

              α2m = 1

              One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

              sumKminus1m=1 αmwm the resulting score vector θk will be equal to

              the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

              be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

              110

              C Solving Fisherrsquos Discriminant Problem

              The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

              maxβisinRp

              βgtΣBβ (C1a)

              s t βgtΣWβ = 1 (C1b)

              where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

              The Lagrangian of Problem (C1) is

              L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

              so that its first derivative with respect to β is

              partL(β ν)

              partβ= 2ΣBβ minus 2νΣWβ

              A necessary optimality condition for β is that this derivative is zero that is

              ΣBβ = νΣWβ

              Provided ΣW is full rank we have

              Σminus1W ΣBβ

              = νβ (C2)

              Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

              eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

              βgtΣBβ = βgtΣWΣminus1

              W ΣBβ

              = νβgtΣWβ from (C2)

              = ν from (C1b)

              That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

              W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

              111

              D Alternative Variational Formulation forthe Group-Lasso

              In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

              minτisinRp

              minBisinRptimesKminus1

              J(B) + λ

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj(D1a)

              s tsump

              j=1 τj = 1 (D1b)

              τj ge 0 j = 1 p (D1c)

              Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

              of row vectors βj isin RKminus1 B =(β1gt βpgt

              )gt

              L(B τ λ ν0 νj) = J(B) + λ

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj+ ν0

              psumj=1

              τj minus 1

              minus psumj=1

              νjτj (D2)

              The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

              partL(B τ λ ν0 νj)

              partτj

              ∣∣∣∣τj=τj

              = 0 rArr minusλw2j

              ∥∥βj∥∥2

              2

              τj2 + ν0 minus νj = 0

              rArr minusλw2j

              ∥∥βj∥∥2

              2+ ν0τ

              j

              2 minus νjτj2 = 0

              rArr minusλw2j

              ∥∥βj∥∥2

              2+ ν0τ

              j

              2 = 0

              The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

              ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

              inequality Lagrange condition Then the optimal τj can be deduced

              τj =

              radicλ

              ν0wj∥∥βj∥∥

              2

              Placing this optimal value of τj into constraint (D1b)

              psumj=1

              τj = 1rArr τj =wj∥∥βj∥∥

              2sumpj=1wj

              ∥∥βj∥∥2

              (D3)

              113

              D Alternative Variational Formulation for the Group-Lasso

              With this value of τj Problem (D1) is equivalent to

              minBisinRptimesKminus1

              J(B) + λ

              psumj=1

              wj∥∥βj∥∥

              2

              2

              (D4)

              This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

              The penalty term of (D1a) can be conveniently presented as λBgtΩB where

              Ω = diag

              (w2

              1

              τ1w2

              2

              τ2

              w2p

              τp

              ) (D5)

              Using the value of τj from (D3) each diagonal component of Ω is

              (Ω)jj =wjsump

              j=1wj∥∥βj∥∥

              2∥∥βj∥∥2

              (D6)

              In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

              D1 Useful Properties

              Lemma D1 If J is convex Problem (D1) is convex

              In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

              Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

              partJ(B)

              partB+ 2λ

              Kminus1sumj=1

              wj∥∥βj∥∥

              2

              G

              (D7)

              where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

              ∥∥βj∥∥26= 0 then we have

              forallj isin S(B) gj = wj∥∥βj∥∥minus1

              2βj (D8)

              forallj isin S(B) ∥∥gj∥∥

              2le wj (D9)

              114

              D2 An Upper Bound on the Objective Function

              This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

              Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

              ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

              forallj isin S(B) minus partJ(B)

              partβj= 2λ

              Kminus1sumj=1

              wj∥∥βj∥∥2

              wj∥∥βj∥∥minus1

              2βj (D10a)

              forallj isin S(B)

              ∥∥∥∥partJ(B)

              partβj

              ∥∥∥∥2

              le 2λwj

              Kminus1sumj=1

              wj∥∥βj∥∥2

              (D10b)

              In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

              D2 An Upper Bound on the Objective Function

              Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

              τj =wj∥∥βj∥∥

              2sumpj=1wj

              ∥∥βj∥∥2

              Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

              j=1

              wj∥∥βj∥∥

              2

              2

              =

              psumj=1

              τ12j

              wj∥∥βj∥∥

              2

              τ12j

              2

              le

              psumj=1

              τj

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj

              le

              psumj=1

              w2j

              ∥∥βj∥∥2

              2

              τj

              where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

              115

              D Alternative Variational Formulation for the Group-Lasso

              This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

              116

              E Invariance of the Group-Lasso to UnitaryTransformations

              The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

              Proposition E1 Let B be a solution of

              minBisinRptimesM

              Y minusXB2F + λ

              psumj=1

              wj∥∥βj∥∥

              2(E1)

              and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

              minBisinRptimesM

              ∥∥∥Y minusXB∥∥∥2

              F+ λ

              psumj=1

              wj∥∥βj∥∥

              2(E2)

              Proof The first-order necessary optimality conditions for B are

              forallj isin S(B) 2xjgt(xjβ

              j minusY)

              + λwj

              ∥∥∥βj∥∥∥minus1

              2βj

              = 0 (E3a)

              forallj isin S(B) 2∥∥∥xjgt (xjβ

              j minusY)∥∥∥

              2le λwj (E3b)

              where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

              First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

              forallj isin S(B) 2xjgt(xjβ

              j minus Y)

              + λwj

              ∥∥∥βj∥∥∥minus1

              2βj

              = 0 (E4a)

              forallj isin S(B) 2∥∥∥xjgt (xjβ

              j minus Y)∥∥∥

              2le λwj (E4b)

              where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

              ∥∥ugt∥∥2

              =∥∥ugtV

              ∥∥2 Equation (E4b) is also

              117

              E Invariance of the Group-Lasso to Unitary Transformations

              obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

              118

              F Expected Complete Likelihood andLikelihood

              Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

              L(θ) =

              nsumi=1

              log

              (Ksumk=1

              πkfk(xiθk)

              )(F1)

              Q(θθprime) =nsumi=1

              Ksumk=1

              tik(θprime) log (πkfk(xiθk)) (F2)

              with tik(θprime) =

              πprimekfk(xiθprimek)sum

              ` πprime`f`(xiθ

              prime`)

              (F3)

              In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

              the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

              Using (F3) we have

              Q(θθprime) =sumik

              tik(θprime) log (πkfk(xiθk))

              =sumik

              tik(θprime) log(tik(θ)) +

              sumik

              tik(θprime) log

              (sum`

              π`f`(xiθ`)

              )=sumik

              tik(θprime) log(tik(θ)) + L(θ)

              In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

              L(θ) = Q(θθ)minussumik

              tik(θ) log(tik(θ))

              = Q(θθ) +H(T)

              119

              G Derivation of the M-Step Equations

              This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

              Q(θθprime) = maxθ

              sumik

              tik(θprime) log(πkfk(xiθk))

              =sumk

              log

              (πksumi

              tik

              )minus np

              2log(2π)minus n

              2log |Σ| minus 1

              2

              sumik

              tik(xi minus microk)gtΣminus1(xi minus microk)

              which has to be maximized subject tosumk

              πk = 1

              The Lagrangian of this problem is

              L(θ) = Q(θθprime) + λ

              (sumk

              πk minus 1

              )

              Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

              G1 Prior probabilities

              partL(θ)

              partπk= 0hArr 1

              πk

              sumi

              tik + λ = 0

              where λ is identified from the constraint leading to

              πk =1

              n

              sumi

              tik

              121

              G Derivation of the M-Step Equations

              G2 Means

              partL(θ)

              partmicrok= 0hArr minus1

              2

              sumi

              tik2Σminus1(microk minus xi) = 0

              rArr microk =

              sumi tikxisumi tik

              G3 Covariance Matrix

              partL(θ)

              partΣminus1 = 0hArr n

              2Σ︸︷︷︸

              as per property 4

              minus 1

              2

              sumik

              tik(xi minus microk)(xi minus microk)gt

              ︸ ︷︷ ︸as per property 5

              = 0

              rArr Σ =1

              n

              sumik

              tik(xi minus microk)(xi minus microk)gt

              122

              Bibliography

              F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

              F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

              F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

              J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

              A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

              H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

              P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

              C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

              C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

              C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

              C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

              S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

              L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

              L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

              123

              Bibliography

              T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

              S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

              C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

              B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

              L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

              C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

              A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

              D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

              R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

              B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

              Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

              R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

              V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

              J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

              124

              Bibliography

              J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

              J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

              W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

              A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

              D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

              G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

              G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

              Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

              Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

              L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

              Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

              J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

              I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

              T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

              T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

              125

              Bibliography

              T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

              A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

              J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

              T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

              K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

              P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

              T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

              M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

              Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

              C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

              C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

              H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

              J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

              Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

              C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

              126

              Bibliography

              C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

              L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

              N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

              B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

              B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

              Y Nesterov Gradient methods for minimizing composite functions preprint 2007

              S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

              B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

              M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

              M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

              W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

              W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

              K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

              S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

              127

              Bibliography

              Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

              A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

              C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

              S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

              V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

              V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

              V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

              C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

              L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

              Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

              A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

              S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

              P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

              M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

              128

              Bibliography

              M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

              R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

              J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

              S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

              D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

              D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

              D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

              M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

              MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

              T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

              B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

              B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

              C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

              J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

              129

              Bibliography

              M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

              P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

              P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

              H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

              H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

              H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

              130

              • SANCHEZ MERCHANTE PDTpdf
              • Thesis Luis Francisco Sanchez Merchantepdf
                • List of figures
                • List of tables
                • Notation and Symbols
                • Context and Foundations
                  • Context
                  • Regularization for Feature Selection
                    • Motivations
                    • Categorization of Feature Selection Techniques
                    • Regularization
                      • Important Properties
                      • Pure Penalties
                      • Hybrid Penalties
                      • Mixed Penalties
                      • Sparsity Considerations
                      • Optimization Tools for Regularized Problems
                        • Sparse Linear Discriminant Analysis
                          • Abstract
                          • Feature Selection in Fisher Discriminant Analysis
                            • Fisher Discriminant Analysis
                            • Feature Selection in LDA Problems
                              • Inertia Based
                              • Regression Based
                                  • Formalizing the Objective
                                    • From Optimal Scoring to Linear Discriminant Analysis
                                      • Penalized Optimal Scoring Problem
                                      • Penalized Canonical Correlation Analysis
                                      • Penalized Linear Discriminant Analysis
                                      • Summary
                                        • Practicalities
                                          • Solution of the Penalized Optimal Scoring Regression
                                          • Distance Evaluation
                                          • Posterior Probability Evaluation
                                          • Graphical Representation
                                            • From Sparse Optimal Scoring to Sparse LDA
                                              • A Quadratic Variational Form
                                              • Group-Lasso OS as Penalized LDA
                                                  • GLOSS Algorithm
                                                    • Regression Coefficients Updates
                                                      • Cholesky decomposition
                                                      • Numerical Stability
                                                        • Score Matrix
                                                        • Optimality Conditions
                                                        • Active and Inactive Sets
                                                        • Penalty Parameter
                                                        • Options and Variants
                                                          • Scaling Variables
                                                          • Sparse Variant
                                                          • Diagonal Variant
                                                          • Elastic net and Structured Variant
                                                              • Experimental Results
                                                                • Normalization
                                                                • Decision Thresholds
                                                                • Simulated Data
                                                                • Gene Expression Data
                                                                • Correlated Data
                                                                  • Discussion
                                                                    • Sparse Clustering Analysis
                                                                      • Abstract
                                                                      • Feature Selection in Mixture Models
                                                                        • Mixture Models
                                                                          • Model
                                                                          • Parameter Estimation The EM Algorithm
                                                                            • Feature Selection in Model-Based Clustering
                                                                              • Based on Penalized Likelihood
                                                                              • Based on Model Variants
                                                                              • Based on Model Selection
                                                                                  • Theoretical Foundations
                                                                                    • Resolving EM with Optimal Scoring
                                                                                      • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                      • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                      • Clustering Using Penalized Optimal Scoring
                                                                                      • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                        • Optimized Criterion
                                                                                          • A Bayesian Derivation
                                                                                          • Maximum a Posteriori Estimator
                                                                                              • Mix-GLOSS Algorithm
                                                                                                • Mix-GLOSS
                                                                                                  • Outer Loop Whole Algorithm Repetitions
                                                                                                  • Penalty Parameter Loop
                                                                                                  • Inner Loop EM Algorithm
                                                                                                    • Model Selection
                                                                                                      • Experimental Results
                                                                                                        • Tested Clustering Algorithms
                                                                                                        • Results
                                                                                                        • Discussion
                                                                                                            • Conclusions
                                                                                                            • Appendix
                                                                                                              • Matrix Properties
                                                                                                              • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                                • How to Solve the Eigenvector Decomposition
                                                                                                                • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                                  • Solving Fishers Discriminant Problem
                                                                                                                  • Alternative Variational Formulation for the Group-Lasso
                                                                                                                    • Useful Properties
                                                                                                                    • An Upper Bound on the Objective Function
                                                                                                                      • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                      • Expected Complete Likelihood and Likelihood
                                                                                                                      • Derivation of the M-Step Equations
                                                                                                                        • Prior probabilities
                                                                                                                        • Means
                                                                                                                        • Covariance Matrix
                                                                                                                            • Bibliography

                Contents

                72 Feature Selection in Model-Based Clustering 75721 Based on Penalized Likelihood 76722 Based on Model Variants 77723 Based on Model Selection 79

                8 Theoretical Foundations 8181 Resolving EM with Optimal Scoring 81

                811 Relationship Between the M-Step and Linear Discriminant Analysis 81812 Relationship Between Optimal Scoring and Linear Discriminant

                Analysis 82813 Clustering Using Penalized Optimal Scoring 82814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis 83

                82 Optimized Criterion 83821 A Bayesian Derivation 84822 Maximum a Posteriori Estimator 85

                9 Mix-GLOSS Algorithm 8791 Mix-GLOSS 87

                911 Outer Loop Whole Algorithm Repetitions 87912 Penalty Parameter Loop 88913 Inner Loop EM Algorithm 89

                92 Model Selection 91

                10Experimental Results 93101 Tested Clustering Algorithms 93102 Results 95103 Discussion 97

                Conclusions 97

                Appendix 103

                A Matrix Properties 105

                B The Penalized-OS Problem is an Eigenvector Problem 107B1 How to Solve the Eigenvector Decomposition 107B2 Why the OS Problem is Solved as an Eigenvector Problem 109

                C Solving Fisherrsquos Discriminant Problem 111

                D Alternative Variational Formulation for the Group-Lasso 113D1 Useful Properties 114D2 An Upper Bound on the Objective Function 115

                iii

                Contents

                E Invariance of the Group-Lasso to Unitary Transformations 117

                F Expected Complete Likelihood and Likelihood 119

                G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

                Bibliography 123

                iv

                List of Figures

                11 MASH project logo 5

                21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

                rameters 20

                41 Graphical representation of the variational approach to Group-Lasso 45

                51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

                61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

                discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

                91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

                101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

                v

                List of Tables

                61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

                101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

                vii

                Notation and Symbols

                Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

                Sets

                N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

                Data

                X input domainxi input sample xi isin XX design matrix X = (xgt1 x

                gtn )gt

                xj column j of Xyi class indicator of sample i

                Y indicator matrix Y = (ygt1 ygtn )gt

                z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

                Vectors Matrices and Norms

                0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

                ix

                Notation and Symbols

                Probability

                E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

                W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

                H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

                Mixture Models

                yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

                θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

                Optimization

                J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

                βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

                x

                Notation and Symbols

                Penalized models

                λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

                βj jth row of B = (β1gt βpgt)gt

                BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

                ΣB sample between-class covariance matrix

                ΣW sample within-class covariance matrix

                ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

                xi

                Part I

                Context and Foundations

                1

                This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

                The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

                The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

                3

                1 Context

                The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

                The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

                From the point of view of the research the members of the consortium must deal withfour main goals

                1 Software development of website framework and APIrsquos

                2 Classification and goal-planning in high dimensional feature spaces

                3 Interfacing the platform with the 3D virtual environment and the robot arm

                4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

                S HM A

                Figure 11 MASH project logo

                5

                1 Context

                The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

                Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

                As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

                bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

                bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

                6

                All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

                bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

                I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

                7

                2 Regularization for Feature Selection

                With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

                21 Motivations

                There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

                As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

                When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

                bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

                bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

                9

                2 Regularization for Feature Selection

                Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

                selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

                As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

                ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

                Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

                10

                22 Categorization of Feature Selection Techniques

                Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

                ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

                There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

                22 Categorization of Feature Selection Techniques

                Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

                I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

                The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

                bull Depending on the type of integration with the machine learning algorithm we have

                ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

                ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

                11

                2 Regularization for Feature Selection

                the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

                ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

                bull Depending on the feature searching technique

                ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

                ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

                ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

                bull Depending on the evaluation technique

                ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

                ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

                ndash Dependency Measures - Measuring the correlation between features

                ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

                ndash Predictive Accuracy - Use the selected features to predict the labels

                ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

                The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

                In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

                12

                23 Regularization

                goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

                23 Regularization

                In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

                An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

                minβJ(β) + λP (β) (21)

                minβ

                J(β)

                s t P (β) le t (22)

                In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

                In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

                13

                2 Regularization for Feature Selection

                Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

                231 Important Properties

                Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

                Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

                forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

                for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

                Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

                Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

                232 Pure Penalties

                For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

                14

                23 Regularization

                Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

                this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

                Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

                A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

                After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

                3penalty has a support region with sharper vertexes that would induce

                a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

                3results in difficulties during optimization that will not happen with a convex

                shape

                15

                2 Regularization for Feature Selection

                To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

                L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

                minβ

                J(β)

                s t β0 le t (24)

                where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

                L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

                minβ

                J(β)

                s t

                psumj=1

                |βj | le t (25)

                Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

                Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

                The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

                16

                23 Regularization

                minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

                L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

                minβJ(β) + λ β22 (26)

                The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

                minβ

                nsumi=1

                (yi minus xgti β)2 (27)

                with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

                minβ

                nsumi=1

                (yi minus xgti β)2 + λ

                psumj=1

                β2j

                The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

                the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

                As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

                minβ

                nsumi=1

                (yi minus xgti β)2 + λ

                psumj=1

                β2j

                (βlsj )2 (28)

                The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

                17

                2 Regularization for Feature Selection

                where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

                Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

                Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

                This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

                βlowast = maxwisinRp

                βgtw s t w le 1

                In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

                r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

                233 Hybrid Penalties

                There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

                minβ

                nsumi=1

                (yi minus xgti β)2 + λ1

                psumj=1

                |βj |+ λ2

                psumj=1

                β2j (29)

                The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

                18

                23 Regularization

                234 Mixed Penalties

                Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

                sumL`=1 d` Mixed norms are

                a type of norms that take into consideration those groups The general expression isshowed below

                β(rs) =

                sum`

                sumjisinG`

                |βj |s r

                s

                1r

                (210)

                The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

                Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

                (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

                235 Sparsity Considerations

                In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

                The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

                To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

                19

                2 Regularization for Feature Selection

                (a) L1 Lasso (b) L(12) group-Lasso

                Figure 25 Admissible sets for the Lasso and Group-Lasso

                (a) L1 induced sparsity (b) L(12) group inducedsparsity

                Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

                20

                23 Regularization

                the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

                236 Optimization Tools for Regularized Problems

                In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

                In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

                Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

                β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

                Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

                βj =minusλsign(βj)minus partJ(β)

                partβj

                2sumn

                i=1 x2ij

                In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

                algorithm where β(t+1)j = Sλ

                (partJ(β(t))partβj

                ) The objective function is optimized with respect

                21

                2 Regularization for Feature Selection

                to one variable at a time while all others are kept fixed

                (partJ(β)

                partβj

                )=

                λminus partJ(β)partβj

                2sumn

                i=1 x2ij

                if partJ(β)partβj

                gt λ

                minusλminus partJ(β)partβj

                2sumn

                i=1 x2ij

                if partJ(β)partβj

                lt minusλ

                0 if |partJ(β)partβj| le λ

                (211)

                The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

                Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

                Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

                Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

                This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

                22

                23 Regularization

                and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

                Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

                This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

                This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

                Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

                This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

                Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

                Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

                minβisinRp

                J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

                2

                ∥∥∥β minus β(t)∥∥∥2

                2(212)

                They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

                23

                2 Regularization for Feature Selection

                (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

                minβisinRp

                1

                2

                ∥∥∥∥β minus (β(t) minus 1

                LnablaJ(β(t)))

                ∥∥∥∥2

                2

                LP (β) (213)

                The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

                24

                Part II

                Sparse Linear Discriminant Analysis

                25

                Abstract

                Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

                There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

                In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

                27

                3 Feature Selection in Fisher DiscriminantAnalysis

                31 Fisher Discriminant Analysis

                Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

                We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

                gtn )gt and the corresponding labels in the ntimesK matrix

                Y = (ygt1 ygtn )gt

                Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

                maxβisinRp

                βgtΣBβ

                βgtΣWβ (31)

                where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

                ΣW =1

                n

                Ksumk=1

                sumiisinGk

                (xi minus microk)(xi minus microk)gt

                ΣB =1

                n

                Ksumk=1

                sumiisinGk

                (microminus microk)(microminus microk)gt

                where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

                29

                3 Feature Selection in Fisher Discriminant Analysis

                This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

                maxBisinRptimesKminus1

                tr(BgtΣBB

                )tr(BgtΣWB

                ) (32)

                where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

                based on a series of K minus 1 subproblemsmaxβkisinRp

                βgtk ΣBβk

                s t βgtk ΣWβk le 1

                βgtk ΣWβ` = 0 forall` lt k

                (33)

                The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

                eigenvalue (see Appendix C)

                32 Feature Selection in LDA Problems

                LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

                Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

                The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

                They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

                321 Inertia Based

                The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

                30

                32 Feature Selection in LDA Problems

                classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

                Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

                Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

                minβisinRp

                βgtΣWβ

                s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

                where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

                Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

                βisinkRpβgtk Σ

                k

                Bβk minus Pk(βk)

                s t βgtk ΣWβk le 1

                The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

                Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

                solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

                minimization minβisinRp

                β1

                s t∥∥∥Σβ minus (micro1 minus micro2)

                ∥∥∥infinle λ

                Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

                31

                3 Feature Selection in Fisher Discriminant Analysis

                Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

                322 Regression Based

                In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

                Predefined Indicator Matrix

                Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

                There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

                Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

                In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

                32

                32 Feature Selection in LDA Problems

                obtained by solving

                minβisinRpβ0isinR

                nminus1nsumi=1

                (yi minus β0 minus xgti β)2 + λ

                psumj=1

                |βj |

                where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

                vector for λ = 0 but a different intercept β0 is required

                Optimal Scoring

                In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

                As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

                minΘ BYΘminusXB2F + λ tr

                (BgtΩB

                )(34a)

                s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

                where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

                minθkisinRK βkisinRp

                Yθk minusXβk2 + βgtk Ωβk (35a)

                s t nminus1 θgtk YgtYθk = 1 (35b)

                θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

                where each βk corresponds to a discriminant direction

                33

                3 Feature Selection in Fisher Discriminant Analysis

                Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

                minβkisinRpθkisinRK

                sumk

                Yθk minusXβk22 + λ1 βk1 + λ2β

                gtk Ωβk

                where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

                Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

                minβkisinRpθkisinRK

                Kminus1sumk=1

                Yθk minusXβk22 + λ

                psumj=1

                radicradicradicradicKminus1sumk=1

                β2kj

                2

                (36)

                which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

                this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

                34

                4 Formalizing the Objective

                In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

                The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

                The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

                41 From Optimal Scoring to Linear Discriminant Analysis

                Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

                Throughout this chapter we assume that

                bull there is no empty class that is the diagonal matrix YgtY is full rank

                bull inputs are centered that is Xgt1n = 0

                bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

                35

                4 Formalizing the Objective

                411 Penalized Optimal Scoring Problem

                For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

                The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

                minθisinRK βisinRp

                Yθ minusXβ2 + βgtΩβ (41a)

                s t nminus1 θgtYgtYθ = 1 (41b)

                For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

                βos =(XgtX + Ω

                )minus1XgtYθ (42)

                The objective function (41a) is then

                Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

                (XgtX + Ω

                )βos

                = θgtYgtYθ minus θgtYgtX(XgtX + Ω

                )minus1XgtYθ

                where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

                maxθnminus1θgtYgtYθ=1

                θgtYgtX(XgtX + Ω

                )minus1XgtYθ (43)

                which shows that the optimization of the p-OS problem with respect to θk boils down to

                finding the kth largest eigenvector of YgtX(XgtX + Ω

                )minus1XgtY Indeed Appendix C

                details that Problem (43) is solved by

                (YgtY)minus1YgtX(XgtX + Ω

                )minus1XgtYθ = α2θ (44)

                36

                41 From Optimal Scoring to Linear Discriminant Analysis

                where α2 is the maximal eigenvalue 1

                nminus1θgtYgtX(XgtX + Ω

                )minus1XgtYθ = α2nminus1θgt(YgtY)θ

                nminus1θgtYgtX(XgtX + Ω

                )minus1XgtYθ = α2 (45)

                412 Penalized Canonical Correlation Analysis

                As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

                maxθisinRK βisinRp

                nminus1θgtYgtXβ (46a)

                s t nminus1 θgtYgtYθ = 1 (46b)

                nminus1 βgt(XgtX + Ω

                )β = 1 (46c)

                The solutions to (46) are obtained by finding saddle points of the Lagrangian

                nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

                rArr npartL(βθ γ ν)

                partβ= XgtYθ minus 2γ(XgtX + Ω)β

                rArr βcca =1

                2γ(XgtX + Ω)minus1XgtYθ

                Then as βcca obeys (46c) we obtain

                βcca =(XgtX + Ω)minus1XgtYθradic

                nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

                so that the optimal objective function (46a) can be expressed with θ alone

                nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

                =

                radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

                and the optimization problem with respect to θ can be restated as

                maxθnminus1θgtYgtYθ=1

                θgtYgtX(XgtX + Ω

                )minus1XgtYθ (48)

                Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

                βos = αβcca (49)

                1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

                37

                4 Formalizing the Objective

                where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

                the optimality conditions for θ

                npartL(βθ γ ν)

                partθ= YgtXβ minus 2νYgtYθ

                rArr θcca =1

                2ν(YgtY)minus1YgtXβ (410)

                Then as θcca obeys (46b) we obtain

                θcca =(YgtY)minus1YgtXβradic

                nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

                leading to the following expression of the optimal objective function

                nminus1θgtccaYgtXβ =

                nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

                =

                radicnminus1βgtXgtY(YgtY)minus1YgtXβ

                The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

                maxβisinRp

                nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

                s t nminus1 βgt(XgtX + Ω

                )β = 1 (412b)

                where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

                nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

                )βcca (413)

                where λ is the maximal eigenvalue shown below to be equal to α2

                nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

                rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

                rArr nminus1αβgtccaXgtYθ = λ

                rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

                rArr α2 = λ

                The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

                38

                41 From Optimal Scoring to Linear Discriminant Analysis

                413 Penalized Linear Discriminant Analysis

                Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

                maxβisinRp

                βgtΣBβ (414a)

                s t βgt(ΣW + nminus1Ω)β = 1 (414b)

                where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

                As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

                to a simple matrix representation using the projection operator Y(YgtY

                )minus1Ygt

                ΣT =1

                n

                nsumi=1

                xixigt

                = nminus1XgtX

                ΣB =1

                n

                Ksumk=1

                nk microkmicrogtk

                = nminus1XgtY(YgtY

                )minus1YgtX

                ΣW =1

                n

                Ksumk=1

                sumiyik=1

                (xi minus microk) (xi minus microk)gt

                = nminus1

                (XgtXminusXgtY

                (YgtY

                )minus1YgtX

                )

                Using these formulae the solution to the p-LDA problem (414) is obtained as

                XgtY(YgtY

                )minus1YgtXβlda = λ

                (XgtX + ΩminusXgtY

                (YgtY

                )minus1YgtX

                )βlda

                XgtY(YgtY

                )minus1YgtXβlda =

                λ

                1minus λ

                (XgtX + Ω

                )βlda

                The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

                βlda = (1minus α2)minus12 βcca

                = αminus1(1minus α2)minus12 βos

                which ends the path from p-OS to p-LDA

                39

                4 Formalizing the Objective

                414 Summary

                The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

                minΘ BYΘminusXB2F + λ tr

                (BgtΩB

                )s t nminus1 ΘgtYgtYΘ = IKminus1

                Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

                square-root of the largest eigenvector of YgtX(XgtX + Ω

                )minus1XgtY we have

                BLDA = BCCA

                (IKminus1 minusA2

                )minus 12

                = BOS Aminus1(IKminus1 minusA2

                )minus 12 (415)

                where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

                can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

                or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

                With the aim of performing classification the whole process could be summarized asfollows

                1 Solve the p-OS problem as

                BOS =(XgtX + λΩ

                )minus1XgtYΘ

                where Θ are the K minus 1 leading eigenvectors of

                YgtX(XgtX + λΩ

                )minus1XgtY

                2 Translate the data samples X into the LDA domain as XLDA = XBOSD

                where D = Aminus1(IKminus1 minusA2

                )minus 12

                3 Compute the matrix M of centroids microk from XLDA and Y

                4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

                5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

                6 Graphical Representation

                40

                42 Practicalities

                The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

                42 Practicalities

                421 Solution of the Penalized Optimal Scoring Regression

                Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

                minΘisinRKtimesKminus1BisinRptimesKminus1

                YΘminusXB2F + λ tr(BgtΩB

                )(416a)

                s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

                where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

                Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

                1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

                2 Compute B =(XgtX + λΩ

                )minus1XgtYΘ0

                3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

                )minus1XgtY

                4 Compute the optimal regression coefficients

                BOS =(XgtX + λΩ

                )minus1XgtYΘ (417)

                Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

                Θ0gtYgtX(XgtX + λΩ

                )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

                costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

                This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

                41

                4 Formalizing the Objective

                a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

                422 Distance Evaluation

                The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

                d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

                (nkn

                ) (418)

                is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

                Σminus1WΩ =

                (nminus1(XgtX + λΩ)minus ΣB

                )minus1

                =(nminus1XgtXminus ΣB + nminus1λΩ

                )minus1

                =(ΣW + nminus1λΩ

                )minus1 (419)

                Before explaining how to compute the distances let us summarize some clarifying points

                bull The solution BOS of the p-OS problem is enough to accomplish classification

                bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

                bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

                As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

                (xi minus microk)BOS2ΣWΩminus 2 log(πk)

                where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

                (IKminus1 minusA2

                )minus 12

                ∥∥∥2

                2minus 2 log(πk)

                which is a plain Euclidean distance

                42

                43 From Sparse Optimal Scoring to Sparse LDA

                423 Posterior Probability Evaluation

                Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

                p(yk = 1|x) prop exp

                (minusd(xmicrok)

                2

                )prop πk exp

                (minus1

                2

                ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

                )minus 12

                ∥∥∥2

                2

                ) (420)

                Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

                2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

                p(yk = 1|x) =πk exp

                (minusd(xmicrok)

                2

                )sum

                ` π` exp(minusd(xmicro`)

                2

                )=

                πk exp(minusd(xmicrok)

                2 + dmax2

                )sum`

                π` exp

                (minusd(xmicro`)

                2+dmax

                2

                )

                where dmax = maxk d(xmicrok)

                424 Graphical Representation

                Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

                43 From Sparse Optimal Scoring to Sparse LDA

                The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

                In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

                43

                4 Formalizing the Objective

                section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

                431 A Quadratic Variational Form

                Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

                Our formulation of group-Lasso is showed below

                minτisinRp

                minBisinRptimesKminus1

                J(B) + λ

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj(421a)

                s tsum

                j τj minussum

                j wj∥∥βj∥∥

                2le 0 (421b)

                τj ge 0 j = 1 p (421c)

                where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

                B =(β1gt βpgt

                )gtand wj are predefined nonnegative weights The cost function

                J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

                The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

                Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

                Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

                j=1wj∥∥βj∥∥

                2

                Proof The Lagrangian of Problem (421) is

                L = J(B) + λ

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj+ ν0

                ( psumj=1

                τj minuspsumj=1

                wj∥∥βj∥∥

                2

                )minus

                psumj=1

                νjτj

                44

                43 From Sparse Optimal Scoring to Sparse LDA

                Figure 41 Graphical representation of the variational approach to Group-Lasso

                Thus the first order optimality conditions for τj are

                partLpartτj

                (τj ) = 0hArr minusλw2j

                ∥∥βj∥∥2

                2

                τj2 + ν0 minus νj = 0

                hArr minusλw2j

                ∥∥βj∥∥2

                2+ ν0τ

                j

                2 minus νjτj2 = 0

                rArr minusλw2j

                ∥∥βj∥∥2

                2+ ν0 τ

                j

                2 = 0

                The last line is obtained from complementary slackness which implies here νjτj = 0

                Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

                for constraint gj(τj) le 0 As a result the optimal value of τj

                τj =

                radicλw2

                j

                ∥∥βj∥∥2

                2

                ν0=

                radicλ

                ν0wj∥∥βj∥∥

                2(422)

                We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

                psumj=1

                τj minuspsumj=1

                wj∥∥βj∥∥

                2= 0 (423)

                so that τj = wj∥∥βj∥∥

                2 Using this value into (421a) it is possible to conclude that

                Problem (421) is equivalent to the standard group-Lasso operator

                minBisinRptimesM

                J(B) + λ

                psumj=1

                wj∥∥βj∥∥

                2 (424)

                So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

                45

                4 Formalizing the Objective

                With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

                Ω = diag

                (w2

                1

                τ1w2

                2

                τ2

                w2p

                τp

                ) (425)

                with

                τj = wj∥∥βj∥∥

                2

                resulting in Ω diagonal components

                (Ω)jj =wj∥∥βj∥∥

                2

                (426)

                And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

                The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

                Lemma 42 If J is convex Problem (421) is convex

                Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

                In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

                Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

                V isin RptimesKminus1 V =partJ(B)

                partB+ λG

                (427)

                where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

                G =(g1gt gpgt

                )gtdefined as follows Let S(B) denote the columnwise support of

                B S(B) = j isin 1 p ∥∥βj∥∥

                26= 0 then we have

                forallj isin S(B) gj = wj∥∥βj∥∥minus1

                2βj (428)

                forallj isin S(B) ∥∥gj∥∥

                2le wj (429)

                46

                43 From Sparse Optimal Scoring to Sparse LDA

                This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

                Proof When∥∥βj∥∥

                26= 0 the gradient of the penalty with respect to βj is

                part (λsump

                m=1wj βm2)

                partβj= λwj

                βj∥∥βj∥∥2

                (430)

                At∥∥βj∥∥

                2= 0 the gradient of the objective function is not continuous and the optimality

                conditions then make use of the subdifferential (Bach et al 2011)

                partβj

                psumm=1

                wj βm2

                )= partβj

                (λwj

                ∥∥βj∥∥2

                )=λwjv isin RKminus1 v2 le 1

                (431)

                That gives the expression (429)

                Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

                forallj isin S partJ(B)

                partβj+ λwj

                ∥∥βj∥∥minus1

                2βj = 0 (432a)

                forallj isin S ∥∥∥∥partJ(B)

                partβj

                ∥∥∥∥2

                le λwj (432b)

                where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

                Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

                432 Group-Lasso OS as Penalized LDA

                With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

                Proposition 41 The group-Lasso OS problem

                BOS = argminBisinRptimesKminus1

                minΘisinRKtimesKminus1

                1

                2YΘminusXB2F + λ

                psumj=1

                wj∥∥βj∥∥

                2

                s t nminus1 ΘgtYgtYΘ = IKminus1

                47

                4 Formalizing the Objective

                is equivalent to the penalized LDA problem

                BLDA = maxBisinRptimesKminus1

                tr(BgtΣBB

                )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

                where Ω = diag

                (w2

                1

                τ1

                w2p

                τp

                ) with Ωjj =

                +infin if βjos = 0

                wj∥∥βjos

                ∥∥minus1

                2otherwise

                (433)

                That is BLDA = BOS diag(αminus1k (1minus α2

                k)minus12

                ) where αk isin (0 1) is the kth leading

                eigenvalue of

                nminus1YgtX(XgtX + λΩ

                )minus1XgtY

                Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

                The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

                (BgtΩB

                )

                48

                5 GLOSS Algorithm

                The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

                The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

                1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

                2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

                3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

                This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

                51 Regression Coefficients Updates

                Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

                XgtAXA + λΩ)βk = XgtAYθ0

                k (51)

                49

                5 GLOSS Algorithm

                initialize modelλ B

                ACTIVE SETall j st||βj ||2 gt 0

                p-OS PROBLEMB must hold1st optimality

                condition

                any variablefrom

                ACTIVE SETmust go toINACTIVE

                SET

                take it out ofACTIVE SET

                test 2nd op-timality con-dition on the

                INACTIVE SET

                any variablefrom

                INACTIVE SETmust go toACTIVE

                SET

                take it out ofINACTIVE SET

                compute Θ

                and update B end

                yes

                no

                yes

                no

                Figure 51 GLOSS block diagram

                50

                51 Regression Coefficients Updates

                Algorithm 1 Adaptively Penalized Optimal Scoring

                Input X Y B λInitialize A larr

                j isin 1 p

                ∥∥βj∥∥2gt 0

                Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

                Step 1 solve (421) in B assuming A optimalrepeat

                Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

                2

                BA larr(XgtAXA + λΩ

                )minus1XgtAYΘ0

                until condition (432a) holds for all j isin A Step 2 identify inactivated variables

                for j isin A ∥∥βj∥∥

                2= 0 do

                if optimality condition (432b) holds thenA larr AjGo back to Step 1

                end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

                jisinA

                ∥∥partJpartβj∥∥2

                if∥∥∥partJpartβj∥∥∥

                2lt λ then

                convergence larr true B is optimalelseA larr Acup j

                end ifuntil convergence

                (sV)larreigenanalyze(Θ0gtYgtXAB) that is

                Θ0gtYgtXABVk = skVk k = 1 K minus 1

                Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

                Output Θ B α

                51

                5 GLOSS Algorithm

                where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

                column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

                511 Cholesky decomposition

                Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

                (XgtX + λΩ)B = XgtYΘ (52)

                Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

                CgtCB = XgtYΘ

                CB = CgtXgtYΘ

                B = CCgtXgtYΘ (53)

                where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

                512 Numerical Stability

                The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

                B = Ωminus12(Ωminus12XgtXΩminus12 + λI

                )minus1Ωminus12XgtYΘ0 (54)

                where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

                52 Score Matrix

                The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

                YgtX(XgtX + Ω

                )minus1XgtY This eigen-analysis is actually solved in the form

                ΘgtYgtX(XgtX + Ω

                )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

                vector decomposition does not require the costly computation of(XgtX + Ω

                )minus1that

                52

                53 Optimality Conditions

                involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

                trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

                )minus1XgtY 1

                Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

                Θ0gtYgtX(XgtX + Ω

                )minus1XgtYΘ0 = Θ0gtYgtXB0

                Thus the solution to penalized OS problem can be computed trough the singular

                value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

                Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

                )minus1XgtYΘ = Λ and when Θ0 is chosen such

                that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

                53 Optimality Conditions

                GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

                1

                2YΘminusXB22 + λ

                psumj=1

                wj∥∥βj∥∥

                2(55)

                Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

                row of B βj is the (K minus 1)-dimensional vector

                partJ(B)

                partβj= xj

                gt(XBminusYΘ)

                where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

                xjgt

                (XBminusYΘ) + λwjβj∥∥βj∥∥

                2

                1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

                )minus1XgtY It is thus suffi-

                cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

                YgtX(XgtX + Ω

                )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

                (35c) we set Θ0 =(YgtY

                )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

                vectors orthogonal to 1K

                53

                5 GLOSS Algorithm

                The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

                2le λwj

                54 Active and Inactive Sets

                The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

                j = maxj

                ∥∥∥xjgt (XBminusYΘ)∥∥∥

                2minus λwj 0

                The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

                2

                is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

                2le λwj

                The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

                55 Penalty Parameter

                The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

                The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

                λmax = maxjisin1p

                1

                wj

                ∥∥∥xjgtYΘ0∥∥∥

                2

                The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

                is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

                54

                56 Options and Variants

                56 Options and Variants

                561 Scaling Variables

                As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

                562 Sparse Variant

                This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

                563 Diagonal Variant

                We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

                The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

                minBisinRptimesKminus1

                YΘminusXB2F = minBisinRptimesKminus1

                tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

                )are replaced by

                minBisinRptimesKminus1

                tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

                )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

                564 Elastic net and Structured Variant

                For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

                55

                5 GLOSS Algorithm

                7 8 9

                4 5 6

                1 2 3

                - ΩL =

                3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

                Figure 52 Graph and Laplacian matrix for a 3times 3 image

                for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

                When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

                This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

                56

                6 Experimental Results

                This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

                61 Normalization

                With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

                62 Decision Thresholds

                The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

                1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

                57

                6 Experimental Results

                63 Simulated Data

                We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

                Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

                Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

                dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

                is intended to mimic gene expression data correlation

                Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

                3 1)if j le 100 and Xij sim N(0 1) otherwise

                Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

                Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

                The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

                58

                63 Simulated Data

                Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

                Err () Var Dir

                Sim 1 K = 4 mean shift ind features

                PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

                Sim 2 K = 2 mean shift dependent features

                PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

                Sim 3 K = 4 1D mean shift ind features

                PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

                Sim 4 K = 4 mean shift ind features

                PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

                59

                6 Experimental Results

                0 10 20 30 40 50 60 70 8020

                30

                40

                50

                60

                70

                80

                90

                100TPR Vs FPR

                gloss

                glossd

                slda

                plda

                Simulation1

                Simulation2

                Simulation3

                Simulation4

                Figure 61 TPR versus FPR (in ) for all algorithms and simulations

                Table 62 Average TPR and FPR (in ) computed over 25 repetitions

                Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

                PLDA 990 782 969 603 980 159 743 656

                SLDA 739 385 338 163 416 278 507 395

                GLOSS 641 106 300 46 511 182 260 121

                GLOSS-D 935 394 921 281 956 655 429 299

                method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

                64 Gene Expression Data

                We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

                2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

                60

                64 Gene Expression Data

                Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

                Err () Var

                Nakayama n = 86 p = 22 283 K = 5

                PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

                Ramaswamy n = 198 p = 16 063 K = 14

                PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

                Sun n = 180 p = 54 613 K = 4

                PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

                ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

                dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

                Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

                Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

                Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

                4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

                61

                6 Experimental Results

                GLOSS SLDA

                Naka

                yam

                a

                minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

                minus25

                minus2

                minus15

                minus1

                minus05

                0

                05

                1

                x 104

                1) Synovial sarcoma

                2) Myxoid liposarcoma

                3) Dedifferentiated liposarcoma

                4) Myxofibrosarcoma

                5) Malignant fibrous histiocytoma

                2n

                dd

                iscr

                imin

                ant

                minus2000 0 2000 4000 6000 8000 10000 12000 14000

                2000

                4000

                6000

                8000

                10000

                12000

                14000

                16000

                1) Synovial sarcoma

                2) Myxoid liposarcoma

                3) Dedifferentiated liposarcoma

                4) Myxofibrosarcoma

                5) Malignant fibrous histiocytoma

                Su

                n

                minus1 minus05 0 05 1 15 2

                x 104

                05

                1

                15

                2

                25

                3

                35

                x 104

                1) NonTumor

                2) Astrocytomas

                3) Glioblastomas

                4) Oligodendrogliomas

                1st discriminant

                2n

                dd

                iscr

                imin

                ant

                minus2 minus15 minus1 minus05 0

                x 104

                0

                05

                1

                15

                2

                x 104

                1) NonTumor

                2) Astrocytomas

                3) Glioblastomas

                4) Oligodendrogliomas

                1st discriminant

                Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

                62

                65 Correlated Data

                Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

                65 Correlated Data

                When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

                The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

                For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

                As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

                The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

                Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

                63

                6 Experimental Results

                β for GLOSS β for S-GLOSS

                Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

                β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

                Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

                64

                Discussion

                GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

                Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

                The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

                The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

                65

                Part III

                Sparse Clustering Analysis

                67

                Abstract

                Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

                Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

                As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

                Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

                69

                7 Feature Selection in Mixture Models

                71 Mixture Models

                One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

                711 Model

                We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

                from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

                f(xi) =

                Ksumk=1

                πkfk(xi) foralli isin 1 n

                where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

                sumk πk = 1) Mixture models transcribe that

                given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

                bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

                bull x each xi is assumed to arise from a random vector with probability densityfunction fk

                In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

                f(xiθ) =

                Ksumk=1

                πkφ(xiθk) foralli isin 1 n

                71

                7 Feature Selection in Mixture Models

                where θ = (π1 πK θ1 θK) is the parameter of the model

                712 Parameter Estimation The EM Algorithm

                For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

                21 σ

                22 π) of a univariate

                Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

                The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

                The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

                Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

                Maximum Likelihood Definitions

                The likelihood is is commonly expressed in its logarithmic version

                L(θ X) = log

                (nprodi=1

                f(xiθ)

                )

                =nsumi=1

                log

                (Ksumk=1

                πkfk(xiθk)

                ) (71)

                where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

                To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

                72

                71 Mixture Models

                classification log-likelihood

                LC(θ XY) = log

                (nprodi=1

                f(xiyiθ)

                )

                =

                nsumi=1

                log

                (Ksumk=1

                yikπkfk(xiθk)

                )

                =nsumi=1

                Ksumk=1

                yik log (πkfk(xiθk)) (72)

                The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

                Defining the soft membership tik(θ) as

                tik(θ) = p(Yik = 1|xiθ) (73)

                =πkfk(xiθk)

                f(xiθ) (74)

                To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

                LC(θ XY) =sumik

                yik log (πkfk(xiθk))

                =sumik

                yik log (tikf(xiθ))

                =sumik

                yik log tik +sumik

                yik log f(xiθ)

                =sumik

                yik log tik +nsumi=1

                log f(xiθ)

                =sumik

                yik log tik + L(θ X) (75)

                wheresum

                ik yik log tik can be reformulated as

                sumik

                yik log tik =nsumi=1

                Ksumk=1

                yik log(p(Yik = 1|xiθ))

                =

                nsumi=1

                log(p(Yik = 1|xiθ))

                = log (p(Y |Xθ))

                As a result the relationship (75) can be rewritten as

                L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

                73

                7 Feature Selection in Mixture Models

                Likelihood Maximization

                The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

                L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

                +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

                In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

                ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

                minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

                Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

                For the mixture model problem Q(θθprime) is

                Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

                =sumik

                p(Yik = 1|xiθprime) log(πkfk(xiθk))

                =nsumi=1

                Ksumk=1

                tik(θprime) log (πkfk(xiθk)) (77)

                Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

                prime) are the posterior proba-bilities of cluster memberships

                Hence the EM algorithm sketched above results in

                bull Initialization (not iterated) choice of the initial parameter θ(0)

                bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

                bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

                74

                72 Feature Selection in Model-Based Clustering

                Gaussian Model

                In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

                f(xiθ) =Ksumk=1

                πkfk(xiθk)

                =

                Ksumk=1

                πk1

                (2π)p2 |Σ|

                12

                exp

                minus1

                2(xi minus microk)

                gtΣminus1(xi minus microk)

                At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

                Q(θθ(t)) =sumik

                tik log(πk)minussumik

                tik log(

                (2π)p2 |Σ|

                12

                )minus 1

                2

                sumik

                tik(xi minus microk)gtΣminus1(xi minus microk)

                =sumk

                tk log(πk)minusnp

                2log(2π)︸ ︷︷ ︸

                constant term

                minusn2

                log(|Σ|)minus 1

                2

                sumik

                tik(xi minus microk)gtΣminus1(xi minus microk)

                equivsumk

                tk log(πk)minusn

                2log(|Σ|)minus

                sumik

                tik

                (1

                2(xi minus microk)

                gtΣminus1(xi minus microk)

                ) (78)

                where

                tk =nsumi=1

                tik (79)

                The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

                π(t+1)k =

                tkn

                (710)

                micro(t+1)k =

                sumi tikxitk

                (711)

                Σ(t+1) =1

                n

                sumk

                Wk (712)

                with Wk =sumi

                tik(xi minus microk)(xi minus microk)gt (713)

                The derivations are detailed in Appendix G

                72 Feature Selection in Model-Based Clustering

                When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

                75

                7 Feature Selection in Mixture Models

                covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

                In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

                gtk (Banfield and Raftery 1993)

                These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

                In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

                721 Based on Penalized Likelihood

                Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

                log

                (p(Yk = 1|x)

                p(Y` = 1|x)

                )= xgtΣminus1(microk minus micro`)minus

                1

                2(microk + micro`)

                gtΣminus1(microk minus micro`) + logπkπ`

                In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

                λKsumk=1

                psumj=1

                |microkj |

                as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

                λ1

                Ksumk=1

                psumj=1

                |microkj |+ λ2

                Ksumk=1

                psumj=1

                psumm=1

                |(Σminus1k )jm|

                In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

                76

                72 Feature Selection in Model-Based Clustering

                Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

                λ

                psumj=1

                sum16k6kprime6K

                |microkj minus microkprimej |

                This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

                A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

                λ

                psumj=1

                (micro1j micro2j microKj)infin

                One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

                λradicK

                psumj=1

                radicradicradicradic Ksum

                k=1

                micro2kj

                The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

                The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

                722 Based on Model Variants

                The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

                77

                7 Feature Selection in Mixture Models

                f(xi|φ πθν) =Ksumk=1

                πk

                pprodj=1

                [f(xij |θjk)]φj [h(xij |νj)]1minusφj

                where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

                An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

                which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

                tr(

                (UgtΣWU)minus1UgtΣBU) (714)

                so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

                To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

                minUisinRptimesKminus1

                ∥∥∥XU minusXU∥∥∥2

                F+ λ

                Kminus1sumk=1

                ∥∥∥uk∥∥∥1

                where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

                minABisinRptimesKminus1

                Ksumk=1

                ∥∥∥RminusgtW HBk minusABgtHBk

                ∥∥∥2

                2+ ρ

                Kminus1sumj=1

                βgtj ΣWβj + λ

                Kminus1sumj=1

                ∥∥βj∥∥1

                s t AgtA = IKminus1

                where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

                78

                72 Feature Selection in Model-Based Clustering

                triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

                The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

                minUisinRptimesKminus1

                psumj=1

                ∥∥∥ΣBj minus UUgtΣBj

                ∥∥∥2

                2

                s t UgtU = IKminus1

                whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

                To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

                However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

                723 Based on Model Selection

                Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

                bull X(1) set of selected relevant variables

                bull X(2) set of variables being considered for inclusion or exclusion of X(1)

                bull X(3) set of non relevant variables

                79

                7 Feature Selection in Mixture Models

                With those subsets they defined two different models where Y is the partition toconsider

                bull M1

                f (X|Y) = f(X(1)X(2)X(3)|Y

                )= f

                (X(3)|X(2)X(1)

                )f(X(2)|X(1)

                )f(X(1)|Y

                )bull M2

                f (X|Y) = f(X(1)X(2)X(3)|Y

                )= f

                (X(3)|X(2)X(1)

                )f(X(2)X(1)|Y

                )Model M1 means that variables in X(2) are independent on clustering Y Model M2

                shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

                B12 =f (X|M1)

                f (X|M2)

                where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

                B12 =f(X(1)X(2)X(3)|M1

                )f(X(1)X(2)X(3)|M2

                )=f(X(2)|X(1)M1

                )f(X(1)|M1

                )f(X(2)X(1)|M2

                )

                This factor is approximated since the integrated likelihoods f(X(1)|M1

                )and

                f(X(2)X(1)|M2

                )are difficult to calculate exactly Raftery and Dean (2006) use the

                BIC approximation The computation of f(X(2)|X(1)M1

                ) if there is only one variable

                in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

                Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

                Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

                80

                8 Theoretical Foundations

                In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

                We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

                In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

                81 Resolving EM with Optimal Scoring

                In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

                811 Relationship Between the M-Step and Linear Discriminant Analysis

                LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

                d(ximicrok) = (xi minus microk)gtΣminus1

                W (xi minus microk)

                where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

                81

                8 Theoretical Foundations

                The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

                Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

                2lweight(microΣ) =nsumi=1

                Ksumk=1

                tikd(ximicrok)minus n log(|ΣW|)

                which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

                812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

                The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

                813 Clustering Using Penalized Optimal Scoring

                The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

                d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

                This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

                82

                82 Optimized Criterion

                1 Initialize the membership matrix Y (for example by K-means algorithm)

                2 Solve the p-OS problem as

                BOS =(XgtX + λΩ

                )minus1XgtYΘ

                where Θ are the K minus 1 leading eigenvectors of

                YgtX(XgtX + λΩ

                )minus1XgtY

                3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

                k)minus 1

                2 )

                4 Compute the centroids M in the LDA domain

                5 Evaluate distances in the LDA domain

                6 Translate distances into posterior probabilities tik with

                tik prop exp

                [minusd(x microk)minus 2 log(πk)

                2

                ] (81)

                7 Update the labels using the posterior probabilities matrix Y = T

                8 Go back to step 2 and iterate until tik converge

                Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

                814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

                In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

                82 Optimized Criterion

                In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

                83

                8 Theoretical Foundations

                optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

                This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

                821 A Bayesian Derivation

                This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

                The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

                The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

                f(Σ|Λ0 ν0) =1

                2np2 |Λ0|

                n2 Γp(

                n2 )|Σminus1|

                ν0minuspminus12 exp

                minus1

                2tr(Λminus1

                0 Σminus1)

                where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

                Γp(n2) = πp(pminus1)4pprodj=1

                Γ (n2 + (1minus j)2)

                The posterior distribution can be maximized similarly to the likelihood through the

                84

                82 Optimized Criterion

                maximization of

                Q(θθprime) + log(f(Σ|Λ0 ν0))

                =Ksumk=1

                tk log πk minus(n+ 1)p

                2log 2minus n

                2log |Λ0| minus

                p(p+ 1)

                4log(π)

                minuspsumj=1

                log

                (n

                2+

                1minus j2

                ))minus νn minus pminus 1

                2log |Σ| minus 1

                2tr(Λminus1n Σminus1

                )equiv

                Ksumk=1

                tk log πk minusn

                2log |Λ0| minus

                νn minus pminus 1

                2log |Σ| minus 1

                2tr(Λminus1n Σminus1

                ) (82)

                with tk =

                nsumi=1

                tik

                νn = ν0 + n

                Λminus1n = Λminus1

                0 + S0

                S0 =

                nsumi=1

                Ksumk=1

                tik(xi minus microk)(xi minus microk)gt

                Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

                822 Maximum a Posteriori Estimator

                The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

                ΣMAP =1

                ν0 + nminus pminus 1(Λminus1

                0 + S0) (83)

                where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

                0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

                85

                9 Mix-GLOSS Algorithm

                Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

                91 Mix-GLOSS

                The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

                When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

                The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

                911 Outer Loop Whole Algorithm Repetitions

                This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

                bull the centered ntimes p feature matrix X

                bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

                bull the number of clusters K

                bull the maximum number of iterations for the EM algorithm

                bull the convergence tolerance for the EM algorithm

                bull the number of whole repetitions of the clustering algorithm

                87

                9 Mix-GLOSS Algorithm

                Figure 91 Mix-GLOSS Loops Scheme

                bull a ptimes (K minus 1) initial coefficient matrix (optional)

                bull a ntimesK initial posterior probability matrix (optional)

                For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

                912 Penalty Parameter Loop

                The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

                Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

                88

                91 Mix-GLOSS

                of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

                Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

                Algorithm 2 Automatic selection of λ

                Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

                Estimate λ Compute gradient at βj = 0partJ(B)

                partβj

                ∣∣∣βj=0

                = xjgt

                (sum

                m6=j xmβm minusYΘ)

                Compute λmax for every feature using (432b)

                λmaxj = 1

                wj

                ∥∥∥∥ partJ(B)

                partβj

                ∣∣∣βj=0

                ∥∥∥∥2

                Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

                elselastLAMBDA larr true

                end ifuntil lastLAMBDA

                Output B L(θ) tik πk microk Σ Y for every λ in solution path

                913 Inner Loop EM Algorithm

                The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

                89

                9 Mix-GLOSS Algorithm

                Algorithm 3 Mix-GLOSS for one value of λ

                Input X K B0 Y0 λInitializeif (B0Y0) available then

                BOS larr B0 Y larr Y0

                elseBOS larr 0 Y larr kmeans(XK)

                end ifconvergenceEM larr false tolEM larr 1e-3repeat

                M-step(BOSΘ

                α)larr GLOSS(XYBOS λ)

                XLDA = XBOS diag (αminus1(1minusα2)minus12

                )

                πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

                sumi |tik minus yik| lt tolEM then

                convergenceEM larr trueend ifY larr T

                until convergenceEMY larr MAP(T)

                Output BOS ΘL(θ) tik πk microk Σ Y

                90

                92 Model Selection

                M-Step

                The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

                E-Step

                The E-step evaluates the posterior probability matrix T using

                tik prop exp

                [minusd(x microk)minus 2 log(πk)

                2

                ]

                The convergence of those tik is used as stopping criterion for EM

                92 Model Selection

                Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

                In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

                In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

                The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

                91

                9 Mix-GLOSS Algorithm

                Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

                X K λEMITER MAXREPMixminusGLOSS

                Use B and T frombest repetition as

                StartB and StartT

                Mix-GLOSS (λStartBStartT)

                Compute BIC

                Chose λ = minλ BIC

                Partition tikπk λBEST BΘ D L(θ)activeset

                Figure 92 Mix-GLOSS model selection diagram

                with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

                92

                10 Experimental Results

                The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

                This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

                In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

                The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

                101 Tested Clustering Algorithms

                This section compares Mix-GLOSS with the following methods in the state of the art

                bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

                bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

                93

                10 Experimental Results

                Figure 101 Class mean vectors for each artificial simulation

                bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

                After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

                The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

                bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

                bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

                94

                102 Results

                Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

                102 Results

                In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

                bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

                bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

                bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

                The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

                Results in percentages are displayed in Figure 102 (or in Table 102 )

                95

                10 Experimental Results

                Table 101 Experimental results for simulated data

                Err () Var Time

                Sim 1 K = 4 mean shift ind features

                CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

                Sim 2 K = 2 mean shift dependent features

                CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

                Sim 3 K = 4 1D mean shift ind features

                CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

                Sim 4 K = 4 mean shift ind features

                CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

                Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

                Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

                MIX-GLOSS 992 015 828 335 884 67 780 12

                LUMI-KUAN 992 28 1000 02 1000 005 50 005

                FISHER-EM 986 24 888 17 838 5825 620 4075

                96

                103 Discussion

                0 10 20 30 40 50 600

                10

                20

                30

                40

                50

                60

                70

                80

                90

                100TPR Vs FPR

                MIXminusGLOSS

                LUMIminusKUAN

                FISHERminusEM

                Simulation1

                Simulation2

                Simulation3

                Simulation4

                Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

                103 Discussion

                After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

                LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

                The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

                From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

                97

                Conclusions

                99

                Conclusions

                Summary

                The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

                In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

                The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

                In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

                In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

                Perspectives

                Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

                101

                based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

                At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

                The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

                From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

                At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

                102

                Appendix

                103

                A Matrix Properties

                Property 1 By definition ΣW and ΣB are both symmetric matrices

                ΣW =1

                n

                gsumk=1

                sumiisinCk

                (xi minus microk)(xi minus microk)gt

                ΣB =1

                n

                gsumk=1

                nk(microk minus x)(microk minus x)gt

                Property 2 partxgtapartx = partagtx

                partx = a

                Property 3 partxgtAxpartx = (A + Agt)x

                Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

                Property 5 partagtXbpartX = abgt

                Property 6 partpartXtr

                (AXminus1B

                )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

                105

                B The Penalized-OS Problem is anEigenvector Problem

                In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

                minθkβk

                Yθk minusXβk22 + βgtk Ωkβk (B1)

                st θgtk YgtYθk = 1

                θgt` YgtYθk = 0 forall` lt k

                for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

                Lk(θkβk λkνk) =

                Yθk minusXβk22 + βgtk Ωkβk + λk(θ

                gtk YgtYθk minus 1) +

                sum`ltk

                ν`θgt` YgtYθk (B2)

                Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

                βk = (XgtX + Ωk)minus1XgtYθk (B3)

                The objective function of (B1) evaluated at βk is

                minθk

                Yθk minusXβk22 + βk

                gtΩkβk = min

                θk

                θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

                = maxθk

                θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

                If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

                B1 How to Solve the Eigenvector Decomposition

                Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

                107

                B The Penalized-OS Problem is an Eigenvector Problem

                Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

                maxΘisinRKtimes(Kminus1)

                tr(ΘgtMΘ

                )(B5)

                st ΘgtYgtYΘ = IKminus1

                If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

                MΘv = λv (B6)

                where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

                vgtMΘv = λhArr vgtΘgtMΘv = λ

                Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

                wgtMw = λ (B7)

                Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

                MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

                = ΘgtYgtXB

                Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

                To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

                B = (XgtX + Ω)minus1XgtYΘV = BV

                108

                B2 Why the OS Problem is Solved as an Eigenvector Problem

                B2 Why the OS Problem is Solved as an Eigenvector Problem

                In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

                By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

                θk =

                Kminus1summ=1

                αmwm s t θgtk θk = 1 (B8)

                The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

                Kminus1summ=1

                αmwm

                )gt(Kminus1summ=1

                αmwm

                )= 1

                that as per the eigenvector properties can be reduced to

                Kminus1summ=1

                α2m = 1 (B9)

                Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

                Mθk = M

                Kminus1summ=1

                αmwm

                =

                Kminus1summ=1

                αmMwm

                As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

                Mθk =Kminus1summ=1

                αmλmwm

                Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

                θgtk Mθk =

                (Kminus1sum`=1

                α`w`

                )gt(Kminus1summ=1

                αmλmwm

                )

                This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

                θgtk Mθk =Kminus1summ=1

                α2mλm

                109

                B The Penalized-OS Problem is an Eigenvector Problem

                The optimization Problem (B5) for discriminant direction k can be rewritten as

                maxθkisinRKtimes1

                θgtk Mθk

                = max

                θkisinRKtimes1

                Kminus1summ=1

                α2mλm

                (B10)

                with θk =Kminus1summ=1

                αmwm

                andKminus1summ=1

                α2m = 1

                One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

                sumKminus1m=1 αmwm the resulting score vector θk will be equal to

                the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

                be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

                110

                C Solving Fisherrsquos Discriminant Problem

                The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

                maxβisinRp

                βgtΣBβ (C1a)

                s t βgtΣWβ = 1 (C1b)

                where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

                The Lagrangian of Problem (C1) is

                L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

                so that its first derivative with respect to β is

                partL(β ν)

                partβ= 2ΣBβ minus 2νΣWβ

                A necessary optimality condition for β is that this derivative is zero that is

                ΣBβ = νΣWβ

                Provided ΣW is full rank we have

                Σminus1W ΣBβ

                = νβ (C2)

                Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

                eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

                βgtΣBβ = βgtΣWΣminus1

                W ΣBβ

                = νβgtΣWβ from (C2)

                = ν from (C1b)

                That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

                W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

                111

                D Alternative Variational Formulation forthe Group-Lasso

                In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

                minτisinRp

                minBisinRptimesKminus1

                J(B) + λ

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj(D1a)

                s tsump

                j=1 τj = 1 (D1b)

                τj ge 0 j = 1 p (D1c)

                Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

                of row vectors βj isin RKminus1 B =(β1gt βpgt

                )gt

                L(B τ λ ν0 νj) = J(B) + λ

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj+ ν0

                psumj=1

                τj minus 1

                minus psumj=1

                νjτj (D2)

                The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

                partL(B τ λ ν0 νj)

                partτj

                ∣∣∣∣τj=τj

                = 0 rArr minusλw2j

                ∥∥βj∥∥2

                2

                τj2 + ν0 minus νj = 0

                rArr minusλw2j

                ∥∥βj∥∥2

                2+ ν0τ

                j

                2 minus νjτj2 = 0

                rArr minusλw2j

                ∥∥βj∥∥2

                2+ ν0τ

                j

                2 = 0

                The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

                ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

                inequality Lagrange condition Then the optimal τj can be deduced

                τj =

                radicλ

                ν0wj∥∥βj∥∥

                2

                Placing this optimal value of τj into constraint (D1b)

                psumj=1

                τj = 1rArr τj =wj∥∥βj∥∥

                2sumpj=1wj

                ∥∥βj∥∥2

                (D3)

                113

                D Alternative Variational Formulation for the Group-Lasso

                With this value of τj Problem (D1) is equivalent to

                minBisinRptimesKminus1

                J(B) + λ

                psumj=1

                wj∥∥βj∥∥

                2

                2

                (D4)

                This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

                The penalty term of (D1a) can be conveniently presented as λBgtΩB where

                Ω = diag

                (w2

                1

                τ1w2

                2

                τ2

                w2p

                τp

                ) (D5)

                Using the value of τj from (D3) each diagonal component of Ω is

                (Ω)jj =wjsump

                j=1wj∥∥βj∥∥

                2∥∥βj∥∥2

                (D6)

                In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

                D1 Useful Properties

                Lemma D1 If J is convex Problem (D1) is convex

                In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

                Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

                partJ(B)

                partB+ 2λ

                Kminus1sumj=1

                wj∥∥βj∥∥

                2

                G

                (D7)

                where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

                ∥∥βj∥∥26= 0 then we have

                forallj isin S(B) gj = wj∥∥βj∥∥minus1

                2βj (D8)

                forallj isin S(B) ∥∥gj∥∥

                2le wj (D9)

                114

                D2 An Upper Bound on the Objective Function

                This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

                Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

                ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

                forallj isin S(B) minus partJ(B)

                partβj= 2λ

                Kminus1sumj=1

                wj∥∥βj∥∥2

                wj∥∥βj∥∥minus1

                2βj (D10a)

                forallj isin S(B)

                ∥∥∥∥partJ(B)

                partβj

                ∥∥∥∥2

                le 2λwj

                Kminus1sumj=1

                wj∥∥βj∥∥2

                (D10b)

                In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

                D2 An Upper Bound on the Objective Function

                Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

                τj =wj∥∥βj∥∥

                2sumpj=1wj

                ∥∥βj∥∥2

                Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

                j=1

                wj∥∥βj∥∥

                2

                2

                =

                psumj=1

                τ12j

                wj∥∥βj∥∥

                2

                τ12j

                2

                le

                psumj=1

                τj

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj

                le

                psumj=1

                w2j

                ∥∥βj∥∥2

                2

                τj

                where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

                115

                D Alternative Variational Formulation for the Group-Lasso

                This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

                116

                E Invariance of the Group-Lasso to UnitaryTransformations

                The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

                Proposition E1 Let B be a solution of

                minBisinRptimesM

                Y minusXB2F + λ

                psumj=1

                wj∥∥βj∥∥

                2(E1)

                and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

                minBisinRptimesM

                ∥∥∥Y minusXB∥∥∥2

                F+ λ

                psumj=1

                wj∥∥βj∥∥

                2(E2)

                Proof The first-order necessary optimality conditions for B are

                forallj isin S(B) 2xjgt(xjβ

                j minusY)

                + λwj

                ∥∥∥βj∥∥∥minus1

                2βj

                = 0 (E3a)

                forallj isin S(B) 2∥∥∥xjgt (xjβ

                j minusY)∥∥∥

                2le λwj (E3b)

                where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

                First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

                forallj isin S(B) 2xjgt(xjβ

                j minus Y)

                + λwj

                ∥∥∥βj∥∥∥minus1

                2βj

                = 0 (E4a)

                forallj isin S(B) 2∥∥∥xjgt (xjβ

                j minus Y)∥∥∥

                2le λwj (E4b)

                where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

                ∥∥ugt∥∥2

                =∥∥ugtV

                ∥∥2 Equation (E4b) is also

                117

                E Invariance of the Group-Lasso to Unitary Transformations

                obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

                118

                F Expected Complete Likelihood andLikelihood

                Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

                L(θ) =

                nsumi=1

                log

                (Ksumk=1

                πkfk(xiθk)

                )(F1)

                Q(θθprime) =nsumi=1

                Ksumk=1

                tik(θprime) log (πkfk(xiθk)) (F2)

                with tik(θprime) =

                πprimekfk(xiθprimek)sum

                ` πprime`f`(xiθ

                prime`)

                (F3)

                In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

                the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

                Using (F3) we have

                Q(θθprime) =sumik

                tik(θprime) log (πkfk(xiθk))

                =sumik

                tik(θprime) log(tik(θ)) +

                sumik

                tik(θprime) log

                (sum`

                π`f`(xiθ`)

                )=sumik

                tik(θprime) log(tik(θ)) + L(θ)

                In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

                L(θ) = Q(θθ)minussumik

                tik(θ) log(tik(θ))

                = Q(θθ) +H(T)

                119

                G Derivation of the M-Step Equations

                This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

                Q(θθprime) = maxθ

                sumik

                tik(θprime) log(πkfk(xiθk))

                =sumk

                log

                (πksumi

                tik

                )minus np

                2log(2π)minus n

                2log |Σ| minus 1

                2

                sumik

                tik(xi minus microk)gtΣminus1(xi minus microk)

                which has to be maximized subject tosumk

                πk = 1

                The Lagrangian of this problem is

                L(θ) = Q(θθprime) + λ

                (sumk

                πk minus 1

                )

                Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

                G1 Prior probabilities

                partL(θ)

                partπk= 0hArr 1

                πk

                sumi

                tik + λ = 0

                where λ is identified from the constraint leading to

                πk =1

                n

                sumi

                tik

                121

                G Derivation of the M-Step Equations

                G2 Means

                partL(θ)

                partmicrok= 0hArr minus1

                2

                sumi

                tik2Σminus1(microk minus xi) = 0

                rArr microk =

                sumi tikxisumi tik

                G3 Covariance Matrix

                partL(θ)

                partΣminus1 = 0hArr n

                2Σ︸︷︷︸

                as per property 4

                minus 1

                2

                sumik

                tik(xi minus microk)(xi minus microk)gt

                ︸ ︷︷ ︸as per property 5

                = 0

                rArr Σ =1

                n

                sumik

                tik(xi minus microk)(xi minus microk)gt

                122

                Bibliography

                F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

                F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

                F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

                J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

                A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

                H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

                P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

                C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

                C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

                C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

                C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

                S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

                L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

                L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

                123

                Bibliography

                T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

                S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

                C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

                B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

                L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

                C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

                A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

                D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

                R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

                B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

                Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

                R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

                V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

                J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

                124

                Bibliography

                J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

                J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

                W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

                A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

                D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

                G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

                G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

                Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

                Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

                L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

                Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

                J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

                I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

                T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

                T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

                125

                Bibliography

                T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

                A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

                J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

                T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

                K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

                P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

                T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

                M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

                Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

                C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

                C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

                H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

                J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

                Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

                C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

                126

                Bibliography

                C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

                L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

                N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

                B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

                B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

                Y Nesterov Gradient methods for minimizing composite functions preprint 2007

                S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

                B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

                M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

                M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

                W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

                W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

                K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

                S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

                127

                Bibliography

                Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

                A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

                C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

                S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

                V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

                V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

                V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

                C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

                L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

                Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

                A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

                S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

                P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

                M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

                128

                Bibliography

                M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

                R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

                J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

                S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

                D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

                D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

                D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

                M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

                MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

                T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

                B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

                B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

                C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

                J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

                129

                Bibliography

                M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

                P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

                P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

                H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

                H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

                H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

                130

                • SANCHEZ MERCHANTE PDTpdf
                • Thesis Luis Francisco Sanchez Merchantepdf
                  • List of figures
                  • List of tables
                  • Notation and Symbols
                  • Context and Foundations
                    • Context
                    • Regularization for Feature Selection
                      • Motivations
                      • Categorization of Feature Selection Techniques
                      • Regularization
                        • Important Properties
                        • Pure Penalties
                        • Hybrid Penalties
                        • Mixed Penalties
                        • Sparsity Considerations
                        • Optimization Tools for Regularized Problems
                          • Sparse Linear Discriminant Analysis
                            • Abstract
                            • Feature Selection in Fisher Discriminant Analysis
                              • Fisher Discriminant Analysis
                              • Feature Selection in LDA Problems
                                • Inertia Based
                                • Regression Based
                                    • Formalizing the Objective
                                      • From Optimal Scoring to Linear Discriminant Analysis
                                        • Penalized Optimal Scoring Problem
                                        • Penalized Canonical Correlation Analysis
                                        • Penalized Linear Discriminant Analysis
                                        • Summary
                                          • Practicalities
                                            • Solution of the Penalized Optimal Scoring Regression
                                            • Distance Evaluation
                                            • Posterior Probability Evaluation
                                            • Graphical Representation
                                              • From Sparse Optimal Scoring to Sparse LDA
                                                • A Quadratic Variational Form
                                                • Group-Lasso OS as Penalized LDA
                                                    • GLOSS Algorithm
                                                      • Regression Coefficients Updates
                                                        • Cholesky decomposition
                                                        • Numerical Stability
                                                          • Score Matrix
                                                          • Optimality Conditions
                                                          • Active and Inactive Sets
                                                          • Penalty Parameter
                                                          • Options and Variants
                                                            • Scaling Variables
                                                            • Sparse Variant
                                                            • Diagonal Variant
                                                            • Elastic net and Structured Variant
                                                                • Experimental Results
                                                                  • Normalization
                                                                  • Decision Thresholds
                                                                  • Simulated Data
                                                                  • Gene Expression Data
                                                                  • Correlated Data
                                                                    • Discussion
                                                                      • Sparse Clustering Analysis
                                                                        • Abstract
                                                                        • Feature Selection in Mixture Models
                                                                          • Mixture Models
                                                                            • Model
                                                                            • Parameter Estimation The EM Algorithm
                                                                              • Feature Selection in Model-Based Clustering
                                                                                • Based on Penalized Likelihood
                                                                                • Based on Model Variants
                                                                                • Based on Model Selection
                                                                                    • Theoretical Foundations
                                                                                      • Resolving EM with Optimal Scoring
                                                                                        • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                        • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                        • Clustering Using Penalized Optimal Scoring
                                                                                        • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                          • Optimized Criterion
                                                                                            • A Bayesian Derivation
                                                                                            • Maximum a Posteriori Estimator
                                                                                                • Mix-GLOSS Algorithm
                                                                                                  • Mix-GLOSS
                                                                                                    • Outer Loop Whole Algorithm Repetitions
                                                                                                    • Penalty Parameter Loop
                                                                                                    • Inner Loop EM Algorithm
                                                                                                      • Model Selection
                                                                                                        • Experimental Results
                                                                                                          • Tested Clustering Algorithms
                                                                                                          • Results
                                                                                                          • Discussion
                                                                                                              • Conclusions
                                                                                                              • Appendix
                                                                                                                • Matrix Properties
                                                                                                                • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                                  • How to Solve the Eigenvector Decomposition
                                                                                                                  • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                                    • Solving Fishers Discriminant Problem
                                                                                                                    • Alternative Variational Formulation for the Group-Lasso
                                                                                                                      • Useful Properties
                                                                                                                      • An Upper Bound on the Objective Function
                                                                                                                        • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                        • Expected Complete Likelihood and Likelihood
                                                                                                                        • Derivation of the M-Step Equations
                                                                                                                          • Prior probabilities
                                                                                                                          • Means
                                                                                                                          • Covariance Matrix
                                                                                                                              • Bibliography

                  Contents

                  E Invariance of the Group-Lasso to Unitary Transformations 117

                  F Expected Complete Likelihood and Likelihood 119

                  G Derivation of the M-Step Equations 121G1 Prior probabilities 121G2 Means 122G3 Covariance Matrix 122

                  Bibliography 123

                  iv

                  List of Figures

                  11 MASH project logo 5

                  21 Example of relevant features 1022 Four key steps of feature selection 1123 Admissible sets in two dimensions for different pure norms ||β||p 1424 Two dimensional regularized problems with ||β||1 and ||β||2 penalties 1525 Admissible sets for the Lasso and Group-Lasso 2026 Sparsity patterns for an example with 8 variables characterized by 4 pa-

                  rameters 20

                  41 Graphical representation of the variational approach to Group-Lasso 45

                  51 GLOSS block diagram 5052 Graph and Laplacian matrix for a 3times 3 image 56

                  61 TPR versus FPR for all simulations 6062 2D-representations of Nakayama and Sun datasets based on the two first

                  discriminant vectors provided by GLOSS and SLDA 6263 USPS digits ldquo1rdquo and ldquo0rdquo 6364 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo 6465 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo 64

                  91 Mix-GLOSS Loops Scheme 8892 Mix-GLOSS model selection diagram 92

                  101 Class mean vectors for each artificial simulation 94102 TPR versus FPR for all simulations 97

                  v

                  List of Tables

                  61 Experimental results for simulated data supervised classification 5962 Average TPR and FPR for all simulations 6063 Experimental results for gene expression data supervised classification 61

                  101 Experimental results for simulated data unsupervised clustering 96102 Average TPR versus FPR for all clustering simulations 96

                  vii

                  Notation and Symbols

                  Throughout this thesis vectors are denoted by lowercase letters in bold font andmatrices by uppercase letters in bold font Unless otherwise stated vectors are columnvectors and parentheses are used to build line vectors from comma-separated lists ofscalars or to build matrices from comma-separated lists of column vectors

                  Sets

                  N the set of natural numbers N = 1 2 R the set of reals|A| cardinality of a set A (for finite sets the number of elements)A complement of set A

                  Data

                  X input domainxi input sample xi isin XX design matrix X = (xgt1 x

                  gtn )gt

                  xj column j of Xyi class indicator of sample i

                  Y indicator matrix Y = (ygt1 ygtn )gt

                  z complete data z = (xy)Gk set of the indices of observations belonging to class kn number of examplesK number of classesp dimension of Xi j k indices running over N

                  Vectors Matrices and Norms

                  0 vector with all entries equal to zero1 vector with all entries equal to oneI identity matrixAgt transposed of matrix A (ditto for vector)Aminus1 inverse of matrix Atr(A) trace of matrix A|A| determinant of matrix Adiag(v) diagonal matrix with v on the diagonalv1 L1 norm of vector vv2 L2 norm of vector vAF Frobenius norm of matrix A

                  ix

                  Notation and Symbols

                  Probability

                  E [middot] expectation of a random variablevar [middot] variance of a random variableN (micro σ2) normal distribution with mean micro and variance σ2

                  W(W ν) Wishart distribution with ν degrees of freedom and W scalematrix

                  H (X) entropy of random variable XI (XY ) mutual information between random variables X and Y

                  Mixture Models

                  yik hard membership of sample i to cluster kfk distribution function for cluster ktik posterior probability of sample i to belong to cluster kT posterior probability matrixπk prior probability or mixture proportion for cluster kmicrok mean vector of cluster kΣk covariance matrix of cluster kθk parameter vector for cluster k θk = (microkΣk)

                  θ(t) parameter vector at iteration t of the EM algorithmf(Xθ) likelihood functionL(θ X) log-likelihood functionLC(θ XY) complete log-likelihood function

                  Optimization

                  J(middot) cost functionL(middot) Lagrangianβ generic notation for the solution wrt β

                  βls least squares solution coefficient vectorA active setγ step size to update regularization pathh direction to update regularization path

                  x

                  Notation and Symbols

                  Penalized models

                  λ λ1 λ2 penalty parametersPλ(θ) penalty term over a generic parameter vectorβkj coefficient j of discriminant vector kβk kth discriminant vector βk = (βk1 βkp)B matrix of discriminant vectors B = (β1 βKminus1)

                  βj jth row of B = (β1gt βpgt)gt

                  BLDA coefficient matrix in the LDA domainBCCA coefficient matrix in the CCA domainBOS coefficient matrix in the OS domainXLDA data matrix in the LDA domainXCCA data matrix in the CCA domainXOS data matrix in the OS domainθk score vector kΘ score matrix Θ = (θ1 θKminus1)Y label matrixΩ penalty matrixLCP (θXZ) penalized complete log-likelihood functionΣB between-class covariance matrixΣW within-class covariance matrixΣT total covariance matrix

                  ΣB sample between-class covariance matrix

                  ΣW sample within-class covariance matrix

                  ΣT sample total covariance matrixΛ inverse of covariance matrix or precision matrixwj weightsτj penalty components of the variational approach

                  xi

                  Part I

                  Context and Foundations

                  1

                  This thesis is divided in three parts In Part I I am introducing the context in whichthis work has been developed the project that funded it and the constraints that we hadto obey Generic are also detailed here to introduce the models and some basic conceptsthat will be used along this document The state of the art of is also reviewed

                  The first contribution of this thesis is explained in Part II where I present the super-vised learning algorithm GLOSS and its supporting theory as well as some experimentsto test its performance compared to other state of the art mechanisms Before describingthe algorithm and the experiments its theoretical foundations are provided

                  The second contribution is described in Part III with an analogue structure to Part IIbut for the unsupervised domain The clustering algorithm Mix-GLOSS adapts the su-pervised technique from Part II by means of a modified EM This section is also furnishedwith specific theoretical foundations an experimental section and a final discussion

                  3

                  1 Context

                  The MASH project is a research initiative to investigate the open and collaborativedesign of feature extractors for the Machine Learning scientific community The project isstructured around a web platform (httpmash-projecteu) comprising collaborativetools such as wiki-documentation forums coding templates and an experiment centerempowered with non-stop calculation servers The applications targeted by MASH arevision and goal-planning problems either in a 3D virtual environment or with a realrobotic arm

                  The MASH consortium is led by the IDIAP Research Institute in Switzerland Theother members are the University of Potsdam in Germany the Czech Technical Uni-versity of Prague the National Institute for Research in Computer Science and Control(INRIA) in France and the National Centre for Scientific Research (CNRS) also in Francethrough the laboratory of Heuristics and Diagnosis for Complex Systems (HEUDIASYC)attached to the the University of Technology of Compiegne

                  From the point of view of the research the members of the consortium must deal withfour main goals

                  1 Software development of website framework and APIrsquos

                  2 Classification and goal-planning in high dimensional feature spaces

                  3 Interfacing the platform with the 3D virtual environment and the robot arm

                  4 Building tools to assist contributors with the development of the feature extractorsand the configuration of the experiments

                  S HM A

                  Figure 11 MASH project logo

                  5

                  1 Context

                  The work detailed in this text has been done in the context of goal 4 From the verybeginning of the project our role is to provide the users with some feedback regardingthe feature extractors At the moment of writing this thesis the number of publicfeature extractors reaches 75 In addition to the public ones there are also privateextractors that contributors decide not to share with the rest of the community Thelast number I was aware of was about 300 Within those 375 extractors there must besome of them sharing the same theoretical principles or supplying similar features Theframework of the project tests every new piece of code with some datasets of reference inorder to provide a ranking depending on the quality of the estimation However similarperformance of two extractors for a particular dataset does not mean that both are usingthe same variables

                  Our engagement was to provide some textual or graphical tools to discover whichextractors compute features similar to other ones Our hypothesis is that many of themuse the same theoretical foundations that should induce a grouping of similar extractorsIf we succeed discovering those groups we would also be able to select representativesThis information can be used in several ways For example from the perspective of a userthat develops feature extractors it would be interesting comparing the performance of hiscode against the K representatives instead to the whole database As another exampleimagine a user wants to obtain the best prediction results for a particular datasetInstead of selecting all the feature extractors creating an extremely high dimensionalspace he could select only the K representatives foreseeing similar results with a fasterexperiment

                  As there is no prior knowledge about the latent structure we make use of unsupervisedtechniques Below there is a brief description of the different tools that we developedfor the web platform

                  bull Clustering Using Mixture Models This is a well-known technique that mod-els the data as if it was randomly generated from a distribution function Thisdistribution is typically a mixture of Gaussian with unknown mixture proportionsmeans and covariance matrices The number of Gaussian components matchesthe number of expected groups The parameters of the model are computed usingthe EM algorithm and the clusters are built by maximum a posteriori estimationFor the calculation we use mixmod that is a c++ library that can be interfacedwith matlab This library allows working with high dimensional data Furtherinformation regarding mixmod is given by Bienarcki et al (2008) All details con-cerning the tool implemented are given in deliverable ldquomash-deliverable-D71-m12rdquo(Govaert et al 2010)

                  bull Sparse Clustering Using Penalized Optimal Scoring This technique in-tends again to perform clustering by modelling the data as a mixture of Gaussiandistributions However instead of using a classic EM algorithm for estimatingthe componentsrsquo parameters the M-step is replaced by a penalized Optimal Scor-ing problem This replacement induces sparsity improving the robustness and theinterpretability of the results Its theory will be explained later in this thesis

                  6

                  All details concerning the tool implemented can be found in deliverable ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

                  bull Table Clustering Using The RV Coefficient This technique applies clus-tering methods directly to the tables computed by the feature extractors insteadcreating a single matrix A distance in the extractors space is defined using theRV coefficient that is a multivariate generalization of the Pearsonrsquos correlation co-efficient on the form of an inner product The distance is defined for every pair iand j as RV(OiOj) where Oi and Oj are operators computed from the tables re-turned by feature extractors i and j Once that we have a distance matrix severalstandard techniques may be used to group extractors A detailed description ofthis technique can be found in deliverables ldquomash-deliverable-D71-m12rdquo (Govaertet al 2010) and ldquomash-deliverable-D72-m24rdquo (Govaert et al 2011)

                  I am not extending this section with further explanations about the MASH project ordeeper details about the theory that we used to commit our engagements I will simplyrefer to the public deliverables of the project where everything is carefully detailed(Govaert et al 2010 2011)

                  7

                  2 Regularization for Feature Selection

                  With the advances in technology data is becoming larger and larger resulting inhigh dimensional ensembles of information Genomics textual indexation and medicalimages are some examples of data that can easily exceed thousands of dimensions Thefirst experiments aiming to cluster the data from the MASH project (see Chapter 1)intended to work with the whole dimensionality of the samples As the number of featureextractors rose the numerical issues also rose Redundancy or extremely correlatedfeatures may happen if two contributors implement the same extractor with differentnames When the number of features exceeded the number of samples we started todeal with singular covariance matrices whose inverses are not defined Many algorithmsin the field of Machine Learning make use of this statistic

                  21 Motivations

                  There is a quite recent effort in the direction of handling high dimensional dataTraditional techniques can be adapted but quite often large dimensions turn thosetechniques useless Linear Discriminant Analysis was shown to be no better than aldquorandom guessingrdquo of the object labels when the dimension is larger than the samplesize (Bickel and Levina 2004 Fan and Fan 2008)

                  As a rule of thumb in discriminant and clustering problems the complexity of cal-culus increases with the numbers of objects in the database the number of features(dimensionality) and the number of classes or clusters One way to reduce this complex-ity is to reduce the number of features This reduction induces more robust estimatorsallows faster learning and predictions in the supervised environments and easier inter-pretations in the unsupervised framework Removing features must be done wisely toavoid removing critical information

                  When talking about dimensionality reduction there are two families of techniquesthat could induce confusion

                  bull Reduction by feature transformations summarizes the dataset with fewer dimen-sions by creating combinations of the original attributes These techniques are lesseffective when there are many irrelevant attributes (noise) Principal ComponentAnalysis or Independent Component Analysis are two popular examples

                  bull Reduction by feature selection removes irrelevant dimensions preserving the in-tegrity of the informative features from the original dataset The problem comesout when there is a restriction in the number of variables to preserve and discardingthe exceeding dimensions leads to a loss of information Prediction with feature

                  9

                  2 Regularization for Feature Selection

                  Figure 21 Example of relevant features from Chidlovskii and Lecerf (2008)

                  selection is computationally cheaper because only relevant features are used andthe resulting models are easier to interpret The Lasso operator is an example ofthis category

                  As a basic rule we can use the reduction techniques by feature transformation whenthe majority of the features are relevant and when there is a lot of redundancy orcorrelation On the contrary feature selection techniques are useful when there areplenty of useless or noisy features (irrelevant information) that needs to be filtered outIn the paper of Chidlovskii and Lecerf (2008) we find a great explanation about thedifference between irrelevant and redundant features The following two paragraphs arealmost exact reproductions of their text

                  ldquoIrrelevant features are those which provide negligible distinguishing information Forexample if the objects are all dogs cats or squirrels and it is desired to classify eachnew animal into one of these three classes the feature of color may be irrelevant if eachof dogs cats and squirrels have about the same distribution of brown black and tanfur colors In such a case knowing that an input animal is brown provides negligibledistinguishing information for classifying the animal as a cat dog or squirrel Featureswhich are irrelevant for a given classification problem are not useful and accordingly afeature that is irrelevant can be filtered out

                  Redundant features are those which provide distinguishing information but are cu-mulative to another feature or group of features that provide substantially the same dis-tinguishing information Using previous example consider illustrative ldquodietrdquo and ldquodo-mesticationrdquo features Dogs and cats both have similar carnivorous diets while squirrelsconsume nuts and so forth Thus the ldquodietrdquo feature can efficiently distinguish squirrelsfrom dogs and cats although it provides little information to distinguish between dogsand cats Dogs and cats are also both typically domesticated animals while squirrels arewild animals Thus the ldquodomesticationrdquo feature provides substantially the same infor-mation as the ldquodietrdquo feature namely distinguishing squirrels from dogs and cats but notdistinguishing between dogs and cats Thus the ldquodietrdquo and ldquodomesticationrdquo features arecumulative and one can identify one of these features as redundant so as to be filteredout However unlike irrelevant features care should be taken with redundant featuresto ensure that one retains enough of the redundant features to provide the relevant dis-tinguishing information In the foregoing example on may wish to filter out either the

                  10

                  22 Categorization of Feature Selection Techniques

                  Figure 22 The four key steps of feature selection according to Liu and Yu (2005)

                  ldquodietrdquo feature or the ldquodomesticationrdquo feature but if one removes both the ldquodietrdquo and theldquodomesticationrdquo features then useful distinguishing information is lost

                  There are some tricks to build robust estimators when the number of features exceedsthe number of samples Ignoring some of the dependencies among variables and replacingthe covariance matrix by a diagonal approximation are two of them Another populartechnique and the one chosen in this thesis is imposing regularity conditions

                  22 Categorization of Feature Selection Techniques

                  Feature selection is one of the most frequent techniques in preprocessing data in orderto remove irrelevant redundant or noisy features Nevertheless the risk of removingsome informative dimensions is always there thus the relevance of the remaining subsetof features must be measured

                  I am reproducing here the scheme that generalizes any feature selection process as itis shown by Liu and Yu (2005) Figure 22 provides a very intuitive scheme with thefour key steps in a feature selection algorithm

                  The classification of those algorithms can respond to different criteria Guyon andElisseeff (2003) propose a check list that summarizes the steps that may be taken tosolve a feature selection problem guiding the user through several techniques Liu andYu (2005) propose a framework that integrates supervised and unsupervised featureselection algorithms through a categorizing framework Both references are excellentreviews to characterize feature selection techniques according to their characteristicsI am proposing a framework inspired by these references that does not cover all thepossibilities but which gives a good summary about existing possibilities

                  bull Depending on the type of integration with the machine learning algorithm we have

                  ndash Filter Models - The filter models work as a preprocessing step using an inde-pendent evaluation criteria to select a subset of variables without assistanceof the mining algorithm

                  ndash Wrapper Models - The wrapper models require a classification or clusteringalgorithm and use its prediction performance to assess the relevance of thesubset selection The feature selection is done in the optimization block while

                  11

                  2 Regularization for Feature Selection

                  the feature subset evaluation is done in a different one Therefore the cri-terion to optimize and to evaluate may be different Those algorithms arecomputationally expensive

                  ndash Embedded Models - They perform variable selection inside the learning ma-chine with the selection being made at the training step That means thatthere is only one criterion the optimization and the evaluation are a singleblock and the features are selected to optimize this unique criterion and donot need to be re-evaluated in a later phase That makes them more effi-cient since no validation or test process are needed for every variable subsetinvestigated However they are less universal because they are specific of thetraining process for a given mining algorithm

                  bull Depending on the feature searching technique

                  ndash Complete - No subsets are missed from evaluation Involves combinatorialsearches

                  ndash Sequential - Features are added (forward searches) or removed (backwardsearches) one at a time

                  ndash Random - The initial subset or even subsequent subsets are randomly chosento escape local optima

                  bull Depending on the evaluation technique

                  ndash Distance Measures - Choosing the features that maximize the difference inseparability divergence or discrimination measures

                  ndash Information Measures - Choosing the features that maximize the informationgain that is minimizing the posterior uncertainty

                  ndash Dependency Measures - Measuring the correlation between features

                  ndash Consistency Measures - Finding a minimum number of features that separateclasses as consistently as the full set of features can

                  ndash Predictive Accuracy - Use the selected features to predict the labels

                  ndash Cluster Goodness - Use the selected features to perform clustering and eval-uate the result (cluster compactness scatter separability maximum likeli-hood)

                  The distance information correlation and consistency measures are typical of variableranking algorithms commonly used in filter models Predictive accuracy and clustergoodness allow to evaluate subsets of features and can be used in wrapper and embeddedmodels

                  In this thesis we developed some algorithms following the embedded paradigm ei-ther in the supervised or the unsupervised framework Integrating the subset selectionproblem in the overall learning problem may be computationally demanding but it isappealing from a conceptual viewpoint there is a perfect match between the formalized

                  12

                  23 Regularization

                  goal and the process dedicated to achieve this goal thus avoiding many problems arisingin filter or wrapper methods Practically it is however intractable to solve exactly hardselection problems when the number of features exceeds a few tenth Regularizationtechniques allow to provide a sensible approximate answer to the selection problem withreasonable computing resources and their recent study have demonstrated powerful the-oretical and empirical results The following section introduces the tools that will beemployed in Part II and III

                  23 Regularization

                  In the machine learning domain the term ldquoregularizationrdquo refers to a technique thatintroduces some extra assumptions or knowledge in the resolution of an optimizationproblem The most popular point of view presents regularization as a mechanism toprevent overfitting but it can also help to fix some numerical issues on ill-posed problems(like some matrix singularities when solving a linear system) besides other interestingproperties like the capacity to induce sparsity thus producing models that are easier tointerpret

                  An ill-posed problem violates the rules defined by Jacques Hadamard according towhom the solution to a mathematical problem has to exist be unique and stable Forexample when the number of samples is smaller than their dimensionality and we try toinfer some generic laws from such a low sample of the population Regularization trans-forms an ill-posed problem into a well-posed one To do that some a priori knowledgeis introduced in the solution through a regularization term that penalizes a criterion Jwith a penalty P Below are the two most popular formulations

                  minβJ(β) + λP (β) (21)

                  minβ

                  J(β)

                  s t P (β) le t (22)

                  In the expressions (21) and (22) the parameters λ and t have a similar functionthat is to control the trade-off between fitting the data to the model according to J(β)and the effect of the penalty P (β) The set such that the constraint in (22) is verified(β P (β) le t) is called the admissible set This penalty term can also be understoodas a measure that quantifies the complexity of the model (as in the definition of Sammutand Webb 2010) Note that regularization terms can also be interpreted in the Bayesianparadigm as prior distributions on the parameters of the model In this thesis both viewswill be taken

                  In this section I am reviewing pure mixed and hybrid penalties that will be used inthe following chapters to implement feature selection I first list important propertiesthat may pertain to any type of penalty

                  13

                  2 Regularization for Feature Selection

                  Figure 23 Admissible sets in two dimensions for different pure norms ||β||p

                  231 Important Properties

                  Penalties may have different properties that can be more or less interesting dependingon the problem and the expected solution The most important properties for ourpurposes here are convexity sparsity and stability

                  Convexity Regarding optimization convexity is a desirable property that eases find-ing global solutions A convex function verifies

                  forall(x1x2) isin X 2 f(tx1 + (1minus t)x2) le tf(x1) + (1minus t)f(x2) (23)

                  for any value of t isin [0 1] Replacing the inequality by strict inequality we obtain thedefinition of strict convexity A regularized expression like (22) is convex if functionJ(β) and penalty P (β) are both convex

                  Sparsity Usually null coefficients furnishes models that are easier to interpret Whensparsity does not harm the quality of the predictions it is a desirable property whichmoreover entails less memory usage and computation resources

                  Stability There are numerous notions of stability or robustness which measure howthe solution varies when the input is perturbed by small changes This perturbation canbe adding removing or replacing few elements in the training set Adding regularizationin addition to prevent overfitting is a means to favor the stability of the solution

                  232 Pure Penalties

                  For pure penalties defined as P (β) = ||β||p convexity holds for p ge 1 This isgraphically illustrated in Figure 23 borrowed from Szafranski (2008) whose Chapter 3is an excellent review of regularization techniques and the algorithms to solve them In

                  14

                  23 Regularization

                  Figure 24 Two dimensional regularized problems with ||β||1 and ||β||2 penalties

                  this figure the shape of the admissible sets corresponding to different pure penalties isgreyed out Since convexity of the penalty corresponds to the convexity of the set wesee that this property is verified for p ge 1

                  Regularizing a linear model with a norm like βp means that the larger the component|βj | the more important the feature xj in the estimation On the contrary the closer tozero the more dispensable it is In the limit of |βj | = 0 xj is not involved in the modelIf many dimensions can be dismissed then we can speak of sparsity

                  A graphical interpretation of sparsity borrowed from Marie Szafranski is given in Fig-ure 24 In a 2D problem a solution can be considered as sparse if any of its components(β1 or β2) is null That is if the optimal β is located on one of the coordinate axis Letus consider a search algorithm that minimizes an expression like (22) where J(β) is aquadratic function When the solution to the unconstrained problem does not belongto the admissible set defined by P (β) (greyed out area) the solution to the constrainedproblem is as close as possible to the global minimum of the cost function inside thegrey region Depending on the shape of this region the probability of having a sparsesolution varies A region with vertexes as the one corresponding to a L1 penalty hasmore chances of inducing sparse solutions than the one of an L2 penalty That ideais displayed in Figure 24 where J(β) is a quadratic function represented with threeisolevel curves whose global minimum βls is outside the penaltiesrsquo admissible region Theclosest point to this βls for the L1 regularization is βl1 and for the L2 regularization it isβl2 Solution βl1 is sparse because its second component is zero while both componentsof βl2 are different from zero

                  After reviewing the regions from Figure 23 we can relate the capacity of generatingsparse solutions to the quantity and the ldquosharpnessrdquo of vertexes of the greyed out areaFor example a L 1

                  3penalty has a support region with sharper vertexes that would induce

                  a sparse solution even more strongly than a L1 penalty however the non-convex shapeof the L 1

                  3results in difficulties during optimization that will not happen with a convex

                  shape

                  15

                  2 Regularization for Feature Selection

                  To summarize convex problem with a sparse solution is desired But with purepenalties sparsity is only possible with Lp norms with p le 1 due to the fact that they arethe only ones that have vertexes On the other side only norms with p ge 1 are convexhence the only pure penalty that builds a convex problem with a sparse solution is theL1 penalty

                  L0 Penalties The L0 pseudo norm of a vector β is defined as the number of entriesdifferent from zero that is P (β) = β0 = cardβj |βj 6= 0

                  minβ

                  J(β)

                  s t β0 le t (24)

                  where parameter t represents the maximum number of non-zero coefficients in vectorβ The larger the value of t (or the lower value of λ if we use the equivalent expres-sion in (21)) the fewer the number of zeros induced in vector β If t is equal to thedimensionality of the problem (or if λ = 0) then the penalty term is not effective andβ is not altered In general the computation of the solutions relies on combinatorialoptimization schemes Their solutions are sparse but unstable

                  L1 Penalties The penalties built using L1 norms induce sparsity and stability It hasbeen named the Lasso (Least Absolute Shrinkage and Selection Operator) by Tibshirani(1996)

                  minβ

                  J(β)

                  s t

                  psumj=1

                  |βj | le t (25)

                  Despite all the advantages of the Lasso the choice of the right penalty is not so easyas a question of convexity and sparsity For example concerning the Lasso Osborneet al (2000a) have shown that when the number of examples n is lower than the numberof variables p then the maximum number of non-zero entries of β is n Therefore ifthere is a strong correlation between several variables this penalty risks to dismiss allbut one resulting in a hardly interpretable model In a field like genomics where n istypically some tens of individuals and p several thousands of genes the performance ofthe algorithm and the interpretability of the genetic relationships are severely limited

                  Lasso is a popular tool that has been used in multiple contexts beside regressionparticularly in the field of feature selection in supervised classification (Mai et al 2012Witten and Tibshirani 2011) and clustering (Roth and Lange 2004 Pan et al 2006Pan and Shen 2007 Zhou et al 2009 Guo et al 2010 Witten and Tibshirani 2010Bouveyron and Brunet 2012ba)

                  The consistency of the problems regularized by a Lasso penalty is also a key featureDefining consistency as the capability of making always the right choice of relevant vari-ables when the number of individuals is infinitely large Leng et al (2006) have shownthat when the penalty parameter (t or λ depending on the formulation) is chosen by

                  16

                  23 Regularization

                  minimization of the prediction error the Lasso penalty does not lead into consistentmodels There is a large bibliography defining conditions where Lasso estimators be-come consistent (Knight and Fu 2000 Donoho et al 2006 Meinshausen and Buhlmann2006 Zhao and Yu 2007 Bach 2008) In addition to those papers some authors have in-troduced modifications to improve the interpretability and the consistency of the Lassosuch as the adaptive Lasso (Zou 2006)

                  L2 Penalties The graphical interpretation of pure norm penalties in Figure 23 showsthat this norm does not induce sparsity due to its lack of vertexes Strictly speakingthe L2 norm involves the square root of the sum of all squared components In practicewhen using L2 penalties the square of the norm is used to avoid the square root andsolve a linear system Thus a L2 penalized optimization problem looks like

                  minβJ(β) + λ β22 (26)

                  The effect of this penalty is the ldquoequalizationrdquo of the components of the parameter thatis being penalized To enlighten this property let us consider a least squares problem

                  minβ

                  nsumi=1

                  (yi minus xgti β)2 (27)

                  with solution βls = (XgtX)minus1Xgty If some input variables are highly correlated theestimator βls is very unstable To fix this numerical instability Hoerl and Kennard(1970) proposed ridge regression that regularizes Problem (27) with a quadratic penalty

                  minβ

                  nsumi=1

                  (yi minus xgti β)2 + λ

                  psumj=1

                  β2j

                  The solution to this problem is βl2 = (XgtX+λIp)minus1Xgty All eigenvalues in particular

                  the small ones corresponding to the correlated dimensions are now moved upwards byλ This can be enough to avoid the instability induced by small eigenvalues Thisldquoequalizationrdquo in the coefficients reduces the variability of the estimation which mayimprove performances

                  As with the Lasso operator there are several variations of ridge regression For exam-ple Breiman (1995) proposed the nonnegative garrotte that looks like a ridge regressionwhere each variable is penalized adaptively To do that the least square solution is usedto define the penalty parameter attached to each coefficient

                  minβ

                  nsumi=1

                  (yi minus xgti β)2 + λ

                  psumj=1

                  β2j

                  (βlsj )2 (28)

                  The effect is an elliptic admissible set instead of the ball of ridge regression Anotherexample is the adaptive ridge regression (Grandvalet 1998 Grandvalet and Canu 2002)

                  17

                  2 Regularization for Feature Selection

                  where the penalty parameter differs on each component There every λj is optimizedto penalize more or less depending on the influence of βj in the model

                  Although the L2 penalized problems are stable they are not sparse That makes thosemodels harder to interpret mainly in high dimensions

                  Linfin Penalties A special case of Lp norms is the infinity norm defined as xinfin =max(|x1| |x2| |xp|) The admissible region for a penalty like βinfin le t is displayedin Figure 23 For the Linfin norm the greyed out region fits a square containing all the βvectors whose largest coefficient is less or equal to the value of the penalty parameter t

                  This norm is not commonly used as a regularization term itself however it is a frequentnorm combined in mixed penalties as it is shown in Section 234 In addition in theoptimization of penalized problems there exists the concept of dual norms Dual normsarise in the analysis of estimation bounds and in the design of algorithms that addressoptimization problems by solving an increasing sequence of small subproblems (workingset algorithms) The dual norm plays a direct role in computing optimality conditionsof sparse regularized problems The dual norm βlowast of a norm β is defined as

                  βlowast = maxwisinRp

                  βgtw s t w le 1

                  In the case of an Lq norm with q isin [1 +infin] the dual norm is the Lr norm such that1q + 1

                  r = 1 For example the L2 norm is self-dual and the dual norm of the L1 normis the Linfin norm Thus this is one of the reasons why Linfin is so important even if it isnot so popular as a penalty itself because L1 is An extensive explanation about dualnorms and the algorithms that make use of them can be found in Bach et al (2011)

                  233 Hybrid Penalties

                  There are no reasons for using pure penalties in isolation We can combine them andtry to obtain different benefits from any of them The most popular example is theElastic net regularization (Zou and Hastie 2005) with the objective of improving theLasso penalization when n le p As recalled in Section 232 when n le p the Lassopenalty can select at most n non null features Thus in situations where there are morerelevant variables the Lasso penalty risks selecting only some of them To avoid thiseffect a combination of L1 and L2 penalties has been proposed For the least squaresexample (27) from Section 232 the Elastic net is

                  minβ

                  nsumi=1

                  (yi minus xgti β)2 + λ1

                  psumj=1

                  |βj |+ λ2

                  psumj=1

                  β2j (29)

                  The term in λ1 is a Lasso penalty that induces sparsity in vector β on the other sidethe term in λ2 is a ridge regression penalty that provides universal strong consistency(De Mol et al 2009) that is the asymptotical capability (when n goes to infinity) ofmaking always the right choice of relevant variables

                  18

                  23 Regularization

                  234 Mixed Penalties

                  Imagine a linear regression problem where each variable is a gene Depending on theapplication several biological processes can be identified by L different groups of genesLet us identify as G` the group of genes for the l process and d` the number of genes(variables) in each group foralll isin 1 L Thus the dimension of vector β will be theaddition of the number of genes of every group dim(β) =

                  sumL`=1 d` Mixed norms are

                  a type of norms that take into consideration those groups The general expression isshowed below

                  β(rs) =

                  sum`

                  sumjisinG`

                  |βj |s r

                  s

                  1r

                  (210)

                  The pair (r s) identifies the norms that are combined a Ls norm within groups anda Lr norm between groups The Ls norm penalizes the variables in every group G`while the Lr norm penalizes the within-group norms The pair (r s) is set so as toinduce different properties in the resulting β vector Note that the outer norm is oftenweighted to adjust for the different cardinalities the groups in order to avoid favoringthe selection of the largest groups

                  Several combinations are available the most popular is the norm β(12) known asgroup-Lasso (Yuan and Lin 2006 Leng 2008 Xie et al 2008ab Meier et al 2008 Rothand Fischer 2008 Yang et al 2010 Sanchez Merchante et al 2012) Figure 25 showsthe difference between the admissible sets of a pure L1 norm and a mixed L12 normMany other mixing are possible such as β(143) (Szafranski et al 2008) or β(1infin)

                  (Wang and Zhu 2008 Kuan et al 2010 Vogt and Roth 2010) Modifications of mixednorms have also been proposed such as the group bridge penalty (Huang et al 2009)the composite absolute penalties (Zhao et al 2009) or combinations of mixed and purenorms such as Lasso and group-Lasso (Friedman et al 2010 Sprechmann et al 2010) orgroup-Lasso and ridge penalty (Ng and Abugharbieh 2011)

                  235 Sparsity Considerations

                  In this chapter I have reviewed several possibilities that induce sparsity in the solutionof optimization problems However having sparse solutions does not always lead toparsimonious models featurewise For example if we have four parameters per featurewe look for solutions where all four parameters are null for non-informative variables

                  The Lasso and the other L1 penalties encourage solutions such as the one in the leftof Figure 26 If the objective is sparsity then the L1 norm do the job However if weaim at feature selection and if the number of parameters per variable exceeds one thistype of sparsity does not target the removal of variables

                  To be able to dismiss some features the sparsity pattern must encourage null valuesfor the same variable across parameters as shown in the right of Figure 26 This can beachieved with mixed penalties that define groups of features For example L12 or L1infinmixed norms with the proper definition of groups can induce sparsity patterns such as

                  19

                  2 Regularization for Feature Selection

                  (a) L1 Lasso (b) L(12) group-Lasso

                  Figure 25 Admissible sets for the Lasso and Group-Lasso

                  (a) L1 induced sparsity (b) L(12) group inducedsparsity

                  Figure 26 Sparsity patterns for an example with 8 variables characterized by 4 param-eters

                  20

                  23 Regularization

                  the one in the right of Figure 26 which displays a solution where variables 3 5 and 8are removed

                  236 Optimization Tools for Regularized Problems

                  In Caramanis et al (2012) there is good collection of mathematical techniques andoptimization methods to solve regularized problems Another good reference is the thesisof Szafranski (2008) which also reviews some techniques classified in four categoriesThose techniques even if they belong to different categories can be used separately orcombined to produce improved optimization algorithms

                  In fact the algorithm implemented in this thesis is inspired by three of those tech-niques It could be defined as an algorithm of ldquoactive constraintsrdquo implemented followinga regularization path that is updated approaching the cost function with secant hyper-planes Deeper details are given in the dedicated Chapter 5

                  Subgradient Descent Subgradient descent is a generic optimization method that canbe used for the settings of penalized problems where the subgradient of the loss functionpartJ(β) and the subgradient of the regularizer partP (β) can be computed efficiently Onthe one hand it is essentially blind to the problem structure On the other hand manyiterations are needed so the convergence is slow and the solutions are not sparse Basi-cally it is a generalization of the iterative gradient descent algorithm where the solutionvector β(t+1) is updated proportionally to the negative subgradient of the function atthe current point β(t)

                  β(t+1) = β(t) minus α(s + λsprime) where s isin partJ(β(t)) sprime isin partP (β(t))

                  Coordinate Descent Coordinate descent is based on the first order optimality condi-tions of the criterion (21) In the case of penalties like Lasso making zero the first orderderivative with respect to coefficient βj gives

                  βj =minusλsign(βj)minus partJ(β)

                  partβj

                  2sumn

                  i=1 x2ij

                  In the literature those algorithms can also be referred as ldquoiterative thresholdingrdquo algo-rithms because the optimization can be solved by soft-thresholding in an iterative processAs an example Fu (1998) implements this technique initializing every coefficient withthe least squares solution βls and updating their values using an iterative thresholding

                  algorithm where β(t+1)j = Sλ

                  (partJ(β(t))partβj

                  ) The objective function is optimized with respect

                  21

                  2 Regularization for Feature Selection

                  to one variable at a time while all others are kept fixed

                  (partJ(β)

                  partβj

                  )=

                  λminus partJ(β)partβj

                  2sumn

                  i=1 x2ij

                  if partJ(β)partβj

                  gt λ

                  minusλminus partJ(β)partβj

                  2sumn

                  i=1 x2ij

                  if partJ(β)partβj

                  lt minusλ

                  0 if |partJ(β)partβj| le λ

                  (211)

                  The same principles define ldquoblock-coordinate descentrdquo algorithms In this case firstorder derivative are applied to the equations of a group-Lasso penalty (Yuan and Lin2006 Wu and Lange 2008)

                  Active and Inactive Sets Active sets algorithms are also referred as ldquoactive con-straintsrdquo or ldquoworking setrdquo methods These algorithms define a subset of variables calledldquoactive setrdquo This subset stores the indices of variables with non-zero βj It is usuallyidentified as set A The complement of the active set is the ldquoinactive setrdquo noted A Inthe inactive set we can find the indexes of the variables whose βj is zero Thus theproblem can be simplified to the dimensionality of A

                  Osborne et al (2000a) proposed the first of those algorithms to solve quadratic prob-lems with Lasso penalties His algorithm starts from an empty active set that is updatedincrementally (forward growing) There exists also a backward view where relevant vari-ables are allowed to leave the active set however the forward philosophy that startswith an empty A has the advantage that the first calculations are low dimensional Inaddition the forward view fits better in the feature selection intuition where few featuresare intended to be selected

                  Working set algorithms have to deal with three main tasks There is an optimizationtask where a minimization problem has to be solved using only the variables from theactive set Osborne et al (2000a) solve a linear approximation of the original problemto determine the objective function descent direction but any other method can beconsidered In general as the solution of successive active sets are typically close to eachother It is a good idea to use the solution of the previous iteration as the initializationof the current one (warm start) Besides the optimization task there is a working setupdate task where the active set A is augmented with the variable from the inactiveset A that violates the most the optimality conditions of Problem (21) Finally there isalso a task to compute the optimality conditions Their expressions are essentials in theselection of the next variable to add to the active set and to test if a particular vector βis a solution of Problem (21)

                  This active constraints or working set methods even if they were originally proposedto solve L1 regularized quadratic problems can also be adapted to generic functions andpenalties For example linear functions and L1 penalties (Roth 2004) linear functions

                  22

                  23 Regularization

                  and L12 penalties (Roth and Fischer 2008) or even logarithmic cost functions and com-binations of L0 L1 and L2 penalties (Perkins et al 2003) The algorithm developed inthis work belongs to this family of solutions

                  Hyper-Planes Approximation Hyper-planes approximations solve a regularized prob-lem using a piecewise linear approximation of the original cost function This convexapproximation is built using several secant hyper-planes in different points obtainedfrom the sub-gradient of the cost function at these points

                  This family of algorithms implements an iterative mechanism where the number ofhyper-planes increases at every iteration These techniques are useful with large popu-lations since the number of iterations needed to converge does not depend on the sizeof the dataset On the contrary if few hyper-planes are used then the quality of theapproximation is not good enough and the solution can be unstable

                  This family of algorithms is not so popular as the previous one but some examples canbe found in the domain of Support Vector Machines (Joachims 2006 Smola et al 2008Franc and Sonnenburg 2008) or Multiple Kernel Learning (Sonnenburg et al 2006)

                  Regularization Path The regularization path is the set of solutions that can be reachedwhen solving a series of optimization problems of the form (21) where the penaltyparameter λ is varied It is not an optimization technique per se but it is of practicaluse when the exact regularization path can be easily followed Rosset and Zhu (2007)stated that this path is piecewise linear for those problems where the cost function ispiecewise quadratic and the regularization term is piecewise linear (or vice-versa)

                  This concept was firstly applied to Lasso algorithm of Osborne et al (2000b) Howeverit was after the publication of the algorithm called Least Angle Regression (LARS)developed by Efron et al (2004) that those techniques become popular LARS definesthe regularization path using active constraint techniques

                  Once that an active set A(t) and its corresponding solution β(t) have been set lookingfor the regularization path means looking for a direction h and a step size γ to updatethe solution as β(t+1) = β(t) + γh Afterwards the active and inactive sets A(t+1) andA(t+1) are updated That can be done looking for the variables that strongly violate theoptimality conditions Hence LARS sets the update step size and which variable shouldenter in the active set from the correlation with residuals

                  Proximal Methods Proximal Methods optimize on objective function of the form (21)resulting of the addition of a Lipschitz differentiable cost function J(β) and a non-differentiable penalty λP (β)

                  minβisinRp

                  J(β(t)) +nablaJ(β(t))gt(β minus β(t)) + λP (β) +L

                  2

                  ∥∥∥β minus β(t)∥∥∥2

                  2(212)

                  They are also iterative methods where the cost function J(β) is linearized in theproximity of the solution β so that the problem to solve in each iteration looks like

                  23

                  2 Regularization for Feature Selection

                  (212) where the parameter L gt 0 should be an upper bound on the Lipschitz constantof the gradient nablaJ That can be rewritten as

                  minβisinRp

                  1

                  2

                  ∥∥∥∥β minus (β(t) minus 1

                  LnablaJ(β(t)))

                  ∥∥∥∥2

                  2

                  LP (β) (213)

                  The basic algorithm makes use of the solution to (213) as the next value of β(t+1)However there are faster versions that take advantage of information about previoussteps as the ones described by Nesterov (2007) or the FISTA algorithm (Beck andTeboulle 2009) Proximal methods can be seen as generalizations of gradient updatesIn fact making λ = 0 in equation (213) the standard gradient update rule comes up

                  24

                  Part II

                  Sparse Linear Discriminant Analysis

                  25

                  Abstract

                  Linear discriminant analysis (LDA) aims to describe data by a linear combination offeatures that best separates the classes It may be used for classifying future observationsor for describing those classes

                  There is a vast bibliography about sparse LDA methods reviewed in Chapter 3Sparsity is typically induced regularizing the discriminant vectors or the class means byL1 penalties (see Section 2) Section 235 discussed why this sparsity inducing penaltymay not guarantee parsimonious models regarding variables

                  In this part we develop the group-Lasso Optimal Scoring Solver (GLOSS) that ad-dresses a sparse LDA problem globally through a regression approach of LDA Ouranalysis presented in Chapter 4 formally relates GLOSS to Fisherrsquos discriminant anal-ysis and also enables to derive variants such that LDA assuming diagonal within-classcovariance structure (Bickel and Levina 2004) The group-Lasso penalty selects the samefeatures in all discriminant directions leading to a more interpretable low-dimensionalrepresentation of data The discriminant directions can be used in their totality or thefirst ones may be chosen to produce a reduced rank classification The first two or threedirections can also be used to project the data to generate a graphical display of thedata The algorithm is detailed in Chapter 5 and our experimental results of Chapter 6demonstrate that compared to the competing approaches the models are extremelyparsimonious without compromising prediction performances The algorithm efficientlyprocesses medium to large number of variables and is thus particularly well suited tothe analysis of gene expression data

                  27

                  3 Feature Selection in Fisher DiscriminantAnalysis

                  31 Fisher Discriminant Analysis

                  Linear discriminant analysis (LDA) aims to describe n labeled observations belongingto K groups by a linear combination of features which characterizes or separates classesIt is used for two main purposes classifying future observations or describing the essen-tial differences between classes either by providing a visual representation of data orby revealing the combinations of features that discriminate between classes There areseveral frameworks in which linear combinations can be derived Friedman et al (2009)dedicate a whole chapter to linear methods for classification In this part we focus onFisherrsquos discriminant analysis which is a standard tool for linear discriminant analysiswhose formulation does not rely on posterior probabilities but rather on some inertiaprinciples (Fisher 1936)

                  We consider that the data consist of a set of n examples with observations xi isin Rpcomprising p features and label yi isin 0 1K indicating the exclusive assignment ofobservation xi to one of the K classes It will be convenient to gather the observationsin the ntimesp matrix X = (xgt1 x

                  gtn )gt and the corresponding labels in the ntimesK matrix

                  Y = (ygt1 ygtn )gt

                  Fisherrsquos discriminant problem was first proposed for two-class problems for the analy-sis of the famous iris dataset as the maximization of the ratio of the projected between-class covariance to the projected within-class covariance

                  maxβisinRp

                  βgtΣBβ

                  βgtΣWβ (31)

                  where β is the discriminant direction used to project the data and ΣB and ΣW are theptimes p between-class covariance and within-class covariance matrices respectively defined(for a K-class problem) as

                  ΣW =1

                  n

                  Ksumk=1

                  sumiisinGk

                  (xi minus microk)(xi minus microk)gt

                  ΣB =1

                  n

                  Ksumk=1

                  sumiisinGk

                  (microminus microk)(microminus microk)gt

                  where micro is the sample mean of the whole dataset microk the sample mean of class k and Gkindexes the observations of class k

                  29

                  3 Feature Selection in Fisher Discriminant Analysis

                  This analysis can be extended to the multi-class framework with K groups In thiscase K minus 1 discriminant vectors βk may be computed Such a generalization was firstproposed by Rao (1948) Several formulations for the multi-class Fisherrsquos discriminantare available for example as the maximization of a trace ratio

                  maxBisinRptimesKminus1

                  tr(BgtΣBB

                  )tr(BgtΣWB

                  ) (32)

                  where the B matrix is built with the discriminant directions βk as columnsSolving the multi-class criterion (32) is an ill-posed problem a better formulation is

                  based on a series of K minus 1 subproblemsmaxβkisinRp

                  βgtk ΣBβk

                  s t βgtk ΣWβk le 1

                  βgtk ΣWβ` = 0 forall` lt k

                  (33)

                  The maximizer of subproblem k is the eigenvector of Σminus1W ΣB associated to the kth largest

                  eigenvalue (see Appendix C)

                  32 Feature Selection in LDA Problems

                  LDA is often used as a data reduction technique where the K minus 1 discriminant direc-tions summarize the p original variables However all variables intervene in the definitionof these discriminant directions and this behavior may be troublesome

                  Several modifications of LDA have been proposed to generate sparse discriminantdirections Sparse LDA reveals discriminant directions that only involve a few variablesThis sparsity has as main target to reduce the dimensionality of the problem (as in geneticanalysis) but parsimonious classification is also motivated by the need of interpretablemodels robustness in the solution or computational constraints

                  The easiest approach to sparse LDA performs variable selection before discriminationThe relevancy of each feature is usually based on univariate statistics which are fastand convenient to compute but whose very partial view of the overall classificationproblem may lead to dramatic information loss As a result several approaches havebeen devised in the recent years to construct LDA with wrapper and embedded featureselection capabilities

                  They can be categorized according to the LDA formulation that provides the basis tothe sparsity inducing extension that is either Fisherrsquos Discriminant Analysis (variance-based) or regression-based

                  321 Inertia Based

                  The Fisher discriminant seeks a projection maximizing the separability of classes frominertia principles mass centers should be far away (large between-class variance) and

                  30

                  32 Feature Selection in LDA Problems

                  classes should be concentrated around their mass centers (small within-class variance)This view motivates a first series of Sparse LDA formulations

                  Moghaddam et al (2006) propose an algorithm for Sparse LDA in binary classificationwhere sparsity originates in a hard cardinality constraint The formalization is basedon the Fisherrsquos discriminant (31) reformulated as a quadratically-constrained quadraticprogram (33) Computationally the algorithm implements a combinatorial search withsome eigenvalue properties that are used to avoid exploring subsets of possible solutionsExtensions of this approach have been developed with new sparsity bounds for the twoclass discrimination problem and shortcuts to speed up the evaluation of eigenvalues(Moghaddam et al 2007)

                  Also for binary problems Wu et al (2009) proposed a sparse LDA applied to geneexpression data where the Fisherrsquos discriminant (31) is solved as

                  minβisinRp

                  βgtΣWβ

                  s t (micro1 minus micro2)gtβ = 1sumpj=1 |βj | le t

                  where micro1 and micro2 are vectors of mean gene expression values corresponding to the twogroups The expression to optimize and the first constraint match problem (31) Thesecond constraint encourages parsimony

                  Witten and Tibshirani (2011) describe a multi-class technique using the Fisherrsquos dis-criminant rewritten on the form of Kminus1 constrained and penalized maximization prob-lems max

                  βisinkRpβgtk Σ

                  k

                  Bβk minus Pk(βk)

                  s t βgtk ΣWβk le 1

                  The term to maximize is the projected between-class covariance matrix βgtk ΣBβksubject to an upper bound on the projected within-class covariance matrix βgtk ΣWβkThe penalty Pk(βk) is added to avoid singularities and induce sparsity The authorssuggest weighted versions of regular Lasso and fused Lasso penalties for general purposedata The Lasso shrinks to zero less informative variables and the fused Lasso encouragesa piecewise constant βk vector The R code is available from the website of DanielaWitten

                  Cai and Liu (2011) use the Fisherrsquos discriminant to solve a binary LDA problemBut instead perform separate estimation of ΣW and (micro1 minus micro2) to obtain the optimal

                  solution β = Σminus1W (micro1minus micro2) they estimate the product directly through constrained L1

                  minimization minβisinRp

                  β1

                  s t∥∥∥Σβ minus (micro1 minus micro2)

                  ∥∥∥infinle λ

                  Sparsity is encouraged by the L1 norm of vector β and the parameter λ is used to tunethe optimization

                  31

                  3 Feature Selection in Fisher Discriminant Analysis

                  Most of the algorithms reviewed are conceived for the binary classification And forthose that are envisaged for multi-class scenarios Lasso is the most popular way toinduce sparsity however as we discussed in Section 235 Lasso is not the best tool toencourage parsimonious models when there are multiple discriminant directions

                  322 Regression Based

                  In binary classification LDA is known to be equivalent to linear regression of scaledclass labels since Fisher (1936) For K gt 2 many studies show that multivariate linearregression of a specific class indicator matrix can be applied as a preprocessing step forLDA However directly casting LDA as a least squares problem is challenging for themulti-class case (Duda et al 2000 Friedman et al 2009)

                  Predefined Indicator Matrix

                  Multi-class classification is usually linked with linear regression through the definitionof an indicator matrix (Friedman et al 2009) An indicator matrix Y is a ntimesK matrixwith the class labels for all samples There are several well-known types in the literatureFor example the binary or dummy indicator (yik = 1 if the sample i belongs to class kand yik = 0 otherwise) is commonly used in linking multi-class classification with linearregression (Friedman et al 2009) Another ldquopopularrdquo choice is yik = 1 if the sample ibelongs to class k and yik = minus1(Kminus1) otherwise It was used for example in extendingSupport Vector Machines to multi-class classification (Lee et al 2004) or for generalizingthe kernel target alignment measure (Guermeur et al 2004)

                  There are some efforts which propose a formulation for the least squares problemsbased on a new class indicator matrix (Ye 2007) This new indicator matrix allowsthe definition of the LS-LDA (Least Squares Linear Discriminant Analysis) which holdsa rigorous equivalence with a multi-class LDA under a mild condition which is shownempirically to hold in many applications involving high-dimensional data

                  Qiao et al (2009) propose a discriminant analysis in the high-dimensional low-samplesetting which incorporates variable selection in a Fisherrsquos LDA formulated as a general-ized eigenvalue problem which is then recast as a least squares regression Sparsity isobtained by means of a Lasso penalty on the discriminant vectors Even if this is notmentioned in the article their formulation looks very close in spirit to Optimal Scoringregression Some rather clumsy steps in the developments hinder the comparison so thatfurther investigations are required The lack of publicly available code also restrainedan empirical test of this conjecture If the similitude is confirmed their formalizationwould be very close to the one of Clemmensen et al (2011) reviewed in the followingsection

                  In a recent paper Mai et al (2012) take advantage of the equivalence between ordinaryleast squares and LDA problems to propose a binary classifier solving a penalized leastsquares problem with a Lasso penalty The sparse version of the projection vector β is

                  32

                  32 Feature Selection in LDA Problems

                  obtained by solving

                  minβisinRpβ0isinR

                  nminus1nsumi=1

                  (yi minus β0 minus xgti β)2 + λ

                  psumj=1

                  |βj |

                  where yi is the binary indicator of label for pattern xi Even if the authors focus onthe Lasso penalty they also suggest any other generic sparsity-inducing penalty Thedecision rule xgtβ + β0 gt 0 is the LDA classifier when it is built using the resulting β

                  vector for λ = 0 but a different intercept β0 is required

                  Optimal Scoring

                  In binary classification the regression of (scaled) class indicators enables to recoverexactly the LDA discriminant direction For more than two classes regressing predefinedindicator matrices may be impaired by the masking effect where the scores assigned toa class situated between two other ones never dominates (Hastie et al 1994) Optimalscoring (OS) circumvents the problem by assigning ldquooptimal scoresrdquo to the classes Thisroute was opened by Fisher (1936) for binary classification and pursued for more thantwo classes by Breiman and Ihaka (1984) in the aim of developing a non-linear extensionof discriminant analysis based on additive models They named their approach optimalscaling for it optimizes the scaling of the indicators of classes together with the discrim-inant functions Their approach was later disseminated under the name optimal scoringby Hastie et al (1994) who proposed several extensions of LDA either aiming at con-structing more flexible discriminants (Hastie and Tibshirani 1996) or more conservativeones (Hastie et al 1995)

                  As an alternative method to solve LDA problems Hastie et al (1995) proposed toincorporate a smoothness prior on the discriminant directions in the OS problem througha positive-definite penalty matrix Ω leading to a problem expressed in compact formas

                  minΘ BYΘminusXB2F + λ tr

                  (BgtΩB

                  )(34a)

                  s t nminus1 ΘgtYgtYΘ = IKminus1 (34b)

                  where Θ isin RKtimes(Kminus1) are the class scores B isin Rptimes(Kminus1) are the regression coefficientsand middotF is the Frobenius norm This compact form does not render the order thatarises naturally when considering the following series of K minus 1 problems

                  minθkisinRK βkisinRp

                  Yθk minusXβk2 + βgtk Ωβk (35a)

                  s t nminus1 θgtk YgtYθk = 1 (35b)

                  θgtk YgtYθ` = 0 ` = 1 k minus 1 (35c)

                  where each βk corresponds to a discriminant direction

                  33

                  3 Feature Selection in Fisher Discriminant Analysis

                  Several sparse LDA have been derived by introducing non-quadratic sparsity-inducingpenalties in the OS regression problem (Ghosh and Chinnaiyan 2005 Leng 2008Grosenick et al 2008 Clemmensen et al 2011) Grosenick et al (2008) proposed avariant of the lasso-based penalized OS of Ghosh and Chinnaiyan (2005) by introducingan elastic-net penalty in binary class problems A generalization to multi-class prob-lems was suggested by Clemmensen et al (2011) where the objective function (35a) isreplaced by

                  minβkisinRpθkisinRK

                  sumk

                  Yθk minusXβk22 + λ1 βk1 + λ2β

                  gtk Ωβk

                  where λ1 and λ2 are regularization parameters and Ω is a penalization matrix oftentaken to be the identity for the elastic net The code for SLDA is available from thewebsite of Line Clemmensen

                  Another generalization of the work of Ghosh and Chinnaiyan (2005) was proposedby Leng (2008) with an extension to the multi-class framework based on a group-lassopenalty in the objective function (35a)

                  minβkisinRpθkisinRK

                  Kminus1sumk=1

                  Yθk minusXβk22 + λ

                  psumj=1

                  radicradicradicradicKminus1sumk=1

                  β2kj

                  2

                  (36)

                  which is the criterion that was chosen in this thesisThe following chapters present our theoretical and algorithmic contributions regarding

                  this formulation The proposal of Leng (2008) was heuristically driven and his algorithmfollowed closely the group-lasso algorithm of Yuan and Lin (2006) which is not veryefficient (the experiments of Leng (2008) are limited to small data sets with hundredsexamples and 1000 preselected genes and no code is provided) Here we formally link(36) to penalized LDA and propose a publicly available efficient code for solving thisproblem

                  34

                  4 Formalizing the Objective

                  In this chapter we detail the rationale supporting the Group-Lasso Optimal ScoringSolver (GLOSS) algorithm GLOSS addresses a sparse LDA problem globally througha regression approach Our analysis formally relates GLOSS to Fisherrsquos discriminantanalysis and also enables to derive variants such that LDA assuming diagonal within-class covariance structure (Bickel and Levina 2004)

                  The sparsity arises from the group-Lasso penalty (36) due to Leng (2008) thatselects the same features in all discriminant directions thus providing an interpretablelow-dimensional representation of data For K classes this representation can be eitherthe complete in dimension (Kminus1) or partial for a reduced rank classification The firsttwo or three discriminants can also be used to display a graphical summary of the data

                  The derivation of penalized LDA as a penalized optimal scoring regression is quitetedious but it is required here since the algorithm hinges on this equivalence The mainlines have been derived in several places (Breiman and Ihaka 1984 Hastie et al 1994Hastie and Tibshirani 1996 Hastie et al 1995) and already used before for sparsity-inducing penalties (Roth and Lange 2004) However the published demonstrations werequite elusive on a number of points leading to generalizations that were not supportedin a rigorous way To our knowledge we disclosed the first formal equivalence betweenthe optimal scoring regression problem penalized by group-Lasso and penalized LDA(Sanchez Merchante et al 2012)

                  41 From Optimal Scoring to Linear Discriminant Analysis

                  Following Hastie et al (1995) we now show the equivalence between the series ofproblems encountered in penalized optimal scoring (p-OS) problems and in penalizedLDA (p-LDA) problems by going through canonical correlation analysis We first providesome properties about the solutions of an arbitrary problem in the p-OS series (35)

                  Throughout this chapter we assume that

                  bull there is no empty class that is the diagonal matrix YgtY is full rank

                  bull inputs are centered that is Xgt1n = 0

                  bull the quadratic penalty Ω is positive-semidefinite and such that XgtX + Ω is fullrank

                  35

                  4 Formalizing the Objective

                  411 Penalized Optimal Scoring Problem

                  For the sake of simplicity we now drop subscript k to refer to any problem in the p-OSseries (35) First note that Problems (35) are biconvex in (θβ) that is convex in θfor each β value and vice-versa The problems are however non-convex In particular if(θβ) is a solution then (minusθminusβ) is also a solution

                  The orthogonality constraint (35c) inherently limits the number of possible problemsin the series to K since we assumed that there are no empty classes Moreover as X iscentered the Kminus1 first optimal scores are orthogonal to 1 (and the Kth problem wouldbe solved by βK = 0) All the problems considered here can be solved by a singularvalue decomposition of a real symmetric matrix so that the orthogonality constraint areeasily dealt with Hence in the sequel we do not mention anymore these orthogonalityconstraints (35c) that apply along the route so as to simplify all expressions Thegeneric problem solved is thus

                  minθisinRK βisinRp

                  Yθ minusXβ2 + βgtΩβ (41a)

                  s t nminus1 θgtYgtYθ = 1 (41b)

                  For a given score vector θ the discriminant direction β that minimizes the p-OScriterion (41) is the penalized least squares estimator

                  βos =(XgtX + Ω

                  )minus1XgtYθ (42)

                  The objective function (41a) is then

                  Yθ minusXβos2 + βgtosΩβos = θgtYgtYθ minus 2θgtYgtXβos + βgtos

                  (XgtX + Ω

                  )βos

                  = θgtYgtYθ minus θgtYgtX(XgtX + Ω

                  )minus1XgtYθ

                  where the second line stems from the definition of βos (42) Now using the fact thatthe optimal θ obeys constraint (41b) the optimization problem is equivalent to

                  maxθnminus1θgtYgtYθ=1

                  θgtYgtX(XgtX + Ω

                  )minus1XgtYθ (43)

                  which shows that the optimization of the p-OS problem with respect to θk boils down to

                  finding the kth largest eigenvector of YgtX(XgtX + Ω

                  )minus1XgtY Indeed Appendix C

                  details that Problem (43) is solved by

                  (YgtY)minus1YgtX(XgtX + Ω

                  )minus1XgtYθ = α2θ (44)

                  36

                  41 From Optimal Scoring to Linear Discriminant Analysis

                  where α2 is the maximal eigenvalue 1

                  nminus1θgtYgtX(XgtX + Ω

                  )minus1XgtYθ = α2nminus1θgt(YgtY)θ

                  nminus1θgtYgtX(XgtX + Ω

                  )minus1XgtYθ = α2 (45)

                  412 Penalized Canonical Correlation Analysis

                  As per Hastie et al (1995) the penalized Canonical Correlation Analysis (p-CCA)problem between variables X and Y is defined as follows

                  maxθisinRK βisinRp

                  nminus1θgtYgtXβ (46a)

                  s t nminus1 θgtYgtYθ = 1 (46b)

                  nminus1 βgt(XgtX + Ω

                  )β = 1 (46c)

                  The solutions to (46) are obtained by finding saddle points of the Lagrangian

                  nL(βθ ν γ) = θgtYgtXβ minus ν(θgtYgtYθ minus n)minus γ(βgt(XgtX + Ω)β minus n)

                  rArr npartL(βθ γ ν)

                  partβ= XgtYθ minus 2γ(XgtX + Ω)β

                  rArr βcca =1

                  2γ(XgtX + Ω)minus1XgtYθ

                  Then as βcca obeys (46c) we obtain

                  βcca =(XgtX + Ω)minus1XgtYθradic

                  nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ (47)

                  so that the optimal objective function (46a) can be expressed with θ alone

                  nminus1θgtYgtXβcca =nminus1θgtYgtX(XgtX + Ω)minus1XgtYθradicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

                  =

                  radicnminus1θgtYgtX(XgtX + Ω)minus1XgtYθ

                  and the optimization problem with respect to θ can be restated as

                  maxθnminus1θgtYgtYθ=1

                  θgtYgtX(XgtX + Ω

                  )minus1XgtYθ (48)

                  Hence the p-OS and p-CCA problems produce the same score optimal vectors θ Theregression coefficients are thus proportional as shown by (42) and (47)

                  βos = αβcca (49)

                  1The awkward notation α2 for the eigenvalue was chosen here to ease comparison with Hastie et al(1995) It is easy to check that this eigenvalue is indeed non-negative (see Equation (45) for example)

                  37

                  4 Formalizing the Objective

                  where α is defined by (45)The p-CCA optimization problem can also be written as a function of β alone using

                  the optimality conditions for θ

                  npartL(βθ γ ν)

                  partθ= YgtXβ minus 2νYgtYθ

                  rArr θcca =1

                  2ν(YgtY)minus1YgtXβ (410)

                  Then as θcca obeys (46b) we obtain

                  θcca =(YgtY)minus1YgtXβradic

                  nminus1βgtXgtY(YgtY)minus1YgtXβ (411)

                  leading to the following expression of the optimal objective function

                  nminus1θgtccaYgtXβ =

                  nminus1βgtXgtY(YgtY)minus1YgtXβradicnminus1βgtXgtY(YgtY)minus1YgtXβ

                  =

                  radicnminus1βgtXgtY(YgtY)minus1YgtXβ

                  The p-CCA problem can thus be solved with respect to β by plugging this value in (46)

                  maxβisinRp

                  nminus1βgtXgtY(YgtY)minus1YgtXβ (412a)

                  s t nminus1 βgt(XgtX + Ω

                  )β = 1 (412b)

                  where the positive objective function has been squared compared to (46) This formu-lation is important since it will be used to link p-CCA to p-LDA We thus derive itssolution and following the reasoning of Appendix C βcca verifies

                  nminus1XgtY(YgtY)minus1YgtXβcca = λ(XgtX + Ω

                  )βcca (413)

                  where λ is the maximal eigenvalue shown below to be equal to α2

                  nminus1βgtccaXgtY(YgtY)minus1YgtXβcca = λ

                  rArr nminus1αminus1βgtccaXgtY(YgtY)minus1YgtX(XgtX + Ω)minus1XgtYθ = λ

                  rArr nminus1αβgtccaXgtYθ = λ

                  rArr nminus1θgtYgtX(XgtX + Ω)minus1XgtYθ = λ

                  rArr α2 = λ

                  The first line is obtained by obeying constraint (412b) the second line by the relation-ship (47) where the denominator is α the third line comes from (44) the fourth lineuses again the relationship (47) and the last one the definition of α (45)

                  38

                  41 From Optimal Scoring to Linear Discriminant Analysis

                  413 Penalized Linear Discriminant Analysis

                  Still following Hastie et al (1995) the penalized Linear Discriminant Analysis is de-fined as follows

                  maxβisinRp

                  βgtΣBβ (414a)

                  s t βgt(ΣW + nminus1Ω)β = 1 (414b)

                  where ΣB and ΣW are respectively the sample between-class and within-class variancesof the original p-dimensional data This problem may be solved by an eigenvector de-composition as detailed in Appendix C

                  As the feature matrix X is assumed to be centered the sample total between-classand within-class covariance matrices can be written in a simple form that is amenable

                  to a simple matrix representation using the projection operator Y(YgtY

                  )minus1Ygt

                  ΣT =1

                  n

                  nsumi=1

                  xixigt

                  = nminus1XgtX

                  ΣB =1

                  n

                  Ksumk=1

                  nk microkmicrogtk

                  = nminus1XgtY(YgtY

                  )minus1YgtX

                  ΣW =1

                  n

                  Ksumk=1

                  sumiyik=1

                  (xi minus microk) (xi minus microk)gt

                  = nminus1

                  (XgtXminusXgtY

                  (YgtY

                  )minus1YgtX

                  )

                  Using these formulae the solution to the p-LDA problem (414) is obtained as

                  XgtY(YgtY

                  )minus1YgtXβlda = λ

                  (XgtX + ΩminusXgtY

                  (YgtY

                  )minus1YgtX

                  )βlda

                  XgtY(YgtY

                  )minus1YgtXβlda =

                  λ

                  1minus λ

                  (XgtX + Ω

                  )βlda

                  The comparison of the last equation with βcca (413) shows that βlda and βcca areproportional and that λ(1minus λ) = α2 Using constraints (412b) and (414b) it comesthat

                  βlda = (1minus α2)minus12 βcca

                  = αminus1(1minus α2)minus12 βos

                  which ends the path from p-OS to p-LDA

                  39

                  4 Formalizing the Objective

                  414 Summary

                  The three previous subsections considered a generic form of the kth problem in the p-OS series The relationships unveiled above also hold for the compact notation gatheringall problems (34) which is recalled below

                  minΘ BYΘminusXB2F + λ tr

                  (BgtΩB

                  )s t nminus1 ΘgtYgtYΘ = IKminus1

                  Let A represent the (K minus 1) times (K minus 1) diagonal matrix with elements αk being the

                  square-root of the largest eigenvector of YgtX(XgtX + Ω

                  )minus1XgtY we have

                  BLDA = BCCA

                  (IKminus1 minusA2

                  )minus 12

                  = BOS Aminus1(IKminus1 minusA2

                  )minus 12 (415)

                  where IKminus1 is the (K minus 1)times (K minus 1) identity matrixAt this point the features matrix X that in the input space has dimensions n times p

                  can be projected into the optimal scoring domain as a ntimesK minus 1 matrix XOS = XBOS

                  or into the linear discriminant analysis space as a n timesK minus 1 matrix XLDA = XBLDAClassification can be performed in any of those domains if the appropriate distance(penalized within-class covariance matrix) is applied

                  With the aim of performing classification the whole process could be summarized asfollows

                  1 Solve the p-OS problem as

                  BOS =(XgtX + λΩ

                  )minus1XgtYΘ

                  where Θ are the K minus 1 leading eigenvectors of

                  YgtX(XgtX + λΩ

                  )minus1XgtY

                  2 Translate the data samples X into the LDA domain as XLDA = XBOSD

                  where D = Aminus1(IKminus1 minusA2

                  )minus 12

                  3 Compute the matrix M of centroids microk from XLDA and Y

                  4 Evaluate the distance d(x microk) in the LDA domain as a function of M andXLDA

                  5 Translate distances into posterior probabilities and affect every sample i to aclass k following the maximum a posteriori rule

                  6 Graphical Representation

                  40

                  42 Practicalities

                  The solution of the penalized optimal scoring regression and the computation of thedistance and posterior matrices are detailed in Sections 421 Section 422 and Section423 respectively

                  42 Practicalities

                  421 Solution of the Penalized Optimal Scoring Regression

                  Following Hastie et al (1994) and Hastie et al (1995) a quadratically penalized LDAproblem can be presented as a quadratically penalized OS problem

                  minΘisinRKtimesKminus1BisinRptimesKminus1

                  YΘminusXB2F + λ tr(BgtΩB

                  )(416a)

                  s t nminus1 ΘgtYgtYΘ = IKminus1 (416b)

                  where Θ are the class scores B the regression coefficients and middotF is the Frobeniusnorm

                  Though non-convex the OS problem is readily solved by a decomposition in Θ and Bthe optimal BOS does not intervene in the optimality conditions with respect to Θ andthe optimization with respect to B is obtained in a closed form as a linear combinationof the optimal scores Θ (Hastie et al 1995) The algorithm may seem a bit tortuousconsidering the properties mentioned above as it proceeds in four steps

                  1 Initialize Θ to Θ0 such that nminus1 Θ0gtYgtYΘ0 = IKminus1

                  2 Compute B =(XgtX + λΩ

                  )minus1XgtYΘ0

                  3 Set Θ to be the K minus 1 leading eigenvectors of YgtX(XgtX + λΩ

                  )minus1XgtY

                  4 Compute the optimal regression coefficients

                  BOS =(XgtX + λΩ

                  )minus1XgtYΘ (417)

                  Defining Θ0 in Step 1 instead of using directly Θ as expressed in Step 3 drasti-cally reduces the computational burden of the eigen-analysis the latter is performed on

                  Θ0gtYgtX(XgtX + λΩ

                  )minus1XgtYΘ0 which is computed as Θ0gtYgtXB thus avoiding a

                  costly matrix inversion The solution of the penalized optimal scoring as an eigenvectordecomposition is detailed and justified in Appendix B

                  This four step algorithm is valid when the penalty is on the form BgtΩBgt Howeverwhen a L1 penalty is applied in (416) the optimization algorithm requires iterativeupdates of B and Θ That situation is developed by Clemmensen et al (2011) where

                  41

                  4 Formalizing the Objective

                  a Lasso or an Elastic net penalty is used to induce sparsity in the OS problem Fur-thermore these Lasso and Elastic net penalties do not enjoy the equivalence with LDAproblems

                  422 Distance Evaluation

                  The simplest classification rule is the Nearest Centroid rule where the sample xi isassigned to class k if sample xi is closer (in terms of the shared within-class Mahalanobisdistance) to centroid microk than to any other centroid micro` In general the parameters of themodel are unknown and the rule is applied with the parameters estimated from trainingdata (sample estimators microk and ΣW) If microk are the centroids in the input space samplexi is assigned to the class k if the distance

                  d(xi microk) = (xi minus microk)gtΣminus1WΩ(xi minus microk)minus 2 log

                  (nkn

                  ) (418)

                  is minimized among all k In expression (418) the first term is the Mahalanobis distancein the input space and the second term is an adjustment term for unequal class sizes thatestimates the prior probability of class k Note that this is inspired by the Gaussian viewof LDA and that another definition of the adjustment term could be used (Friedmanet al 2009 Mai et al 2012) The matrix ΣWΩ used in (418) is the penalized within-class covariance matrix that can be decomposed in a penalized and a non-penalizedcomponent

                  Σminus1WΩ =

                  (nminus1(XgtX + λΩ)minus ΣB

                  )minus1

                  =(nminus1XgtXminus ΣB + nminus1λΩ

                  )minus1

                  =(ΣW + nminus1λΩ

                  )minus1 (419)

                  Before explaining how to compute the distances let us summarize some clarifying points

                  bull The solution BOS of the p-OS problem is enough to accomplish classification

                  bull In the LDA domain (space of discriminant variates XLDA) classification is basedon Euclidean distances

                  bull Classification can be done in a reduced rank space of dimension R lt K minus 1 byusing the first R discriminant directions βkRk=1

                  As a result the expression of the distance (418) depends on the domain where theclassification is performed If we classify in the p-OS domain

                  (xi minus microk)BOS2ΣWΩminus 2 log(πk)

                  where πk is the estimated class prior and middotS is the Mahalanobis distance assumingwithin-class covariance S If classification is done in the p-LDA domain∥∥∥(xi minus microk)BOSAminus1

                  (IKminus1 minusA2

                  )minus 12

                  ∥∥∥2

                  2minus 2 log(πk)

                  which is a plain Euclidean distance

                  42

                  43 From Sparse Optimal Scoring to Sparse LDA

                  423 Posterior Probability Evaluation

                  Let d(xmicrok) be a distance between xi and microk defined as in (418) under the assumptionthat classes are Gaussians the estimated posterior probabilities p(yk = 1|x) can beestimated as

                  p(yk = 1|x) prop exp

                  (minusd(xmicrok)

                  2

                  )prop πk exp

                  (minus1

                  2

                  ∥∥∥(xi minus microk)BOSAminus1(IKminus1 minusA2

                  )minus 12

                  ∥∥∥2

                  2

                  ) (420)

                  Those probabilities must be normalized to ensure that their sum one When the dis-tances d(xmicrok) take large values expminusd(xmicrok)

                  2 can take extremely small values generatingunderflow issues A classical trick to fix this numerical issue is detailed below

                  p(yk = 1|x) =πk exp

                  (minusd(xmicrok)

                  2

                  )sum

                  ` π` exp(minusd(xmicro`)

                  2

                  )=

                  πk exp(minusd(xmicrok)

                  2 + dmax2

                  )sum`

                  π` exp

                  (minusd(xmicro`)

                  2+dmax

                  2

                  )

                  where dmax = maxk d(xmicrok)

                  424 Graphical Representation

                  Sometimes it can be useful to have a graphical display of the data set Using onlythe two or the three most discriminant directions may not provide the best separationbetween classes but can suffice to inspect the data That can be accomplished by plottingthe first two or three dimensions of the regression fits XOS or the discriminant variatesXLDA depending if we are presenting the dataset in the OS or in the LDA domainOther attributes such as the centroids or the shape of the within-class variance can berepresented

                  43 From Sparse Optimal Scoring to Sparse LDA

                  The equivalence stated in Section 41 holds for quadratic penalties of the form βgtΩβunder the assumption that YgtY and XgtX + λΩ are full rank (fulfilled when thereare not empty classes and Ω is positive definite) Quadratic penalties have interestingproperties but as recalled in Section 23 they do not induce sparsity In this respectL1 penalties are preferable but they lack a connection such as the one stated by Hastieet al (1995) between p-LDA and p-OS stated

                  In this section we introduce the tools used to obtain sparse models maintaining theequivalence between p-LDA and p-OS problems We use a group-Lasso penalty (see

                  43

                  4 Formalizing the Objective

                  section 234) that induces groups of zeroes to the coefficients corresponding to thesame feature in all discriminant directions resulting in real parsimonious models Ourderivation uses a variational formulation of the group-Lasso to generalize the equivalencedrawn by Hastie et al (1995) for quadratic penalties Therefore we are intending toshow that our formulation of group-Lasso can be written in the quadratic form BgtΩB

                  431 A Quadratic Variational Form

                  Quadratic variational forms of the Lasso and group-Lasso have been proposed shortlyafter the original Lasso paper of Hastie and Tibshirani (1996) as a means to address opti-mization issues but also as an inspiration for generalizing the Lasso penalty (Grandvalet1998 Canu and Grandvalet 1999) The algorithms based on these quadratic variationalforms iteratively reweighs a quadratic penalty They are now often outperformed bymore efficient strategies (Bach et al 2012)

                  Our formulation of group-Lasso is showed below

                  minτisinRp

                  minBisinRptimesKminus1

                  J(B) + λ

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj(421a)

                  s tsum

                  j τj minussum

                  j wj∥∥βj∥∥

                  2le 0 (421b)

                  τj ge 0 j = 1 p (421c)

                  where B isin RptimesKminus1 is a matrix composed of row vectors βj isin RKminus1

                  B =(β1gt βpgt

                  )gtand wj are predefined nonnegative weights The cost function

                  J(B) in our context is the OS regression YΘ + XB22 by now on behalf of sim-plicity I leave J(B) Here and in what follows bτ is defined by continuation at zeroas b0 = +infin if b 6= 0 and 00 = 0 Note that variants of (421) have been proposedelsewhere (see eg Canu and Grandvalet 1999 Bach et al 2012 and references therein)

                  The intuition behind our approach is that using the variational formulation we recasta non quadratic expression into the convex hull of a family of quadratic penalties definedby variable τj That is graphically shown in Figure 41

                  Let us start proving the equivalence of our variational formulation and the standardgroup-Lasso (there is an alternative variational formulation detailed and demonstratedin Appendix D)

                  Lemma 41 The quadratic penalty in βj in (421) acts as the group-Lasso penaltyλsump

                  j=1wj∥∥βj∥∥

                  2

                  Proof The Lagrangian of Problem (421) is

                  L = J(B) + λ

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj+ ν0

                  ( psumj=1

                  τj minuspsumj=1

                  wj∥∥βj∥∥

                  2

                  )minus

                  psumj=1

                  νjτj

                  44

                  43 From Sparse Optimal Scoring to Sparse LDA

                  Figure 41 Graphical representation of the variational approach to Group-Lasso

                  Thus the first order optimality conditions for τj are

                  partLpartτj

                  (τj ) = 0hArr minusλw2j

                  ∥∥βj∥∥2

                  2

                  τj2 + ν0 minus νj = 0

                  hArr minusλw2j

                  ∥∥βj∥∥2

                  2+ ν0τ

                  j

                  2 minus νjτj2 = 0

                  rArr minusλw2j

                  ∥∥βj∥∥2

                  2+ ν0 τ

                  j

                  2 = 0

                  The last line is obtained from complementary slackness which implies here νjτj = 0

                  Complementary slackness states that νjgj(τj ) = 0 where νj is the Lagrange multiplier

                  for constraint gj(τj) le 0 As a result the optimal value of τj

                  τj =

                  radicλw2

                  j

                  ∥∥βj∥∥2

                  2

                  ν0=

                  radicλ

                  ν0wj∥∥βj∥∥

                  2(422)

                  We note that ν0 6= 0 if there is at least one coefficient βjk 6= 0 thus the inequalityconstraint (421b) is at bound (due to complementary slackness)

                  psumj=1

                  τj minuspsumj=1

                  wj∥∥βj∥∥

                  2= 0 (423)

                  so that τj = wj∥∥βj∥∥

                  2 Using this value into (421a) it is possible to conclude that

                  Problem (421) is equivalent to the standard group-Lasso operator

                  minBisinRptimesM

                  J(B) + λ

                  psumj=1

                  wj∥∥βj∥∥

                  2 (424)

                  So we have presented a convex quadratic variational form of the group-Lasso anddemonstrate its equivalence with the standard group-Lasso formulation

                  45

                  4 Formalizing the Objective

                  With Lemma 41 we have proved that under constraints (421b)-(421c) the quadraticproblem (421a) is equivalent to the standard formulation for the group-Lasso (424) Thepenalty term of (421a) can be conveniently presented as λBgtΩB where

                  Ω = diag

                  (w2

                  1

                  τ1w2

                  2

                  τ2

                  w2p

                  τp

                  ) (425)

                  with

                  τj = wj∥∥βj∥∥

                  2

                  resulting in Ω diagonal components

                  (Ω)jj =wj∥∥βj∥∥

                  2

                  (426)

                  And as stated at the beginning of this section the equivalence between p-LDA prob-lems and p-OS problems is demonstrated for the variational formulation This equiv-alence is crucial to the derivation of the link between sparse OS and sparse LDA itfurthermore suggests a convenient implementation We sketch below some propertiesthat are instrumental in the implementation of the active set described in Section 5

                  The first property states that the quadratic formulation is convex when J is convexthus providing an easy control of optimality and convergence

                  Lemma 42 If J is convex Problem (421) is convex

                  Proof The function g(β τ) = β22τ known as the perspective function of f(β) =β22 is convex in (β τ) (see eg Boyd and Vandenberghe 2004 Chapter 3) and theconstraints (421b)ndash(421c) define convex admissible sets hence Problem (421) is jointlyconvex with respect to (B τ )

                  In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

                  Lemma 43 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (424) is

                  V isin RptimesKminus1 V =partJ(B)

                  partB+ λG

                  (427)

                  where G isin RptimesKminus1 is a matrix composed of row vectors gj isin RKminus1

                  G =(g1gt gpgt

                  )gtdefined as follows Let S(B) denote the columnwise support of

                  B S(B) = j isin 1 p ∥∥βj∥∥

                  26= 0 then we have

                  forallj isin S(B) gj = wj∥∥βj∥∥minus1

                  2βj (428)

                  forallj isin S(B) ∥∥gj∥∥

                  2le wj (429)

                  46

                  43 From Sparse Optimal Scoring to Sparse LDA

                  This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

                  Proof When∥∥βj∥∥

                  26= 0 the gradient of the penalty with respect to βj is

                  part (λsump

                  m=1wj βm2)

                  partβj= λwj

                  βj∥∥βj∥∥2

                  (430)

                  At∥∥βj∥∥

                  2= 0 the gradient of the objective function is not continuous and the optimality

                  conditions then make use of the subdifferential (Bach et al 2011)

                  partβj

                  psumm=1

                  wj βm2

                  )= partβj

                  (λwj

                  ∥∥βj∥∥2

                  )=λwjv isin RKminus1 v2 le 1

                  (431)

                  That gives the expression (429)

                  Lemma 44 Problem (421) admits at least one solution which is unique if J is strictlyconvex All critical points B of the objective function verifying the following conditionsare global minima

                  forallj isin S partJ(B)

                  partβj+ λwj

                  ∥∥βj∥∥minus1

                  2βj = 0 (432a)

                  forallj isin S ∥∥∥∥partJ(B)

                  partβj

                  ∥∥∥∥2

                  le λwj (432b)

                  where S sube 1 p denotes the set of non-zero row vectors βj and S(B) is its comple-ment

                  Lemma 44 provides a simple appraisal of the support of the solution which wouldnot be as easily handled with the direct analysis of the variational problem (421)

                  432 Group-Lasso OS as Penalized LDA

                  With all the previous ingredients the group-Lasso Optimal Scoring Solver for per-forming sparse LDA can be introduced

                  Proposition 41 The group-Lasso OS problem

                  BOS = argminBisinRptimesKminus1

                  minΘisinRKtimesKminus1

                  1

                  2YΘminusXB2F + λ

                  psumj=1

                  wj∥∥βj∥∥

                  2

                  s t nminus1 ΘgtYgtYΘ = IKminus1

                  47

                  4 Formalizing the Objective

                  is equivalent to the penalized LDA problem

                  BLDA = maxBisinRptimesKminus1

                  tr(BgtΣBB

                  )s t Bgt(ΣW + nminus1λΩ)B = IKminus1

                  where Ω = diag

                  (w2

                  1

                  τ1

                  w2p

                  τp

                  ) with Ωjj =

                  +infin if βjos = 0

                  wj∥∥βjos

                  ∥∥minus1

                  2otherwise

                  (433)

                  That is BLDA = BOS diag(αminus1k (1minus α2

                  k)minus12

                  ) where αk isin (0 1) is the kth leading

                  eigenvalue of

                  nminus1YgtX(XgtX + λΩ

                  )minus1XgtY

                  Proof The proof simply consists in applying the result of Hastie et al (1995) whichholds for quadratic penalties to the quadratic variational form of the group-Lasso

                  The proposition applies in particular to the Lasso-based OS approaches to sparseLDA (Grosenick et al 2008 Clemmensen et al 2011) for K = 2 that is for binaryclassification or more generally for a single discriminant direction Note however thatit leads to a slightly different decision rule if the decision threshold is chosen a prioriaccording to the Gaussian assumption for the features For more than one discriminantdirection the equivalence does not hold any more since the Lasso penalty does notresult in an equivalent quadratic penalty in the simple form tr

                  (BgtΩB

                  )

                  48

                  5 GLOSS Algorithm

                  The efficient approaches developed for the Lasso take advantage of the sparsity ofthe solution by solving a series of small linear systems whose sizes are incrementallyincreaseddecreased (Osborne et al 2000a) This approach was also pursued for thegroup-Lasso in its standard formulation (Roth and Fischer 2008) We adapt this algo-rithmic framework to the variational form (421) with J(B) = 12 YΘminusXB22

                  The algorithm belongs to the working set family of optimization methods (see Sec-tion 236) It starts from a sparse initial guess say B = 0 thus defining the set Aof ldquoactiverdquo variables currently identified as non-zero Then it iterates the three stepssummarized below

                  1 Update the coefficient matrix B within the current active set A where the opti-mization problem is smooth First the quadratic penalty is updated and then astandard penalized least squares fit is computed

                  2 Check the optimality conditions (432) with respect to the active variables Oneor more βj may be declared inactive when they vanish from the current solution

                  3 Check the optimality conditions (432) with respect to inactive variables If theyare satisfied the algorithm returns the current solution which is optimal If theyare not satisfied the variable corresponding to the greatest violation is added tothe active set

                  This mechanism is graphically represented in Figure 51 as a block diagram and for-malized in more details in Algorithm 1 Note that this formulation uses the equationsfrom the variational approach detailed in Section 431 If we want to use the alterna-tive variational approach from Appendix D then we have to replace Equations (421)(432a) and (432b) by (D1) (D10a) and (D10b) respectively

                  51 Regression Coefficients Updates

                  Step 1 of Algorithm 1 updates the coefficient matrix B within the current active set AThe quadratic variational form of the problem suggests a blockwise optimization strategyconsisting in solving (K minus 1) independent card(A)-dimensional problems instead of asingle (K minus 1) times card(A)-dimensional problem The interaction between the (K minus 1)problems is relegated to the common adaptive quadratic penalty Ω This decompositionis especially attractive as we then solve (K minus 1) similar systems(

                  XgtAXA + λΩ)βk = XgtAYθ0

                  k (51)

                  49

                  5 GLOSS Algorithm

                  initialize modelλ B

                  ACTIVE SETall j st||βj ||2 gt 0

                  p-OS PROBLEMB must hold1st optimality

                  condition

                  any variablefrom

                  ACTIVE SETmust go toINACTIVE

                  SET

                  take it out ofACTIVE SET

                  test 2nd op-timality con-dition on the

                  INACTIVE SET

                  any variablefrom

                  INACTIVE SETmust go toACTIVE

                  SET

                  take it out ofINACTIVE SET

                  compute Θ

                  and update B end

                  yes

                  no

                  yes

                  no

                  Figure 51 GLOSS block diagram

                  50

                  51 Regression Coefficients Updates

                  Algorithm 1 Adaptively Penalized Optimal Scoring

                  Input X Y B λInitialize A larr

                  j isin 1 p

                  ∥∥βj∥∥2gt 0

                  Θ0 nminus1 Θ0gtYgtYΘ0 = IKminus1 convergence larr falserepeat

                  Step 1 solve (421) in B assuming A optimalrepeat

                  Ωlarr diag ΩA with ωj larr∥∥βj∥∥minus1

                  2

                  BA larr(XgtAXA + λΩ

                  )minus1XgtAYΘ0

                  until condition (432a) holds for all j isin A Step 2 identify inactivated variables

                  for j isin A ∥∥βj∥∥

                  2= 0 do

                  if optimality condition (432b) holds thenA larr AjGo back to Step 1

                  end ifend for Step 3 check greatest violation of optimality condition (432b) in set Aj = argmax

                  jisinA

                  ∥∥partJpartβj∥∥2

                  if∥∥∥partJpartβj∥∥∥

                  2lt λ then

                  convergence larr true B is optimalelseA larr Acup j

                  end ifuntil convergence

                  (sV)larreigenanalyze(Θ0gtYgtXAB) that is

                  Θ0gtYgtXABVk = skVk k = 1 K minus 1

                  Θ larr Θ0V B larr BV αk larr nminus12s12k k = 1 K minus 1

                  Output Θ B α

                  51

                  5 GLOSS Algorithm

                  where XA denotes the columns of X indexed by A and βk and θ0k denote the kth

                  column of B and Θ0 respectively These linear systems only differ in the right-hand-sideterm so that a single Cholesky decomposition is necessary to solve all systems whereasa blockwise Newton-Raphson method based on the standard group-Lasso formulationwould result in different ldquopenaltiesrdquo Ω for each system

                  511 Cholesky decomposition

                  Dropping the subscripts and considering the (K minus 1) systems together (51) leads to

                  (XgtX + λΩ)B = XgtYΘ (52)

                  Defining the Cholesky decomposition as CgtC = (XgtX+λΩ) (52) is solved efficientlyas follows

                  CgtCB = XgtYΘ

                  CB = CgtXgtYΘ

                  B = CCgtXgtYΘ (53)

                  where the symbol ldquordquo is the matlab mldivide operator that solves efficiently linearsystems The GLOSS code implements (53)

                  512 Numerical Stability

                  The OS regression coefficients are obtained by (52) where the penalizer Ω is iterativelyupdated by (433) In this iterative process when a variable is about to leave the activeset the corresponding entry of Ω reaches important values whereby driving some OSregression coefficients to zero These large values may cause numerical stability problemsin the Cholesky decomposition of XgtX + λΩ This difficulty can be avoided using thefollowing equivalent expression

                  B = Ωminus12(Ωminus12XgtXΩminus12 + λI

                  )minus1Ωminus12XgtYΘ0 (54)

                  where the conditioning of Ωminus12XgtXΩminus12 + λI is always well-behaved provided X isappropriately normalized (recall that 0 le 1ωj le 1) This stabler expression demandsmore computation and is thus reserved to cases with large ωj values Our code isotherwise based on expression (52)

                  52 Score Matrix

                  The optimal score matrix Θ is made of the K minus 1 leading eigenvectors of

                  YgtX(XgtX + Ω

                  )minus1XgtY This eigen-analysis is actually solved in the form

                  ΘgtYgtX(XgtX + Ω

                  )minus1XgtYΘ (see Section 421 and Appendix B) The latter eigen-

                  vector decomposition does not require the costly computation of(XgtX + Ω

                  )minus1that

                  52

                  53 Optimality Conditions

                  involves the inversion of an n times n matrix Let Θ0 be an arbitrary K times (K minus 1) ma-

                  trix whose range includes the Kminus1 leading eigenvectors of YgtX(XgtX + Ω

                  )minus1XgtY 1

                  Then solving the Kminus1 systems (53) provides the value of B0 = (XgtX+λΩ)minus1XgtYΘ0This B0 matrix can be identified in the expression to eigenanalyze as

                  Θ0gtYgtX(XgtX + Ω

                  )minus1XgtYΘ0 = Θ0gtYgtXB0

                  Thus the solution to penalized OS problem can be computed trough the singular

                  value decomposition of the (K minus 1)times (K minus 1) matrix Θ0gtYgtXB0 = VΛVgt Defining

                  Θ = Θ0V we have ΘgtYgtX(XgtX + Ω

                  )minus1XgtYΘ = Λ and when Θ0 is chosen such

                  that nminus1 Θ0gtYgtYΘ0 = IKminus1 we also have that nminus1 ΘgtYgtYΘ = IKminus1 holding theconstraints of the p-OS problem Hence assuming that the diagonal elements of Λ aresorted in decreasing order θk is an optimal solution to the p-OS problem Finally onceΘ has been computed the corresponding optimal regression coefficients B satisfying(52) are simply recovered using the mapping from Θ0 to Θ that is B = B0VAppendix E details why the computational trick described here for quadratic penaltiescan be applied to the group-Lasso for which Ω is defined by a variational formulation

                  53 Optimality Conditions

                  GLOSS uses an active set optimization technique to obtain the optimal values of thecoefficient matrix B and the score matrix Θ To be a solution the coefficient matrix mustobey Lemmas 43 and 44 Optimality conditions (432a) and (432b) can be deducedfrom those lemmas Both expressions require the computation of the gradient of theobjective function

                  1

                  2YΘminusXB22 + λ

                  psumj=1

                  wj∥∥βj∥∥

                  2(55)

                  Let J(B) be the data-fitting term 12 YΘminusXB22 Its gradient with respect to the jth

                  row of B βj is the (K minus 1)-dimensional vector

                  partJ(B)

                  partβj= xj

                  gt(XBminusYΘ)

                  where xj is the column j of X Hence the first optimality condition (432a) can becomputed for every variable j as

                  xjgt

                  (XBminusYΘ) + λwjβj∥∥βj∥∥

                  2

                  1 As X is centered 1K belongs to the null space of YgtX(XgtX + Ω

                  )minus1XgtY It is thus suffi-

                  cient to choose Θ0 orthogonal to 1K to ensure that its range spans the leading eigenvectors of

                  YgtX(XgtX + Ω

                  )minus1XgtY In practice to comply with this desideratum and conditions (35b) and

                  (35c) we set Θ0 =(YgtY

                  )minus12U where U is a Ktimes (Kminus1) matrix whose columns are orthonormal

                  vectors orthogonal to 1K

                  53

                  5 GLOSS Algorithm

                  The second optimality condition (432b) can be computed for every variable j as∥∥∥xjgt (XBminusYΘ)∥∥∥

                  2le λwj

                  54 Active and Inactive Sets

                  The feature selection mechanism embedded in GLOSS selects the variables that pro-vide the greatest decrease in the objective function This is accomplished by means ofthe optimality conditions (432a) and (432b) Let A be the active set with the variablesthat have already been considered relevant A variable j can be considered for inclusioninto the active set if it violates the second optimality condition We proceed one variableat a time by choosing the one that is expected to produce the greatest decrease in theobjective function

                  j = maxj

                  ∥∥∥xjgt (XBminusYΘ)∥∥∥

                  2minus λwj 0

                  The exclusion of a variable belonging to the active set A is considered if the norm∥∥βj∥∥

                  2

                  is small and if after setting βj to zero the following optimality condition holds∥∥∥xjgt (XBminusYΘ)∥∥∥

                  2le λwj

                  The process continue until no variable in the active set violates the first optimalitycondition and no variable in the inactive set violates the second optimality condition

                  55 Penalty Parameter

                  The penalty parameter can be specified by the user in which case GLOSS solves theproblem with this value of λ The other strategy is to compute the solution path forseveral values of λ GLOSS looks then for the maximum value of the penalty parameterλmax such that B 6= 0 and solve the p-OS problem for decreasing values of λ until aprescribed number of features are declared active

                  The maximum value of the penalty parameter λmax corresponding to a null B matrixis obtained by computing the optimality condition (432b) at B = 0

                  λmax = maxjisin1p

                  1

                  wj

                  ∥∥∥xjgtYΘ0∥∥∥

                  2

                  The algorithm then computes a series of solutions along the regularization path definedby a series of penalties λ1 = λmax gt middot middot middot gt λt gt middot middot middot gt λT = λmin ge 0 by regularlydecreasing the penalty λt+1 = λt2 and using a warm-start strategy where the feasibleinitial guess for B(λt+1) is initialized with B(λt) The final penalty parameter λmin

                  is specified in the optimization process when the maximum number of desired activevariables is attained (by default the minimum of n and p)

                  54

                  56 Options and Variants

                  56 Options and Variants

                  561 Scaling Variables

                  As most penalization schemes GLOSS is sensitive to the scaling of variables Itthus makes sense to normalize them before applying the algorithm or equivalently toaccommodate weights in the penalty This option is available in the algorithm

                  562 Sparse Variant

                  This version replaces some matlab commands used in the standard version of GLOSSby the sparse equivalents commands In addition some mathematical structures areadapted for sparse computation

                  563 Diagonal Variant

                  We motivated the group-Lasso penalty by sparsity requisites but robustness consid-erations could also drive its usage since LDA is known to be unstable when the numberof examples is small compared to the number of variables In this context LDA hasbeen experimentally observed to benefit from unrealistic assumptions on the form of theestimated within-class covariance matrix Indeed the diagonal approximation that ig-nores correlations between genes may lead to better classification in microarray analysisBickel and Levina (2004) shown that this crude approximation provides a classifier withbest worst-case performances than the LDA decision rule in small sample size regimeseven if variables are correlated

                  The equivalence proof between penalized OS and penalized LDA (Hastie et al 1995)reveals that quadratic penalties in the OS problem are equivalent to penalties on thewithin-class covariance matrix in the LDA formulation This proof suggests a slightvariant of penalized OS corresponding to penalized LDA with diagonal within-classcovariance matrix where the least square problems

                  minBisinRptimesKminus1

                  YΘminusXB2F = minBisinRptimesKminus1

                  tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgtΣTB

                  )are replaced by

                  minBisinRptimesKminus1

                  tr(ΘgtYgtYΘminus 2ΘgtYgtXB + nBgt(ΣB + diag (ΣW))B

                  )Note that this variant only requires diag(ΣW)+ΣB +nminus1Ω to be positive definite whichis a weaker requirement than ΣT + nminus1Ω positive definite

                  564 Elastic net and Structured Variant

                  For some learning problems the structure of correlations between variables is partiallyknown Hastie et al (1995) applied this idea to the field of handwritten digits recognition

                  55

                  5 GLOSS Algorithm

                  7 8 9

                  4 5 6

                  1 2 3

                  - ΩL =

                  3 minus1 0 minus1 minus1 0 0 0 0minus1 5 minus1 minus1 minus1 minus1 0 0 00 minus1 3 0 minus1 minus1 0 0 0minus1 minus1 0 5 minus1 0 minus1 minus1 0minus1 minus1 minus1 minus1 8 minus1 minus1 minus1 minus10 minus1 minus1 0 minus1 5 0 minus1 minus10 0 0 minus1 minus1 0 3 minus1 00 0 0 minus1 minus1 minus1 minus1 5 minus10 0 0 0 minus1 minus1 0 minus1 3

                  Figure 52 Graph and Laplacian matrix for a 3times 3 image

                  for their penalized discriminant analysis model to constrain the discriminant directionsto be spatially smooth

                  When an image is represented as a vector of pixels it is reasonable to assume posi-tive correlations between the variables corresponding to neighboring pixels Figure 52represents the neighborhood graph of pixels in an 3 times 3 image with the correspondingLaplacian matrix The Laplacian matrix ΩL is semi-positive definite and the penaltyβgtΩLβ favors among vectors of identical L2 norms the ones having similar coeffi-cients in the neighborhoods of the graph For example this penalty is 9 for the vector(1 1 0 1 1 0 0 0 0)gt which is the indicator of the neighbors of pixel 1 and it is 17 forthe vector (minus1 1 0 1 1 0 0 0 0)gt with sign mismatch between pixel 1 and its neighbor-hood

                  This smoothness penalty can be imposed jointly with the group-Lasso From thecomputational point of view GLOSS hardly needs to be modified The smoothnesspenalty has just to be added to group-Lasso penalty As the new penalty is convex andquadratic (thus smooth) there is no additional burden in the overall algorithm Thereis however an additional hyperparameter to be tuned

                  56

                  6 Experimental Results

                  This section presents some comparison results between the Group Lasso Optimal Scor-ing Solver algorithm and two other classifiers at the state of the art proposed to performsparse LDA Those algorithms are Penalized LDA (PLDA) (Witten and Tibshirani 2011)which applies a Lasso penalty into a Fisherrsquos LDA framework and the Sparse LinearDiscriminant Analysis (SLDA) (Clemmensen et al 2011) which applies an Elastic netpenalty to the OS problem With the aim of testing the parsimony capacities the latteralgorithm was tested without any quadratic penalty that is with a Lasso penalty Theimplementation of PLDA and SLDA is available from the authorsrsquo website PLDA is anR implementation and SLDA is coded in matlab All the experiments used the sametraining validation and test sets Note that they differ significantly from the ones ofWitten and Tibshirani (2011) in Simulation 4 for which there was a typo in their paper

                  61 Normalization

                  With shrunken estimates the scaling of features has important outcomes For thelinear discriminants considered here the two most common normalization strategiesconsist in setting either the diagonal of the total covariance matrix ΣT to ones orthe diagonal of the within-class covariance matrix ΣW to ones These options can beimplemented either by scaling the observations accordingly prior to the analysis or byproviding penalties with weights The latter option is implemented in our matlabpackage 1

                  62 Decision Thresholds

                  The derivations of LDA based on the analysis of variance or on the regression ofclass indicators do not rely on the normality of the class-conditional distribution forthe observations Hence their applicability extends beyond the realm of Gaussian dataBased on this observation Friedman et al (2009 chapter 4) suggest to investigate otherdecision thresholds than the ones stemming from the Gaussian mixture assumptionIn particular they propose to select the decision thresholds that empirically minimizetraining error This option was tested using validation sets or cross-validation

                  1The GLOSS matlab code can be found in the software section of wwwhdsutcfr~grandval

                  57

                  6 Experimental Results

                  63 Simulated Data

                  We first compare the three techniques in the simulation study of Witten and Tibshirani(2011) which considers four setups with 1200 examples equally distributed betweenclasses They are split in a training set of size n = 100 a validation set of size 100 anda test set of size 1000 We are in the small sample regime with p = 500 variables out ofwhich 100 differ between classes Independent variables are generated for all simulationsexcept for Simulation 2 where they are slightly correlated In Simulations 2 and 3 classesare optimally separated by a single projection of the original variables while the twoother scenarios require three discriminant directions The Bayesrsquo error was estimatedto be respectively 17 67 73 and 300 The exact definition of every setup asprovided in Witten and Tibshirani (2011) is

                  Simulation1 Mean shift with independent features There are four classes If samplei is in class k then xi sim N(microk I) where micro1j = 07 times 1(1lejle25) micro2j = 07 times 1(26lejle50)micro3j = 07times 1(51lejle75) micro4j = 07times 1(76lejle100)

                  Simulation2 Mean shift with dependent features There are two classes If samplei is in class 1 then xi sim N(0Σ) and if i is in class 2 then xi sim N(microΣ) withmicroj = 06 times 1(jle200) The covariance structure is block diagonal with 5 blocks each of

                  dimension 100times 100 The blocks have (j jprime) element 06|jminusjprime| This covariance structure

                  is intended to mimic gene expression data correlation

                  Simulation3 One-dimensional mean shift with independent features There are fourclasses and the features are independent If sample i is in class k then Xij sim N(kminus1

                  3 1)if j le 100 and Xij sim N(0 1) otherwise

                  Simulation4 Mean shift with independent features and no linear ordering Thereare four classes If sample i is in class k then xi sim N(microk I) With mean vectorsdefined as follows micro1j sim N(0 032) for j le 25 and micro1j = 0 otherwise micro2j sim N(0 032)for 26 le j le 50 and micro2j = 0 otherwise micro3j sim N(0 032) for 51 le j le 75 and micro3j = 0otherwise micro4j sim N(0 032) for 76 le j le 100 and micro4j = 0 otherwise

                  Note that this protocol is detrimental to GLOSS as each relevant variable only affectsa single class mean out of K The setup is favorable to PLDA in the sense that mostwithin-class covariance matrix are diagonal We thus also tested the diagonal GLOSSvariant discussed in Section 563

                  The results are summarized in Table 61 Overall the best predictions are performedby PLDA and GLOS-D that both benefit of the knowledge of the true within-classcovariance structure Then among SLDA and GLOSS that both ignore this structureour proposal has a clear edge The error rates are far away from the Bayesrsquo error ratesbut the sample size is small with regard to the number of relevant variables Regardingsparsity the clear overall winner is GLOSS followed far away by SLDA which is the only

                  58

                  63 Simulated Data

                  Table 61 Experimental results for simulated data averages with standard deviationscomputed over 25 repetitions of the test error rate the number of selectedvariables and the number of discriminant directions selected on the validationset

                  Err () Var Dir

                  Sim 1 K = 4 mean shift ind features

                  PLDA 126 (01) 4117 (37) 30 (00)SLDA 319 (01) 2280 (02) 30 (00)GLOSS 199 (01) 1064 (13) 30 (00)GLOSS-D 112 (01) 2511 (41) 30 (00)

                  Sim 2 K = 2 mean shift dependent features

                  PLDA 90 (04) 3376 (57) 10 (00)SLDA 193 (01) 990 (00) 10 (00)GLOSS 154 (01) 398 (08) 10 (00)GLOSS-D 90 (00) 2035 (40) 10 (00)

                  Sim 3 K = 4 1D mean shift ind features

                  PLDA 138 (06) 1615 (37) 10 (00)SLDA 578 (02) 1526 (20) 19 (00)GLOSS 312 (01) 1238 (18) 10 (00)GLOSS-D 185 (01) 3575 (28) 10 (00)

                  Sim 4 K = 4 mean shift ind features

                  PLDA 603 (01) 3360 (58) 30 (00)SLDA 659 (01) 2088 (16) 27 (00)GLOSS 607 (02) 743 (22) 27 (00)GLOSS-D 588 (01) 1627 (49) 29 (00)

                  59

                  6 Experimental Results

                  0 10 20 30 40 50 60 70 8020

                  30

                  40

                  50

                  60

                  70

                  80

                  90

                  100TPR Vs FPR

                  gloss

                  glossd

                  slda

                  plda

                  Simulation1

                  Simulation2

                  Simulation3

                  Simulation4

                  Figure 61 TPR versus FPR (in ) for all algorithms and simulations

                  Table 62 Average TPR and FPR (in ) computed over 25 repetitions

                  Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

                  PLDA 990 782 969 603 980 159 743 656

                  SLDA 739 385 338 163 416 278 507 395

                  GLOSS 641 106 300 46 511 182 260 121

                  GLOSS-D 935 394 921 281 956 655 429 299

                  method that do not succeed in uncovering a low-dimensional representation in Simulation3 The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyPLDA has the best TPR but a terrible FPR except in simulation 3 where it dominatesall the other methods GLOSS has by far the best FPR with overall TPR slightly belowSLDA Results are displayed in Figure 61 (both in percentages) (or in Table 62 )

                  64 Gene Expression Data

                  We now compare GLOSS to PLDA and SLDA on three genomic datasets TheNakayama2 dataset contains 105 examples of 22283 gene expressions for categorizing10 soft tissue tumors It was reduced to the 86 examples belonging to the 5 dominantcategories (Witten and Tibshirani 2011) The Ramaswamy3 dataset contains 198 exam-

                  2httpwwwbroadinstituteorgcancersoftwaregenepatterndatasets3httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS2736

                  60

                  64 Gene Expression Data

                  Table 63 Experimental results for gene expression data averages over 10 trainingtestsets splits with standard deviations of the test error rates and the numberof selected variables

                  Err () Var

                  Nakayama n = 86 p = 22 283 K = 5

                  PLDA 2095 (13) 104787 (21163)SLDA 2571 (17) 2525 (31)GLOSS 2048 (14) 1290 (186)

                  Ramaswamy n = 198 p = 16 063 K = 14

                  PLDA 3836 (60) 148735 (7203)SLDA mdash mdashGLOSS 2061 (69) 3724 (1221)

                  Sun n = 180 p = 54 613 K = 4

                  PLDA 3378 (59) 216348 (74432)SLDA 3622 (65) 3844 (165)GLOSS 3177 (45) 930 (936)

                  ples of 16063 gene expressions for categorizing 14 classes of cancer Finally the Sun4

                  dataset contains 180 examples of 54613 gene expressions for categorizing 4 classes oftumors

                  Each dataset was split into a training set and a test set with respectively 75 and25 of the examples Parameter tuning is performed by 10-fold cross-validation and thetest performances are then evaluated The process is repeated 10 times with randomchoices of training and test set split

                  Test error rates and the number of selected variables are presented in Table 63 Theresults for the PLDA algorithm are extracted from Witten and Tibshirani (2011) Thethree methods have comparable prediction performances on the Nakayama and Sundatasets but GLOSS performs better on the Ramaswamy data where the SparseLDApackage failed to return a solution due to numerical problems in the LARS-EN imple-mentation Regarding the number of selected variables GLOSS is again much sparserthan its competitors

                  Finally Figure 62 displays the projection of the observations for the Nakayama andSun datasets in the first canonical planes estimated by GLOSS and SLDA For theNakayama dataset groups 1 and 2 are well-separated from the other ones in both rep-resentations but GLOSS is more discriminant in the meta-cluster gathering groups 3to 5 For the Sun dataset SLDA suffers from a high colinearity of its first canonicalvariables that renders the second one almost non-informative As a result group 1 isbetter separated in the first canonical plane with GLOSS

                  4httpwwwncbinlmnihgovsitesGDSbrowseracc=GDS1962

                  61

                  6 Experimental Results

                  GLOSS SLDA

                  Naka

                  yam

                  a

                  minus25000 minus20000 minus15000 minus10000 minus5000 0 5000

                  minus25

                  minus2

                  minus15

                  minus1

                  minus05

                  0

                  05

                  1

                  x 104

                  1) Synovial sarcoma

                  2) Myxoid liposarcoma

                  3) Dedifferentiated liposarcoma

                  4) Myxofibrosarcoma

                  5) Malignant fibrous histiocytoma

                  2n

                  dd

                  iscr

                  imin

                  ant

                  minus2000 0 2000 4000 6000 8000 10000 12000 14000

                  2000

                  4000

                  6000

                  8000

                  10000

                  12000

                  14000

                  16000

                  1) Synovial sarcoma

                  2) Myxoid liposarcoma

                  3) Dedifferentiated liposarcoma

                  4) Myxofibrosarcoma

                  5) Malignant fibrous histiocytoma

                  Su

                  n

                  minus1 minus05 0 05 1 15 2

                  x 104

                  05

                  1

                  15

                  2

                  25

                  3

                  35

                  x 104

                  1) NonTumor

                  2) Astrocytomas

                  3) Glioblastomas

                  4) Oligodendrogliomas

                  1st discriminant

                  2n

                  dd

                  iscr

                  imin

                  ant

                  minus2 minus15 minus1 minus05 0

                  x 104

                  0

                  05

                  1

                  15

                  2

                  x 104

                  1) NonTumor

                  2) Astrocytomas

                  3) Glioblastomas

                  4) Oligodendrogliomas

                  1st discriminant

                  Figure 62 2D-representations of Nakayama and Sun datasets based on the two first dis-criminant vectors provided by GLOSS and SLDA The big squares representclass means

                  62

                  65 Correlated Data

                  Figure 63 USPS digits ldquo1rdquo and ldquo0rdquo

                  65 Correlated Data

                  When the features are known to be highly correlated the discrimination algorithmcan be improved by using this information in the optimization problem The structuredvariant of GLOSS presented in Section 564 S-GLOSS from now on was conceived tointroduce easily this prior knowledge

                  The experiments described in this section are intended to illustrate the effect of com-bining the group-Lasso sparsity inducing penalty with a quadratic penalty used as asurrogate of the unknown within-class variance matrix This preliminary experimentdoes not include comparisons with other algorithms More comprehensive experimentalresults have been left for future works

                  For this illustration we have used a subset of the USPS handwritten digit datasetmade of of 16times 16 pixels representing digits from 0 to 9 For our purpose we comparethe discriminant direction that separates digits ldquo1rdquo and ldquo0rdquo computed with GLOSS andS-GLOSS The mean image of every digit is showed in Figure 63

                  As in Section 564 we have represented the pixel proximity relationships from Figure52 into a penalty matrix ΩL but this time in a 256-nodes graph Introducing this new256times 256 Laplacian penalty matrix ΩL in the GLOSS algorithm is straightforward

                  The effect of this penalty is fairly evident in Figure 64 where the discriminant vectorβ resulting of a non-penalized execution of GLOSS is compared with the β resultingfrom a Laplace penalized execution of S-GLOSS (without group-Lasso penalty) Weperfectly distinguish the center of the digit ldquo0rdquo in the discriminant direction obtainedby S-GLOSS that is probably the most important element to discriminate both digits

                  Figure 65 display the discriminant direction β obtained by GLOSS and S-GLOSSfor a non-zero group-Lasso penalty with an identical penalization parameter (λ = 03)Even if both solutions are sparse the discriminant vector from S-GLOSS keeps connectedpixels that allow to detect strokes and will probably provide better prediction results

                  63

                  6 Experimental Results

                  β for GLOSS β for S-GLOSS

                  Figure 64 Discriminant direction between digits ldquo1rdquo and ldquo0rdquo

                  β for GLOSS and λ = 03 β for S-GLOSS and λ = 03

                  Figure 65 Sparse discriminant direction between digits ldquo1rdquo and ldquo0rdquo

                  64

                  Discussion

                  GLOSS is an efficient algorithm that performs sparse LDA based on the regressionof class indicators Our proposal is equivalent to a penalized LDA problem This isup to our knowledge the first approach that enjoys this property in the multi-classsetting This relationship is also amenable to accommodate interesting constraints onthe equivalent penalized LDA problem such as imposing a diagonal structure of thewithin-class covariance matrix

                  Computationally GLOSS is based on an efficient active set strategy that is amenableto the processing of problems with a large number of variables The inner optimizationproblem decouples the p times (K minus 1)-dimensional problem into (K minus 1) independent p-dimensional problems The interaction between the (K minus 1) problems is relegated tothe computation of the common adaptive quadratic penalty The algorithm presentedhere is highly efficient in medium to high dimensional setups which makes it a goodcandidate for the analysis of gene expression data

                  The experimental results confirm the relevance of the approach which behaves wellcompared to its competitors either regarding its prediction abilities or its interpretabil-ity (sparsity) Generally compared to the competing approaches GLOSS providesextremely parsimonious discriminants without compromising prediction performancesEmploying the same features in all discriminant directions enables to generate modelsthat are globally extremely parsimonious with good prediction abilities The resultingsparse discriminant directions also allow for visual inspection of data from the low-dimensional representations that can be produced

                  The approach has many potential extensions that have not yet been implemented Afirst line of development is to consider a broader class of penalties For example plainquadratic penalties can also be added to the group-penalty to encode priors about thewithin-class covariance structure in the spirit of the Penalized Discriminant Analysis ofHastie et al (1995) Also besides the group-Lasso our framework can be customized toany penalty that is uniformly spread within groups and many composite or hierarchicalpenalties that have been proposed for structured data meet this condition

                  65

                  Part III

                  Sparse Clustering Analysis

                  67

                  Abstract

                  Clustering can be defined as a grouping task of samples such that all the elementsbelonging to one cluster are more ldquosimilarrdquo to each other than to the objects belongingto the other groups There are similarity measures for any data structure databaserecords or even multimedia objects (audio video) The similarity concept is closelyrelated to the idea of distance which is a specific dissimilarity

                  Model-based clustering aims to describe an heterogeneous population with a proba-bilistic model that represent each group with a its own distribution Here the distribu-tions will be Gaussians and the different populations are identified with different meansand common covariance matrix

                  As in the supervised framework the traditional clustering techniques perform worsewhen the number of irrelevant features increases In this part we develop Mix-GLOSSwhich builds on the supervised GLOSS algorithm to address unsupervised problemsresulting in a clustering mechanism with embedded feature selection

                  Chapter 7 reviews different techniques of inducing sparsity in model-based clusteringalgorithms The theory that motivates our original formulation of the EM algorithm isdeveloped in Chapter 8 followed by the description of the algorithm in Chapter 9 Its per-formance is assessed and compared to other model-based sparse clustering mechanismsat the state of the art in Chapter 10

                  69

                  7 Feature Selection in Mixture Models

                  71 Mixture Models

                  One of the most popular clustering algorithm is K-means that aims to partition nobservations into K clusters Each observation is assigned to the cluster with the nearestmean (MacQueen 1967) A generalization of K-means can be made through probabilisticmodels which represents K subpopulations by a mixture of distributions Since their firstuse by Newcomb (1886) for the detection of outlier points and 8 years later by Pearson(1894) to identify two separate populations of crabs finite mixtures of distributions havebeen employed to model a wide variety of random phenomena These models assumethat measurements are taken from a set of individuals each of which belongs to oneout of a number of different classes while any individualrsquos particular class is unknownMixture models can thus address the heterogeneity of a population and are especiallywell suited to the problem of clustering

                  711 Model

                  We assume that the observed data X = (xgt1 xgtn )gt have been drawn identically

                  from K different subpopulations in the domain Rp The generative distribution is afinite mixture model that is the data are assumed to be generated from a compoundeddistribution whose density can be expressed as

                  f(xi) =

                  Ksumk=1

                  πkfk(xi) foralli isin 1 n

                  where K is the number of components fk are the densities of the components and πk arethe mixture proportions (πk isin]0 1[ forallk and

                  sumk πk = 1) Mixture models transcribe that

                  given the proportions πk and the distributions fk for each class the data is generatedaccording to the following mechanism

                  bull y each individual is allotted to a class according to a multinomial distributionwith parameters π1 πK

                  bull x each xi is assumed to arise from a random vector with probability densityfunction fk

                  In addition it is usually assumed that the component densities fk belong to a para-metric family of densities φ(middotθk) The density of the mixture can then be written as

                  f(xiθ) =

                  Ksumk=1

                  πkφ(xiθk) foralli isin 1 n

                  71

                  7 Feature Selection in Mixture Models

                  where θ = (π1 πK θ1 θK) is the parameter of the model

                  712 Parameter Estimation The EM Algorithm

                  For the estimation of parameters of the mixture model Pearson (1894) used themethod of moments to estimate the five parameters (micro1 micro2 σ

                  21 σ

                  22 π) of a univariate

                  Gaussian mixture model with two components That method required him to solvepolynomial equations of degree nine There are also graphic methods maximum likeli-hood methods and Bayesian approaches

                  The most widely used process to estimate the parameters is by maximizing the log-likelihood using the EM algorithm It is typically used to maximize the likelihood formodels with latent variables for which no analytical solution is available (Dempsteret al 1977)

                  The EM algorithm iterates two steps called the expectation step (E) and the max-imization step (M) Each expectation step involves the computation of the likelihoodexpectation with respect to the hidden variables while each maximization step esti-mates the parameters by maximizing the E-step expected likelihood

                  Under mild regularity assumptions this mechanism converges to a local maximumof the likelihood However the type of problems targeted is typically characterized bythe existence of several local maxima and global convergence cannot be guaranteed Inpractice the obtained solution depends on the initialization of the algorithm

                  Maximum Likelihood Definitions

                  The likelihood is is commonly expressed in its logarithmic version

                  L(θ X) = log

                  (nprodi=1

                  f(xiθ)

                  )

                  =nsumi=1

                  log

                  (Ksumk=1

                  πkfk(xiθk)

                  ) (71)

                  where n in the number of samples K is the number of components of the mixture (ornumber of clusters) and πk are the mixture proportions

                  To obtain maximum likelihood estimates the EM algorithm works with the jointdistribution of the observations x and the unknown latent variables y which indicatethe cluster membership of every sample The pair z = (xy) is called the completedata The log-likelihood of the complete data is called the complete log-likelihood or

                  72

                  71 Mixture Models

                  classification log-likelihood

                  LC(θ XY) = log

                  (nprodi=1

                  f(xiyiθ)

                  )

                  =

                  nsumi=1

                  log

                  (Ksumk=1

                  yikπkfk(xiθk)

                  )

                  =nsumi=1

                  Ksumk=1

                  yik log (πkfk(xiθk)) (72)

                  The yik are the binary entries of the indicator matrix Y with yik = 1 if the observation ibelongs to the cluster k and yik = 0 otherwise

                  Defining the soft membership tik(θ) as

                  tik(θ) = p(Yik = 1|xiθ) (73)

                  =πkfk(xiθk)

                  f(xiθ) (74)

                  To lighten notations tik(θ) will be denoted tik when parameter θ is clear from contextThe regular (71) and complete (72) log-likelihood are related as follows

                  LC(θ XY) =sumik

                  yik log (πkfk(xiθk))

                  =sumik

                  yik log (tikf(xiθ))

                  =sumik

                  yik log tik +sumik

                  yik log f(xiθ)

                  =sumik

                  yik log tik +nsumi=1

                  log f(xiθ)

                  =sumik

                  yik log tik + L(θ X) (75)

                  wheresum

                  ik yik log tik can be reformulated as

                  sumik

                  yik log tik =nsumi=1

                  Ksumk=1

                  yik log(p(Yik = 1|xiθ))

                  =

                  nsumi=1

                  log(p(Yik = 1|xiθ))

                  = log (p(Y |Xθ))

                  As a result the relationship (75) can be rewritten as

                  L(θ X) = LC(θ Z)minus log (p(Y |Xθ)) (76)

                  73

                  7 Feature Selection in Mixture Models

                  Likelihood Maximization

                  The complete log-likelihood cannot be assessed because the variables yik are unknownHowever it is possible to estimate the value of log-likelihood taking expectations condi-tionally to a current value of θ on (76)

                  L(θ X) = EYsimp(middot|Xθ(t)) [LC(θ X Y ))]︸ ︷︷ ︸Q(θθ(t))

                  +EYsimp(middot|Xθ(t)) [minus log p(Y |Xθ)]︸ ︷︷ ︸H(θθ(t))

                  In this expression H(θθ(t)) is the entropy and Q(θθ(t)) is the conditional expecta-tion of the complete log-likelihood Let us define an increment of the log-likelihood as∆L = L(θ(t+1) X)minus L(θ(t) X) Then θ(t+1) = argmaxθQ(θθ(t)) also increases thelog-likelihood

                  ∆L = (Q(θ(t+1)θ(t))minusQ(θ(t)θ(t)))︸ ︷︷ ︸ge0 by definition of iteration t+1

                  minus (H(θ(t+1)θ(t))minusH(θ(t)θ(t)))︸ ︷︷ ︸le0 by Jensen Inequality

                  Therefore it is possible to maximize the likelihood by optimizing Q(θθ(t)) The rela-tionship between Q(θθprime) and L(θ X) is developed in deeper detail in Appendix F toshow how the value of L(θ X) can be recovered from Q(θθ(t))

                  For the mixture model problem Q(θθprime) is

                  Q(θθprime) = EYsimp(Y |Xθprime) [LC(θ X Y ))]

                  =sumik

                  p(Yik = 1|xiθprime) log(πkfk(xiθk))

                  =nsumi=1

                  Ksumk=1

                  tik(θprime) log (πkfk(xiθk)) (77)

                  Q(θθprime) due to its similitude to the expression of the complete likelihood (72) is alsoknown as the weighted likelihood In (77) the weights tik(θ

                  prime) are the posterior proba-bilities of cluster memberships

                  Hence the EM algorithm sketched above results in

                  bull Initialization (not iterated) choice of the initial parameter θ(0)

                  bull E-Step Evaluation of Q(θθ(t)) using tik(θ(t)) (74) in (77)

                  bull M-Step Calculation of θ(t+1) = argmaxθQ(θθ(t))

                  74

                  72 Feature Selection in Model-Based Clustering

                  Gaussian Model

                  In the particular case of a Gaussian mixture model with common covariance matrixΣ and different vector means microk the mixture density is

                  f(xiθ) =Ksumk=1

                  πkfk(xiθk)

                  =

                  Ksumk=1

                  πk1

                  (2π)p2 |Σ|

                  12

                  exp

                  minus1

                  2(xi minus microk)

                  gtΣminus1(xi minus microk)

                  At the E-step the posterior probabilities tik are computed as in (74) with the currentθ(t) parameters then the M-Step maximizes Q(θθ(t)) (77) whose form is as follows

                  Q(θθ(t)) =sumik

                  tik log(πk)minussumik

                  tik log(

                  (2π)p2 |Σ|

                  12

                  )minus 1

                  2

                  sumik

                  tik(xi minus microk)gtΣminus1(xi minus microk)

                  =sumk

                  tk log(πk)minusnp

                  2log(2π)︸ ︷︷ ︸

                  constant term

                  minusn2

                  log(|Σ|)minus 1

                  2

                  sumik

                  tik(xi minus microk)gtΣminus1(xi minus microk)

                  equivsumk

                  tk log(πk)minusn

                  2log(|Σ|)minus

                  sumik

                  tik

                  (1

                  2(xi minus microk)

                  gtΣminus1(xi minus microk)

                  ) (78)

                  where

                  tk =nsumi=1

                  tik (79)

                  The M-step which maximizes this expression with respect to θ applies the followingupdates defining θ(t+1)

                  π(t+1)k =

                  tkn

                  (710)

                  micro(t+1)k =

                  sumi tikxitk

                  (711)

                  Σ(t+1) =1

                  n

                  sumk

                  Wk (712)

                  with Wk =sumi

                  tik(xi minus microk)(xi minus microk)gt (713)

                  The derivations are detailed in Appendix G

                  72 Feature Selection in Model-Based Clustering

                  When common covariance matrices are assumed Gaussian mixtures are related toLDA with partitions defined by linear decision rules When every cluster has its own

                  75

                  7 Feature Selection in Mixture Models

                  covariance matrix Σk Gaussian mixtures are associated to quadratic discriminant anal-ysis (QDA) with quadratic boundaries

                  In the high-dimensional low-sample setting numerical issues appear in the estimationof the covariance matrix To avoid those singularities regularization may be applied Aregularized trade-off between LDA and QDA (RDA) was proposed by Friedman (1989)Bensmail and Celeux (1996) extended this algorithm but rewriting the covariance matrixin terms of its eigenvalue decomposition Σk = λkDkAkD

                  gtk (Banfield and Raftery 1993)

                  These regularization schemes address singularity and stability issues but they do notinduce parsimonious models

                  In this Chapter we review some techniques to induce sparsity with model-based clus-tering algorithms This sparsity refers to the rule that assigns examples to classesclustering is still performed in the original p-dimensional space but the decision rulecan be expressed with only a few coordinates of this high-dimensional space

                  721 Based on Penalized Likelihood

                  Penalized log-likelihood maximization is a popular estimation technique for mixturemodels It is typically achieved by the EM algorithm using mixture models for which theallocation of examples is expressed as a simple function of the input features For exam-ple for Gaussian mixtures with a common covariance matrix the log-ratio of posteriorprobabilities is a linear function of x

                  log

                  (p(Yk = 1|x)

                  p(Y` = 1|x)

                  )= xgtΣminus1(microk minus micro`)minus

                  1

                  2(microk + micro`)

                  gtΣminus1(microk minus micro`) + logπkπ`

                  In this model a simple way of introducing sparsity in discriminant vectors Σminus1(microk minusmicro`) is to constrain Σ to be diagonal and to favor sparse means microk Indeed for Gaussianmixtures with common diagonal covariance matrix if all means have the same value ondimension j then variable j is useless for class allocation and can be discarded Themeans can be penalized by the L1 norm

                  λKsumk=1

                  psumj=1

                  |microkj |

                  as proposed by Pan et al (2006) Pan and Shen (2007) Zhou et al (2009) consider morecomplex penalties on full covariance matrices

                  λ1

                  Ksumk=1

                  psumj=1

                  |microkj |+ λ2

                  Ksumk=1

                  psumj=1

                  psumm=1

                  |(Σminus1k )jm|

                  In their algorithm they make use the graphical Lasso to estimate the covariances Evenif their formulation induces sparsity on the parameters their combination of L1 penaltiesdoes not directly target decision rules based on few variables and thus does not guaranteeparsimonious models

                  76

                  72 Feature Selection in Model-Based Clustering

                  Guo et al (2010) propose a variation with a Pairwise Fusion Penalty (PFP)

                  λ

                  psumj=1

                  sum16k6kprime6K

                  |microkj minus microkprimej |

                  This PFP regularization is not shrinking the means to zero but towards to each otherThe jth feature for all cluster means are driven to the same value that variable can beconsidered as non-informative

                  A L1infin penalty is used by Wang and Zhu (2008) and Kuan et al (2010) to penalizethe likelihood encouraging null groups of features

                  λ

                  psumj=1

                  (micro1j micro2j microKj)infin

                  One group is defined for each variable j as the set of the K meanrsquos jth component(micro1j microKj) The L1infin penalty forces zeros at the group level favoring the removalof the corresponding feature This method seems to produce parsimonious models andgood partitions within a reasonable computing time In addition the code is publiclyavailable Xie et al (2008b) apply a group-Lasso penalty Their principle describesa vertical mean grouping (VMG with the same groups as Xie et al (2008a)) and ahorizontal mean grouping (HMG) VMG allows to get real feature selection because itforces null values for the same variable in all cluster means

                  λradicK

                  psumj=1

                  radicradicradicradic Ksum

                  k=1

                  micro2kj

                  The clustering algorithm of VMG differs from ours but the group penalty proposed isthe same however no code is available on the authorsrsquo website that allows to test

                  The optimization of a penalized likelihood by means of an EM algorithm can be refor-mulated rewriting the maximization expressions from the M-step as a penalized optimalscoring regression Roth and Lange (2004) implemented it for two cluster problems us-ing a L1 penalty to encourage sparsity on the discriminant vector The generalizationfrom quadratic to non-quadratic penalties is quickly outlined in this work We extendthis works by considering an arbitrary number of clusters and by formalizing the linkbetween penalized optimal scoring and penalized likelihood estimation

                  722 Based on Model Variants

                  The algorithm proposed by Law et al (2004) takes a different stance The authorsdefine feature relevancy considering conditional independency That is the jth feature ispresumed uninformative if its distribution is independent of the class labels The densityis expressed as

                  77

                  7 Feature Selection in Mixture Models

                  f(xi|φ πθν) =Ksumk=1

                  πk

                  pprodj=1

                  [f(xij |θjk)]φj [h(xij |νj)]1minusφj

                  where f(middot|θjk) is the distribution function for relevant features and h(middot|νj) is the distri-bution function for the irrelevant ones The binary vector φ = (φ1 φ2 φp) representsrelevance with φj = 1 if the jth feature is informative and φj = 0 otherwise Thesaliency for variable j is then formalized as ρj = P (φj = 1) So all φj must be treatedas missing variables Thus the set of parameters is πk θjk νj ρj Theirestimation is done by means of the EM algorithm (Law et al 2004)

                  An original and recent technique is the Fisher-EM algorithm proposed by Bouveyronand Brunet (2012ba) The Fisher-EM is a modified version of EM that runs in a latentspace This latent space is defined by an orthogonal projection matrix U isin RptimesKminus1

                  which is updated inside the EM loop with a new step called the Fisher step (F-step fromnow on) which maximizes a multi-class Fisherrsquos criterion

                  tr(

                  (UgtΣWU)minus1UgtΣBU) (714)

                  so as to maximize the separability of the data The E-step is the standard one computingthe posterior probabilities Then the F-step updates the projection matrix that projectsthe data to the latent space Finally the M-step estimates the parameters by maximizingthe conditional expectation of the complete log-likelihood Those parameters can berewritten as a function of the projection matrix U and the model parameters in thelatent space such that the U matrix enters into the M-step equations

                  To induce feature selection Bouveyron and Brunet (2012a) suggest three possibilitiesThe first one results in the best sparse orthogonal approximation U of the matrix Uwhich maximizes (714) This sparse approximation is defined as the solution of

                  minUisinRptimesKminus1

                  ∥∥∥XU minusXU∥∥∥2

                  F+ λ

                  Kminus1sumk=1

                  ∥∥∥uk∥∥∥1

                  where XU = XU is the input data projected in the non-sparse space and uk is thekth column vector of the projection matrix U The second possibility is inspired byQiao et al (2009) and reformulates Fisherrsquos discriminant (714) used to compute theprojection matrix as a regression criterion penalized by a mixture of Lasso and Elasticnet

                  minABisinRptimesKminus1

                  Ksumk=1

                  ∥∥∥RminusgtW HBk minusABgtHBk

                  ∥∥∥2

                  2+ ρ

                  Kminus1sumj=1

                  βgtj ΣWβj + λ

                  Kminus1sumj=1

                  ∥∥βj∥∥1

                  s t AgtA = IKminus1

                  where HB isin RptimesK is a matrix defined conditionally to the posterior probabilities tiksatisfying HBHgtB = ΣB and HBk is the kth column of HB RW isin Rptimesp is an upper

                  78

                  72 Feature Selection in Model-Based Clustering

                  triangular matrix resulting from the Cholesky decomposition of ΣW ΣW and ΣB arethe p times p within-class and between-class covariance matrices in the observations spaceA isin RptimesKminus1 and B isin RptimesKminus1 are the solutions of the optimization problem such thatB = [β1 βKminus1] is the best sparse approximation of U

                  The last possibility suggests the solution of the Fisherrsquos discriminant (714) as thesolution of the following constrained optimization problem

                  minUisinRptimesKminus1

                  psumj=1

                  ∥∥∥ΣBj minus UUgtΣBj

                  ∥∥∥2

                  2

                  s t UgtU = IKminus1

                  whereΣBj is the jth column of the between covariance matrix in the observations spaceThis problem can be solved by a penalized version of the singular value decompositionproposed by (Witten et al 2009) resulting in a sparse approximation of U

                  To comply with the constraint stating that the columns of U are orthogonal the firstand the second options must be followed by a singular vector decomposition of U to getorthogonality This is not necessary with the third option since the penalized version ofSVD already guarantees orthogonality

                  However there is a lack of guarantees regarding convergence Bouveyron states ldquotheupdate of the orientation matrix U in the F-step is done by maximizing the Fishercriterion and not by directly maximizing the expected complete log-likelihood as requiredin the EM algorithm theory From this point of view the convergence of the Fisher-EM algorithm cannot therefore be guaranteedrdquo Immediately after this paragraph wecan read that under certain suppositions their algorithms converge ldquothe model []which assumes the equality and the diagonality of covariance matrices the F-step of theFisher-EM algorithm satisfies the convergence conditions of the EM algorithm theoryand the convergence of the Fisher-EM algorithm can be guaranteed in this case For theother discriminant latent mixture models although the convergence of the Fisher-EMprocedure cannot be guaranteed our practical experience has shown that the Fisher-EMalgorithm rarely fails to converge with these models if correctly initializedrdquo

                  723 Based on Model Selection

                  Some clustering algorithms recast the feature selection problem as model selectionproblem According to this Raftery and Dean (2006) model the observations as amixture model of Gaussians distributions To discover a subset of relevant features (andits superfluous complementary) they define three subsets of variables

                  bull X(1) set of selected relevant variables

                  bull X(2) set of variables being considered for inclusion or exclusion of X(1)

                  bull X(3) set of non relevant variables

                  79

                  7 Feature Selection in Mixture Models

                  With those subsets they defined two different models where Y is the partition toconsider

                  bull M1

                  f (X|Y) = f(X(1)X(2)X(3)|Y

                  )= f

                  (X(3)|X(2)X(1)

                  )f(X(2)|X(1)

                  )f(X(1)|Y

                  )bull M2

                  f (X|Y) = f(X(1)X(2)X(3)|Y

                  )= f

                  (X(3)|X(2)X(1)

                  )f(X(2)X(1)|Y

                  )Model M1 means that variables in X(2) are independent on clustering Y Model M2

                  shows that variables in X(2) depend on clustering Y To simplify the algorithm subsetX(2) is only updated one variable at a time Therefore deciding the relevance of variableX(2) deals with a model selection between M1 and M2 The selection is done via theBayes factor

                  B12 =f (X|M1)

                  f (X|M2)

                  where the high-dimensional f(X(3)|X(2)X(1)) cancels from the ratio

                  B12 =f(X(1)X(2)X(3)|M1

                  )f(X(1)X(2)X(3)|M2

                  )=f(X(2)|X(1)M1

                  )f(X(1)|M1

                  )f(X(2)X(1)|M2

                  )

                  This factor is approximated since the integrated likelihoods f(X(1)|M1

                  )and

                  f(X(2)X(1)|M2

                  )are difficult to calculate exactly Raftery and Dean (2006) use the

                  BIC approximation The computation of f(X(2)|X(1)M1

                  ) if there is only one variable

                  in X(2) can be represented as a linear regression of variable X(2) on the variables inX(1) There is also a BIC approximation for this term

                  Maugis et al (2009a) have proposed a variation of the algorithm developed by Rafteryand Dean They define three subsets of variables the relevant and irrelevant subsets(X(1) and X(3)) remains the same but X(2) is reformulated as a subset of relevantvariables that explains the irrelevance through a multidimensional regression This algo-rithm also uses of a backward stepwise strategy instead of the forward stepwise used byRaftery and Dean (2006) Their algorithm allows to define blocks of indivisible variablesthat in certain situations improve the clustering and its interpretability

                  Both algorithms are well motivated and appear to produce good results however thequantity of computation needed to test the different subset of variables requires a hugecomputation time In practice they cannot be used for the amount of data consideredin this thesis

                  80

                  8 Theoretical Foundations

                  In this chapter we develop Mix-GLOSS which uses the GLOSS algorithm conceivedfor supervised classification (see Section 5) to solve clustering problems The goal here issimilar that is providing an assignements of examples to clusters based on few features

                  We use a modified version of the EM algorithm whose M-step is formulated as apenalized linear regression of a scaled indicator matrix that is a penalized optimalscoring problem This idea was originally proposed by Hastie and Tibshirani (1996)to perform reduced-rank decision rules using less than K minus 1 discriminant directionsTheir motivation was mainly driven by stability issues no sparsity-inducing mechanismwas introduced in the construction of discriminant directions Roth and Lange (2004)pursued this idea by for binary clustering problems where sparsity was introduced bya Lasso penalty applied to the OS problem Besides extending the work of Roth andLange (2004) to an arbitrary number of clusters we draw links between the OS penaltyand the parameters of the Gaussian model

                  In the subsequent sections we provide the principles that allow to solve the M-stepas an optimal scoring problem The feature selection technique is embedded by meansof a group-Lasso penalty We must then guarantee that the equivalence between theM-step and the OS problem holds for our penalty As with GLOSS this is accomplishedwith a variational approach of group-Lasso Finally some considerations regarding thecriterion that is optimized with this modified EM are provided

                  81 Resolving EM with Optimal Scoring

                  In the previous chapters EM was presented as an iterative algorithm that computesa maximum likelihood estimate through the maximization of the expected complete log-likelihood This section explains how a penalized OS regression embedded into an EMalgorithm produces a penalized likelihood estimate

                  811 Relationship Between the M-Step and Linear Discriminant Analysis

                  LDA is typically used in a supervised learning framework for classification and dimen-sion reduction It looks for a projection of the data where the ratio of between-classvariance to within-class variance is maximized (see Appendix C) Classification in theLDA domain is based on the Mahalanobis distance

                  d(ximicrok) = (xi minus microk)gtΣminus1

                  W (xi minus microk)

                  where microk are the p-dimensional centroids and ΣW is the p times p common within-classcovariance matrix

                  81

                  8 Theoretical Foundations

                  The likelihood equations in the M-Step (711) and (712) can be interpreted as themean and covariance estimates of a weighted and augmented LDA problem Hastie andTibshirani (1996) where the n observations are replicated K times and weighted by tik(the posterior probabilities computed at the E-step)

                  Having replicated the data vectors Hastie and Tibshirani (1996) remark that the pa-rameters maximizing the mixture likelihood in the M-step of the EM algorithm (711)and (712) can also be defined as the maximizers of the weighted and augmented likeli-hood

                  2lweight(microΣ) =nsumi=1

                  Ksumk=1

                  tikd(ximicrok)minus n log(|ΣW|)

                  which arises when considering a weighted and augmented LDA problem This viewpointprovides the basis for an alternative maximization of penalized maximum likelihood inGaussian mixtures

                  812 Relationship Between Optimal Scoring and Linear DiscriminantAnalysis

                  The equivalence between penalized optimal scoring problems and a penalized lineardiscriminant analysis has already been detailed in Section 41 in the supervised learningframework This is a critical part of the link between the M-step of an EM algorithmand optimal scoring regression

                  813 Clustering Using Penalized Optimal Scoring

                  The solution of the penalized optimal scoring regression in the M-step is a coefficientmatrix BOS analytically related to the Fisherrsquos discriminative directions BLDA for thedata (XY) where Y is the current (hard or soft) cluster assignement In order tocompute the posterior probabilities tik in the E-step the distance between the samplesxi and the centroids microk must be evaluated Depending wether we are working in theinput domain OS or LDA domain different expressions are used for the distances (seeSection 422 for more details) Mix-GLOSS works in the LDA domain based on thefollowing expression

                  d(ximicrok) = (xminus microk)BLDA22 minus 2 log(πk)

                  This distance defines the computation of the posterior probabilities tik in the E-step (seeSection 423) Putting together all those elements the complete clustering algorithmcan be summarized as

                  82

                  82 Optimized Criterion

                  1 Initialize the membership matrix Y (for example by K-means algorithm)

                  2 Solve the p-OS problem as

                  BOS =(XgtX + λΩ

                  )minus1XgtYΘ

                  where Θ are the K minus 1 leading eigenvectors of

                  YgtX(XgtX + λΩ

                  )minus1XgtY

                  3 Map X to the LDA domain XLDA = XBOSD with D = diag(αminus1k (1minusα2

                  k)minus 1

                  2 )

                  4 Compute the centroids M in the LDA domain

                  5 Evaluate distances in the LDA domain

                  6 Translate distances into posterior probabilities tik with

                  tik prop exp

                  [minusd(x microk)minus 2 log(πk)

                  2

                  ] (81)

                  7 Update the labels using the posterior probabilities matrix Y = T

                  8 Go back to step 2 and iterate until tik converge

                  Items 2 to 5 can be interpreted as the M-step and Item 6 as the E-step in this alter-native view of the EM algorithm for Gaussian mixtures

                  814 From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis

                  In the previous section we schemed a clustering algorithm that replaces the M-stepwith penalized OS This modified version of EM holds for any quadratic penalty We ex-tend this equivalence to sparsity-inducing penalties through the a quadratic variationalapproach to the group-Lasso provided in Section 43 We now look for a formal equiva-lence between this penalty and penalized maximum likelihood for Gaussian mixtures

                  82 Optimized Criterion

                  In the classical EM for Gaussian mixtures the M-step maximizes the weighted likeli-hood Q(θθprime) (77) so as to maximize the likelihood L(θ) (see Section 712) Replacingthe M-step by an optimal scoring is equivalent replacing the M-step by a penalized

                  83

                  8 Theoretical Foundations

                  optimal problem is possible and the link between penalized optimal problem and pe-nalized LDA holds but it remains to relate this penalized LDA problem to a penalizedmaximum likelihood criterion for the Gaussian mixture

                  This penalized likelihood cannot be rigorously interpreted as a maximum a posterioricriterion in particular because the penalty only operates on the covariance matrix Σ(there is no prior on the means and proportions of the mixture) We however believethat the Bayesian interpretation provide some insight and we detail it in what follows

                  821 A Bayesian Derivation

                  This section sketches a Bayesian treatment of inference limited to our present needswhere penalties are to be interpreted as prior distributions over the parameters of theprobabilistic model to be estimated Further details can be found in Bishop (2006Section 236) and in Gelman et al (2003 Section 36)

                  The model proposed in this thesis considers a classical maximum likelihood estimationfor the means and a penalized common covariance matrix This penalization can beinterpreted as arising from a prior on this parameter

                  The prior over the covariance matrix of a Gaussian variable is classically expressed asa Wishart distribution since it is a conjugate prior

                  f(Σ|Λ0 ν0) =1

                  2np2 |Λ0|

                  n2 Γp(

                  n2 )|Σminus1|

                  ν0minuspminus12 exp

                  minus1

                  2tr(Λminus1

                  0 Σminus1)

                  where ν0 is the number of degrees of freedom of the distribution Λ0 is a p times p scalematrix and where Γp is the multivariate gamma function defined as

                  Γp(n2) = πp(pminus1)4pprodj=1

                  Γ (n2 + (1minus j)2)

                  The posterior distribution can be maximized similarly to the likelihood through the

                  84

                  82 Optimized Criterion

                  maximization of

                  Q(θθprime) + log(f(Σ|Λ0 ν0))

                  =Ksumk=1

                  tk log πk minus(n+ 1)p

                  2log 2minus n

                  2log |Λ0| minus

                  p(p+ 1)

                  4log(π)

                  minuspsumj=1

                  log

                  (n

                  2+

                  1minus j2

                  ))minus νn minus pminus 1

                  2log |Σ| minus 1

                  2tr(Λminus1n Σminus1

                  )equiv

                  Ksumk=1

                  tk log πk minusn

                  2log |Λ0| minus

                  νn minus pminus 1

                  2log |Σ| minus 1

                  2tr(Λminus1n Σminus1

                  ) (82)

                  with tk =

                  nsumi=1

                  tik

                  νn = ν0 + n

                  Λminus1n = Λminus1

                  0 + S0

                  S0 =

                  nsumi=1

                  Ksumk=1

                  tik(xi minus microk)(xi minus microk)gt

                  Details of these calculations can be found in textbooks (for example Bishop 2006 Gelmanet al 2003)

                  822 Maximum a Posteriori Estimator

                  The maximization of (82) with respect to microk and πk is of course not affected by theadditional prior term where only the covariance Σ intervenes The MAP estimator forΣ is simply obtained by deriving (82) with respect to Σ The details of the calculationsfollow the same lines as the ones for maximum likelihood detailed in Appendix G Theresulting estimator for Σ is

                  ΣMAP =1

                  ν0 + nminus pminus 1(Λminus1

                  0 + S0) (83)

                  where S0 is the matrix defined in Equation (82) The maximum a posteriori estimator ofthe within-class covariance matrix (83) can thus be identified to the penalized within-class variance (419) resulting from the p-OS regression (416a) if ν0 is chosen to bep + 1 and setting Λminus1

                  0 = λΩ where Ω is the penalty matrix from the group-Lassoregularization (425)

                  85

                  9 Mix-GLOSS Algorithm

                  Mix-GLOSS is an algorithm for unsupervised classification that embeds feature se-lection resulting in parsimonious decision rules It is based on the GLOSS algorithmdeveloped in Chapter 5 that has been adapted for clustering In this chapter I describethe details of the implementations of Mix-GLOSS and of the model selection mechanism

                  91 Mix-GLOSS

                  The implementation of Mix-GLOSS involves three nested loops as schemed in Fig-ure 91 The inner one is an EM algorithm that for a given value of the regularizationparameter λ iterates between an M-step where the parameters of the model are esti-mated and an E-step where the corresponding posterior probabilities are computedThe main outputs of the EM are the coefficient matrix B that projects the input dataX onto the best subspace (in Fisherrsquos sense) and the posteriors tik

                  When several values of the penalty parameter are tested we give them to the algorithmin ascending order and the algorithm is initialized by the solution found for the previousλ value This process continues until all the penalty parameter values have been testedif a vector of penalty parameter was provided or until a given sparsity is achieved asmeasured by the number of variables estimated to be relevant

                  The outer loop implements complete repetitions of the clustering algorithm for all thepenalty parameter values with the purpose of choosing the best execution This loopalleviates the local minima issues by resorting to multiple initializations of the partition

                  911 Outer Loop Whole Algorithm Repetitions

                  This loop performs an user defined number of repetitions of the clustering algorithmIt takes as inputs

                  bull the centered ntimes p feature matrix X

                  bull the vector of penalty parameter values to be tried An option is to provide anempty vector and let the algorithm to set trial values automatically

                  bull the number of clusters K

                  bull the maximum number of iterations for the EM algorithm

                  bull the convergence tolerance for the EM algorithm

                  bull the number of whole repetitions of the clustering algorithm

                  87

                  9 Mix-GLOSS Algorithm

                  Figure 91 Mix-GLOSS Loops Scheme

                  bull a ptimes (K minus 1) initial coefficient matrix (optional)

                  bull a ntimesK initial posterior probability matrix (optional)

                  For each algorithm repetition an initial label matrix Y is needed This matrix maycontain either hard or soft assignments If no such matrix is available K-means is usedto initialize the process If we have an initial guess for the coefficient matrix B it canalso be fed into Mix-GLOSS to warm-start the process

                  912 Penalty Parameter Loop

                  The penalty parameter loop goes through all the values of the input vector λ Thesevalues are sorted in ascending order such that the resulting B and Y matrices can beused to warm-start the EM loop for the next value of the penalty parameter If some λvalue results in a null coefficient matrix the algorithm halts We have tested that thewarm-start implemented reduce the computation time in a factor of 8 with respect tousing a null B matrix and a K-means execution for the initial Y label matrix

                  Mix-GLOSS may be fed with an empty vector of penalty parameters in which case afirst non-penalized execution of Mix-GLOSS is done and its resulting coefficient matrixB and posterior matrix Y are used to estimate a trial value of λ that should removeabout 10 of relevant features This estimation is repeated until a minimum numberof relevant variables is achieved The parameter that measures the estimate percentage

                  88

                  91 Mix-GLOSS

                  of variables that will be removed with the next penalty parameter can be modified tomake feature selection more or less aggressive

                  Algorithm 2 details the implementation of the automatic selection of the penaltyparameter If the alternate variational approach from Appendix D is used we have toreplace Equations (432b) by (D10b)

                  Algorithm 2 Automatic selection of λ

                  Input X K λ = empty minVARInitializeBlarr 0Y larr K-means(XK)Run non-penalized Mix-GLOSSλlarr 0(BY)larr Mix-GLOSS(X K BYλ)lastLAMBDA larr falserepeat

                  Estimate λ Compute gradient at βj = 0partJ(B)

                  partβj

                  ∣∣∣βj=0

                  = xjgt

                  (sum

                  m6=j xmβm minusYΘ)

                  Compute λmax for every feature using (432b)

                  λmaxj = 1

                  wj

                  ∥∥∥∥ partJ(B)

                  partβj

                  ∣∣∣βj=0

                  ∥∥∥∥2

                  Choose λ so as to remove 10 of relevant featuresRun penalized Mix-GLOSS(BY)larr Mix-GLOSS(X K BYλ)if number of relevant variables in B gt minVAR thenlastLAMBDA larr false

                  elselastLAMBDA larr true

                  end ifuntil lastLAMBDA

                  Output B L(θ) tik πk microk Σ Y for every λ in solution path

                  913 Inner Loop EM Algorithm

                  The inner loop implements the actual clustering algorithm by means of successivemaximizations of a penalized likelihood criterion Once that convergence in the posteriorprobabilities tik is achieved the maximum a posteriori rule is applied to classify allexamples Algorithm 3 describes this inner loop

                  89

                  9 Mix-GLOSS Algorithm

                  Algorithm 3 Mix-GLOSS for one value of λ

                  Input X K B0 Y0 λInitializeif (B0Y0) available then

                  BOS larr B0 Y larr Y0

                  elseBOS larr 0 Y larr kmeans(XK)

                  end ifconvergenceEM larr false tolEM larr 1e-3repeat

                  M-step(BOSΘ

                  α)larr GLOSS(XYBOS λ)

                  XLDA = XBOS diag (αminus1(1minusα2)minus12

                  )

                  πk microk and Σ as per (710)(711) and (712)E-steptik as per (81)L(θ) as per (82)if 1n

                  sumi |tik minus yik| lt tolEM then

                  convergenceEM larr trueend ifY larr T

                  until convergenceEMY larr MAP(T)

                  Output BOS ΘL(θ) tik πk microk Σ Y

                  90

                  92 Model Selection

                  M-Step

                  The M-step deals with the estimation of the model parameters that is the clusterrsquosmeans microk the common covariance matrix Σ and the priors of every component πk Ina classical M-step this is done explicitly by maximizing the likelihood expression Herethis maximization is implicitly performed by penalized optimal scoring (see Section 81)The core of this step is a GLOSS execution that regress X on the scaled version of thelabel matrix ΘY For the first iteration of EM if no initialization is available Y resultsfrom a K-means execution In subsequent iterations Y is updated as the posteriorprobability matrix T resulting from the E-step

                  E-Step

                  The E-step evaluates the posterior probability matrix T using

                  tik prop exp

                  [minusd(x microk)minus 2 log(πk)

                  2

                  ]

                  The convergence of those tik is used as stopping criterion for EM

                  92 Model Selection

                  Here model selection refers to the choice of the penalty parameter Up to now wehave not conducted experiments where the number of clusters has to be automaticallyselected

                  In a first attempt we tried a classical structure where clustering was performed severaltimes from different initializations for all penalty parameter values Then using the log-likelihood criterion the best repetition for every value of the penalty parameter waschosen The definitive λ was selected by means of the stability criterion described byLange et al (2002) This algorithm took lots of computing resources since the stabilityselection mechanism required a certain number of repetitions that transformed Mix-GLOSS in a lengthy four nested loops structure

                  In a second attempt we replaced the stability based model selection algorithm by theevaluation of a modified version of BIC (Pan and Shen 2007) This version of BIC lookslike the traditional one (Schwarz 1978) but takes into consideration the variables thathave been removed This mechanism even if it turned out to be faster required alsolarge computation time

                  The third and definitive attempt (up to now) proceeds with several executions ofMix-GLOSS for the non-penalized case (λ = 0) The execution with best log-likelihoodis chosen The repetitions are only performed for the non-penalized problem Thecoefficient matrix B and the posterior matrix T resulting from the best non-penalizedexecution are used to warm-start a new Mix-GLOSS execution This second executionof Mix-GLOSS is done using the values of the penalty parameter provided by the user orcomputed by the automatic selection mechanism This time only one repetition of thealgorithm is done for every value of the penalty parameter This version has been tested

                  91

                  9 Mix-GLOSS Algorithm

                  Initial Mix-GLOSS (λ =0 REPMixminusGLOSS = 20)

                  X K λEMITER MAXREPMixminusGLOSS

                  Use B and T frombest repetition as

                  StartB and StartT

                  Mix-GLOSS (λStartBStartT)

                  Compute BIC

                  Chose λ = minλ BIC

                  Partition tikπk λBEST BΘ D L(θ)activeset

                  Figure 92 Mix-GLOSS model selection diagram

                  with no significant differences in the quality of the clustering but reducing dramaticallythe computation time Diagram 92 resumes the mechanism that implements the modelselection of the penalty parameter λ

                  92

                  10 Experimental Results

                  The performance of Mix-GLOSS is measured here with the artificial dataset that hasbeen used in Section 6

                  This synthetic database is interesting because it covers four different situations wherefeature selection can be applied Basically it considers four setups with 1200 examplesequally distributed between classes It is an small sample regime with p = 500 variablesout of which 100 differ between classes Independent variables are generated for allsimulations except for simulation 2 where they are slightly correlated In simulation 2and 3 classes are optimally separated by a single projection of the original variableswhile the two other scenarios require three discriminant directions The Bayesrsquo errorwas estimated to be respectively 17 67 73 and 300 The exact description ofevery setup has already been done in Section 63

                  In our tests we have reduced the volume of the problem because with the originalsize of 1200 samples and 500 dimensions some of the algorithms to test took severaldays (even weeks) to finish Hence the definitive database was chosen to maintainapproximately the Bayesrsquo error of the original one but with five time less examplesand dimensions (n = 240 p = 100) The Figure 101 has been adapted from Wittenand Tibshirani (2011) to the dimensionality of ours experiments and allows a betterunderstanding of the different simulations

                  The simulation protocol involves 25 repetitions of each setup generating a differentdataset for each repetition Thus the results of the tested algorithms are provided asthe average value and the standard deviation of the 25 repetitions

                  101 Tested Clustering Algorithms

                  This section compares Mix-GLOSS with the following methods in the state of the art

                  bull CS general cov This is a model-based clustering with unconstrained covariancematrices based on the regularization of the likelihood function using L1 penaltiesfollowed of a classical EM algorithm Further details can be found in Zhou et al(2009) We use the R function available in the website of Wei Pan

                  bull Fisher EM This method models and clusters the data in a discriminative andlow-dimensional latent subspace (Bouveyron and Brunet 2012ba) Feature selec-tion is induced by means of the ldquosparsificationrdquo of the projection matrix (threepossibilities are suggested by Bouveyron and Brunet 2012a) The corresponding Rpackage ldquoFisher EMrdquo is available from the web site of Charles Bouveyron or fromthe Comprehensive R Archive Network website

                  93

                  10 Experimental Results

                  Figure 101 Class mean vectors for each artificial simulation

                  bull SelvarClustClustvarsel Implements a method of variable selection for clus-tering using Gaussian mixture models as a modification of the Raftery and Dean(2006) algorithm SelvarClust (Maugis et al 2009b) is a software implemented inC++ that make use of clustering libraries mixmod (Bienarcki et al 2008) Furtherinformation can be found in the related paper Maugis et al (2009a) The softwarecan be downloaded from the SelvarClust project homepage There is a link to theproject from Cathy Maugisrsquos website

                  After several tests this entrant was discarded due to the amount of computing timerequired by the greedy selection technique that basically involves two executionsof a classical clustering algorithm (with mixmod) for every single variable whoseinclusion needs to be considered

                  The substitute of SelvarClust has been the algorithm that inspired it that is themethod developed by Raftery and Dean (2006) There is a R package namedClustvarsel that can be downloaded from the website of Nema Dean or from theComprehensive R Archive Network website

                  bull LumiWCluster LumiWCluster is an R package available from the homepageof Pei Fen Kuan This algorithm is inspired by Wang and Zhu (2008) who pro-pose a penalty for the likelihood that incorporates group information through aL1infin mixed norm In Kuan et al (2010) they introduce some slight changes inthe penalty term as weighting parameters that are particularly important for theirdataset The package LumiWCluster allows to perform clustering using the ex-pression from Wang and Zhu (2008) (called LumiWCluster-Wang) or the one fromKuan et al (2010) (called LumiWCluster-Kuan)

                  bull Mix-GLOSS This is the clustering algorithm implemented using GLOSS (see

                  94

                  102 Results

                  Section 9) It makes use of an EM algorithm and the equivalences between the M-step and an LDA problem and between an p-LDA problem and a p-OS problem Itpenalizes an OS regression with a variational approach of the group-Lasso penalty(see Section 814) that induces zeros in all discriminant directions for the samevariable

                  102 Results

                  In Table 101 are shown the results of the experiments for all those algorithms fromSection 101 The parameters to measure the performance are

                  bull Clustering Error (in percentage) To measure the quality of the partitionwith the a priori knowledge of the real classes the clustering error is computedas explained in Wu and Scholkopf (2007) If the obtained partition and the reallabeling are the same then the clustering error shows a 0 The way this measureis defined allows to obtain the ideal 0 of clustering error even if the IDs for theclusters or the real classes are different

                  bull Number of Disposed Features This value shows the number of variables whosecoefficients have been zeroed therefore they are not used in the partitioning Inour datasets only the first 20 features are relevant for the discrimination thelast 80 variables can be discarded Hence a good result for the tested algorithmsshould be around 80

                  bull Time of execution (in hours minutes or seconds) Finally the time neededto execute the 25 repetitions for each simulation setup is also measured Thosealgorithms tend to be more memory and cpu consuming as the number of variablesincreases This is one of the reasons why the dimensionality of the original problemwas reduced

                  The adequacy of the selected features was assessed by the True Positive Rate (TPR)and the False Positive Rate (FPR) The TPR is defined as the ratio of selected variablesthat are actually relevant Similarly the FPR is the ratio of selected variables that areactually non relevant The best algorithm would be the one that selects all the relevantvariables and rejects all the others That is TPR = 1 and FPR = 0 simultaneouslyIn order to avoid cluttered results we compare TPR and FPR for the four simulationsbut only for the three algorithms CS general cov and Clustvarsel were discarded dueto high computing time and cluster error respectively The two versions of LumiW-Cluster providing almost the same TPR and FPR only one is displayed The threeremaining algorithms are Fisher EM by Bouveyron and Brunet (2012a) the version ofLumiWCluster by Kuan et al (2010) and Mix-GLOSS

                  Results in percentages are displayed in Figure 102 (or in Table 102 )

                  95

                  10 Experimental Results

                  Table 101 Experimental results for simulated data

                  Err () Var Time

                  Sim 1 K = 4 mean shift ind features

                  CS general cov 46 (15) 985 (72) 884hFisher EM 58 (87) 784 (52) 1645mClustvarsel 602 (107) 378 (291) 383hLumiWCluster-Kuan 42 (68) 779 (4) 389sLumiWCluster-Wang 43 (69) 784 (39) 619sMix-GLOSS 32 (16) 80 (09) 15h

                  Sim 2 K = 2 mean shift dependent features

                  CS general cov 154 (2) 997 (09) 783hFisher EM 74 (23) 809 (28) 8mClustvarsel 73 (2) 334 (207) 166hLumiWCluster-Kuan 64 (18) 798 (04) 155sLumiWCluster-Wang 63 (17) 799 (03) 14sMix-GLOSS 77 (2) 841 (34) 2h

                  Sim 3 K = 4 1D mean shift ind features

                  CS general cov 304 (57) 55 (468) 1317hFisher EM 233 (65) 366 (55) 22mClustvarsel 658 (115) 232 (291) 542hLumiWCluster-Kuan 323 (21) 80 (02) 83sLumiWCluster-Wang 308 (36) 80 (02) 1292sMix-GLOSS 347 (92) 81 (88) 21h

                  Sim 4 K = 4 mean shift ind features

                  CS general cov 626 (55) 999 (02) 112hFisher EM 567 (104) 55 (48) 195mClustvarsel 732 (4) 24 (12) 767hLumiWCluster-Kuan 692 (112) 99 (2) 876sLumiWCluster-Wang 697 (119) 991 (21) 825sMix-GLOSS 669 (91) 975(12) 11h

                  Table 102 TPR versus FPR (in ) average computed over 25 repetitions for the bestperforming algorithms

                  Simulation1 Simulation2 Simulation3 Simulation4TPR FPR TPR FPR TPR FPR TPR FPR

                  MIX-GLOSS 992 015 828 335 884 67 780 12

                  LUMI-KUAN 992 28 1000 02 1000 005 50 005

                  FISHER-EM 986 24 888 17 838 5825 620 4075

                  96

                  103 Discussion

                  0 10 20 30 40 50 600

                  10

                  20

                  30

                  40

                  50

                  60

                  70

                  80

                  90

                  100TPR Vs FPR

                  MIXminusGLOSS

                  LUMIminusKUAN

                  FISHERminusEM

                  Simulation1

                  Simulation2

                  Simulation3

                  Simulation4

                  Figure 102 TPR versus FPR (in ) for the most performing algorithms and simula-tions

                  103 Discussion

                  After reviewing Tables 101ndash102 and Figure 102 we see that there is no definitivewinner in all situations regarding all criteria According to the objectives and constraintsof the problem the following observations deserve to be highlighted

                  LumiWCluster (Wang and Zhu 2008 Kuan et al 2010) is by far the fastest kind ofmethod with good behaviors regarding the other performances At the other end ofthis criterion CS general cov is extremely slow and Clustvarsel though twice as fast isalso very long to produce an output Of course the speed criterion does not say muchby itself the implementations use different programming languages different stoppingcriteria and we do not know what effort has been spent on implementation That beingsaid the slowest algorithm are not the more precise ones so their long computation timeis worth mentioning here

                  The quality of the partition vary depending on the simulation and the algorithm Mix-GLOSS has a small edge in Simulation 1 LumiWCluster (Zhou et al 2009) performsbetter in Simulation 2 while Fisher EM (Bouveyron and Brunet 2012a) does slightlybetter in Simulations 3 and 4

                  From the feature selection point of view LumiWCluster (Kuan et al 2010) and Mix-GLOSS succeed in removing irrelevant variables in all the situations Fisher EM (Bou-veyron and Brunet 2012a) and Mix-GLOSS discover the relevant ones Mix-GLOSSconsistently performs best or close to the best solution in terms of fall-out and recall

                  97

                  Conclusions

                  99

                  Conclusions

                  Summary

                  The linear regression of scaled indicator matrices or optimal scoring is a versatiletechnique with applicability in many fields of the machine learning domain An optimalscoring regression by means of regularization can be strengthen to be more robustavoid overfitting counteract ill-posed problems or remove correlated or noisy variables

                  In this thesis we have proved the utility of penalized optimal scoring in the fields ofmulti-class linear discrimination and clustering

                  The equivalence between LDA and OS problems allows to take advantage of all theresources available on the resolution of regression to the solution of linear discriminationIn their penalized versions this equivalence holds under certain conditions that have notalways been obeyed when OS has been used to solve LDA problems

                  In Part II we have used a variational approach of group-Lasso penalty to preserve thisequivalence granting the use of penalized optimal scoring regressions for the solutionof linear discrimination problems This theory has been verified with the implementa-tion of our Group Lasso Optimal Scoring Solver algorithm (GLOSS) that has provedits effectiveness inducing extremely parsimonious models without renouncing any pre-dicting capabilities GLOSS has been tested with four artificial and three real datasetsoutperforming other algorithms at the state of the art in almost all situations

                  In Part III this theory has been adapted by means of an EM algorithm to the unsu-pervised domain As for the supervised case the theory must guarantee the equivalencebetween penalized LDA and penalized OS The difficulty of this method resides in thecomputation of the criterion to maximize at every iteration of the EM loop that istypically used to detect the convergence of the algorithm and to implement model selec-tion of the penalty parameter Also in this case the theory has been put into practicewith the implementation of Mix-GLOSS By now due to time constraints only artificialdatasets have been tested with positive results

                  Perspectives

                  Even if the preliminary result are optimistic Mix-GLOSS has not been sufficientlytested We have planned to test it at least with the same real datasets that we used withGLOSS However more testing would be recommended in both cases Those algorithmsare well suited for genomic data where the number of samples is smaller than the numberof variables however other high-dimensional low-sample setting (HDLSS) domains arealso possible Identification of male or female silhouettes fungal species or fish species

                  101

                  based on shape and texture (Clemmensen et al 2011) Stirling faces (Roth and Lange2004) are only some examples Moreover we are not constrained to the HDLSS domainthe USPS handwritten digits database (Roth and Lange 2004) or the well known IrisFisherrsquos dataset and six UCIrsquos others (Bouveyron and Brunet 2012a) have also beentested in the bibliography

                  At the programming level both codes must be revisited to improve their robustnessand optimize their computation because during the prototyping phase the priority wasachieving a functional code An old version of GLOSS numerically more stable but lessefficient has been made available to the public A better suited and documented versionshould be made available for GLOSS and Mix-GLOSS in the short term

                  The theory developed in this thesis and the programming structure used for its im-plementation allow easy alterations the the algorithm by modifying the within-classcovariance matrix Diagonal versions of the model can be obtained by discarding allthe elements but the diagonal of the covariance matrix Spherical models could also beimplemented easily Prior information concerning the correlation between features canbe included by adding a quadratic penalty term such as the Laplacian that describesthe relationships between variables That can be used to implement pair-wise penaltieswhen the dataset is formed by pixels Quadratic penalty matrices can be also be addedto the within-class covariance to implement Elastic net equivalent penalties Some ofthose possibilities have been partially implemented as the diagonal version of GLOSShowever they have not been properly tested or even updated with the last algorith-mic modifications Their equivalents for the unsupervised domain have not been yetproposed due to the time deadlines for the publication of this thesis

                  From the point of view of the supporting theory we didnrsquot succeed finding the exactcriterion that is maximized in Mix-GLOSS We believe it must be a kind of penalizedor even hyper-penalized likelihood but we decided to prioritize the experimental resultsdue to the time constraints Ignorancing this criterion does not prevent from successfulsimulations of Mix-GLOSS Other mechanisms have been used in the stopping of theEM algorithm and in model selection that do not involve the computation of the realcriterion However further investigations must be done in this direction to assess theconvergence properties of this algorithm

                  At the beginning of this thesis even if finally the work took the direction of featureselection a big effort was done in the domain of outliers detection and block clusteringOne of the most succsefull mechanism in the detection of outliers is done by modelling thepopulation with a mixture model where the outliers should be described by an uniformdistribution This technique does not need any prior knowledge about the number orabout the percentage of outliers As the basis model of this thesis is a mixture ofGaussians our impression is that it should not be difficult to introduce a new uniformcomponent to gather together all those points that do not fit the Gaussian mixture Onthe other hand the application of penalized optimal scoring to block clustering looksmore complex but as block clustering is typically defined as a mixture model whoseparameters are estimated by means of an EM it could be possible to re-interpret thatestimation using a penalized optimal scoring regression

                  102

                  Appendix

                  103

                  A Matrix Properties

                  Property 1 By definition ΣW and ΣB are both symmetric matrices

                  ΣW =1

                  n

                  gsumk=1

                  sumiisinCk

                  (xi minus microk)(xi minus microk)gt

                  ΣB =1

                  n

                  gsumk=1

                  nk(microk minus x)(microk minus x)gt

                  Property 2 partxgtapartx = partagtx

                  partx = a

                  Property 3 partxgtAxpartx = (A + Agt)x

                  Property 4 part|Xminus1|partX = minus|Xminus1|(Xminus1)gt

                  Property 5 partagtXbpartX = abgt

                  Property 6 partpartXtr

                  (AXminus1B

                  )= minus(Xminus1BAXminus1)gt = XminusgtAgtBgtXminusgt

                  105

                  B The Penalized-OS Problem is anEigenvector Problem

                  In this appendix we answer the question why the solution of a penalized optimalscoring regression involves the computation of an eigenvector decomposition The p-OSproblem has this form

                  minθkβk

                  Yθk minusXβk22 + βgtk Ωkβk (B1)

                  st θgtk YgtYθk = 1

                  θgt` YgtYθk = 0 forall` lt k

                  for k = 1 K minus 1The Lagrangian associated to Problem (B1) is

                  Lk(θkβk λkνk) =

                  Yθk minusXβk22 + βgtk Ωkβk + λk(θ

                  gtk YgtYθk minus 1) +

                  sum`ltk

                  ν`θgt` YgtYθk (B2)

                  Making zero the gradient of (B2) with respect to βk gives the value of the optimal βk

                  βk = (XgtX + Ωk)minus1XgtYθk (B3)

                  The objective function of (B1) evaluated at βk is

                  minθk

                  Yθk minusXβk22 + βk

                  gtΩkβk = min

                  θk

                  θgtk Ygt(IminusX(XgtX + Ωk)minus1Xgt)Yθk

                  = maxθk

                  θgtk YgtX(XgtX + Ωk)minus1Xgt)Yθk (B4)

                  If the penalty matrix Ωk is identical for all problems Ωk = Ω then (B4) corresponds toan eigen-problem where the k score vectors θk are then the eigenvectors of YgtX(XgtX+Ω)minus1XgtY

                  B1 How to Solve the Eigenvector Decomposition

                  Making an eigen-decomposition of an expression like YgtX(XgtX + Ω)minus1XgtY is nottrivial due to the p times p inverse With some datasets p can be extremely large makingthis inverse intractable In this section we show how to circumvent this issue solving aneasier eigenvector decomposition

                  107

                  B The Penalized-OS Problem is an Eigenvector Problem

                  Let M be the matrix YgtX(XgtX + Ω)minus1XgtY such that we can rewrite expression(B4) in a compact way

                  maxΘisinRKtimes(Kminus1)

                  tr(ΘgtMΘ

                  )(B5)

                  st ΘgtYgtYΘ = IKminus1

                  If (B5) is an eigenvector problem it can be reformulated on the traditional way Letthe K minus 1timesK minus 1 matrix MΘ be ΘgtMΘ Hence the eigenvector classical formulationassociated to (B5) is

                  MΘv = λv (B6)

                  where v is the eigenvector and λ the associated eigenvalue of MΘ Operating

                  vgtMΘv = λhArr vgtΘgtMΘv = λ

                  Making the variable change w = Θv we obtain an alternative eigenproblem where ware the eigenvectors of M and λ the associated eigenvalue

                  wgtMw = λ (B7)

                  Therefore v are the eigenvectors of the eigen-decomposition of matrix MΘ and w arethe eigenvectors of the eigen-decomposition of matrix M Note that the only differencebetween the K minus 1 times K minus 1 matrix MΘ and the K times K matrix M is the K times K minus 1matrix Θ in expression MΘ = ΘgtMΘ Then to avoid the computation of the p times pinverse (XgtX+Ω)minus1 we can use the optimal value of the coefficient matrix B = (XgtX+Ω)minus1XgtYΘ into MΘ

                  MΘ = ΘgtYgtX(XgtX + Ω)minus1XgtYΘ

                  = ΘgtYgtXB

                  Thus the eigen-decomposition of the (K minus 1) times (K minus 1) matrix MΘ = ΘgtYgtXB results in the v eigenvectors of (B6) To obtain the w eigenvectors of the alternativeformulation (B7) the variable change w = Θv needs to be undone

                  To summarize we calcule the v eigenvectors computed as the eigen-decomposition of atractable MΘ matrix evaluated as ΘgtYgtXB Then the definitive eigenvectors w arerecovered by doing w = Θv The final step is the reconstruction of the optimal scorematrix Θ using the vectors w as its columns At this point we understand what inthe literature is called ldquoupdating the initial score matrixrdquo Multiplying the initial Θ tothe eigenvectors matrix V from decomposition (B6) is reversing the change of variableto restore the w vectors The B matrix also needs to be ldquoupdatedrdquo by multiplying Bby the same matrix of eigenvectors V in order to affect the initial Θ matrix used in thefirst computation of B

                  B = (XgtX + Ω)minus1XgtYΘV = BV

                  108

                  B2 Why the OS Problem is Solved as an Eigenvector Problem

                  B2 Why the OS Problem is Solved as an Eigenvector Problem

                  In the Optimal Scoring literature the score matrix Θ that optimizes Problem (B1)is obtained by means of a eigenvector decomposition of matrix M = YgtX(XgtX +Ω)minus1XgtY

                  By definition of eigen-decomposition the eigenvectors of the M matrix (called w in(B7)) form a base so that any score vector θ can be expressed as a linear combinationof them

                  θk =

                  Kminus1summ=1

                  αmwm s t θgtk θk = 1 (B8)

                  The score vectors orthogonality constraint θgtk θk = 1 can be expressed also as a functionof this base (

                  Kminus1summ=1

                  αmwm

                  )gt(Kminus1summ=1

                  αmwm

                  )= 1

                  that as per the eigenvector properties can be reduced to

                  Kminus1summ=1

                  α2m = 1 (B9)

                  Let M be multiplied by a score vector θk that can be replaced by its linear combinationof eigenvectors wm (B8)

                  Mθk = M

                  Kminus1summ=1

                  αmwm

                  =

                  Kminus1summ=1

                  αmMwm

                  As wm are the eigenvectors of the M matrix the relationship Mwm = λmwm can beused to obtain

                  Mθk =Kminus1summ=1

                  αmλmwm

                  Multiplying right side by θgtk and left side by its corresponding linear combination ofeigenvectors

                  θgtk Mθk =

                  (Kminus1sum`=1

                  α`w`

                  )gt(Kminus1summ=1

                  αmλmwm

                  )

                  This equation can be simplified using the orthogonality property of eigenvectors accord-ing to which w`wm is zero for any ` 6= m giving

                  θgtk Mθk =Kminus1summ=1

                  α2mλm

                  109

                  B The Penalized-OS Problem is an Eigenvector Problem

                  The optimization Problem (B5) for discriminant direction k can be rewritten as

                  maxθkisinRKtimes1

                  θgtk Mθk

                  = max

                  θkisinRKtimes1

                  Kminus1summ=1

                  α2mλm

                  (B10)

                  with θk =Kminus1summ=1

                  αmwm

                  andKminus1summ=1

                  α2m = 1

                  One way of maximizing Problem (B10) is choosing αm = 1 for m = k and αm = 0otherwise Hence as θk =

                  sumKminus1m=1 αmwm the resulting score vector θk will be equal to

                  the kth eigenvector wkAs a summary it can be concluded that the solution to the original problem (B1) can

                  be achieved by an eigenvector decomposition of matrix M = YgtX(XgtX + Ω)minus1XgtY

                  110

                  C Solving Fisherrsquos Discriminant Problem

                  The classical Fisherrsquos discriminant problem seeks a projection that better separatesthe class centers while every class remains compact This is formalized as looking fora projection such that the projected data has maximal between-class variance under aunitary constraint on the within-class variance

                  maxβisinRp

                  βgtΣBβ (C1a)

                  s t βgtΣWβ = 1 (C1b)

                  where ΣB and ΣW are respectively the between-class variance and the within-classvariance of the original p-dimensional data

                  The Lagrangian of Problem (C1) is

                  L(β ν) = βgtΣBβ minus ν(βgtΣWβ minus 1)

                  so that its first derivative with respect to β is

                  partL(β ν)

                  partβ= 2ΣBβ minus 2νΣWβ

                  A necessary optimality condition for β is that this derivative is zero that is

                  ΣBβ = νΣWβ

                  Provided ΣW is full rank we have

                  Σminus1W ΣBβ

                  = νβ (C2)

                  Thus the solutions β match the definition of an eigenvector of matrix Σminus1W ΣB of

                  eigenvalue ν To characterize this eigenvalue we note that the the objective function(C1a) can be expressed as follows

                  βgtΣBβ = βgtΣWΣminus1

                  W ΣBβ

                  = νβgtΣWβ from (C2)

                  = ν from (C1b)

                  That is the optimal value of the objective function to be maximized is the eigenvalue νHence ν is the largest eigenvalue of Σminus1

                  W ΣB and β is any eigenvector correspondingto this maximal eigenvalue

                  111

                  D Alternative Variational Formulation forthe Group-Lasso

                  In this appendix an alternative to the variational form of the group-Lasso (421)presented in Section 431 is proposed

                  minτisinRp

                  minBisinRptimesKminus1

                  J(B) + λ

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj(D1a)

                  s tsump

                  j=1 τj = 1 (D1b)

                  τj ge 0 j = 1 p (D1c)

                  Following the approach detailed in Section 431 its equivalence with the standardgroup-Lasso formulation is demonstrated here Let B isin RptimesKminus1 be a matrix composed

                  of row vectors βj isin RKminus1 B =(β1gt βpgt

                  )gt

                  L(B τ λ ν0 νj) = J(B) + λ

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj+ ν0

                  psumj=1

                  τj minus 1

                  minus psumj=1

                  νjτj (D2)

                  The starting point is the Lagrangian (D2) that is differentiated with respect to τj toget the optimal value τj

                  partL(B τ λ ν0 νj)

                  partτj

                  ∣∣∣∣τj=τj

                  = 0 rArr minusλw2j

                  ∥∥βj∥∥2

                  2

                  τj2 + ν0 minus νj = 0

                  rArr minusλw2j

                  ∥∥βj∥∥2

                  2+ ν0τ

                  j

                  2 minus νjτj2 = 0

                  rArr minusλw2j

                  ∥∥βj∥∥2

                  2+ ν0τ

                  j

                  2 = 0

                  The last two expressions are related through one property of the Lagrange multipliersthat states that νjgj(τ

                  ) = 0 where νj is the Lagrange multiplier and gj(τ) is the

                  inequality Lagrange condition Then the optimal τj can be deduced

                  τj =

                  radicλ

                  ν0wj∥∥βj∥∥

                  2

                  Placing this optimal value of τj into constraint (D1b)

                  psumj=1

                  τj = 1rArr τj =wj∥∥βj∥∥

                  2sumpj=1wj

                  ∥∥βj∥∥2

                  (D3)

                  113

                  D Alternative Variational Formulation for the Group-Lasso

                  With this value of τj Problem (D1) is equivalent to

                  minBisinRptimesKminus1

                  J(B) + λ

                  psumj=1

                  wj∥∥βj∥∥

                  2

                  2

                  (D4)

                  This problem is a slight alteration of the standard group-Lasso as the penalty is squaredcompared to the usual form This square only affects the strength of the penalty and theusual properties of the group-Lasso apply to the solution of problem D4) In particularits solution is expected to be sparse with some null vectors βj

                  The penalty term of (D1a) can be conveniently presented as λBgtΩB where

                  Ω = diag

                  (w2

                  1

                  τ1w2

                  2

                  τ2

                  w2p

                  τp

                  ) (D5)

                  Using the value of τj from (D3) each diagonal component of Ω is

                  (Ω)jj =wjsump

                  j=1wj∥∥βj∥∥

                  2∥∥βj∥∥2

                  (D6)

                  In the following paragraphs the optimality conditions and properties developed forthe quadratic variational approach detailed in Section 431 are also computed here forthis alternative formulation

                  D1 Useful Properties

                  Lemma D1 If J is convex Problem (D1) is convex

                  In what follows J will be a convex quadratic (hence smooth) function in which casea necessary and sufficient optimality condition is that zero belongs to the subdifferentialof the objective function whose expression is provided in the following lemma

                  Lemma D2 For all B isin RptimesKminus1 the subdifferential of the objective function of Prob-lem (D4) is V isin RptimesKminus1 V =

                  partJ(B)

                  partB+ 2λ

                  Kminus1sumj=1

                  wj∥∥βj∥∥

                  2

                  G

                  (D7)

                  where G = (g1 gKminus1) is a ptimesK minus 1 matrix defined as follows Let S(B) denotethe columnwise support of B S(B) = j isin 1 K minus 1

                  ∥∥βj∥∥26= 0 then we have

                  forallj isin S(B) gj = wj∥∥βj∥∥minus1

                  2βj (D8)

                  forallj isin S(B) ∥∥gj∥∥

                  2le wj (D9)

                  114

                  D2 An Upper Bound on the Objective Function

                  This condition results in an equality for the ldquoactiverdquo non-zero vectors βj and aninequality for the other ones which both provide essential building blocks of our algo-rithm

                  Lemma D3 Problem (D4) admits at least one solution which is unique if J(B)is strictly convex All critical points B of the objective function verifying the followingconditions are global minima Let S(B) denote the columnwise support of B S(B) =j isin 1 K minus 1

                  ∥∥βj∥∥26= 0 and let S(B) be its complement then we have

                  forallj isin S(B) minus partJ(B)

                  partβj= 2λ

                  Kminus1sumj=1

                  wj∥∥βj∥∥2

                  wj∥∥βj∥∥minus1

                  2βj (D10a)

                  forallj isin S(B)

                  ∥∥∥∥partJ(B)

                  partβj

                  ∥∥∥∥2

                  le 2λwj

                  Kminus1sumj=1

                  wj∥∥βj∥∥2

                  (D10b)

                  In particular Lemma D3 provides a well-defined appraisal of the support of thesolution which is not easily handled from the direct analysis of the variational problem(D1)

                  D2 An Upper Bound on the Objective Function

                  Lemma D4 The objective function of the variational form (D1) is an upper bound onthe group-Lasso objective function (D4) and for a given B the gap in these objectivesis null at τ such that

                  τj =wj∥∥βj∥∥

                  2sumpj=1wj

                  ∥∥βj∥∥2

                  Proof The objective functions of (421) and (424) only differ in their second term Letτ isin Rp be any feasible vector we have psum

                  j=1

                  wj∥∥βj∥∥

                  2

                  2

                  =

                  psumj=1

                  τ12j

                  wj∥∥βj∥∥

                  2

                  τ12j

                  2

                  le

                  psumj=1

                  τj

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj

                  le

                  psumj=1

                  w2j

                  ∥∥βj∥∥2

                  2

                  τj

                  where we used the Cauchy-Schwarz inequality in the second line and the definition ofthe feasibility set of τ in the last one

                  115

                  D Alternative Variational Formulation for the Group-Lasso

                  This lemma only holds for the alternative variational formulation described in thisappendix It is difficult to have the same result in the first variational form (Section431) because the definition of the feasible sets of τ and β are intertwined

                  116

                  E Invariance of the Group-Lasso to UnitaryTransformations

                  The computational trick described in Section 52 for quadratic penalties can be appliedto group-Lasso provided that the following holds if the regression coefficients B0 areoptimal for the score values Θ0 and if the optimal scores Θ are obtained by a unitarytransformation of Θ0 say Θ = Θ0V (where V isin RMtimesM is a unitary matrix) thenB = B0V is optimal conditionally on Θ that is (ΘB) is a global solution corre-sponding to the optimal scoring problem To show this we use the standard group-Lassoformulation and show the following proposition

                  Proposition E1 Let B be a solution of

                  minBisinRptimesM

                  Y minusXB2F + λ

                  psumj=1

                  wj∥∥βj∥∥

                  2(E1)

                  and let Y = YV where V isin RMtimesM is a unitary matrix Then B = BV is a solutionof

                  minBisinRptimesM

                  ∥∥∥Y minusXB∥∥∥2

                  F+ λ

                  psumj=1

                  wj∥∥βj∥∥

                  2(E2)

                  Proof The first-order necessary optimality conditions for B are

                  forallj isin S(B) 2xjgt(xjβ

                  j minusY)

                  + λwj

                  ∥∥∥βj∥∥∥minus1

                  2βj

                  = 0 (E3a)

                  forallj isin S(B) 2∥∥∥xjgt (xjβ

                  j minusY)∥∥∥

                  2le λwj (E3b)

                  where S(B) sube 1 p denotes the set of non-zero row vectors of B and S(B) is itscomplement

                  First we note that from the definition of B we have S(B) = S(B) Then we mayrewrite the above conditions as follows

                  forallj isin S(B) 2xjgt(xjβ

                  j minus Y)

                  + λwj

                  ∥∥∥βj∥∥∥minus1

                  2βj

                  = 0 (E4a)

                  forallj isin S(B) 2∥∥∥xjgt (xjβ

                  j minus Y)∥∥∥

                  2le λwj (E4b)

                  where (E4a) is obtained by multiplying both sides of Equation (E3a) by V and alsouses that VVgt = I so that forallu isin RM

                  ∥∥ugt∥∥2

                  =∥∥ugtV

                  ∥∥2 Equation (E4b) is also

                  117

                  E Invariance of the Group-Lasso to Unitary Transformations

                  obtained from the latter relationship Conditions (E4) are then recognized as the first-order necessary conditions for B to be a solution to Problem (E2) As the latter isconvex these conditions are sufficient which concludes the proof

                  118

                  F Expected Complete Likelihood andLikelihood

                  Section 712 explains that with the maximization of the conditional expectation ofthe complete log-likelihood Q(θθprime) (77) by means of the EM algorithm log-likelihood(71) is also maximized The value of the log-likelihood can be computed using itsdefinition (71) but there is a shorter way to compute it from Q(θθprime) when the latteris available

                  L(θ) =

                  nsumi=1

                  log

                  (Ksumk=1

                  πkfk(xiθk)

                  )(F1)

                  Q(θθprime) =nsumi=1

                  Ksumk=1

                  tik(θprime) log (πkfk(xiθk)) (F2)

                  with tik(θprime) =

                  πprimekfk(xiθprimek)sum

                  ` πprime`f`(xiθ

                  prime`)

                  (F3)

                  In the EM algorithm θprime is the model parameters at previous iteration tik(θprime) are

                  the posterior probability values computed from θprime at the previous E-Step and θ with-out ldquoprimerdquo denotes the parameters of the current iteration to be obtained with themaximization of Q(θθprime)

                  Using (F3) we have

                  Q(θθprime) =sumik

                  tik(θprime) log (πkfk(xiθk))

                  =sumik

                  tik(θprime) log(tik(θ)) +

                  sumik

                  tik(θprime) log

                  (sum`

                  π`f`(xiθ`)

                  )=sumik

                  tik(θprime) log(tik(θ)) + L(θ)

                  In particular after the evaluation of tik in the E-step where θ = θprime the log-likelihoodcan be computed using the value of Q(θθ) (77) and the entropy of the posterior prob-abilities

                  L(θ) = Q(θθ)minussumik

                  tik(θ) log(tik(θ))

                  = Q(θθ) +H(T)

                  119

                  G Derivation of the M-Step Equations

                  This appendix shows the whole process to obtain expressions (710) (711) and (712)in the context of a Gaussian mixture model with common covariance matrices Thecriterion is defined as

                  Q(θθprime) = maxθ

                  sumik

                  tik(θprime) log(πkfk(xiθk))

                  =sumk

                  log

                  (πksumi

                  tik

                  )minus np

                  2log(2π)minus n

                  2log |Σ| minus 1

                  2

                  sumik

                  tik(xi minus microk)gtΣminus1(xi minus microk)

                  which has to be maximized subject tosumk

                  πk = 1

                  The Lagrangian of this problem is

                  L(θ) = Q(θθprime) + λ

                  (sumk

                  πk minus 1

                  )

                  Partial derivatives of the Lagrangian are made zero to obtain the optimal values ofπk microk and Σ

                  G1 Prior probabilities

                  partL(θ)

                  partπk= 0hArr 1

                  πk

                  sumi

                  tik + λ = 0

                  where λ is identified from the constraint leading to

                  πk =1

                  n

                  sumi

                  tik

                  121

                  G Derivation of the M-Step Equations

                  G2 Means

                  partL(θ)

                  partmicrok= 0hArr minus1

                  2

                  sumi

                  tik2Σminus1(microk minus xi) = 0

                  rArr microk =

                  sumi tikxisumi tik

                  G3 Covariance Matrix

                  partL(θ)

                  partΣminus1 = 0hArr n

                  2Σ︸︷︷︸

                  as per property 4

                  minus 1

                  2

                  sumik

                  tik(xi minus microk)(xi minus microk)gt

                  ︸ ︷︷ ︸as per property 5

                  = 0

                  rArr Σ =1

                  n

                  sumik

                  tik(xi minus microk)(xi minus microk)gt

                  122

                  Bibliography

                  F Bach R Jenatton J Mairal and G Obozinski Convex optimization with sparsity-inducing norms Optimization for Machine Learning pages 19ndash54 2011

                  F R Bach Bolasso model consistent lasso estimation through the bootstrap InProceedings of the 25th international conference on Machine learning ICML 2008

                  F R Bach R Jenatton J Mairal and G Obozinski Optimization with sparsity-inducing penalties Foundations and Trends in Machine Learning 4(1)1ndash106 2012

                  J D Banfield and A E Raftery Model-based Gaussian and non-Gaussian clusteringBiometrics pages 803ndash821 1993

                  A Beck and M Teboulle A fast iterative shrinkage-thresholding algorithm for linearinverse problems SIAM Journal on Imaging Sciences 2(1)183ndash202 2009

                  H Bensmail and G Celeux Regularized Gaussian discriminant analysis through eigen-value decomposition Journal of the American statistical Association 91(436)1743ndash1748 1996

                  P J Bickel and E Levina Some theory for Fisherrsquos linear discriminant function lsquonaiveBayesrsquo and some alternatives when there are many more variables than observationsBernoulli 10(6)989ndash1010 2004

                  C Bienarcki G Celeux G Govaert and F Langrognet MIXMOD Statistical Docu-mentation httpwwwmixmodorg 2008

                  C M Bishop Pattern Recognition and Machine Learning Springer New York 2006

                  C Bouveyron and C Brunet Discriminative variable selection for clustering with thesparse Fisher-EM algorithm Technical Report 12042067 Arxiv e-prints 2012a

                  C Bouveyron and C Brunet Simultaneous model-based clustering and visualization inthe Fisher discriminative subspace Statistics and Computing 22(1)301ndash324 2012b

                  S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004

                  L Breiman Better subset regression using the nonnegative garrote Technometrics 37(4)373ndash384 1995

                  L Breiman and R Ihaka Nonlinear discriminant analysis via ACE and scaling TechnicalReport 40 University of California Berkeley 1984

                  123

                  Bibliography

                  T Cai and W Liu A direct estimation approach to sparse linear discriminant analysisJournal of the American Statistical Association 106(496)1566ndash1577 2011

                  S Canu and Y Grandvalet Outcomes of the equivalence of adaptive ridge with leastabsolute shrinkage Advances in Neural Information Processing Systems page 4451999

                  C Caramanis S Mannor and H Xu Robust optimization in machine learning InS Sra S Nowozin and S J Wright editors Optimization for Machine Learningpages 369ndash402 MIT Press 2012

                  B Chidlovskii and L Lecerf Scalable feature selection for multi-class problems InW Daelemans B Goethals and K Morik editors Machine Learning and KnowledgeDiscovery in Databases volume 5211 of Lecture Notes in Computer Science pages227ndash240 Springer 2008

                  L Clemmensen T Hastie D Witten and B Ersboslashll Sparse discriminant analysisTechnometrics 53(4)406ndash413 2011

                  C De Mol E De Vito and L Rosasco Elastic-net regularization in learning theoryJournal of Complexity 25(2)201ndash230 2009

                  A P Dempster N M Laird and D B Rubin Maximum likelihood from incompletedata via the em algorithm Journal of the Royal Statistical Society Series B (Method-ological) 39(1)1ndash38 1977 ISSN 0035-9246

                  D L Donoho M Elad and V N Temlyakov Stable recovery of sparse overcompleterepresentations in the presence of noise IEEE Transactions on Information Theory52(1)6ndash18 2006

                  R O Duda P E Hart and D G Stork Pattern Classification Wiley 2000

                  B Efron T Hastie I Johnstone and R Tibshirani Least angle regression The Annalsof statistics 32(2)407ndash499 2004

                  Jianqing Fan and Yingying Fan High dimensional classification using features annealedindependence rules Annals of statistics 36(6)2605 2008

                  R A Fisher The use of multiple measurements in taxonomic problems Annals ofHuman Genetics 7(2)179ndash188 1936

                  V Franc and S Sonnenburg Optimized cutting plane algorithm for support vectormachines In Proceedings of the 25th international conference on Machine learningpages 320ndash327 ACM 2008

                  J Friedman T Hastie and R Tibshirani The Elements of Statistical Learning DataMining Inference and Prediction Springer 2009

                  124

                  Bibliography

                  J Friedman T Hastie and R Tibshirani A note on the group lasso and a sparse grouplasso Technical Report 10010736 ArXiv e-prints 2010

                  J H Friedman Regularized discriminant analysis Journal of the American StatisticalAssociation 84(405)165ndash175 1989

                  W J Fu Penalized regressions the bridge versus the lasso Journal of Computationaland Graphical Statistics 7(3)397ndash416 1998

                  A Gelman J B Carlin H S Stern and D B Rubin Bayesian Data Analysis Chap-man amp HallCRC 2003

                  D Ghosh and A M Chinnaiyan Classification and selection of biomarkers in genomicdata using lasso Journal of Biomedicine and Biotechnology 2147ndash154 2005

                  G Govaert Y Grandvalet X Liu and L F Sanchez Merchante Implementation base-line for clustering Technical Report D71-m12 Massive Sets of Heuristics for MachineLearning httpssecuremash-projecteufilesmash-deliverable-D71-m12pdf 2010

                  G Govaert Y Grandvalet B Laval X Liu and L F Sanchez Merchante Implemen-tations of original clustering Technical Report D72-m24 Massive Sets of Heuristicsfor Machine Learning httpssecuremash-projecteufilesmash-deliverable-D72-m24pdf 2011

                  Y Grandvalet Least absolute shrinkage is equivalent to quadratic penalization InPerspectives in Neural Computing volume 98 pages 201ndash206 1998

                  Y Grandvalet and S Canu Adaptive scaling for feature selection in svms Advances inNeural Information Processing Systems 15553ndash560 2002

                  L Grosenick S Greer and B Knutson Interpretable classifiers for fMRI improveprediction of purchases IEEE Transactions on Neural Systems and RehabilitationEngineering 16(6)539ndash548 2008

                  Y Guermeur G Pollastri A Elisseeff D Zelus H Paugam-Moisy and P Baldi Com-bining protein secondary structure prediction models with ensemble methods of opti-mal complexity Neurocomputing 56305ndash327 2004

                  J Guo E Levina G Michailidis and J Zhu Pairwise variable selection for high-dimensional model-based clustering Biometrics 66(3)793ndash804 2010

                  I Guyon and A Elisseeff An introduction to variable and feature selection Journal ofMachine Learning Research 31157ndash1182 2003

                  T Hastie and R Tibshirani Discriminant analysis by Gaussian mixtures Journal ofthe Royal Statistical Society Series B (Methodological) 58(1)155ndash176 1996

                  T Hastie R Tibshirani and A Buja Flexible discriminant analysis by optimal scoringJournal of the American Statistical Association 89(428)1255ndash1270 1994

                  125

                  Bibliography

                  T Hastie A Buja and R Tibshirani Penalized discriminant analysis The Annals ofStatistics 23(1)73ndash102 1995

                  A E Hoerl and R W Kennard Ridge regression Biased estimation for nonorthogonalproblems Technometrics 12(1)55ndash67 1970

                  J Huang S Ma H Xie and C H Zhang A group bridge approach for variableselection Biometrika 96(2)339ndash355 2009

                  T Joachims Training linear svms in linear time In Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining pages217ndash226 ACM 2006

                  K Knight and W Fu Asymptotics for lasso-type estimators The Annals of Statistics28(5)1356ndash1378 2000

                  P F Kuan S Wang X Zhou and H Chu A statistical framework for illumina DNAmethylation arrays Bioinformatics 26(22)2849ndash2855 2010

                  T Lange M Braun V Roth and J Buhmann Stability-based model selection Ad-vances in Neural Information Processing Systems 15617ndash624 2002

                  M H C Law M A T Figueiredo and A K Jain Simultaneous feature selectionand clustering using mixture models IEEE Transactions on Pattern Analysis andMachine Intelligence 26(9)1154ndash1166 2004

                  Y Lee Y Lin and G Wahba Multicategory support vector machines Journal of theAmerican Statistical Association 99(465)67ndash81 2004

                  C Leng Sparse optimal scoring for multiclass cancer diagnosis and biomarker detectionusing microarray data Computational Biology and Chemistry 32(6)417ndash425 2008

                  C Leng Y Lin and G Wahba A note on the lasso and related procedures in modelselection Statistica Sinica 16(4)1273 2006

                  H Liu and L Yu Toward integrating feature selection algorithms for classification andclustering IEEE Transactions on Knowledge and Data Engineering 17(4)491ndash5022005

                  J MacQueen Some methods for classification and analysis of multivariate observa-tions In Proceedings of the fifth Berkeley Symposium on Mathematical Statistics andProbability volume 1 pages 281ndash297 University of California Press 1967

                  Q Mai H Zou and M Yuan A direct approach to sparse discriminant analysis inultra-high dimensions Biometrika 99(1)29ndash42 2012

                  C Maugis G Celeux and M L Martin-Magniette Variable selection for clusteringwith Gaussian mixture models Biometrics 65(3)701ndash709 2009a

                  126

                  Bibliography

                  C Maugis G Celeux and ML Martin-Magniette Selvarclust software for variable se-lection in model-based clustering rdquohttpwwwmathuniv-toulousefr~maugisSelvarClustHomepagehtmlrdquo 2009b

                  L Meier S Van De Geer and P Buhlmann The group lasso for logistic regressionJournal of the Royal Statistical Society Series B (Statistical Methodology) 70(1)53ndash71 2008

                  N Meinshausen and P Buhlmann High-dimensional graphs and variable selection withthe lasso The Annals of Statistics 34(3)1436ndash1462 2006

                  B Moghaddam Y Weiss and S Avidan Generalized spectral bounds for sparse LDAIn Proceedings of the 23rd international conference on Machine learning pages 641ndash648 ACM 2006

                  B Moghaddam Y Weiss and S Avidan Fast pixelpart selection with sparse eigen-vectors In IEEE 11th International Conference on Computer Vision 2007 ICCV2007 pages 1ndash8 2007

                  Y Nesterov Gradient methods for minimizing composite functions preprint 2007

                  S Newcomb A generalized theory of the combination of observations so as to obtainthe best result American Journal of Mathematics 8(4)343ndash366 1886

                  B Ng and R Abugharbieh Generalized group sparse classifiers with application in fMRIbrain decoding In Computer Vision and Pattern Recognition (CVPR) 2011 IEEEConference on pages 1065ndash1071 IEEE 2011

                  M R Osborne B Presnell and B A Turlach On the lasso and its dual Journal ofComputational and Graphical statistics 9(2)319ndash337 2000a

                  M R Osborne B Presnell and B A Turlach A new approach to variable selection inleast squares problems IMA Journal of Numerical Analysis 20(3)389ndash403 2000b

                  W Pan and X Shen Penalized model-based clustering with application to variableselection Journal of Machine Learning Research 81145ndash1164 2007

                  W Pan X Shen A Jiang and R P Hebbel Semi-supervised learning via penalizedmixture model with application to microarray sample classification Bioinformatics22(19)2388ndash2395 2006

                  K Pearson Contributions to the mathematical theory of evolution Philosophical Trans-actions of the Royal Society of London 18571ndash110 1894

                  S Perkins K Lacker and J Theiler Grafting Fast incremental feature selection bygradient descent in function space Journal of Machine Learning Research 31333ndash1356 2003

                  127

                  Bibliography

                  Z Qiao L Zhou and J Huang Sparse linear discriminant analysis with applications tohigh dimensional low sample size data International Journal of Applied Mathematics39(1) 2009

                  A E Raftery and N Dean Variable selection for model-based clustering Journal ofthe American Statistical Association 101(473)168ndash178 2006

                  C R Rao The utilization of multiple measurements in problems of biological classi-fication Journal of the Royal Statistical Society Series B (Methodological) 10(2)159ndash203 1948

                  S Rosset and J Zhu Piecewise linear regularized solution paths The Annals of Statis-tics 35(3)1012ndash1030 2007

                  V Roth The generalized lasso IEEE Transactions on Neural Networks 15(1)16ndash282004

                  V Roth and B Fischer The group-lasso for generalized linear models uniqueness ofsolutions and efficient algorithms In W W Cohen A McCallum and S T Roweiseditors Machine Learning Proceedings of the Twenty-Fifth International Conference(ICML 2008) volume 307 of ACM International Conference Proceeding Series pages848ndash855 2008

                  V Roth and T Lange Feature selection in clustering problems In S Thrun L KSaul and B Scholkopf editors Advances in Neural Information Processing Systems16 pages 473ndash480 MIT Press 2004

                  C Sammut and G I Webb Encyclopedia of Machine Learning Springer-Verlag NewYork Inc 2010

                  L F Sanchez Merchante Y Grandvalet and G Govaert An efficient approach to sparselinear discriminant analysis In Proceedings of the 29th International Conference onMachine Learning ICML 2012

                  Gideon Schwarz Estimating the dimension of a model The annals of statistics 6(2)461ndash464 1978

                  A J Smola SVN Vishwanathan and Q Le Bundle methods for machine learningAdvances in Neural Information Processing Systems 201377ndash1384 2008

                  S Sonnenburg G Ratsch C Schafer and B Scholkopf Large scale multiple kernellearning Journal of Machine Learning Research 71531ndash1565 2006

                  P Sprechmann I Ramirez G Sapiro and Y Eldar Collaborative hierarchical sparsemodeling In Information Sciences and Systems (CISS) 2010 44th Annual Conferenceon pages 1ndash6 IEEE 2010

                  M Szafranski Penalites Hierarchiques pour lrsquoIntegration de Connaissances dans lesModeles Statistiques PhD thesis Universite de Technologie de Compiegne 2008

                  128

                  Bibliography

                  M Szafranski Y Grandvalet and P Morizet-Mahoudeaux Hierarchical penalizationAdvances in Neural Information Processing Systems 2008

                  R Tibshirani Regression shrinkage and selection via the lasso Journal of the RoyalStatistical Society Series B (Methodological) pages 267ndash288 1996

                  J E Vogt and V Roth The group-lasso l1 regularization versus l12 regularization InPattern Recognition 32-nd DAGM Symposium Lecture Notes in Computer Science2010

                  S Wang and J Zhu Variable selection for model-based high-dimensional clustering andits application to microarray data Biometrics 64(2)440ndash448 2008

                  D Witten and R Tibshirani Penalized classification using Fisherrsquos linear discriminantJournal of the Royal Statistical Society Series B (Statistical Methodology) 73(5)753ndash772 2011

                  D M Witten and R Tibshirani A framework for feature selection in clustering Journalof the American Statistical Association 105(490)713ndash726 2010

                  D M Witten R Tibshirani and T Hastie A penalized matrix decomposition withapplications to sparse principal components and canonical correlation analysis Bio-statistics 10(3)515ndash534 2009

                  M Wu and B Scholkopf A local learning approach for clustering Advances in NeuralInformation Processing Systems 191529 2007

                  MC Wu L Zhang Z Wang DC Christiani and X Lin Sparse linear discriminantanalysis for simultaneous testing for the significance of a gene setpathway and geneselection Bioinformatics 25(9)1145ndash1151 2009

                  T T Wu and K Lange Coordinate descent algorithms for lasso penalized regressionThe Annals of Applied Statistics pages 224ndash244 2008

                  B Xie W Pan and X Shen Penalized model-based clustering with cluster-specificdiagonal covariance matrices and grouped variables Electronic Journal of Statistics2168ndash172 2008a

                  B Xie W Pan and X Shen Variable selection in penalized model-based clustering viaregularization on grouped parameters Biometrics 64(3)921ndash930 2008b

                  C Yang X Wan Q Yang H Xue and W Yu Identifying main effects and epistaticinteractions from large-scale snp data via adaptive group lasso BMC bioinformatics11(Suppl 1)S18 2010

                  J Ye Least squares linear discriminant analysis In Proceedings of the 24th internationalconference on Machine learning pages 1087ndash1093 ACM 2007

                  129

                  Bibliography

                  M Yuan and Y Lin Model selection and estimation in regression with grouped variablesJournal of the Royal Statistical Society Series B (Statistical Methodology) 68(1)49ndash67 2006

                  P Zhao and B Yu On model selection consistency of lasso Journal of Machine LearningResearch 7(2)2541 2007

                  P Zhao G Rocha and B Yu The composite absolute penalties family for grouped andhierarchical variable selection The Annals of Statistics 37(6A)3468ndash3497 2009

                  H Zhou W Pan and X Shen Penalized model-based clustering with unconstrainedcovariance matrices Electronic Journal of Statistics 31473ndash1496 2009

                  H Zou The adaptive lasso and its oracle properties Journal of the American StatisticalAssociation 101(476)1418ndash1429 2006

                  H Zou and T Hastie Regularization and variable selection via the elastic net Journal ofthe Royal Statistical Society Series B (Statistical Methodology) 67(2)301ndash320 2005

                  130

                  • SANCHEZ MERCHANTE PDTpdf
                  • Thesis Luis Francisco Sanchez Merchantepdf
                    • List of figures
                    • List of tables
                    • Notation and Symbols
                    • Context and Foundations
                      • Context
                      • Regularization for Feature Selection
                        • Motivations
                        • Categorization of Feature Selection Techniques
                        • Regularization
                          • Important Properties
                          • Pure Penalties
                          • Hybrid Penalties
                          • Mixed Penalties
                          • Sparsity Considerations
                          • Optimization Tools for Regularized Problems
                            • Sparse Linear Discriminant Analysis
                              • Abstract
                              • Feature Selection in Fisher Discriminant Analysis
                                • Fisher Discriminant Analysis
                                • Feature Selection in LDA Problems
                                  • Inertia Based
                                  • Regression Based
                                      • Formalizing the Objective
                                        • From Optimal Scoring to Linear Discriminant Analysis
                                          • Penalized Optimal Scoring Problem
                                          • Penalized Canonical Correlation Analysis
                                          • Penalized Linear Discriminant Analysis
                                          • Summary
                                            • Practicalities
                                              • Solution of the Penalized Optimal Scoring Regression
                                              • Distance Evaluation
                                              • Posterior Probability Evaluation
                                              • Graphical Representation
                                                • From Sparse Optimal Scoring to Sparse LDA
                                                  • A Quadratic Variational Form
                                                  • Group-Lasso OS as Penalized LDA
                                                      • GLOSS Algorithm
                                                        • Regression Coefficients Updates
                                                          • Cholesky decomposition
                                                          • Numerical Stability
                                                            • Score Matrix
                                                            • Optimality Conditions
                                                            • Active and Inactive Sets
                                                            • Penalty Parameter
                                                            • Options and Variants
                                                              • Scaling Variables
                                                              • Sparse Variant
                                                              • Diagonal Variant
                                                              • Elastic net and Structured Variant
                                                                  • Experimental Results
                                                                    • Normalization
                                                                    • Decision Thresholds
                                                                    • Simulated Data
                                                                    • Gene Expression Data
                                                                    • Correlated Data
                                                                      • Discussion
                                                                        • Sparse Clustering Analysis
                                                                          • Abstract
                                                                          • Feature Selection in Mixture Models
                                                                            • Mixture Models
                                                                              • Model
                                                                              • Parameter Estimation The EM Algorithm
                                                                                • Feature Selection in Model-Based Clustering
                                                                                  • Based on Penalized Likelihood
                                                                                  • Based on Model Variants
                                                                                  • Based on Model Selection
                                                                                      • Theoretical Foundations
                                                                                        • Resolving EM with Optimal Scoring
                                                                                          • Relationship Between the M-Step and Linear Discriminant Analysis
                                                                                          • Relationship Between Optimal Scoring and Linear Discriminant Analysis
                                                                                          • Clustering Using Penalized Optimal Scoring
                                                                                          • From Sparse Optimal Scoring to Sparse Linear Discriminant Analysis
                                                                                            • Optimized Criterion
                                                                                              • A Bayesian Derivation
                                                                                              • Maximum a Posteriori Estimator
                                                                                                  • Mix-GLOSS Algorithm
                                                                                                    • Mix-GLOSS
                                                                                                      • Outer Loop Whole Algorithm Repetitions
                                                                                                      • Penalty Parameter Loop
                                                                                                      • Inner Loop EM Algorithm
                                                                                                        • Model Selection
                                                                                                          • Experimental Results
                                                                                                            • Tested Clustering Algorithms
                                                                                                            • Results
                                                                                                            • Discussion
                                                                                                                • Conclusions
                                                                                                                • Appendix
                                                                                                                  • Matrix Properties
                                                                                                                  • The Penalized-OS Problem is an Eigenvector Problem
                                                                                                                    • How to Solve the Eigenvector Decomposition
                                                                                                                    • Why the OS Problem is Solved as an Eigenvector Problem
                                                                                                                      • Solving Fishers Discriminant Problem
                                                                                                                      • Alternative Variational Formulation for the Group-Lasso
                                                                                                                        • Useful Properties
                                                                                                                        • An Upper Bound on the Objective Function
                                                                                                                          • Invariance of the Group-Lasso to Unitary Transformations
                                                                                                                          • Expected Complete Likelihood and Likelihood
                                                                                                                          • Derivation of the M-Step Equations
                                                                                                                            • Prior probabilities
                                                                                                                            • Means
                                                                                                                            • Covariance Matrix
                                                                                                                                • Bibliography

                    top related