Statistics 202: Data Mining Part I Linear regression & LASSOstatweb.stanford.edu/.../notes/week10_2x2.pdf · Part I Linear regression & LASSO 2/1 Statistics 202: Data Mining c Jonathan

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 10

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 5, 2012

1 / 1


c©JonathanTaylor

Part I

Linear regression & LASSO

2 / 1


c©JonathanTaylor

Linear Regression

Linear Regression

We’ve talked mostly about classification, where theoutcome categorical.

In regression, the outcome is continuous. Given Y ∈ Rn

and X ∈ Rn×p, the least squares regression problem is

β = argminβ∈Rp

1

2‖y − Xβ‖22.

3 / 1


c©JonathanTaylor

Linear Regression

Linear Regression

If p ≤ n and XTX is invertible, this has a unique solution:

β = (XTX )−1XTY .

If p > n or, XTX is not invertible, the solution is notunique.

4 / 1


c©JonathanTaylor

Linear Regression

Linear Regression

Often, due to some other knowledge, we might change theloss function to a Mahalanobis distance

1

2(Y − Xβ)TΣ−1(Y − Xβ).

Example: Σ is diagonal is often used if Y |X has variancedependent on X .

Example: if the cases are sampled with some structure,the errors might be correlated with covariance Σ.

5 / 1


c©JonathanTaylor

Linear Regression

Feature Selection in Linear Regression

As noted, if p > n there is no unique least squares model.

Leads to variable selection problem.

Might try fixing a certain number of variables, k andsolving “best subsets regression” by

βk = argminJ={j1,...,jk}

1

2‖Y − XJ βJ‖22

where βJ is a/the least squares solution with features J.

This is like clustering: a combinatorial optimizationproblem.

We might be able prove things about the minimizer butwe have no algorithm to find it for p > 40 or so, unless Xhas very special structure.

6 / 1


c©JonathanTaylor

Linear Regression

The `0 norm

Defined as‖β‖0 = # {j : βj 6= 0}

(which is not really a norm . . . )

Best subsets model is equivalent to

βk = argminβ∈Rp :‖β‖0≤k

1

2‖Y − Xβ‖22

A quasi-equivalent problem is its Lagrange version

βλ,`0 = argminβ∈Rp

1

2‖Y − Xβ‖22 + λ‖β‖0

This is similar to cost-complexity pruning for the decisiontree.

Still combinatorially hard.7 / 1


c©JonathanTaylor

Linear Regression

The `1 norm: LASSO

Recall the `1 norm

‖β‖1 =

p∑

j=1

|βj |

(which really is a norm . . . )

Since it’s a norm it is convex.

It is also the “best” convex approximation to the `0 norm.

8 / 1


c©JonathanTaylor

Linear Regression

LASSO

The LASSO problem is

βλ = argminβ∈Rp

‖Y − Xβ‖22 + λ‖β‖1.

In bound form

βc = argminβ∈Rp :‖β‖1≤c

‖Y − Xβ‖22 + λ‖β‖1.

This is a convex problem . . . lots of nice ways to solve it.

With `0 we knew we would get feature selection – doesthis happen with `1 as well?

9 / 1


c©JonathanTaylor

Linear RegressionThe LassoWhy does `1-penalty give sparse ��?

minimize�2Rp

1

2ky � X�k2 subject to k�k1 c

+βOLS

●

βλ β1

β2

Jacob Bien A Lasso for Hierarchical InteractionsWhy do we get sparse solutions with the LASSO?

10 / 1


c©JonathanTaylor

Linear Regression

LASSO

The parameter λ controls the sparsity.

For λ > ‖XT y‖∞, the sup-norm of XT y ∈ Rp, theLASSO solution is

βλ = 0

Just below ‖XTY ‖∞ only the feature with maximalabsolute correlation with Y has a nonzero coefficient.

For λ = 0, any least squares solution is a LASSO solution.

It is possible, depending on X to have multiple solutions,at some fixed λ but for “generic” X it never happens.

11 / 1


c©JonathanTaylor

LASSO path

12 / 1


c©JonathanTaylor

LASSO

So what? (Consistency)

The LASSO yields sparse solutions, but are they goodsparse solutions?

Noting that the first variable to enter has largest absolutecorrelation suggests the answer is yes.

Suppose

E (Y |X ) =∑

j∈AXAβA.

We call A ⊂ {1, . . . , p} the (true) active coefficients andI = Ac the (true) inactive features.

13 / 1


c©JonathanTaylor

LASSO

So what? (Consistency)

Suppose

#A is not too large;the matrix XT

I XA is not very big;XTA XA is not too close to singular.

Then, for certain values of λ, βλ makes no false positives.That is,

βλ,j 6= 0 =⇒ j ∈ A.

If (βj)j∈A are large enough, then it also makes no falsepositives. That is,

βλ,j = 0 =⇒ j ∈ I .

14 / 1


c©JonathanTaylor

Linear RegressionThe LassoWhy does `1-penalty give sparse ��?

minimize�2Rp

1

2ky � X�k2 subject to k�k1 c

+βOLS

●

βλ β1

β2

Jacob Bien A Lasso for Hierarchical InteractionsWhy do we get sparse solutions with the LASSO?

15 / 1


c©JonathanTaylor

LASSO

Other problems

The sparsity property is a property of the `1 norm, not thesmooth loss function.

For many smooth objectives, adding this `1 penalty yieldssparse solutions.

16 / 1


c©JonathanTaylor

LASSO

Sparse Support Vector Machine

We can add the `1 norm to the support vector machineloss:

minimizeβ,α

n∑

i=1

(1− yi (α + xxxTi β))+ + λ2‖β‖22 + λ1‖β‖1.

Yields a sparse coefficient vector for the SVM.

17 / 1


c©JonathanTaylor

LASSO

Sparse Logistic Regression

We can add the `1 norm to the logistic regression loss:

minimizeβ,α

n∑

i=1

DEV(α + xxxTi β,yyy i ) + λ‖β‖1

Yields a sparse coefficient vector for the logistic regressionmodel.

18 / 1


c©JonathanTaylor

LASSO

Fused / Generalized LASSO

What if we want a solution with small

# {β : Diβ 6= 0} = ‖Dβ‖0?

Example: change point detection, a form of outlierdetection in a streaming data situation.

19 / 1


c©JonathanTaylor

Linear Regression

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

0 20 40 60 80 100

−2

02

46

810

12

A time course with few jumps.

20 / 1


c©JonathanTaylor

Linear Regression

Fused / Generalized LASSO

Might consider solving


‖Y − Xβ‖22 + λ‖Dβ‖1.

A lot of the theory from LASSO carries over to this caseas well, but not quite all of it . . .

21 / 1


c©JonathanTaylor

LASSO

Group LASSO

What if there are disjoint “groups” (βg )g∈G of coefficientsin our model and we want a solution with small

# {g : βg 6= 0} = ‖β‖0,G?

Example: the groups of variables might be indicatorvariables for different categorical variables.

22 / 1


c©JonathanTaylor

LASSO

Group LASSO

Might consider solving


‖Y − Xβ‖22 + λ∑

g∈G‖βg‖2.

Leads to a group LASSO version of all your favorites:

logistic regression;support vector machine;etc.

23 / 1


c©JonathanTaylor

LASSO

Other problems

Why not add `1 penalty to decision tree problem?

Because the original problem is non-convex so theircombination is generally still non-convex . . .

There are tree-like group LASSO penalties . . .

Similarly, we would not add `1 penalty to K -meansobjective function . . .

24 / 1


c©JonathanTaylor

LASSO

No free lunch

Of course, we must pay some price for all of this, but whatprice?

Well, the LASSO produces biased estimates even when itfinds the true active variables.

But, general theory says that we should be using somebias in the form of shrinkage . . .

25 / 1


c©JonathanTaylor

LASSO

Summary

A lot of interesting work in high-dimensional statistics /machine learning over the last few years involves studyingproblems of the form


L(β) + λP(β)

where

L is a loss function like the support vector loss, logisticloss, squared error loss, etc.P is a convex penalty that imparts “structure” on thesolutions.

Lots of interesting questions remain . . .

Try STATS315 for a more detailed introduction to theLASSO . . .

26 / 1


c©JonathanTaylor

Part II

Final review

27 / 1


c©JonathanTaylor

Final review

Overview – Before Midterm

General goals of data mining.

Datatypes.

Preprocessing & dimension reduction.

Distances.

Multidimensional scaling.

Multidimensional arrays.

Decision trees.

Performance measures for classifiers.

Discriminant analysis.

28 / 1


c©JonathanTaylor

Final review

Overview – After Midterm

More classifiers:

Rule-based ClassifiersNearest-Neighbour ClassifiersNaive Bayes ClassifiersNeural NetworksSupport Vector MachinesRandom ForestsBoosting (AdaBoost / Gradient Boosting)

Clustering.

Outlier detection.

29 / 1


c©JonathanTaylor

Rule based classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Rule-based Classifier (Example)

R1: (Give Birth = no) ! (Can Fly = yes) " Birds R2: (Give Birth = no) ! (Live in Water = yes) " Fishes R3: (Give Birth = yes) ! (Blood Type = warm) " Mammals R4: (Give Birth = no) ! (Can Fly = no) " Reptiles R5: (Live in Water = sometimes) " Amphibians 30 / 1


c©JonathanTaylor

Rule based classifiers

Concepts

coverage

accuracy

mutual exclusivity

exhaustivity

Laplace accuracy

31 / 1


c©JonathanTaylor

Nearest neighbour classifier


Nearest Neighbor Classifiers

! Basic idea: –  If it walks like a duck, quacks like a duck, then

it’s probably a duck

Training Records

Test Record

Compute Distance

Choose k of the “nearest” records

32 / 1


c©JonathanTaylor

Nearest neighbour classifier


Nearest Neighbor Classification…

! Choosing the value of k: –  If k is too small, sensitive to noise points –  If k is too large, neighborhood may include points from

other classes

33 / 1


c©JonathanTaylor

Naive Bayes classifiers

Naive Bayes classifiers

Model:

P(Y = c |X1 = xxx1, . . . ,Xp = xxxp)

∝(

p∏

l=1

P(Xl = xxx l |Y = c)

)P(Y = c)

For continuous features, typically a 1-dimensional QDAmodel is used (i.e. Gaussian within each class).

For discrete features: use the Laplace smoothedprobabilities

P(Xj = l |Y = c) =# {i : XXX ij = l ,YYY i = c}+ α

# {YYY i = c}+ α · k .

34 / 1


c©JonathanTaylor

Neural networks: single layer


Artificial Neural Networks (ANN)

35 / 1


c©JonathanTaylor

Neural networks: double layer

36 / 1


c©JonathanTaylor

Support vector machine


Support Vector Machines

!  Find hyperplane maximizes the margin => B1 is better than B2

37 / 1


c©JonathanTaylor

Support vector machines


Solves the problem

minimizeβ,α,ξ‖β‖2

subject to yyy i (xxxTi β + α) ≥ 1− ξi , ξi ≥ 0,

∑ni=1 ξi ≤ C

38 / 1


c©JonathanTaylor


Non-separable problems

The ξi ’s can be removed from this problem, yielding

minimizeβ,α‖β‖22 + γ

n∑

i=1

(1− yi fα,β(xxx i ))+

where (z)+ = max(z , 0) is the positive part function.

Or,

minimizeβ,α

n∑

i=1

(1− yi fα,β(xxx i ))+ + λ‖β‖22

39 / 1


c©JonathanTaylor

Logistic vs. SVM

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

LogisticSVM

40 / 1


c©JonathanTaylor

Ensemble methods


General Idea

41 / 1


c©JonathanTaylor

Ensemble methods

Bagging / Random Forests

In this method, one takes several bootstrap samples(samples with replacement) of the data.

For each bootstrap sample Sb, 1 ≤ b ≤ B, fit a model,retaining the classifier f ∗,b.

After all models have been fit, use majority vote

f (xxx) = majority vote of (f ∗,b(xxx))1≤i≤B .

Defined the OOB estimate of error.

42 / 1


c©JonathanTaylor

Ensemble methods


Illustrating AdaBoost

Data points for training

Initial weights for each data point

43 / 1


c©JonathanTaylor

Ensemble methods


Illustrating AdaBoost

44 / 1


c©JonathanTaylor

Ensemble methods

Boosting as gradient descent

It turns out that boosting can be thought of as somethinglike gradient descent.

In some sense, the boosting algorithm is a “steepestdescent” algorithm to find

argminf ∈F

n∑

i=1

L(yyy i , f (xxx i )).

45 / 1


c©JonathanTaylor

Cluster analysis


What is Cluster Analysis?

!  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

46 / 1


c©JonathanTaylor

Clustering

Types of clustering

Partitional A division data objects into non-overlappingsubsets (clusters) such that each data object is inexactly one subset.

Hierarchical A set of nested clusters organized as ahierarchical tree. Each data object is in exactlyone subset for any horizontal cut of the tree . . .

47 / 1


c©JonathanTaylor

Cluster analysis502 14. Unsupervised Learning

• • •

•

••

••

••••

•

•

• •• ••

•• • •

•

••

•

•

••

•

••

•••••

• ••

•

• •••

•

•

•

••

••

••

•••

•

••

•

•

•

•

•

•

••••

•

•

• •••

••

•• •

•• •• •

•

• •• •

••

•

• •

••

••• •

•

•• • •

•• •• ••• •

••

•••

•••

•

•• •

•

••

••

•

•

••

••

••••

• •

••• ••

X1X

2

FIGURE 14.4. Simulated data in the plane, clustered into three classes (repre-sented by orange, blue and green) by the K-means clustering algorithm

that at each level of the hierarchy, clusters within the same group are moresimilar to each other than those in di!erent groups.

Cluster analysis is also used to form descriptive statistics to ascertainwhether or not the data consists of a set distinct subgroups, each grouprepresenting objects with substantially di!erent properties. This latter goalrequires an assessment of the degree of di!erence between the objects as-signed to the respective clusters.

Central to all of the goals of cluster analysis is the notion of the degree ofsimilarity (or dissimilarity) between the individual objects being clustered.A clustering method attempts to group the objects based on the definitionof similarity supplied to it. This can only come from subject matter consid-erations. The situation is somewhat similar to the specification of a loss orcost function in prediction problems (supervised learning). There the costassociated with an inaccurate prediction depends on considerations outsidethe data.

Figure 14.4 shows some simulated data clustered into three groups viathe popular K-means algorithm. In this case two of the clusters are notwell separated, so that “segmentation” more accurately describes the partof this process than “clustering.” K-means clustering starts with guessesfor the three cluster centers. Then it alternates the following steps untilconvergence:

• for each data point, the closest cluster center (in Euclidean distance)is identified;

A partitional example48 / 1


c©JonathanTaylor

K -means520 14. Unsupervised Learning

Number of Clusters2 4 6 8

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0 •

• •

• •• • •

•

••

••

• • •

Number of Clusters

Gap

2 4 6 8

-0.5

0.0

0.5

1.0

•

•

••

• • •

•

logW

K

FIGURE 14.11. (Left panel): observed (green) and expected (blue) values oflog WK for the simulated data of Figure 14.4. Both curves have been translatedto equal zero at one cluster. (Right panel): Gap curve, equal to the di!erencebetween the observed and expected values of log WK . The Gap estimate K! is thesmallest K producing a gap within one standard deviation of the gap at K + 1;here K! = 2.

This gives K! = 2, which looks reasonable from Figure 14.4.

14.3.12 Hierarchical Clustering

The results of applying K-means or K-medoids clustering algorithms de-pend on the choice for the number of clusters to be searched and a startingconfiguration assignment. In contrast, hierarchical clustering methods donot require such specifications. Instead, they require the user to specify ameasure of dissimilarity between (disjoint) groups of observations, basedon the pairwise dissimilarities among the observations in the two groups.As the name suggests, they produce hierarchical representations in whichthe clusters at each level of the hierarchy are created by merging clustersat the next lower level. At the lowest level, each cluster contains a singleobservation. At the highest level there is only one cluster containing all ofthe data.

Strategies for hierarchical clustering divide into two basic paradigms: ag-glomerative (bottom-up) and divisive (top-down). Agglomerative strategiesstart at the bottom and at each level recursively merge a selected pair ofclusters into a single cluster. This produces a grouping at the next higherlevel with one less cluster. The pair chosen for merging consist of the twogroups with the smallest intergroup dissimilarity. Divisive methods startat the top and at each level recursively split one of the existing clusters at

Figure : Gap statistic

49 / 1


c©JonathanTaylor

K -medoid

Algorithm

Same as K -means, except that centroid is estimated notby the average, but by the observation having minimumpairwise distance with the other cluster members.

Advantage: centroid is one of the observations— useful,eg when features are 0 or 1. Also, one only needs pairwisedistances for K -medoids rather than the raw observations.

50 / 1


c©JonathanTaylor

Silhouette plot

51 / 1


c©JonathanTaylor

Cluster analysis

522 14. Unsupervised Learning

CNS

CNS

CNS

RENA

L

BREA

ST

CNSCN

S

BREA

ST

NSCL

C

NSCL

C

RENA

LRE

NAL

RENA

LRENA

LRE

NAL

RENA

L

RENA

L

BREA

STNS

CLC

RENA

L

UNKN

OWN

OVA

RIAN

MELAN

OMA

PROST

ATEOVA

RIAN

OVA

RIAN

OVA

RIAN

OVA

RIAN

OVA

RIAN

PROST

ATE

NSCL

CNS

CLC

NSCL

C

LEUK

EMIA

K562B-repro

K562A-repro

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

LEUK

EMIA

COLO

NCO

LON

COLO

NCO

LON

COLO

N

COLO

NCO

LON

MCF

7A-repro

BREA

STMCF

7D-repro

BREA

ST

NSCL

C

NSCL

CNS

CLC

MELAN

OMA

BREA

STBR

EAST

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

MELAN

OMA

FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering withaverage linkage to the human tumor microarray data.

chical structure produced by the algorithm. Hierarchical methods imposehierarchical structure whether or not such structure actually exists in thedata.The extent to which the hierarchical structure produced by a dendro-

gram actually represents the data itself can be judged by the copheneticcorrelation coe!cient. This is the correlation between the N(N!1)/2 pair-wise observation dissimilarities dii! input to the algorithm and their corre-sponding cophenetic dissimilarities Cii! derived from the dendrogram. Thecophenetic dissimilarity Cii! between two observations (i, i!) is the inter-group dissimilarity at which observations i and i! are first joined togetherin the same cluster.The cophenetic dissimilarity is a very restrictive dissimilarity measure.

First, the Cii! over the observations must contain many ties, since onlyN!1of the total N(N ! 1)/2 values can be distinct. Also these dissimilaritiesobey the ultrametric inequality

Cii! " max{Cik, Ci!k} (14.40)

A hierarchical example52 / 1


c©JonathanTaylor

Hierarchical clustering

Concepts

Top-down vs. bottom up

Different linkages:

single linkage (minimum distance)complete linkage (maximum distance)

53 / 1


c©JonathanTaylor

Mixture models

Mixture models

Similar to K -means but assignment to clusters is “soft”.

Often applied with multivariate normal as the modelwithin classes.

EM algorithm used to fit the model:

Estimate responsibilities.Estimate within class parameters replacing labels(unobserved) with responsibilities.

54 / 1


c©JonathanTaylor

Model-based clustering

Summary

1 Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K

2 Use a specialized hierarchical clustering technique:model-based hierarchical agglomeration.

3 Use clusters from previous step to initialize EM for themixture model.

4 Uses BIC to compare different mixture models and modelswith different numbers of clusters.

55 / 1


c©JonathanTaylor

Outliers

56 / 1


c©JonathanTaylor

Outliers

General steps

Build a profile of the “normal” behavior.

Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.

General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.

Example: Grubbs’ test chooses an outlier threshold tocontrol Type I error of any declared outliers if data doesactually follow the model . . .

57 / 1

Statistics 202: Data Mining Part I Linear regression & LASSOstatweb.stanford.edu/.../notes/week10_2x2.pdf · Part I Linear regression & LASSO 2/1 Statistics 202: Data Mining c Jonathan

Documents