Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 10 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor December 5, 2012 1/1 Statistics 202: Data Mining c Jonathan Taylor Part I Linear regression & LASSO 2/1 Statistics 202: Data Mining c Jonathan Taylor Linear Regression Linear Regression We’ve talked mostly about classification, where the outcome categorical. In regression, the outcome is continuous. Given Y ∈ R n and X ∈ R n×p , the least squares regression problem is ˆ β = argmin β∈R p 1 2 y - X β 2 2 . 3/1 Statistics 202: Data Mining c Jonathan Taylor Linear Regression Linear Regression If p ≤ n and X T X is invertible, this has a unique solution: ˆ β =(X T X ) -1 X T Y . If p > n or, X T X is not invertible, the solution is not unique. 4/1
15
Embed
Statistics 202: Data Mining Part I Linear regression & LASSOstatweb.stanford.edu/.../notes/week10_2x2.pdf · Part I Linear regression & LASSO 2/1 Statistics 202: Data Mining c Jonathan
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
A lot of interesting work in high-dimensional statistics /machine learning over the last few years involves studyingproblems of the form
βλ = argminβ∈Rp
L(β) + λP(β)
where
L is a loss function like the support vector loss, logisticloss, squared error loss, etc.P is a convex penalty that imparts “structure” on thesolutions.
Lots of interesting questions remain . . .
Try STATS315 for a more detailed introduction to theLASSO . . .
! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Partitional A division data objects into non-overlappingsubsets (clusters) such that each data object is inexactly one subset.
Hierarchical A set of nested clusters organized as ahierarchical tree. Each data object is in exactlyone subset for any horizontal cut of the tree . . .
FIGURE 14.4. Simulated data in the plane, clustered into three classes (repre-sented by orange, blue and green) by the K-means clustering algorithm
that at each level of the hierarchy, clusters within the same group are moresimilar to each other than those in di!erent groups.
Cluster analysis is also used to form descriptive statistics to ascertainwhether or not the data consists of a set distinct subgroups, each grouprepresenting objects with substantially di!erent properties. This latter goalrequires an assessment of the degree of di!erence between the objects as-signed to the respective clusters.
Central to all of the goals of cluster analysis is the notion of the degree ofsimilarity (or dissimilarity) between the individual objects being clustered.A clustering method attempts to group the objects based on the definitionof similarity supplied to it. This can only come from subject matter consid-erations. The situation is somewhat similar to the specification of a loss orcost function in prediction problems (supervised learning). There the costassociated with an inaccurate prediction depends on considerations outsidethe data.
Figure 14.4 shows some simulated data clustered into three groups viathe popular K-means algorithm. In this case two of the clusters are notwell separated, so that “segmentation” more accurately describes the partof this process than “clustering.” K-means clustering starts with guessesfor the three cluster centers. Then it alternates the following steps untilconvergence:
• for each data point, the closest cluster center (in Euclidean distance)is identified;
FIGURE 14.11. (Left panel): observed (green) and expected (blue) values oflog WK for the simulated data of Figure 14.4. Both curves have been translatedto equal zero at one cluster. (Right panel): Gap curve, equal to the di!erencebetween the observed and expected values of log WK . The Gap estimate K! is thesmallest K producing a gap within one standard deviation of the gap at K + 1;here K! = 2.
This gives K! = 2, which looks reasonable from Figure 14.4.
14.3.12 Hierarchical Clustering
The results of applying K-means or K-medoids clustering algorithms de-pend on the choice for the number of clusters to be searched and a startingconfiguration assignment. In contrast, hierarchical clustering methods donot require such specifications. Instead, they require the user to specify ameasure of dissimilarity between (disjoint) groups of observations, basedon the pairwise dissimilarities among the observations in the two groups.As the name suggests, they produce hierarchical representations in whichthe clusters at each level of the hierarchy are created by merging clustersat the next lower level. At the lowest level, each cluster contains a singleobservation. At the highest level there is only one cluster containing all ofthe data.
Strategies for hierarchical clustering divide into two basic paradigms: ag-glomerative (bottom-up) and divisive (top-down). Agglomerative strategiesstart at the bottom and at each level recursively merge a selected pair ofclusters into a single cluster. This produces a grouping at the next higherlevel with one less cluster. The pair chosen for merging consist of the twogroups with the smallest intergroup dissimilarity. Divisive methods startat the top and at each level recursively split one of the existing clusters at
Same as K -means, except that centroid is estimated notby the average, but by the observation having minimumpairwise distance with the other cluster members.
Advantage: centroid is one of the observations— useful,eg when features are 0 or 1. Also, one only needs pairwisedistances for K -medoids rather than the raw observations.
FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering withaverage linkage to the human tumor microarray data.
chical structure produced by the algorithm. Hierarchical methods imposehierarchical structure whether or not such structure actually exists in thedata.The extent to which the hierarchical structure produced by a dendro-
gram actually represents the data itself can be judged by the copheneticcorrelation coe!cient. This is the correlation between the N(N!1)/2 pair-wise observation dissimilarities dii! input to the algorithm and their corre-sponding cophenetic dissimilarities Cii! derived from the dendrogram. Thecophenetic dissimilarity Cii! between two observations (i, i!) is the inter-group dissimilarity at which observations i and i! are first joined togetherin the same cluster.The cophenetic dissimilarity is a very restrictive dissimilarity measure.
First, the Cii! over the observations must contain many ties, since onlyN!1of the total N(N ! 1)/2 values can be distinct. Also these dissimilaritiesobey the ultrametric inequality