Top Banner
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein http://statwww.epfl.ch/davison/ teaching/Microarrays/ Supervised Learning, Classification, Discrimination
48

SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.

Jan 01, 2016

Download

Documents

Jonah Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

SLIDES RECYCLED FROM

ppt slides by Darlene Goldstein

http://statwww.epfl.ch/davison/teaching/Microarrays/

Supervised Learning, Classification, Discrimination

Page 2: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Gene expression data Data on G genes for n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 3: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Machine learning tasks

• Task: assign objects to classes (groups) on the basis of measurements made on the objects

• Unsupervised: classes unknown, want to discover them from the data (cluster analysis)

• Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

Page 4: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Discrimination

• Objects (e.g. arrays) are to be classified as belonging to one of a number of predefined classes {1, 2, …, K}

• Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG)

• Aim: predict Y from X.

Page 5: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Example: Tumor Classification• Reliable and precise classification essential for

successful cancer treatment

• Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables

• Uncertainties in diagnosis remain; likely that existing classes are heterogeneous

• Characterize molecular variations among tumors by monitoring gene expression (microarray)

• Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

Page 6: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Tumor Classification Using Gene Expression Data

Three main types of statistical problems associated with tumor classification:

• Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering)

• Classification of malignancies into known classes (supervised learning – discrimination)

• Identification of “marker” genes that characterize the different tumor classes (feature or variable selection).

Page 7: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Classifiers• A predictor or classifier partitions the space of

gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile X=(X1, ...,XG) Ak the predicted class is k

• Classifiers are built from a learning set (LS) L = (X1, Y1), ..., (Xn,Yn)

• Classifier C built from a learning set L: C( . ,L): X {1,2, ... ,K}

• Predicted class for observation X:C(X,L) = k if X is in Ak

Page 8: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Decision Theory (I)

• Can view classification as statistical decision theory: must decide which of the classes an object belongs to

• Use the observed feature vector X to aid in decision making

• Denote population proportion of objects of class k as k = p(Y = k)

• Assume objects in class k have feature vectors with density pk(X) = p(X|Y = k)

Page 9: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Decision Theory (II)

• One criterion for assessing classifier quality is the misclassification rate,

p(C(X)Y)

• A loss function L(i,j) quantifies the loss incurred by erroneously classifying a member of class i as class j

• The risk function R(C) for a classifier is the expected (average) loss:

R(C) = E[L(Y,C(X))]

Page 10: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Decision Theory (III)

• Typically L(i,i) = 0

• In many cases can assume symmetric loss with L(i,j) = 1 for i j (so that different types of errors are equivalent)

• In this case, the risk is simply the misclassification probability

• There are some important examples, such as in diagnosis, where the loss function is not symmetric

Page 11: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Maximum likelihood discriminant rule

• A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest

• For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by

C(X) = argmaxk pk(X)

Page 12: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Fisher Linear Discriminant Analysis

First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA):

1. finds linear combinations of the gene expression profiles X=X1,...,XG with large ratios of between-groups to within-groups sums of squares - discriminant variables;

2. predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables

Page 13: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Gaussian ML Discriminant Rules

• For multivariate Gaussian (normal) class densities X|Y= k ~ N(k,k), the ML classifier is

C(X) = argmink {(X - k) k-1

(X - k)’ + log| k |}

• In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA)

• In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities

Page 14: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Gaussian ML Discriminant Rules

• When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2):

C(X) = argmink (X - k) -1 (X - k)’

• When all class densities have the same diagonal covariance matrix =diag(1

2… G2),

the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA)

Page 15: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Nearest Neighbor Classification

• Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation)

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k observations in the learning set closest to

X– predict the class of X by majority vote, i.e., choose

the class that is most common among those k observations.

• The number of neighbors k can be chosen by cross-validation (more on this later)

Page 16: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

How to construct a tree predictor

BINARY RECURSIVE PARTITIONING• Binary: split parent node into

two child nodes• Recursive: each child node can

be treated as parent node• Partitioning: data set is

partitioned into mutually exclusive subsets in each split

Page 17: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

High 12%Low 88%

High 17%Low 83%

Is BP <= 91?

High 70%Low 30%

High 11%Low 89%

High 50%Low 50%

High 2%Low 98%

High 23%Low 77%

Is age <= 62.5?Classified as high risk!

Classified as low risk!

Classified as high risk! Classified as low risk!

Is ST present?

Tree construction

Yes No

No

No

Yes

Yes

Page 18: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Classification Trees• Partition the feature space into a set of

rectangles, then fit a simple model in each one

• Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself)

• Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

• RPART function in R

Page 19: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Classification Tree

Page 20: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Three Aspects of Tree Construction

• Split Selection Rule

• Split-stopping Rule

• Class assignment Rule

Different approaches to these three issues (e.g. CART: Classification And Regression Trees, Breiman et al. (1984); C4.5 and C5.0, Quinlan (1993)).

Page 21: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Three Rules (CART)

• Splitting: At each node, choose split maximizing decrease in impurity (e.g. Gini index, entropy, misclassification error)

• Split-stopping: Grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate

• Class assignment: For each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node

Page 22: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.
Page 23: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.
Page 24: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.
Page 25: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.
Page 26: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Other Classifiers Include…

• Support vector machines (SVMs)

• Neural networks

• Random forest predictors

• HUNDREDS more…

Page 27: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Feature selection and missing data

• Feature selection– Automatic with trees– For DA, NN need preliminary selection– Need to account for selection when

assessing performance

• Missing data– Automatic imputation with trees– Otherwise, impute (or ignore)

Page 28: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Performance Assessment

-error rate- test set error- learning set error (aka resubstitution error)-cross-validation

Page 29: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Performance assessment (I)

• Resubstitution estimation: error rate on the learning set– Problem: downward bias

• Test set estimation: divide cases in learning set into two sets, L1 and L2; classifier built using L1, error rate computed for L2. L1 and L2 must be iid.– Problem: reduced effective sample

size

Page 30: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Performance assessment (II)

• V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers leaving one set out; test set error rates computed on left out set and averaged. – Bias-variance tradeoff: smaller V can give

larger bias but smaller variance

• Out-of-bag estimation: only used when dealing with “bagged” predictors

Page 31: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Performance assessment (III)

• Common error to do feature selection using all of the data, then CV only for model building and classification

• However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased.

• Features should be selected only from the learning set used to build the model (and not the entire learning set)

Page 32: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Aggregating classifiers

• Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set; the multiple versions of the predictor are aggregated by voting.

• Let C(., Lb) denote the classifier built from the bth perturbed learning set Lb, and let wb denote the weight given to predictions made by this classifier. The predicted class for an observation x is given by

argmaxk ∑b wbI(C(x,Lb) = k)

Page 33: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Bagging• Bagging = Bootstrap aggregating

• Nonparametric Bootstrap (standard bagging): perturbed learning sets drawn at random with replacement from the learning sets; predictors built for each perturbed dataset and aggregated by plurality voting (wb = 1)

• Parametric Bootstrap: perturbed learning sets are multivariate Gaussian

• Convex pseudo-data (Breiman 1996)

Page 34: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Aggregation By-products: Out-of-bag estimation of error

rate

• Out-of-bag error rate estimate: unbiased

• Use the left out cases from each bootstrap sample as a test set

• Classify these test set cases, and compare to the class labels of the learning set to get the out-of-bag estimate of the error rate

Page 35: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Aggregation By-products: Case-wise information

• Class probability estimates (votes) (0,1): the proportion of votes for the “winning” class; gives a measure of prediction confidence

• Vote margins (–1,1) : the proportion of votes for the true class minus the maximum of the proportion of votes for each of the other classes; can be used to detect mislabeled (learning set) cases

Page 36: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Aggregation By-products: Variable Importance Statistics

• Measure of predictive power

• For each tree, randomly permute the values of the jth variable for the out-of-bag cases, use to get new classifications

• Several possible importance measures

Page 37: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Aggregation By-products: Intrinsic Case Proximities

• Proportion of trees for which cases i and j are in the same terminal node

• “Clustering”

• Outlier detection:

1/sum(squared proximities of cases in same class)

Page 38: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Boosting

• Freund and Schapire (1997), Breiman (1998)

• Data resampled adaptively so that the weights in the resampling are increased for those cases most often misclassified

• Predictor aggregation done by weighted voting

Page 39: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Comparison of classifiers

• Dudoit, Fridlyand, Speed (JASA, 2002)

• FLDA

• DLDA

• DQDA

• NN

• CART

• Bagging and boosting

Page 40: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Comparison study datasets

• Leukemia – Golub et al. (1999)n = 72 samples, G = 3,571 genes3 classes (B-cell ALL, T-cell ALL, AML)

• Lymphoma – Alizadeh et al. (2000)n = 81 samples, G = 4,682 genes3 classes (B-CLL, FL, DLBCL)

• NCI 60 – Ross et al. (2000)N = 64 samples, p = 5,244 genes8 classes

Page 41: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Leukemia data, 2 classes: Test set error rates;150 LS/TS runs

Page 42: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Leukemia data, 3 classes: Test set error rates;150 LS/TS runs

Page 43: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS runs

Page 44: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

NCI 60 data :Test set error rates;150 LS/TS runs

Page 45: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Results• In the main comparison of Dudoit et al, NN

and DLDA had the smallest error rates, FLDA had the highest

• For the lymphoma and leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers; there was an improvement for the NCI 60 dataset.

• More careful selection of a small number of genes (10) improved the performance of FLDA dramatically

Page 46: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Comparison study – Discussion (I)

• “Diagonal” LDA: ignoring correlation between genes helped here

• Unlike classification trees and nearest neighbors, LDA is unable to take into account gene interactions

• Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions

Page 47: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Comparison study – Discussion (II)• Classification trees are capable of handling and

revealing interactions between variables

• Useful by-product of aggregated classifiers: prediction votes, variable importance statistics

• Variable selection: A crude criterion such as BSS/WSS may not identify the genes that discriminate between all the classes and may not reveal interactions between genes

• With larger training sets, expect improvement in performance of aggregated classifiers

Page 48: SLIDES RECYCLED FROM ppt slides by Darlene Goldstein  Supervised Learning, Classification, Discrimination.

Acknowledgements

• Sandrine Dudoit

• Jane Fridlyand

• Yee Hwa (Jean) Yang

• Terry Speed