Top Banner
Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu
57

Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Dec 26, 2015

Download

Documents

Gervase Kennedy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Supervised Learning and Classification

Xiaole Shirley Liu

and

Jun Liu

Page 2: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Outline• Dimension reduction

– Principal Component Analysis (PCA)– Other approaches such as MDS, SOM, etc.

• Unsupervised learning for classification– Clustering and KNN

• Supervised learning for classification– CART, SVM

• Expression and genome resources

Page 3: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Dimension Reduction

• High dimensional data points are difficult to visualize

• Always good to plot data in 2D– Easier to detect or confirm the relationship among data

points

– Catch stupid mistakes (e.g. in clustering)

• Two ways to reduce:– By genes: some experiments are similar or have little

information

– By experiments: some genes are similar or have little information

Page 4: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Principal Component Analysis

• Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components

• Components are orthogonal (mutually uncorrelated)

• Few PCs may capture most

variation in original data• E.g. reduce 2D into 1D data

Page 5: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

5

Principle Component Analysis (PCA)

Page 6: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Example: human SNP marker data

Page 7: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA for 800 randomly selected SNPs

Page 8: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA for 400 randomly selected SNPs

Page 9: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA for 200 randomly selected SNPs

Page 10: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA for 100 randomly selected SNPs

Page 11: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA for 50 randomly selected SNPs

Page 12: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Interpretations and Insights

• PCA can discover aggregated subtle effects/differences in high-dimensional data

• PCA finds linear directions to project data, and is purely unsupervised, so it is indifferent about the “importance” of certain directions, but attempting to find most “variable” directions.

• There are generalizations for supervised PCA.

Page 13: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Principal Component Analysis

• Achieved by singular value decomposition (SVD): X = UDVT

• X is the original N p data– E.g. N genes, p experiments

• V is p p project directions

– Orthogonal matrix: UTU = Ip

– v1 is direction of the first projection

– Linear combination (relative importance) of each experiment or (gene if PCA on samples)

Page 14: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

14

Quick Linear Algebra Review

• N p matrix: m row, n col• Matrix multiplication:

– N k matrix multiplied with

k p matrix gets Np matrix

• Diagonal matrix

• Identity matrix Ip

• Orthogonal matrix:– r = c, UTU = Ip

• Orthonormal matrix– r >= c, UTU = Ip

1000

0100

0010

0001

5.0000

0200

0030

0005

35

20

01

320

501T

Page 15: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

15

Quick Linear Algebra Review

• Example• Orthogonal matrix

• Transformation

• Multiplication

• Identity matrix

10

01

)1()1(0)1(0010

)1(0000111

10

01

10

01

10

01

10

01

10

01

T

Page 16: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

16

Some basic mathematics• Goal of PCA step 1: find a direction that has the

“maximum variation”– Let the direction be a (column) unit vector c (p-dim)– A projection of x onto c is

– So we want to maximize

– Algebra:

– So the maximum is achieved as the largest eigen value of the sample covariance matrix

1,

p

k kkc x

x c

2

1

n

iiv v

,i iv x cwhere , and subject to ||c||=1

2

1 1

1

ˆ( 1)

n nTT

i i ii i

nTT T

i ii

v v

n

c x x x x c

c x x x x c c c

Page 17: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

A simple illustration

Page 18: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA and SVD• PCA: eigen(XTX), SVD: X=UDVT

• U is N p, relative projection of points• D is p p scaling factor

– Diagonal matrix, d1 d2 d3 … dp 0

• ui1d1 is distance along v1 from origin (first principal components)– Expression value projected on v1 – v2 is 2nd projection direction, ui2d2 is 2nd principal

component, so on

• Captured variances by the first m principal components

p

jj

m

ii dd

11

5.0000

0200

0030

0005

Page 19: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA

N

p

× p

p

= N

p

× p

p

Original data Projection dir Projected value scale

X11V11 + X12V21 + X13V31 + …= X11’ = U11 D11

X21V11 + X22V21 + X23V31 + …= X21’ = U21 D11

X11V12 + X12V22 + X13V32 + …= X12’ = U12 D22

X21V12 + X22V22 + X23V32 + …= X22’ = U22 D22

1st Principal Component

2nd PrincipalComponent

Page 20: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA

v1

v2

v1

v2

v1

v2

Page 21: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA on Genes Example• Cell cycle genes, 13 time points, reduced to 2D• Genes: 1: G1; 4: S; 2: G2; 3: M

Page 22: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA ExampleVariance in data explained by the first n principle

components

Page 23: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

PCA Example• The coefficients of the first 8 principle directions

• This is an example of PCA to reduce samples• Can do PCA to reduce the genes as well

– Use first 2-3 PC to plot samples, give more weight to the more differentially expressed genes, can often see sample classification

v1 v2 v3 v4

Page 24: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Microarray Classificationprobe set Normal m412aNormal m414aNormal m416aNormal m426aNormal m430aMM m282 MM m331aMM m332aMM m333aMM m334aMM m353aMM m408aMM m423aMM m424a39089_at 89.31 143.37 111.61 134.78 121.57 104.02 101.11 105.16 121.21 176.72 117.16 137.19 109.5 109.0635862_at 95.05 107.04 71.06 100.63 117.58 103.96 95.2 114.35 95.03 90.32 93.13 88.61 90.87 112.9541777_at 22.76 20.05 21.37 25.55 30.8 20.75 21.95 28.82 30.85 28.81 22.65 18.91 22.58 21.6538250_at 53.55 62.89 29.36 62.74 36.14 60.07 37.46 42.85 27.86 41.48 116.4 46.39 38.9 29.11656_at 177.69 177.65 167.15 166.04 155.07 180.4 136.47 200.4 201.8 138.38 165.92 176.25 162.85 156.17332_at 128.5 98.29 130.58 111.49 103.56 115.47 121.01 134.5 118.85 88.71 105.08 93.28 113.18 140.1339185_at 107.86 114.02 104.08 108.89 112.75 113.61 120.9 120.1 113.82 102.72 109.81 104.86 104.4 95.53514_at 69.21 51.43 92.43 69.21 55.46 58.43 73.9 74.58 88.07 57.01 79.11 53.63 53.43 69.6235010_at 65.34 42 48.14 52.85 59.07 49.62 62.59 68.39 55.57 47.92 46.97 49.73 44.7 55.7334793_s_at 9.95 9.12 10.45 14.65 21.91 13.2 14.02 17.15 9.05 10.66 8.24 13.43 17.17 15.9733277_at 153.21 120.52 136.7 113.79 110.23 140.96 153.44 149.59 119.14 98.57 156.85 101.86 117.28 104.7234788_at 167.66 172.86 142.6 199.39 195.34 156.66 173.96 159.16 207.34 154.18 158.59 151.91 171.65 246.112053_at 91.76 111.82 99.57 95.58 87.17 123.15 82.24 93.92 97.76 114.66 80.33 107.65 89.78 85.4133465_at 63.37 45.24 54.72 56.74 58.16 59.55 63.43 71.55 55.76 46.63 49.78 40.49 44.5 69.3341097_at 145.34 148.08 171.78 151.96 128.26 138.98 148.45 160.25 169.47 133.5 166.24 135.37 159.2 129.9632394_s_at 449.9 1190.09 429.93 1034.13 196.52 214.51 220.81 331.66 652.66 488.37 699.41 1903.88 843.79 575.161969_s_at 30.03 34.58 59.76 32.84 46.98 51.34 40.4 41.75 31.8 36.74 62.42 40.4 36.37 26.0639225_at 43.19 82.15 97.56 78.3 57.23 65.29 75.14 54.5 58.35 62.47 124.64 56.42 90.55 57.2836919_r_at 36.45 26.84 37.94 35.79 38.86 33.99 28.94 32.57 39.61 32.08 31.37 36.58 44.33 36.9933574_at 16.14 12.58 10.93 14.65 29.64 19.38 14.65 15.29 16.14 19.72 11.23 12.6 18.2 24.0436271_at 41.71 25.8 39.79 49.71 52.64 33.5 48.33 41.15 48.74 45.12 36.5 38.58 55.99 29.73490_g_at 83.48 103.93 121.57 80.05 73.81 115.47 106.57 96.19 101.49 78.5 86.13 71.87 83.73 93.641654_at 78.63 82.7 93.15 73.96 73.82 104.4 100.39 91.78 82.26 63.21 76.23 56.97 76.2 73.0441207_at 100.27 80.62 84.98 75.44 74.26 95.56 96.83 100.36 85.12 71.34 81.04 75.81 70.77 70.8140080_at 172.83 106.63 122.03 118.12 131.15 153.53 150.19 161.04 123 101.64 142.03 110.02 113.58 117.1838699_at 69.1 67.16 62.73 67.46 74.03 61.16 75.27 75.7 63.2 68.12 57.25 65.42 70.71 75.81698_f_at 21.36 43.88 30.5 65.43 35.73 44.05 32.34 35.17 33.89 62.61 34.72 42.49 32.13 37.5136036_at 105.59 71.45 88.72 79.84 75.78 95.13 115.07 100.81 84.13 69.87 76.51 71.58 72.16 73.8540720_at 104.84 175.9 186.87 65.58 64 204.55 89.48 110.87 99 59.84 138.3 59.43 197.43 118.3232194_at 34.01 165.32 153.91 59.4 43.4 98.5 59.53 43.28 47.98 63.09 217.29 127.38 79.38 82.0431499_s_at 42.66 36.26 47.61 43.35 48.55 40.87 52.57 53.86 41.41 40.08 44.22 35.6 43.32 41.4841685_at 25.07 14.68 22.41 22.98 19.79 22.21 21.85 25.12 20.27 18.44 20.37 12.85 22.02 25.9131788_at 115.87 151.38 103.33 144.45 138.01 125.9 132.74 121.06 113.56 114.21 149.88 199.76 121.17 96.031719_at 15.65 18.26 16.74 21.49 15.16 11.49 17.52 21.35 19.36 20.6 15.13 14.3 18.77 18.49973_at 169.15 142.44 164.57 129 151.38 189.15 171.12 169.57 139.02 140.37 145.62 145.17 130.23 132.35

?

Page 25: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

25

Supervised Learning

• Abound in statistical methods, e.g. regression

• A special case is the classification problem: Yi is the class label, and Xi’s are the covariates

• We learn the relationship between Xi’s and Y from training data, and predict Y for future data where only Xi’s are known

Page 26: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

26

Supervised Learning Example

• Use gene expression data to distinguish and predict– Tumor vs. normal samples, or sub-classes of

tumor samples– Long-survival patients vs. short-survival

patients– Metastatic tumors vs.non-metastatic tumors

Page 27: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Clustering Classification• Which known samples does the unknown sample

cluster with?• No guarantee that the known sample will cluster• Try different clustering methods (semi-

supervised)– E.g. change linkage, use subset of genes

Page 28: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

K Nearest Neighbor

• For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X

• Predict the label of X based on majority vote by KNN

• K can be determined by predictability of known samples, semi-supervised again!

• Offer little insights into mechanism

Page 29: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

29

Page 30: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

30

Extensions of Nearest Neighbor Rule

• Class prior weights. Votes may be weighted according to neighbor class.

• Distance weights. Assigning weights to the observed neighbors (“evidences”) that are inversely proportional to their distance from the test-sample.

• Differential misclassification costs. Votes may be adjusted based on the class to be called.

Page 31: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

31

Other Well-known Classification Methods

• Linear Discriminant Analysis (LDA)

• Logistic Regression

• Classification and Regression Trees

• Neutral Networks (NN)

• Support Vector Machines (SVM)

The following presentations of Linear methods for classification, LDA and Logistic Regression are mainly based on Hastie, Tibshirani and Friedman (2001) The Elements of Statistical Learning

Page 32: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

A general framework: the Bayes Classifier

• Consider a two-class problem (can be any number of classes)

• Training data: (Xi,Yi) -- Yi is the class label, and Xi’s are the covariates

• Learn the conditional distribution– P(X|Y=1) and P(X|Y=0)

• Learn (or impose) the prior weight on Y• Use the Bayes rule:

( | 1) ( 1)( 1| )

( | 0) ( 0) ( | 1) ( 1)

P X Y P YP Y X

P X Y P Y P X Y P Y

Page 33: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

STAT115 03/18/2008

33

Supervised Learning Performance Assessment

• If error rate is estimated from whole learning data set, could overfit the data (do well now, but poorly in future observations)

• Divide observations into L1 and L2– Build classifier using L1

– Compute classifier error rate using L2

– Requirement: L1 and L2 are iid (independent & identically-distributed)

• N-fold cross validation– Divide data into N subset (equal size), build classifier

on (N-1) subsets, compute error rate on left out subset

Page 34: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

34

Fisher’s Linear Discriminant Analysis

• First collect differentially expressed genes• Find linear projection so as to maximize class

separability (b/w to w/i group sum of sq)

• Can be used for dimension reduction as well

Page 35: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

35

LDA• 1D, find two group means, cut at some point (middle, say)• 2D, connect two group means with line, use a

line parallel to the “main direction” of the data and pass through somewhere. i.e., project to

• Limitation: – Does not consider non-linear

relationship– Assume class mean capture most of information

• Weighted voting: variation of LDA– Informative genes given different weight based on how

informative it is at classifying samples (e.g. t-statistic)

11 2( )

Page 36: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

36

In practice: estimating Gaussian Distributions

• Prior probabilities:

• Class center:

• Covariance matrix

• Decision boundary for (y, x): find k to maximize

NNkk

kkg

ik Nxi

kkTkk

Tk ˆlogˆˆˆ

2

1ˆˆ)( 11 xx

Page 37: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

37

Logistic Regression

• Data: (yi, xi), i=1,…, n; (binary responses).

• In practice, one estimates the ’s using the training data. (can use R)

• The decision boundary is determined by the linear regression, i.e., classify yi =1 if

ippiii

ii xxyP

yP

110)|0(

)|1(log

x

xModel:

0 1 1 0i p ipx x

Page 38: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

38

Diabetes Data Set

Page 39: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

39

Connections with LDA

Page 40: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

40

Remarks

• Simple methods such as nearest neighbor classification, are competitive with more complex approaches, such as aggregated classification trees or support vector machines (Dudoit and Fridlyand, 2003)

• Screening of genes to G=10 to 100 is advisable;• Models may include other predictor variables (such

as age and sex)• Outcomes maybe continuous (e.g., blood pressure,

cholesterol level, etc.)

Page 41: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Classification And Regression Tree

• Split data using set of binary (or multiple value) decisions

• Root node (all data) has certain impurities, need to split the data to reduce impurities

Page 42: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

CART

• Measure of impurities– Entropy

– Gini index impurity

• Example with Gini: multiply impurity by number of samples in the node– Root node

(e.g. 8 normal & 14 cancer)

– Try split by gene xi (xi 0, 13 cancer; xi < 0, 1 cancer & 8 normal):

– Split at gene with the biggest reduction in impurities

class

classPclassP ))((log)( 2

class

classP 2))((1

18.1022

14

22

8122

22

78.19

1

9

819

13

13113

222

Page 43: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

CART

• Assume independence of partitions, same level may split on different gene

• Stop splitting– When impurity is small enough– When number of node is small

• Pruning to reduce over fit– Training set to split, test set for pruning– Split has cost, compared to gain at each split

Page 44: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

44

Boosting

• Boosting is a method of improving the effectiveness of predictors.

• Boosting relies on the existence of weak learners.

• A weak learner is a “rough and moderately inaccurate” predictor, but one that can predict better than chance.

• Boosting shows the strength of weak learn-ability

Page 45: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

45

The Rules for Boosting

• set all weights of training examples equal• train a weak learner on the weighted

examples• see how well the weak learner performs

on data and give it a weight based on how well it did

• re-weight training examples and repeat• when done, predict by weighted voting

Page 46: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

46

Artificial Neural Network

• ANN: model neurons (feedforward NN)• Perceptron: simplest ANN

– xi input (e.g. expression values of diff genes)

– wi weight (e.g. how much each gene contributes +/–)

– y output (e.g. )normalthresholdwx

cancerthresholdwx

i ii

i ii

Page 47: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

47

ANN

• Multi Layered Perceptron• 3 layer ANN can solve any nonlinear

continuous problem– Picking # layers and

# nodes/layer not easy

• Weight training:– Back propagation– Minimize error b/t

observed and predicted– Black box

12

1 1

)()(n

j

d

iiijj xg x

00)( kj

ji

ijikjkk wwxwfwfyg x

Page 48: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Support Vector Machine

• SVM– Which hyperplane is the best?

Page 49: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Support Vector Machine

• SVM finds the hyperplane that maximizes the margin

• Margin determined by support vectors (samples lie on the class

edge), others irrelevant

Page 50: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Support Vector Machine

• SVM finds the hyperplane that maximizes the margin

• Margin determined by support vectors others irrelevant

• Extensions: – Soft edge, support vectors diff

weight– Non separable: slack var > 0

Max (margin – # bad)

Page 51: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Nonlinear SVM• Project the data through higher dimensional space

with kernel function, so classes can be separated by hyperplane

• A few implemented kernel functions available in Matlab & BioConductor, the choice is usually trial and error and personal experience

K(x,y) = (xy)2

Page 52: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Most Widely Used Sequence IDs

• GenBank: all submitted sequences • EST: Expressed Sequence Tags (mRNA), some

redundancy, might have contaminations• UniGene: computationally derived gene-based

transcribed sequence clusters • Entrez Gene: comprehensive catalog of genes and

associated information, ~ traditional concept of “gene”

• RefSeq: reference sequences mRNAs and proteins, individual transcript (splice variant)

Page 53: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

UCSC Genome Browser

• Can display custom tracks

Page 54: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Entrez: Main NCBI Search Engine

Page 55: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Public Microarray Databases

• SMD: Stanford Microarray Database, most Stanford and collaborators’ cDNA arrays

• GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly.

• Oncomine: Cancer Microarray Database– Published cancer related microarrays– Raw data all processed, nice interface

Page 56: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Outline• Gene ontology

– Check diff expr and clustering, GSEA

• Microarray clustering:– Unsupervised

• Clustering, KNN, PCA

– Supervised learning for classification• CART, SVM

• Expression and genome resources

Page 57: Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Acknowledgment• Kevin Coombes & Keith Baggerly• Darlene Goldstein• Mark Craven• George Gerber• Gabriel Eichler• Ying Xie• Terry Speed & Group• Larry Hunter• Wing Wong & Cheng Li• Ping Ma, Xin Lu, Pengyu Hong• Mark Reimers• Marco Ramoni• Jenia Semyonov