1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

Post on 27-Dec-2015

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

1

CS/CBB 545 - Data MiningSpectral Methods (PCA,SVD)

#2 - Application

Mark Gerstein, Yale Universitygersteinlab.org/courses/545

(class 2007,03.08 14:30-15:45)

2

Intuition on interpretation of SVD in terms of genes

and conditions

3

SVD for microarray data(Alter et al, PNAS 2000)

4

5

Notation• m=1000 genes

– row-vectors

– 10 eigengene (vi) of dimension 10 conditions

• n=10 conditions (assays)– column vectors

– 10 eigenconditions (ui) of dimension 1000 genes

6

Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix

7

Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix

Bra - ket notation

8

Plotting Experiments in Low Dimension Subspace

9

Close up on Eigengenes

10Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Genes sorted by correlation with top 2 eigengenes

11Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Same thing different experiment: Genes sorted by relative correlation with first two eigengenes for alpha-factor experiment

12Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Normalized elutriation

expression in the subspace

associated with the cell cycle

13

Biplot Applied to Genes and Conditions

See grouping of arrays and genes on same plot

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

14

Spectral Biclustering

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

15

Biclustering to associate particular genes with certain phenotypes

Conditions

Reo

rder

ed G

enes

(Sor

ted

acco

rdin

g to

a cl

assi

ficat

ion

vect

or)

?

Matrix of raw data

Gen

es

Reordered Conditions(Sorted according to

a classification vector)

Shuffled Matrix(containing checkerboard

“biclusters” of conditions with marker genes)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

16

Pomeroy et. al. , Nature 415 (2002) 436Prediction of central nervous system embryonal tumor outcome based on gene expression

5 types of brain tumors

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

17

Intuition on Identification of Blocky Matrices

2

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

18

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Ax yGene partition vector

tumor 1 tumor 2

Gen

e cl

uste

r 1

Gen

e cl

uste

r 2

tumor 3

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

19

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

'TA y x

Tissue partition vector

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

20

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

'TA Ax x

Ax y 'TA y x

Biclustering by SVD

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

21

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Identify checkerboard matrices by their action

on classification vectors: Formulation as “eigenproblem”

Checkerboard Matrix A

Condition Classification Vect. x

Conditions

Gen

es

Gene Classification Vector y

A A x = x’T

A A y = y’T

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

Genes

Con

ditio

ns

x’

yAT

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

22

SVD to Solve Eigenproblem

[Botstein]

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

23

Yuval Kluger et al. Genome Res. 2003; 13: 703-716

Figure 1. Overview of important parts of the biclustering process

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

24

4 8 9 3

8 5

1 6 6 3

7 3 2 4

2 7 6

8 6 12 9

6 5 7

8 9 7 8

9 6 8 9

7 9 9

5

8

2

6 1

84 3 4 5

9 4 7

7 5 8 8

6

2

1 5 3

27

6

7 36 49

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Gene partition with noisy data

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

25

.11

.12

.11

.12 .12

.11 .11

.10 .10 .10

.07 .07 .07

.12

.12 .12

.09

.10

.10 .10

.09 .0

.1

.12 .12

.

1 .11 .11

.10 .10

.10

.07

.07 .07

.04 .04 .04

.04 .04 .04

.04 .04 .04

9

.09 .09 .0.11

.11 .11 .11

12 .12 .1

.07 .07

.07 .07 .

2 .

07

.

.

10 .10 .1

9

.09 ..1 0907 .

02

1

1

09

D

D

D

c

E

E

b

b

b

bE

c

a

a

c

a

a

1R Ax y

Normalization Rescales Rows and Columns to Same Means

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

26

.18 .09 .0.16 .32 .16

.16 .32 .16

9

.18 .09

.214 .107 .107

.21

.16 .32 .16

4 .107 .107

.214 .1

.09

.18 .09 .0

.16 .32 .1

.14 .28 .14

.14 .28 .14

.14 .28 .14

.14 .28 .14

.09 .18 .0 .312 .1

07 .10

56 .15

9

.18 .

9

.09 .

7

.214 .10

18 .0

6

.31

09

2 .

7

1

.0

56

6

.

9

1

.

9

107

56

.0 .312 .156 .

'

'

9 .18 .0

'

'

156

'

'

'

'

'

9 '

'

D

D

E

E

E

b

bD

c

c

b

c

a

a

a

b

a

1 'TC A y x Rescale columns

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

27

Representative Cancer Data set

• Lymphoma Data from Dalla-Favera et al. at Columbia

• Informatics from Stolovitzky & Califano at IBM

• Supervised learning some identified characteristic genes associated with different types of lymphoma

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

28

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Results on Representative Cancer Data set

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

29

Actual Data with Normalization and Sorting

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

30

Actual Data just with Sorting

(no normalization)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

31

Actual Data (no normalization

or sorting)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

32

Actual Data just with Sorting

(no normalization)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

33

Actual Data with Normalization and Sorting

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

34

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Just signal from top classification

eigenvectors

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

35

Low Dimension Representation

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

36

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Actual Values of Projections onto

Classification Eigenvectors

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

37

Classification of Cancers Based on Projection onto two top classification

eigenvectors: Better with Normalization

Normalized (“bistochastization”)

CLL DLCLFL DLCL

Straight SVD

Four types of Cancer in Della Favera dataset

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

38

Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286

biclustering bistochastization

SVD bi-normalization Normalized cuts

ALL (B) ALL (T)AML

top related