Top Banner
UNSUPERVISED ANALYSIS •GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. •GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING
30

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

UNSUPERVISED ANALYSIS

•GOAL A: FIND GROUPS OF GENES THAT HAVE

CORRELATED EXPRESSION PROFILES. THESE GENES ARE

BELIEVED TO BELONG TO THE SAME BIOLOGICAL

PROCESS.

•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR

GENE EXPRESSION PROFILES. THESE TISSUES ARE

EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)

STATE.

CLUSTERING

Unsupervised analysis

Page 2: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

Page 3: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Page 4: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

Page 5: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

Page 6: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

CLUSTER ANALYSIS YIELDS DENDROGRAM

Dendrogram2

TLINEAR ORDERING OF DATA

YOUNG OLD

Page 7: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Page 8: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Hierarchical Clustering -Summary

• Results depend on distance update method

• Greedy iterative process

• NOT robust against noise

• No inherent measure to identify stable clusters

Page 9: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

2 good clouds

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

Page 10: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

2 flat clouds

2 FLAT CLOUDS - SINGLE LINKAGE WORKS

Page 11: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

filament

SINGLE LINKAGE SENSITIVE TO NOISE

Page 12: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

52 41 3

Average linkage

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Dendrogram

Page 13: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

nature 2002 breast cancer

Page 14: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

Page 15: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

how many clusters?

3 LARGEMANY small (SPC)

toy problem SPC

Page 16: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

other methods

Page 17: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

K-means

Iteration = 0

•Start with random positions of centroids.

Page 18: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

Page 19: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

Page 20: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

K-means

Iteration = 3

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

•Iterate till minimal cost

Page 21: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• Must preset K

• Fails for non-spherical distributions

K-means - Summary

Page 22: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

TSS vs K

Page 23: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Iris setosa

Iris versicolor

Iris virginica

50 specimes from each group4 numbers for each flower150 data points in 4-dimensional space

irises

Page 24: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

150 points in d=4

3 large clusters

d=4

Page 25: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Output of SPC

Stable clusters “live” for large T

Stable clusters “live” for large T

Page 26: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Choosing a value for T

Page 27: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Same data - Average Linkage

No analog for No analog for

Page 28: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Same data - Average Linkage

Examining this cluster

Examining this cluster

Page 29: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

A ( I I )S c G B M

P r G B MC L

GE

NE

S

S 2S 3

T

S 1 ( G 1 )

G 1 2

G 5

C o u p l e d T w o - W a y C l u s t e r i n g ( C T W C )

o f 3 5 8 G e n e s a n d 3 6 S a m p l e s

F i g . 2 A

G L I O B L A S T O M A : M . H E G I e t a l C H U V , C L O N T E C H A R R A Y S

g l i o b l a s t o m a

Page 30: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

AB004904 STAT- i nduced STAT i nhi bi t or 3

M 32977 VEG F

M 35410 I G FBP2

X51602 VEG FR1

M 96322 gr avi n

AB004903 STAT- i nduced STAT i nhi bi t or 2

X52946 PTN

J04111 c- j un

X79067 TI S11B

S 1 1S 1 2

S 1 4

S 1 0

S 1 3S 1 (G 5 )

S u p e r -P a ra m a g n e tic C lu s te r in g o f A ll S a m p le s

U s in g S ta b le G e n e C lu s te r G 5

F ig . 2 B

S 1 (G 5 )