UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Post on 18-Dec-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

UNSUPERVISED ANALYSIS

•GOAL A: FIND GROUPS OF GENES THAT HAVE

CORRELATED EXPRESSION PROFILES. THESE GENES ARE

BELIEVED TO BELONG TO THE SAME BIOLOGICAL

PROCESS.

•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR

GENE EXPRESSION PROFILES. THESE TISSUES ARE

EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)

STATE.

CLUSTERING

Unsupervised analysis

Giraffe

DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM

T (RESOLUTION)

Giraffe + Okapi

BUT WHAT ABOUT THE OKAPI ?

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

CLUSTER ANALYSIS YIELDS DENDROGRAM

Dendrogram2

TLINEAR ORDERING OF DATA

YOUNG OLD

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Hierarchical Clustering -Summary

• Results depend on distance update method

• Greedy iterative process

• NOT robust against noise

• No inherent measure to identify stable clusters

2 good clouds

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

2 flat clouds

2 FLAT CLOUDS - SINGLE LINKAGE WORKS

filament

SINGLE LINKAGE SENSITIVE TO NOISE

52 41 3

Average linkage

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Need to define the distance between thenew cluster and the other clusters.

Average Linkage: average distance between all pairs

Mean Linkage: distance between centroids

Dendrogram

nature 2002 breast cancer

STATEMENT OF THE PROBLEM

GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D

- DIMENSIONAL SPACE, IDENTIFY THE

UNDERLYING STRUCTURE OF THE DATA.

AIMS:PARTITION THE DATA INTO M CLUSTERS,

POINTS OF SAME CLUSTER - "MORE SIMILAR“

M ALSO TO BE DETERMINED!

GENERATE DENDROGRAM,

IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS

"ILL POSED": WHAT IS "MORE SIMILAR"?

RESOLUTION

Statement of the problem2

how many clusters?

3 LARGEMANY small (SPC)

toy problem SPC

other methods

K-means

Iteration = 0

•Start with random positions of centroids.

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

K-means

Iteration = 1

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

K-means

Iteration = 3

•Start with random positions of centroids.

•Assign data points to

centroids

•Move centroids to center

of assigned points

•Iterate till minimal cost

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• Must preset K

• Fails for non-spherical distributions

K-means - Summary

TSS vs K

Iris setosa

Iris versicolor

Iris virginica

50 specimes from each group4 numbers for each flower150 data points in 4-dimensional space

irises

150 points in d=4

3 large clusters

d=4

Output of SPC

Stable clusters “live” for large T

Stable clusters “live” for large T

Choosing a value for T

Same data - Average Linkage

No analog for No analog for

Same data - Average Linkage

Examining this cluster

Examining this cluster

A ( I I )S c G B M

P r G B MC L

GE

NE

S

S 2S 3

T

S 1 ( G 1 )

G 1 2

G 5

C o u p l e d T w o - W a y C l u s t e r i n g ( C T W C )

o f 3 5 8 G e n e s a n d 3 6 S a m p l e s

F i g . 2 A

G L I O B L A S T O M A : M . H E G I e t a l C H U V , C L O N T E C H A R R A Y S

g l i o b l a s t o m a

AB004904 STAT- i nduced STAT i nhi bi t or 3

M 32977 VEG F

M 35410 I G FBP2

X51602 VEG FR1

M 96322 gr avi n

AB004903 STAT- i nduced STAT i nhi bi t or 2

X52946 PTN

J04111 c- j un

X79067 TI S11B

S 1 1S 1 2

S 1 4

S 1 0

S 1 3S 1 (G 5 )

S u p e r -P a ra m a g n e tic C lu s te r in g o f A ll S a m p le s

U s in g S ta b le G e n e C lu s te r G 5

F ig . 2 B

S 1 (G 5 )

top related