UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Post on 18-Dec-2015
214 Views
Preview:
Transcript
UNSUPERVISED ANALYSIS
•GOAL A: FIND GROUPS OF GENES THAT HAVE
CORRELATED EXPRESSION PROFILES. THESE GENES ARE
BELIEVED TO BELONG TO THE SAME BIOLOGICAL
PROCESS.
•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR
GENE EXPRESSION PROFILES. THESE TISSUES ARE
EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)
STATE.
CLUSTERING
Unsupervised analysis
Giraffe
DEFINITION OF THE CLUSTERING PROBLEM
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
Giraffe + Okapi
BUT WHAT ABOUT THE OKAPI ?
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
Statement of the problem2
CLUSTER ANALYSIS YIELDS DENDROGRAM
Dendrogram2
TLINEAR ORDERING OF DATA
YOUNG OLD
52 41 3
Agglomerative Hierarchical Clustering
3
1
4 2
5
Distance between joined clusters
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Dendrogram
The dendrogram induces a linear ordering of the data points
The dendrogram induces a linear ordering of the data points
Hierarchical Clustering -Summary
• Results depend on distance update method
• Greedy iterative process
• NOT robust against noise
• No inherent measure to identify stable clusters
2 good clouds
COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
2 flat clouds
2 FLAT CLOUDS - SINGLE LINKAGE WORKS
filament
SINGLE LINKAGE SENSITIVE TO NOISE
52 41 3
Average linkage
3
1
4 2
5
Distance between joined clusters
Need to define the distance between thenew cluster and the other clusters.
Average Linkage: average distance between all pairs
Mean Linkage: distance between centroids
Need to define the distance between thenew cluster and the other clusters.
Average Linkage: average distance between all pairs
Mean Linkage: distance between centroids
Dendrogram
nature 2002 breast cancer
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
Statement of the problem2
how many clusters?
3 LARGEMANY small (SPC)
toy problem SPC
other methods
K-means
Iteration = 0
•Start with random positions of centroids.
K-means
Iteration = 1
•Start with random positions of centroids.
•Assign data points to
centroids
K-means
Iteration = 1
•Start with random positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
K-means
Iteration = 3
•Start with random positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
•Iterate till minimal cost
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data points to centroids
• Must preset K
• Fails for non-spherical distributions
K-means - Summary
TSS vs K
Iris setosa
Iris versicolor
Iris virginica
50 specimes from each group4 numbers for each flower150 data points in 4-dimensional space
irises
150 points in d=4
3 large clusters
d=4
Output of SPC
Stable clusters “live” for large T
Stable clusters “live” for large T
Choosing a value for T
Same data - Average Linkage
No analog for No analog for
Same data - Average Linkage
Examining this cluster
Examining this cluster
A ( I I )S c G B M
P r G B MC L
GE
NE
S
S 2S 3
T
S 1 ( G 1 )
G 1 2
G 5
C o u p l e d T w o - W a y C l u s t e r i n g ( C T W C )
o f 3 5 8 G e n e s a n d 3 6 S a m p l e s
F i g . 2 A
G L I O B L A S T O M A : M . H E G I e t a l C H U V , C L O N T E C H A R R A Y S
g l i o b l a s t o m a
AB004904 STAT- i nduced STAT i nhi bi t or 3
M 32977 VEG F
M 35410 I G FBP2
X51602 VEG FR1
M 96322 gr avi n
AB004903 STAT- i nduced STAT i nhi bi t or 2
X52946 PTN
J04111 c- j un
X79067 TI S11B
S 1 1S 1 2
S 1 4
S 1 0
S 1 3S 1 (G 5 )
S u p e r -P a ra m a g n e tic C lu s te r in g o f A ll S a m p le s
U s in g S ta b le G e n e C lu s te r G 5
F ig . 2 B
S 1 (G 5 )
top related