Discriminant Analysis Cluster Analysis
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 1/54
Discriminant AnalysisCluster Analysis
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 2/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Classification and clustering
Classification Clustering
known number of classes
based on a training set
used to classify future observations
unknown number of classes
no prior knowledge
used to understand (explore) data
Classification is a form of
supervised learning Clustering a form of unsupervised
learning
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 3/54
L
inear Discriminant Analysis
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 4/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Classification In classification you do
have a class label (oand x), each definedin terms of G1 and G2values.
You are trying to find amodel that splits thedata elements intotheir existing classes
You then assume thatthis model can beused to assign newdata points x and y tothe right class
*
*
*
*
*
*
*
o
o
o
o
oo
o
o
o*
* o
G2
G1
*
*
*
*
**
*o
o
oo
oo
o
o
o*
* o
G2
G1
Supervised
Learning
y?
x?
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 5/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Linear Discriminant Analysis
Proposed by Fisher (1936) for
classifying an observation into one of
two possible groups based on many
measurements x 1,x 2 ,«x p .
Seek a linear transformation of the
variables Y = a 1x 1+ a 2 x 2 +..+ a p x p such
that the separation between the
group means on the transformed
scale is the best
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 6/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
How to calculate the a i ?
Coefficients that maximize the ratio of
the between group sum of square and
the within group sum of square that is
a T B a/a T W a
T he vector a is the eigenvector of W - 1B corresponding to the largest eigenvalue
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 7/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
For two groups
We find where
are mean vectors of the
groups
C the pooled covariance matrix of the
groups
¹¹¹¹¹
º
¸
©©©©©
ª
¨
p X
X
X
2
1
!1 x Group 1
)( 211 x xC a !
21xand x
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 8/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
We find
W - 1B has only one eigenvalue which is
Tr(W - 1B)=
T
d d n
nn
B ¹ º
¸©ª
¨!
21
)( 21 x xd !
T d W d
n
nn 121
¹ º
¸©ª
¨
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 9/54
Cluster Analysis Arranging objects into groups is a natural and
necessary skill that we all share.
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 10/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
Try to place these faces into groups?
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 11/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
You can measure variablescase sex glasses moustach smile hat
1 m y n y n
2 f n n y n
3 m y n n n
4 m n n n n5 m n n y? n
6 m n y n y
7 m y n y n
8 m n n y n
9 m y y y n
10 f n n n n
11 m n y n n
12 f n n n n
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 12/54
Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 13/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Data
A set of n objects (observations)
measured for p variables
Variable can be binary, continuous ormixture of both
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 14/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Rationale
A set of tools for building groups (clusters)
from multivariate data objects
Groups should have homogenousproperties
The clusters should be as homogenous as
possible and the differences among the
various groups as large as possible
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 15/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Homogeneity and Separation Principles
Homogeneity: Elements within a cluster areclose to each other
Separation: Elements in different clusters are
further apart from each other «clustering is not an easy task!
Given these points a
clustering algorithm
might make two distinct
clusters as follows
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 16/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Bad Clustering
This clustering violates both
Homogeneity and Separation
Close distances
from points in
separate clusters
Far distances from
points in the same
cluster
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 17/54
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 18/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Clustering Techniques
Agglomerative: Start with every elementin its own cluster, and iteratively joinclusters together
Divisive: Start with one cluster anditeratively divide it into smaller clusters
Hierarchical: Organize elements into a
tree, leaves represent genes and thelength of the branches represent thedistances between genes. Similar geneslie within the same subtrees
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 19/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Two fundamental steps
Choice of a proximity measure: echa pair of
observations (objects) are checked for the
similarity of their values. A similarity
measure is defined to measure the¶closeness· of the objects.
Choice of group building algorithm: on the
basis of proximity measure the objects are
assigned to groups so that differencebetween groups become large
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 20/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Proximity measure
The proximity is defined by a squarematrix D containing measures of similarity
(or dissimilarity) between each pair of
objects (Similarity or distance matrix)
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 21/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Similarity of binary variables
Obi 1 0
Obj 1
0
a1
a2
a3
a4
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 22/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Similarity between continuous variables
Lr -norm distances
Commonly L2-norm (Euclidian distance) is used
This assumes that variable are measured on thesame scale; if not the variable should bestandardized
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 23/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Contingency tables
The distance between rows is the chi-
square
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 24/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Cluster algorithms:agglomerative
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 25/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Rationale
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 26/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Example: an 8 points problem
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 27/54
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 28/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
And this gives
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 29/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
K-means clustering
There is no need to calculate a distance
matrix at first
You decide on the number of clusters youwant to divide you objects into
The computer randomly assigns each
object to one of the K clusters
Now we calculate the distance between
each object and the center of each cluster
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 30/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
K-means clustering
If the object is closer to the center of
another cluster than the one it is currently
assigned to, it is reassigned to the closer
cluster
Recalculate the centroids
Do a number of iterations of this procedure
until the clusters no longer change and the
algorithm stops
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 31/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
K-Means Clustering: Lloyd Algorithm
1. Lloyd Algorithm2. Arbitrarily assign the k cluster centers
3. while the cluster centers keep changing
4. Assign each data point to the cluster C i
corresponding to the closest cluster representative (center) x i (1 i k )
5. After the assignment of all n data points,compute new cluster representativesaccording to the center of gravity of each
existing cluster, that is, the new cluster representative is
*This may lead to merely a locally optimal clustering.
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 32/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
e x p r e s s
i o n
i n
c o n d i t i o n
k 1
k 2
k 3
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 33/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
e x p r e s s i o n
i n c
o n d i t i
o n 2
k 1
k 2
k 3
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 34/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
e x p r e s s i o
n i n c
o n d i t
i o n 2
k 1
k 2
k 3
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 35/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
e x p r e s s i o n i n
c o n d i t i o n 2
k 1
k 2
k 3
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 36/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
The problem is
You get what you asked for : the number of
final clusters is the number you choose at
the beginning
One solution is to try different choices of
the number of cluster
Can use other techniques (PCA) to get an
idea on the number of µmajor¶ clusters
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 37/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
K-means vs hierarchical clustering
T his method differs from the hierarchical clustering
in many ways. In particular,
-T
here is no hierarchy, the data are partitioned.You will be presented only with the final cluster
membership for each case.
- T here is no role for the dendrogram in k -means
clustering.
- You must supply the number of clusters (k ) into
which the data are to be grouped.
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 38/54
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 39/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Inferring Gene Functionality
Researchers want to know the functions of newgenes
Simply comparing the new gene sequences toknown DNA sequences often does not giveaway the actual function of gene
For 40% of sequenced genes, functionalitycannot be ascertained by only comparing tosequences of other known genes
Microarrays allow biologists to infer genefunction even when there is not enoughevidence to infer function based on similarityalone
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 40/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Microarray Analysis
Microarrays measure the activity (expressionlevel) of the gene under varying
conditions/time pointsExpression level is estimated by measuring
the amount of mRNA for that particular gene
A gene is active if it is being transcribed
More mRNA usually indicates more geneactivity
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 41/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Microarray Experiments
Analyze mRNA produced from cells in the tissuewith the environmental conditions you are testing
Produce cDNA from mRNA (DNA is more stable)
Attach phosphor to cDNA to see when a particular gene is expressed
Different color phosphors are available to comparemany samples at once
Hybridize cDNA over the micro array
Scan the microarray with a phosphor-illuminatinglaser
Illumination reveals transcribed genes
Scan microarray multiple times for the differentcolor phosphor¶s
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 42/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Using Microarrays
Each box represents
one gene¶s
expression over time
Track the sample
over a period of timeto see geneexpression over time
Track two differentsamples under thesame conditions tosee the difference ingene expressions
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 43/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Using Microarrays (cont¶d)
Green: expressedonly from control
Red: expresses only
from experimental cell Yellow: equally
expressed in both
samplesBlack: NOT
expressed in either control or
experimental cells
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 44/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Microarray Data
Microarray data are usually transformed into anintensity matrix (below)
The intensity matrix allows biologists to makecorrelations between diferent genes (even if they
aredissimilar) and to understand how genes functionsmight be related
Clustering comes into playTime: Time X Time Y Time Z
Gene 1 10 8 10
Gene 2 10 0 9
Gene 3 4 8.6 3
Gene 4 7 8 3
Gene 5 1 2 3
Intensity (expressionlevel) of gene at
measured time
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 45/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Clustering of Microarray Data
Plot each datum as a point in N-dimensional space
Make a distance matrix for the distancebetween every two gene points in the N-dimensional space
Genes with a small distance share thesame expression characteristics and mightbe functionally related or similar!
Clustering reveal groups of functionallyrelated genes
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 46/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical clustering
Step 2: Cluster genes
based on distance matrixand draw a dendrogram
until single node remains
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1 Exp 2 Exp 3 Exp 4
Gene AGene B
Gene C
Gene A Gene B Gene C
Gene A 0Gene B ? 0
Gene C ? ? 0
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 47/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Data and distance matrix
Patients 1 2
Genes AB
C
DE
90
190
90
200
15 0
190
390
110
400
200
A B C D E
A 0.0 223.6 80.0 237.1 60.8
B 0.0 297.3 14.1 194.2
C 0.0 310.2 108.2
D 0.0 206.2
E 0.0
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 48/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical clustering (continued)
G 1 G 2 G 3 G 4 G 5
G 1 0
G 2 2 0G 3 6 5 0
G 4 10 9 4 0G 5 9 8 5 3 0
G (12) G 3 G 4 G5
G (12) 0
G 3 6 0
G 4 10 4 0
G 5 9 5 3 0
G (12) G 3 G (45)
G (12) 0
G 3 6 0
G (45) 10 5 0
Stage Groups
P5 [ 1], [2 ], [3 ], [ 4], [ 5]
P4 [ 1 2 ], [3 ], [ 4], [ 5]
P3 [ 1 2 ], [3 ], [ 4 5]
P2 [ 1 2 ], [3 4 5]
P1 [ 1 2 3 4 5]1 2 3 4 5
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 49/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Clustering of Microarray Data (cont¶d)
Clusters
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 50/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical Clustering
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 51/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical Clustering: Example
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 52/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical Clustering: Example
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 53/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical Clustering: Example
8/3/2019 Ahmed Rebai DA-Cluster
http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 54/54
Ahmed Rebai Bioinformatics and Comparative Genome
Analysis March 2007
Hierarchical Clustering: Example