Ahmed Rebai DA-Cluster

8/3/2019 Ahmed Rebai DA-Cluster

http://slidepdf.com/reader/full/ahmed-rebai-da-cluster 1/54

Discriminant AnalysisCluster Analysis



Ahmed Rebai Bioinformatics and Comparative Genome Analysis March 2007

Classification and clustering

Classification Clustering

known number of classes

based on a training set

used to classify future observations

unknown number of classes

no prior knowledge

used to understand (explore) data

Classification is a form of

supervised learning Clustering a form of unsupervised

learning



L

inear Discriminant Analysis




Classification In classification you do

have a class label (oand x), each definedin terms of G1 and G2values.

You are trying to find amodel that splits thedata elements intotheir existing classes

You then assume thatthis model can beused to assign newdata points x and y tothe right class

*

*

*

*

*

*

*

o

o

o

o

oo

o

o

o*

* o

G2

G1

*

*

*

*

**

*o

o

oo

oo

o

o

o*

* o

G2

G1

Supervised

Learning

y?

x?




Linear Discriminant Analysis

Proposed by Fisher (1936) for

classifying an observation into one of

two possible groups based on many

measurements x 1,x 2 ,«x p .

Seek a linear transformation of the

variables Y = a 1x 1+ a 2 x 2 +..+ a p x p such

that the separation between the

group means on the transformed

scale is the best




How to calculate the a i ?

Coefficients that maximize the ratio of

the between group sum of square and

the within group sum of square that is

a T B a/a T W a

T he vector a is the eigenvector of W - 1B corresponding to the largest eigenvalue




For two groups

We find where

are mean vectors of the

groups

C the pooled covariance matrix of the

groups

¹¹¹¹¹

º

¸

©©©©©

ª

¨

p X

X

X

2

1

!1 x Group 1

)( 211 x xC a !

21xand x




We find

W - 1B has only one eigenvalue which is

Tr(W - 1B)=

T

d d n

nn

B ¹ º

¸©ª

¨!

21

)( 21 x xd !

T d W d

n

nn 121

¹ º

¸©ª

¨



Cluster Analysis Arranging objects into groups is a natural and

necessary skill that we all share.




Try to place these faces into groups?




You can measure variablescase sex glasses moustach smile hat

1 m y n y n

2 f n n y n

3 m y n n n

4 m n n n n5 m n n y? n

6 m n y n y

7 m y n y n

8 m n n y n

9 m y y y n

10 f n n n n

11 m n y n n

12 f n n n n






Ahmed Rebai Bioinformatics and Comparative Genome

Analysis March 2007

Data

A set of n objects (observations)

measured for p variables

Variable can be binary, continuous ormixture of both




Analysis March 2007

Rationale

A set of tools for building groups (clusters)

from multivariate data objects

Groups should have homogenousproperties

The clusters should be as homogenous as

possible and the differences among the

various groups as large as possible




Analysis March 2007

Homogeneity and Separation Principles

Homogeneity: Elements within a cluster areclose to each other

Separation: Elements in different clusters are

further apart from each other «clustering is not an easy task!

Given these points a

clustering algorithm

might make two distinct

clusters as follows




Analysis March 2007

Bad Clustering

This clustering violates both

Homogeneity and Separation

Close distances

from points in

separate clusters

Far distances from

points in the same

cluster






Analysis March 2007

Clustering Techniques

Agglomerative: Start with every elementin its own cluster, and iteratively joinclusters together

Divisive: Start with one cluster anditeratively divide it into smaller clusters

Hierarchical: Organize elements into a

tree, leaves represent genes and thelength of the branches represent thedistances between genes. Similar geneslie within the same subtrees




Analysis March 2007

Two fundamental steps

Choice of a proximity measure: echa pair of

observations (objects) are checked for the

similarity of their values. A similarity

measure is defined to measure the¶closeness· of the objects.

Choice of group building algorithm: on the

basis of proximity measure the objects are

assigned to groups so that differencebetween groups become large




Analysis March 2007

Proximity measure

The proximity is defined by a squarematrix D containing measures of similarity

(or dissimilarity) between each pair of

objects (Similarity or distance matrix)




Analysis March 2007

Similarity of binary variables

Obi 1 0

Obj 1

0

a1

a2

a3

a4




Analysis March 2007

Similarity between continuous variables

Lr -norm distances

Commonly L2-norm (Euclidian distance) is used

This assumes that variable are measured on thesame scale; if not the variable should bestandardized




Analysis March 2007

Contingency tables

The distance between rows is the chi-

square




Analysis March 2007

Cluster algorithms:agglomerative




Analysis March 2007

Rationale




Analysis March 2007

Example: an 8 points problem






Analysis March 2007

And this gives




Analysis March 2007

K-means clustering

There is no need to calculate a distance

matrix at first

You decide on the number of clusters youwant to divide you objects into

The computer randomly assigns each

object to one of the K clusters

Now we calculate the distance between

each object and the center of each cluster




Analysis March 2007

K-means clustering

If the object is closer to the center of

another cluster than the one it is currently

assigned to, it is reassigned to the closer

cluster

Recalculate the centroids

Do a number of iterations of this procedure

until the clusters no longer change and the

algorithm stops




Analysis March 2007

K-Means Clustering: Lloyd Algorithm

1. Lloyd Algorithm2. Arbitrarily assign the k cluster centers

3. while the cluster centers keep changing

4. Assign each data point to the cluster C i

corresponding to the closest cluster representative (center) x i (1 i k )

5. After the assignment of all n data points,compute new cluster representativesaccording to the center of gravity of each

existing cluster, that is, the new cluster representative is

*This may lead to merely a locally optimal clustering.




Analysis March 2007

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

e x p r e s s

i o n

i n

c o n d i t i o n

k 1

k 2

k 3




Analysis March 2007

0

1

2

3

4

5

0 1 2 3 4 5


e x p r e s s i o n

i n c

o n d i t i

o n 2

k 1

k 2

k 3




Analysis March 2007

0

1

2

3

4

5

0 1 2 3 4 5


e x p r e s s i o

n i n c

o n d i t

i o n 2

k 1

k 2

k 3




Analysis March 2007

0

1

2

3

4

5

0 1 2 3 4 5


e x p r e s s i o n i n

c o n d i t i o n 2

k 1

k 2

k 3




Analysis March 2007

The problem is

You get what you asked for : the number of

final clusters is the number you choose at

the beginning

One solution is to try different choices of

the number of cluster

Can use other techniques (PCA) to get an

idea on the number of µmajor¶ clusters




Analysis March 2007

K-means vs hierarchical clustering

T his method differs from the hierarchical clustering

in many ways. In particular,

-T

here is no hierarchy, the data are partitioned.You will be presented only with the final cluster

membership for each case.

- T here is no role for the dendrogram in k -means

clustering.

- You must supply the number of clusters (k ) into

which the data are to be grouped.






Analysis March 2007

Inferring Gene Functionality

Researchers want to know the functions of newgenes

Simply comparing the new gene sequences toknown DNA sequences often does not giveaway the actual function of gene

For 40% of sequenced genes, functionalitycannot be ascertained by only comparing tosequences of other known genes

Microarrays allow biologists to infer genefunction even when there is not enoughevidence to infer function based on similarityalone




Analysis March 2007

Microarray Analysis

Microarrays measure the activity (expressionlevel) of the gene under varying

conditions/time pointsExpression level is estimated by measuring

the amount of mRNA for that particular gene

A gene is active if it is being transcribed

More mRNA usually indicates more geneactivity




Analysis March 2007

Microarray Experiments

Analyze mRNA produced from cells in the tissuewith the environmental conditions you are testing

Produce cDNA from mRNA (DNA is more stable)

Attach phosphor to cDNA to see when a particular gene is expressed

Different color phosphors are available to comparemany samples at once

Hybridize cDNA over the micro array

Scan the microarray with a phosphor-illuminatinglaser

Illumination reveals transcribed genes

Scan microarray multiple times for the differentcolor phosphor¶s




Analysis March 2007

Using Microarrays

Each box represents

one gene¶s

expression over time

Track the sample

over a period of timeto see geneexpression over time

Track two differentsamples under thesame conditions tosee the difference ingene expressions




Analysis March 2007

Using Microarrays (cont¶d)

Green: expressedonly from control

Red: expresses only

from experimental cell Yellow: equally

expressed in both

samplesBlack: NOT

expressed in either control or

experimental cells




Analysis March 2007

Microarray Data

Microarray data are usually transformed into anintensity matrix (below)

The intensity matrix allows biologists to makecorrelations between diferent genes (even if they

aredissimilar) and to understand how genes functionsmight be related

Clustering comes into playTime: Time X Time Y Time Z

Gene 1 10 8 10

Gene 2 10 0 9

Gene 3 4 8.6 3

Gene 4 7 8 3

Gene 5 1 2 3

Intensity (expressionlevel) of gene at

measured time




Analysis March 2007

Clustering of Microarray Data

Plot each datum as a point in N-dimensional space

Make a distance matrix for the distancebetween every two gene points in the N-dimensional space

Genes with a small distance share thesame expression characteristics and mightbe functionally related or similar!

Clustering reveal groups of functionallyrelated genes




Analysis March 2007

Hierarchical clustering

Step 2: Cluster genes

based on distance matrixand draw a dendrogram

until single node remains

Step 1: Transform genes * experiments matrix into

genes * genes distance matrix

Exp 1 Exp 2 Exp 3 Exp 4

Gene AGene B

Gene C

Gene A Gene B Gene C

Gene A 0Gene B ? 0

Gene C ? ? 0




Analysis March 2007

Data and distance matrix

Patients 1 2

Genes AB

C

DE

90

190

90

200

15 0

190

390

110

400

200

A B C D E

A 0.0 223.6 80.0 237.1 60.8

B 0.0 297.3 14.1 194.2

C 0.0 310.2 108.2

D 0.0 206.2

E 0.0




Analysis March 2007

Hierarchical clustering (continued)

G 1 G 2 G 3 G 4 G 5

G 1 0

G 2 2 0G 3 6 5 0

G 4 10 9 4 0G 5 9 8 5 3 0

G (12) G 3 G 4 G5

G (12) 0

G 3 6 0

G 4 10 4 0

G 5 9 5 3 0

G (12) G 3 G (45)

G (12) 0

G 3 6 0

G (45) 10 5 0

Stage Groups

P5 [ 1], [2 ], [3 ], [ 4], [ 5]

P4 [ 1 2 ], [3 ], [ 4], [ 5]

P3 [ 1 2 ], [3 ], [ 4 5]

P2 [ 1 2 ], [3 4 5]

P1 [ 1 2 3 4 5]1 2 3 4 5




Analysis March 2007

Clustering of Microarray Data (cont¶d)

Clusters




Analysis March 2007

Hierarchical Clustering




Analysis March 2007

Hierarchical Clustering: Example




Analysis March 2007





Analysis March 2007





Analysis March 2007


Ahmed Rebai DA-Cluster

Documents