Top Banner
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Anindya Bhattacharya and Rajat K. DeBioinformatics, 2008

Page 2: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

IntroductionDivisive Correlation Clustering

AlgorithmResultsConclusions

2

Page 3: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

IntroductionDivisive Correlation Clustering

AlgorithmResultsConclusions

3

Page 4: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Correlation Clustering

4

Page 5: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004.

It is basically based on the notion of graph partitioning.

5

Page 6: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

How to construct the graph? Nodes: genes. Edges: correlation between the genes.

Two types of edges: Positive edge. Negative edge.

6

Page 7: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

For example:

7

XX YY Positive correlation coefficient: Positive edge( )

XX YY Negative correlation coefficient: Negative edge( )

CC

GG

BB

DD

AA

HH

GG

FF

EE

Cluster 1

Cluster 2

Graph Construction

Graph Partitioning CC

GG

BB

DD

AA

HH

GG

FF

EE

Page 8: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

How to measure the quality of clusters? The number of agreements. The number of disagreements.

The number of agreements: the number of genes that are in correct clusters.

The number of disagreements: the number of genes wrongly clustered.

8

Page 9: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

For example:

9

AA

CC

DD EE

BB

Cluster 1

Cluster 2

The measure of agreements is the sum of:(1) # of positive edges in the same clusters(2) # of negative edges in different clustersThe measure of disagreements is the sum of:(1) # of negative edges in the same clusters(2) # of positive edges in different clusters

4 + 4 = 8

0 + 2 = 2

Page 10: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Minimization of disagreements or equivalently Maximization of agreements!

However, it’s NP-Complete proved by Bansal et al., 2004.

Another problem is without the magnitude of correlation coefficients.

10

Page 11: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

IntroductionDivisive Correlation Clustering

AlgorithmResultsConclusions

11

Page 12: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Pearson correlation coefficientTerms and measurements used in

DCCADivisive Correlation Clustering

Algorithm

12

Page 13: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Consider a set of genes, , for each of which expression values are given.

The Pearson correlation coefficient between two genes and is defined as:

13

lth sample value of gene

mean value of gene from samples

Page 14: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

: and are positively correlated with the degree of correlation as its magnitude.

: and are negatively correlated with value .

14

Page 15: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

We define some terms and measurements used in DCCA: Attraction Repulsion Attraction/Repulsion value Average correlation value

15

Page 16: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Attraction: There’s an attraction between and if .

Repulsion: There’s a repulsion between and if .

Attraction/Repulsion value: Magnitude of

is the strength of attraction or repulsion.

16

Page 17: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

The genes will be grouped into disjoint clusters .

Average correlation value: Average correlation value for a gene with respect to cluster is defined as:

17

the number of data points in

Page 18: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

indicates that the average correlation for a gene with other genes inside the cluster .

Average correlation value reflects the degree of inclusion of to cluster .

18

Page 19: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

19

Divisive Correlation Clustering Algorithm

11 mm

m samples

11 mm

n genes

DCCA

C1C1 C2C2 CkCk

K disjoint clustersX1

Xn

Page 20: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Step 1:

Step 2: for each iteration, do: Step 2-i:

20

Page 21: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Step 2: Step 2-ii:

Step 2-iii:

21

C1C1 C2C2 CpCp

Which cluster exists the most repulsion value?

Cluster C!

Page 22: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Step 2-iv:

22

xixi

xjxj

xk

xk

xk

xk

xk

xk

xk

xkx

k

xk

xk

xk

xk

xk

Cluster C

xjxj

xixi

Cp

Cq

Page 23: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Step 2-v:

23

xk

xk

C1C1 C2C2 CKCK

The highest average correlation value!

C1C1 C2C2 CKCKxk

xk

Place a copy of xk

CNEW: new clusters

Page 24: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Step 2-vi:

24

C1C1 C2C2 CKCK

C1C1 C2C2 CKCK

CNEW: new clusters

Any change?

Page 25: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

IntroductionDivisive Correlation Clustering

AlgorithmResultsConclusions

25

Page 26: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Performance comparison A synthetic dataset ADS Nine gene expression datasets

26

Page 27: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

A synthetic dataset ADS:

27

Three groups.

Page 28: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

28

Clustering correctly.

Page 29: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

29

Undesired Clusters.

Undesired Clusters.

Page 30: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Five yeast datasets: Yeast ATP, Yeast PHO, Yeast AFR, Yeast

AFRt, Yeast Cho et al.Four mammalian datasets:

GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745.

30

Page 31: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster.

31

Attributes

Mutual information

The entropies for each cluster-attribute pair.

The entropies for clustering result independent of attributes.

The entropies for each of the NA attributes independent of clusters.

Page 32: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

z-score is defined as:

32

The computed MI for the clustered data, using the

attribute database.

MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained.

Mean of these MI-values.

The standard deviation of these MI-values.

Page 33: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result.

Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets.

33

Page 34: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

34

Page 35: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

35

Page 36: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

36

Page 37: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

37

Page 38: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Experimental results:

38

Page 39: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

IntroductionDivisive Correlation Clustering

AlgorithmResultsConclusions

39

Page 40: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Pros: DCCA is able to obtain clustering

solution from gene-expression dataset with high biological significance.

DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input.

40

Page 41: Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Cons: The computation cost for repairing any

misplacement occurring in clustering step is high.

DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1.

41