Top Banner
Comp. Genomics Recitation 10 Clustering and analysis of microarrays
29

Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Jan 18, 2016

Download

Documents

Charla Crawford
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Comp. Genomics

Recitation 10Clustering and analysis of microarrays

Page 2: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Exercise 1

• A microarray that contains probes for all the N metabolic enzymes of the bacterium D.Angerous was used for the following time-series microarray experiment: The bacteria population were exposed to a drug, and gene expression was measured every hour for M hours.

• The expression values are discretized to {-1,0,1}

Page 3: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Exercise 1

• Find the longest expression pattern that is common to at least k enzymes. Each enzyme may start the pattern at a different time.

T1 T2 T3 T4 T5 T6 T7

E1 -1 0 1 -1 0 1 -1

E2 -1 -1 0 1 -1 0 1

E3 -1 -1 -1 0 1 -1 0

E4 0 0 0 0 0 0 0

E5 1 1 1 1 0 1 1

E6 1 -1 -1 -1 -1 0 -1

K=3

Page 4: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• Treat each expression vector as a string

• Create a generalized suffix tree O(MN)

• Find longest k-common substring

Page 5: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Exercise 2

• Expression of N genes was measured under a certain condition using a microarray.

• No discretization was performed.• Give a polynomial time algorithm for

clustering these genes into exactly k clusters.

• The objective function is

k

llCjijid

1

},|),(max{

Page 6: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Pictorially

G1 G2 G3 G4 G5 G6Expression level

If {G3,G4,G5}is a cluster, its contribution to theobjective function is d(G3,G5)

Page 7: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• Create a weighted directed graph, every gene is a node and the edge from i to j has weight d(i,j-1) if i’s expression is lower than j’s (otherwise ∞)G1 G2 G3 G4 G5 G6

The path in the graph that corresponds to this clustering is G1G3G6. The value of the objective function is d(G1,G2)+d(G3,G5)+0

Page 8: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• Next:• Find the shortest path that visits

exactly k nodes• Dynamic programming:

),()1(min)(1,..,

jldkPkP ljkl

j

Start from k because if l<k Pl(k-1)=∞

Page 9: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Exercise 3

• A microarray experiment with N genes and M conditions was conducted

• Describe a polynomial algorithm that determines whether the genes can be clustered into 2 clusters such that the maximum distance d(Gi,Gj) in each cluster < W

Page 10: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Illustration

0 1 1

0 0 1

111

1 1 0

W=2

G1

G2

G3

G4

Page 11: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• Create a graph with a node for every gene

• Add an edge (i,j) if d(i,j)> W• Check if the resulting graph is

bipartite: Run BFS, if you discover an edge (u,v) to a gray node and the depths of u and v are both even or both odd, answer: “no”.

Page 12: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

Not Bipartite

Page 13: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Exercise 4

• We are given a microarray with N genes and M experiments

• We want to cluster the genes into k clusters such that the distance between genes that belong to the same cluster will be < W

• Can you give a polynomial algorithm that solves this problem?

Page 14: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• Probably not• More specifically, if we could solve

this problem in polynomial time, we could solve a large class of problem that are widely believed to be unsolvable in polynomial time

Page 15: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Solution

• How can we show that we can probably not find a solution in polynomial time?

• We will take a problem for which this has already been shown

• We will construct a polynomial time reduction to our problem

• So, if our problem could be solved efficiently the “hard” problem could also be solved efficiently

Page 16: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Graph description

The following graph can describe our problem:

G1 G2

G3G6

G5 G4

There’s an edge (Gi,Gj) if the distance between Gi and Gj is less than W

Page 17: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Graph description

Clustering with k=3:

Page 18: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

3COL

3-Colorability: Given a graph G, can we dye its vertices with 3 different colors such that no two adjacent nodes have the same color?

Page 19: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Comparing the problems

• What is common to both these problems?• In both we “cluster” the nodes• What are the differences?• First, in 3COL there are only 3 clusters

instead of k• Second, the elements that belong to the

same group in 3COL must not have edges between them

Page 20: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Reduction

• Now that we understand the differences, we can take a graph G that is an input to 3COL, and transform it to a graph G’ and a constant k that are the input to the k-clustering problem

• We assume that we have a polynomial k-clustering algorithm, and we apply it to (G’,k) and translate the solution to 3COL

Page 21: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Reduction

• Given the first difference that we noted, what should be the value of k?

• We set k to 3, i.e. the algorithm should find exactly 3 clusters

• How do we change G to get G’?• G’ has the complement edges of G

Page 22: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Example

Page 23: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Proof

Suppose that G is 3 colorable. Let V1,V2,V3 be the groups of nodes that can be colored by distinct colors. There are no edges between any pair of nodes in V1, and therefore it forms a legal cluster in G’. Similarly, the nodes of V2 and V3 form clusters. Since V1UV2UV3 contains all the nodes all the genes are clustered in the 3 corresponding clusters.

Page 24: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Proof, second direction

Suppose that G’ contains a clustering to 3 legal clusters. These clusters correspond to 3 nodes sets in G such that within each set there are no edges between pairs of nodes. Therefore, assigning a different color to every set is a 3-coloring.

Page 25: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

HW 2 question 5

• Uniform lifted alignment – alignment in which for each level all string are either lifted from right or left.

• Prove that the optimal uniform lifted alignment has cost at most twice of the optimal alignment tree.

• Give a polynomial algorithm to find the optimal uniform lifted alignment.

Page 26: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

HW 2 question 5

• Uniform lifted alignment, proof:• Assume we had the optimal tree T*.• Transform it in the following way:• To assign string at level k, consider:

• Pick the minimal sum.

Page 27: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

HW 2 – question 5 – cont’d

• Assign each ‘non-zero’ edge (T,S) to a path in the optimal tree:

• The path from leaf (T) to node (S*).

S (S*)

T S

T

Together, these paths cover all edges of the tree.

Page 28: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

HW 2 – question 5 – cont’d

By triangle inequality:D(S, T) ≤ D(S, S*) + D(S*, T)

S (S*)

T S

T

By choice of left/right:Σs D(S,S*)+D(S*,T) ≤ Σs D(S*,T)+D(S*,T) =Σs 2D(S*,T) => One-sided tree with cost at most twice the optimal.

Page 29: Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

HW 2 – question 5 – cont’d

• Algorithm:• Preprocess pairwise sequence

distances.• Try all different assignments for a

left/right for each level, and pick the minimal one.

• Running time (n sequences of length m):• Proprocessing: O(m2n2).• Height h, different assignment 2h.• Calculation cost of tree O(n).