Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Comp. Genomics

Recitation 10Clustering and analysis of microarrays

Exercise 1

• A microarray that contains probes for all the N metabolic enzymes of the bacterium D.Angerous was used for the following time-series microarray experiment: The bacteria population were exposed to a drug, and gene expression was measured every hour for M hours.

• The expression values are discretized to {-1,0,1}

Exercise 1

• Find the longest expression pattern that is common to at least k enzymes. Each enzyme may start the pattern at a different time.

T1 T2 T3 T4 T5 T6 T7

E1 -1 0 1 -1 0 1 -1

E2 -1 -1 0 1 -1 0 1

E3 -1 -1 -1 0 1 -1 0

E4 0 0 0 0 0 0 0

E5 1 1 1 1 0 1 1

E6 1 -1 -1 -1 -1 0 -1

K=3

Solution

• Treat each expression vector as a string

• Create a generalized suffix tree O(MN)

• Find longest k-common substring

Exercise 2

• Expression of N genes was measured under a certain condition using a microarray.

• No discretization was performed.• Give a polynomial time algorithm for

clustering these genes into exactly k clusters.

• The objective function is

k

llCjijid

1

},|),(max{

Pictorially

G1 G2 G3 G4 G5 G6Expression level

If {G3,G4,G5}is a cluster, its contribution to theobjective function is d(G3,G5)

Solution

• Create a weighted directed graph, every gene is a node and the edge from i to j has weight d(i,j-1) if i’s expression is lower than j’s (otherwise ∞)G1 G2 G3 G4 G5 G6

The path in the graph that corresponds to this clustering is G1G3G6. The value of the objective function is d(G1,G2)+d(G3,G5)+0

Solution

• Next:• Find the shortest path that visits

exactly k nodes• Dynamic programming:

),()1(min)(1,..,

jldkPkP ljkl

j

Start from k because if l<k Pl(k-1)=∞

Exercise 3

• A microarray experiment with N genes and M conditions was conducted

• Describe a polynomial algorithm that determines whether the genes can be clustered into 2 clusters such that the maximum distance d(Gi,Gj) in each cluster < W

Illustration

0 1 1

0 0 1

111

1 1 0

W=2

G1

G2

G3

G4

Solution

• Create a graph with a node for every gene

• Add an edge (i,j) if d(i,j)> W• Check if the resulting graph is

bipartite: Run BFS, if you discover an edge (u,v) to a gray node and the depths of u and v are both even or both odd, answer: “no”.

Solution

Not Bipartite

Exercise 4

• We are given a microarray with N genes and M experiments

• We want to cluster the genes into k clusters such that the distance between genes that belong to the same cluster will be < W

• Can you give a polynomial algorithm that solves this problem?

Solution

• Probably not• More specifically, if we could solve

this problem in polynomial time, we could solve a large class of problem that are widely believed to be unsolvable in polynomial time

Solution

• How can we show that we can probably not find a solution in polynomial time?

• We will take a problem for which this has already been shown

• We will construct a polynomial time reduction to our problem

• So, if our problem could be solved efficiently the “hard” problem could also be solved efficiently

Graph description

The following graph can describe our problem:

G1 G2

G3G6

G5 G4

There’s an edge (Gi,Gj) if the distance between Gi and Gj is less than W

Graph description

Clustering with k=3:

3COL

3-Colorability: Given a graph G, can we dye its vertices with 3 different colors such that no two adjacent nodes have the same color?

Comparing the problems

• What is common to both these problems?• In both we “cluster” the nodes• What are the differences?• First, in 3COL there are only 3 clusters

instead of k• Second, the elements that belong to the

same group in 3COL must not have edges between them

Reduction

• Now that we understand the differences, we can take a graph G that is an input to 3COL, and transform it to a graph G’ and a constant k that are the input to the k-clustering problem

• We assume that we have a polynomial k-clustering algorithm, and we apply it to (G’,k) and translate the solution to 3COL

Reduction

• Given the first difference that we noted, what should be the value of k?

• We set k to 3, i.e. the algorithm should find exactly 3 clusters

• How do we change G to get G’?• G’ has the complement edges of G

Example

Proof

Suppose that G is 3 colorable. Let V1,V2,V3 be the groups of nodes that can be colored by distinct colors. There are no edges between any pair of nodes in V1, and therefore it forms a legal cluster in G’. Similarly, the nodes of V2 and V3 form clusters. Since V1UV2UV3 contains all the nodes all the genes are clustered in the 3 corresponding clusters.

Proof, second direction

Suppose that G’ contains a clustering to 3 legal clusters. These clusters correspond to 3 nodes sets in G such that within each set there are no edges between pairs of nodes. Therefore, assigning a different color to every set is a 3-coloring.

HW 2 question 5

• Uniform lifted alignment – alignment in which for each level all string are either lifted from right or left.

• Prove that the optimal uniform lifted alignment has cost at most twice of the optimal alignment tree.

• Give a polynomial algorithm to find the optimal uniform lifted alignment.

HW 2 question 5

• Uniform lifted alignment, proof:• Assume we had the optimal tree T*.• Transform it in the following way:• To assign string at level k, consider:

• Pick the minimal sum.

HW 2 – question 5 – cont’d

• Assign each ‘non-zero’ edge (T,S) to a path in the optimal tree:

• The path from leaf (T) to node (S*).

S (S*)

T S

T

Together, these paths cover all edges of the tree.


By triangle inequality:D(S, T) ≤ D(S, S*) + D(S*, T)

S (S*)

T S

T

By choice of left/right:Σs D(S,S*)+D(S*,T) ≤ Σs D(S*,T)+D(S*,T) =Σs 2D(S*,T) => One-sided tree with cost at most twice the optimal.


• Algorithm:• Preprocess pairwise sequence

distances.• Try all different assignments for a

left/right for each level, and pick the minimal one.

• Running time (n sequences of length m):• Proprocessing: O(m2n2).• Height h, different assignment 2h.• Calculation cost of tree O(n).

Comp. Genomics Recitation 10 Clustering and analysis of microarrays.

Documents