CSCI 256 Data Structures and Algorithm Analysis Lecture 16 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.

Post on 19-Jan-2018

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

RNA Secondary Structure RNA: String B = b 1 b 2  b n over alphabet { A, C, G, U } Secondary structure: RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule G U C A GA A G CG A U G A U U A G A CA A C U G A G U C A U C G G G C C G Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA complementary base pairs: A-U, C-G

Transcript

CSCI 256 Data Structures and Algorithm Analysis Lecture 16

Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some by Iker Gondra

Dynamic Programming Review

• Recipe– Characterize structure of problem– Recursively define value of optimal solution– Compute value of optimal solution– Construct optimal solution from computed information

• Dynamic programming techniques– Binary choice: weighted interval scheduling – Multi-way choice: segmented least squares– Adding a new variable: knapsack– Dynamic programming over intervals: RNA secondary structure

RNA Secondary Structure

• RNA: String B = b1b2bn over alphabet { A, C, G, U }• Secondary structure: RNA is single-stranded so it tends

to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule

G

U

C

A

GA

A

G

CG

A

UG

A

U

U

A

G

AC A

A

C

U

G

A

G

U

C

AU

C

GG

G

C

C

G

Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA

complementary base pairs: A-U, C-G

RNA Secondary Structure

• Secondary structure: A set of pairs S = { (bi, bj) } that satisfy

– [Watson-Crick] S is a matching and each pair in S is a Watson-Crick complement: A-U, U-A, C-G, or G-C

– [No sharp turns] The ends of each pair are separated by at least 4 intervening bases. If (bi, bj) S, then i < j - 4

– [Non-crossing] If (bi, bj) and (bk, bl) are two pairs in S, then we cannot have i < k < j < l

CG G

C

A

G

U

U

U A

A U G U G G C C A U

ok

G G

C

A

G

U

U A

G

A U G G G C A U

sharp turn

G4

CG G

C

A

U

G

U

U A

A G U U G G C C A U

crossing

RNA Secondary Structure

• Out of all the secondary structures that are possible for a single RNA molecule, which are the ones that are likely to arise?– Free energy: Usual hypothesis is that an RNA

molecule will form the secondary structure with the optimum total free energy

– Goal: Given an RNA molecule B = b1b2bn, find a secondary structure S that maximizes the number of base pairs

approximate by number of base pairs

http://www.genebee.msu.su/services/rna2_reduced.html

RNA Secondary Structure: Subproblems

• First attempt: OPT(j) = maximum number of base pairs in a secondary structure on substring b1b2bj. Either– j is not involved in a pair

• Find optimal secondary structure in: b1b2bj-1

– j pairs with t for some t < j – 4

OPT(j-1)

1 t j

RNA Secondary Structure: Subproblems

If j pairs with some t where t < j – 4, then the no crossover rule tells us that we can’t have a base pair (k,l) where k < t < l < j; this implies we can’t have (k,l) where

1 ≤ k ≤ t -1 and t + 1 ≤ l ≤ j -1This means that any other pair (k,l) in an optimal structure is

either in b1,b2,…,bt-1 or in bt+1,…bj-1

So we must look at two subproblems which are decoupled due to the noncrossing constraint:

• Find the optimal secondary structure in: b1b2bt-1

• Find the optimal secondary structure in: bt+1bt+2bj-1

RNA Secondary Structure: Subproblems

• What is different here????• The second subproblem is not on our list of

subproblems, because it does not begin with b1.• We need more subproblems! • We need to be able to work with subproblems

that do not begin with b1.

Dynamic Programming Over Intervals

• Notation: OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1bj

– Case 1: i j - 4• OPT(i, j) = 0 by no-sharp turns condition

– Case 2: i < j - 4If Base bj is not involved in a pair (Watson –Crick)

• OPT(i, j) = OPT(i, j-1)

If Base bj pairs with bt for some t, i t < j - 4the non-crossing constraint means that:OPT(i, j) = 1 + max t { OPT(i, t-1) + OPT(t+1, j-1) }

take max over t such that i t < j-4 andbt and bj are Watson-Crick complements

• Hence if i < j – 4 we want the maximum of the two values for Opt(i,j)

• Opt(i,j) = max ( OPT(i, j-1), 1 + max t { OPT(i, t-1) + OPT(t+1, j-1) } )

Dynamic Programming Over Intervals

• What order to solve the sub-problems?– Looking at the recurrence relation, we see that we are

invoking solutions to subproblems on shorter intervals– Need to evaluate Opt for shortest intervals first – this

is different from the subset sum (and knapsack) strategy of doing row by row

– To achieve this need to set an auxillary variable k to a constant and use values of i and j which keep j-i = k

– As k gets larger, the interval for the subproblem bi,bi+1,…bj grows

Dynamic Programming Over Intervals

– Running time: O(n3) Why???

RNA(b1,…,bn) { Initialize Opt[i, j] = 0 whenever i j-4 (ie, i+4 ≥ j) for k = 5, 6, …, n-1 for i = 1, 2, …, n-k set j = i + k Compute Opt[i, j]

return Opt[1, n]}

using recurrence

Running time analysis:

• There are O(n2) subproblems to solve and evaluating the recurrence in each problem takes O(n) time (because we have to find the max over the t’s such that bt and bj are allowable pairs)

• So running time is O(n3)

Example: ACCGGUAGU

• Recall: base pairs allowed: AU, UA, CG, GC• What is the basic array that we need to fill??

(here n= 9)

Example: ACCGGUAGU

• Note if i > j, let Opt (i,j) = 0 (Why??)• Need two dimensions to present the array M of for values for

Opt (i,j) – one for the left endpoint of the interval being considered, and one for the right endpoint

• Some initial values are 0 – whenever i ≥ j – 4 (Why??)• Begin with k = 5; loop over the i’s from 1 to 4 (= 9 – 5)• for Opt (1,6); t = 1 is only t with 1 ≤ t < 6 - 4, and b1b6 is AU

allowable base pair so• Opt (1,6) = max( 0, max(1+0+0) ) = 1• Opt (2,7) t = 2 is only t with 2 ≤ t < 7-4; but b2b7 is CA – not an

allowable base pair so no ts satisfy the conditions and Opt(2,7) = Opt (2,6) = 0

• Next value??

Example: ACCGGUAGU

• Opt(3,8); t = 3 only possible t and b3b8 is allowable base pair so

Opt(3,8)= max( Opt(3,7), max ( 1 + Opt(3,0) + Opt(2,7) ) ) =max ( 0, max(1 + 0+0) ) = 1

Next value to calculate??

Example: ACCGGUAGU

• It is Opt(4,9) • Now let k = 6 and do • Opt( 1, 7) then Opt( 2,8), then Opt (3,9) ….• Note for Opt (1,7) both t = 1 and t = 2 satisfy the

inequality i ≤ t < j – 4 (i = 1 and j = 7); are both base pair allowable?

• This is a fully worked example in the text – check out more values to make sure you are following the algorithm

top related