Top Banner
Comp. Genomics Recitation 11 SCFG
34

Comp. Genomics

Jan 22, 2016

Download

Documents

brone

Comp. Genomics. Recitation 11 SCFG. Exercise. p. 1-p. q. W 1. W 2. 1-q. Different emission probabilities (e.g. DNA compositions). Convert to SCFG. Solution. W 1 aW 1 |cW 1 |…|aW 2 |cW 2 …|tW 2. W 2 aW 2 |cW 2 |…|aW 1 |cW 1 …|tW 1. p(W 1 aW 1 )=e w1 (a)p. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comp. Genomics

Comp. Genomics

Recitation 11SCFG

Page 2: Comp. Genomics

Exercise

W1 W2

1-p

1-q

pq

Convert to SCFG

Different emission probabilities (e.g. DNA compositions)

Page 3: Comp. Genomics

Solution

• W1aW1|cW1|…|aW2|cW2…|tW2

• W2aW2|cW2|…|aW1|cW1…|tW1

p(W1aW1)=ew1(a)p

p(W1aW2)=ew1(a)(1-p)

Page 4: Comp. Genomics

Solution

• Other rules trivial• Regular CF

Page 5: Comp. Genomics

Exercise

• Convert the production rule WaWbW to Chomsky normal form. If the probability of the original production is p, show the probabilities for the productions in your normal form version.

Page 6: Comp. Genomics

Solution

• Chomsky normal form requires that all production rules are of the form:WvWyWz or Wza

• We define four new non-terminals:W1,W2,Wa,Wb

• The new rules are:WW1W2

Old rule: WaWbW

W1WaW W2WbW

Waa Wbb

Page 7: Comp. Genomics

Solution

• For every non-terminal, the sum of probabilities of all production rules must be 1

• Since the new non-terminals have only one rule, their rules will be assigned probability 1

• The rule WW1W2 will therefore have probability p, same as the rule that we eliminated

Page 8: Comp. Genomics

ממועד ב', תשע"ב4שאלה

. ACGU( מעל א"ב RNA מחרוזת רנא )x=x1x2…xnיהי •

קיפול דו-ממדי של המחרוזת הוא אוסף של זוגות זרים • שהוא מקונן, כלומר אם n ל-1של אינדקסים בין

,a<bמזווגים, וכן c,d מזווגים, ומקומות a,bמקומותc<d וגם a<c אזי לא ייתכן c<b<d.

בסיס יכול להיות מזווג עם בסיס אחר ברצף, ואם אינו • U ל-Aמזווג הוא נקרא חופשי. זיווג ייתכן בין הבסיסים

(i+1,j-1) נגדיר את הזוג (i,j). עבור זוג G ל-Cובין בתור הצמוד לו.

Page 9: Comp. Genomics

ממועד ב', תשע"ב4המשך שאלה

נגדיר מודל אנרגטי פשוט של קיפול בצורה •הבאה:

אם בסיס חופשי, אין לו תרומה אנרגטית.•אם בסיס מזווג, יש לו תרומה )שלילית( אך ורק אם •

הזוג הצמוד לו גם מזווג.

יש לתאר אלגוריתם תכנון דינמי יעיל ככל האפשר •המוצא קיפול הממקסם את מספר הזוגות

המזווגים שהזוג הצמוד להם גם הוא מזווג )היינו קיפול בעל אנרגיה מינימלית(.

Page 10: Comp. Genomics

, מועד ב', תשע"ב4פתרון שאלה

i קיפול עם אנרגיה מינימלית בין A(i, j)נגדיר •.jל-

•W(i, j)-קיפול עם אנרגיה מינימלית, כש = i-ו j לא מזווגים.

•V(i, j)-קיפול עם אנרגיה מינימלית, כש = i-ו j מזווגים.

Page 11: Comp. Genomics

, מועד ב', 4המשך פתרון שאלה תשע"ב

• A(i, j) =max(W(i,j), V(i,j) ) if xi and xj can be paired, W(i,j) otherwise

• W(i,j) = max{i≤k<j} (A(i,k)+A(k+1,j))

• V(i,j) = max(1+V(i+1, j-1), W(i+1, j-1)) if xi and xj can be paired, W(i+1, j-1) otherwise

• A(i,i) = 0, A(i,i+1) = 0

Page 12: Comp. Genomics

EM algorithm for SCFG

1. Initial estimate.

2. Calculate expectations: E(X->YZ), E(X)

3. Update rule: Pt+1(X->YZ)=E(X->YZ)/E(X)

4. Repeat until convergence.

Page 13: Comp. Genomics

Probability calculation| x,Θ

• The probability that state v is used as a root in the derivation of xi,…,xj:

• The probability the rule vyz is used in deriving Xij (v is the root):

1 1 1

1( ,..., ,..., | , ) ( , , ) ( , , )

( | )i ij j nP x x v x x x i j v i j vP x

11( | , ) ( , , ) ( , , ) ( 1, , ) ( , )

( | )

j

vk i

P v yz x i j v i k y k j z t y zP x

Page 14: Comp. Genomics

Expectation calculation

• The expected number of times state v is used in a derivation:

1 1

1( ) ( , , ) ( , , )

( | )

L L

i j

c v i j v i j vP x

11

1 1

1( ) ( , , ) ( , , ) ( 1, , ) ( , )

( | )

jL L

vi j i k i

c v yz i j v i k y k j z t y zP x

• The expected number of times the rule vyz is used:

inside outside

Page 15: Comp. Genomics

EM for SCFG

• How to compute the new probability for vyz?

( )

( )

c v yz

c v

• What about va?

1

1 1

( , , ) ( )( )

( ) ( , , ) ( , , )

L

vi

L L

i j

i i v e ac v a

c v i j v i j v

Page 16: Comp. Genomics

Example

• Suppose that our data contains the following sentence: S

V

N

N

N

P

PP NV

He hangs pictures without frames

T1

Page 17: Comp. Genomics

Example

The sentence was generated using the following production rules:

SNV with probability p(SNV)VVN …NNP …PPPNNHeVhangsNpicturesPPwithoutNframes

Page 18: Comp. Genomics

Example

• The likelihood of this sentence is:

( , ) ( , )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

j iij

P sentence T P sentence T

p S NV p V VN p N NP p P PP N p N He p V hangs

p N pictures p PP without p N frames

We believe in our sentence! We start with some initial probabilities and want to the likelihood of the sentence using the EM algorithm

Page 19: Comp. Genomics

Example

• To make it more interesting, let’s add another production rule:

VVNPV

N N

P

PP NV

He hangs pictures without frames

S

T2

Page 20: Comp. Genomics

Example

• But now the grammar is no longer Chomsky normal form

• We will turn it into Chomsky normal form as follows:• VV N-P p(VV N-P)=p(VVNP)• N-PN P p(N-PN P)=1.0

Page 21: Comp. Genomics

Example

1,

( ) ( ) ( ) ( )j

ij ik k jY Z k i

X p X YZ Y Z

• Compute inside probabilities

Page 22: Comp. Genomics

Example

He

hangs

pictures

without

frames

N

11( ) ( )N p N He

V

22 ( ) ( )V p V hangs

N

PP

N

Page 23: Comp. Genomics

Example

He

hangs

pictures

without

frames

N

V

N

PP

N

S

12 11 22( ) ( ) ( ) ( )S p S NV N V

V

23 22 33( ) ( ) ( ) ( )V p V VN V N

P

Page 24: Comp. Genomics

Example

He

hangs

pictures

without

frames

N

V

N

PP

N

S

15 11 25( ) ( ) ( ) ( )S p S NV N V

V

P

S

N,N-P

V

Box(3,5)accounts forsubstring 3-5

Box(1,3)accounts forsubstring 1-3S

Page 25: Comp. Genomics

Example

1

1, 1

1 ,

( ) ( ) ( ) ( )

( ) ( ) ( )

i

ij k i kjY X k

L

j k ikY X k j

Z p X YZ Y X

p X YZ Y X

• Compute outside probabilities

Page 26: Comp. Genomics

Example

1

1, 1

( ) ( ) ( ) ( ) ...i

ij k i kjY X k

Z p X YZ Y X

1 , 1

( ) ( ) ( )L

j k ikY X k j

p X YZ Y X

i j

X

ZY

k i-1i j

X

YZ

kj+1

Page 27: Comp. Genomics

Example

He

hangs

pictures

without

frames

S

15 ( ) 1S

25 11 15( ) ( ) ( ) ( )V N S p S NV

V

55 44 45( ) ( N) ( ) ( )N P P PP PP P

N

Page 28: Comp. Genomics

Example

11

1 1

1( ) ( , , ) ( , , ) ( 1, , ) ( , )

( | )

jL L

vi j i k i

c v yz i j v i k y k j z t y zP x

15 1 5

1 115

1( , , ) ( , , ) ( 1, , ) ( )

( )

j

i j i k i

i j V i k V k j N p V VNS

25 22 35 23 22 3315

We saw the rule V in two squares,

and some squares don't have , , so we get:

( )( ) ( ) ( ) ( ) ( ) ( )

( )

VN

p V VNV V N V V N

S

Let’s improve p(VVN). The expected number of times it is used:

Page 29: Comp. Genomics

Example

The expected number of times that V is visited:

15

1( ) ( ) ( )

( )c V VN c V VNP c V hangs

S

1 1

1( ) ( , , ) ( , , )

( | )

L L

i j

c v i j v i j vP x

This is actually the same as:

Page 30: Comp. Genomics

Example

• In order to get the new p(VVN), we divide and get:

( )

( ) ( ) ( )

c V VN

c V VN c V VNP c V hangs

• Similarly, for p(Vhangs), we get:

( )

( ) ( ) ( )

c V hangs

c V VN c V VNP c V hangs

Page 31: Comp. Genomics

The CYK algorithm

• Initialization: for i=1…L, v=1…M:

• Iteration: for i=1…L-1, j=i+1…L, v=1…M

• Termination:score of optimal parse tree π* for sentence x

v(i,i,v)=log(e ( ))ix

, ... 1

(i,j,v)=

max max { ( , , ) ( 1, , ) log ( , ) }y z k i j vi k y k j z t y z

*

logP(x, | ) (1, ,1)L

Page 32: Comp. Genomics

The CYK algorithm

• Looks similar to the inside algorithm, but we take the maximum instead of summing (consider the forward algorithm vs. Viterbi)

Page 33: Comp. Genomics

Summary

Goal HMM algorithm

Time

SCFG algorithm

Time

Optimal alignment Viterbi CYK

P(x|Θ) forward inside

EM parameter estimation

forward-backward

inside-outside

|Q|2L

|Q|2L

|Q|2L

|M|3L3

|M|3L3

|M|3L3

M: SCFG symbols, Q: HMM states, L: Data length

Page 34: Comp. Genomics

Summary

Goal HMM algorithm

Space

SCFG algorithm

Space

Optimal alignment Viterbi CYK

P(x|Θ) forward inside

EM parameter estimation

forward-backward

inside-outside

|Q|L

|Q|L

|Q|L

|M|L2

|M|L2

|M|L2