YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Minimum Edit Distance

Minimum Edit Distance

Minimum Edit Distance in Computational Biology

Page 2: Minimum Edit Distance

Dan Jurafsky

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 3: Minimum Edit Distance

Dan Jurafsky

Why sequence alignment?

• Comparing genes or regions from different species• to find important regions• determine function• uncover evolutionary forces

• Assembling fragments to sequence DNA• Compare individuals to looking for mutations

Page 4: Minimum Edit Distance

Dan Jurafsky

Alignments in two fields

• In Natural Language Processing•We generally talk about distance (minimized)• And weights

• In Computational Biology•We generally talk about similarity (maximized)• And scores

Page 5: Minimum Edit Distance

Dan Jurafsky

The Needleman-Wunsch Algorithm

• Initialization:D(i,0) = -i * dD(0,j) = -j * d

• Recurrence Relation: D(i-1,j) - dD(i,j)= max D(i,j-1) - d D(i-1,j-1) + s[x(i),y(j)]

• Termination:D(N,M) is distance

Page 6: Minimum Edit Distance

Dan Jurafsky

The Needleman-Wunsch Matrix

Slide adapted from Serafim Batzoglou

x1 ……………………………… xMy1 …

……

……

……

… y

N

(Note that the origin is at the upper left.)

Page 7: Minimum Edit Distance

Dan Jurafsky

A variant of the basic algorithm:

• Maybe it is OK to have an unlimited # of gaps in the beginning and end:

Slide from Serafim Batzoglou

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGCGCGAGTTCATCTATCAC--GACCGC--GGTCG--------------

• If so, we don’t want to penalize gaps at the ends

Page 8: Minimum Edit Distance

Dan Jurafsky

Different types of overlaps

Slide from Serafim Batzoglou

Example:2 overlapping“reads” from a sequencing project

Example:Search for a mouse genewithin a human chromosome

Page 9: Minimum Edit Distance

Dan Jurafsky

The Overlap Detection variant

Changes:

1. InitializationFor all i, j,

F(i, 0) = 0F(0, j) = 0

2. Termination maxi F(i, N)

FOPT = max maxj F(M, j)

Slide from Serafim Batzoglou

x1 ……………………………… xM

y 1 …

……

……

……

… y

N

Page 10: Minimum Edit Distance

Dan Jurafsky

Given two strings x = x1……xM, y = y1……yN

Find substrings x’, y’ whose similarity (optimal global alignment value)is maximum

x = aaaacccccggggttay = ttcccgggaaccaacc Slide from Serafim Batzoglou

The Local Alignment Problem

Page 11: Minimum Edit Distance

Dan Jurafsky

The Smith-Waterman algorithmIdea: Ignore badly aligning regions

Modifications to Needleman-Wunsch:

Initialization:F(0, j) = 0F(i, 0) = 0

0Iteration: F(i, j) = max F(i – 1, j) – d

F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj)

Slide from Serafim Batzoglou

Page 12: Minimum Edit Distance

Dan Jurafsky

The Smith-Waterman algorithm

Termination:1. If we want the best local alignment…

FOPT = maxi,j F(i, j)

Find FOPT and trace back

2. If we want all local alignments scoring > t

?? For all i, j find F(i, j) > t, and trace back?

Complicated by overlapping local alignments Slide from Serafim Batzoglou

Page 13: Minimum Edit Distance

Dan Jurafsky

Local alignment example

A T T A T C0 0 0 0 0 0 0

A 0T 0C 0A 0T 0

X = ATCATY = ATTATC

Let: m = 1 (1 point for match) d = 1 (-1 point for del/ins/sub)

Page 14: Minimum Edit Distance

Dan Jurafsky

Local alignment example

A T T A T C0 0 0 0 0 0 0

A 0 1 0 0 1 0 0T 0 0 2 1 0 2 0C 0 0 1 1 0 1 3A 0 1 0 0 2 1 2T 0 0 2 0 1 3 2

X = ATCATY = ATTATC

Page 15: Minimum Edit Distance

Dan Jurafsky

Local alignment example

A T T A T C0 0 0 0 0 0 0

A 0 1 0 0 1 0 0T 0 0 2 1 0 2 0C 0 0 1 1 0 1 3A 0 1 0 0 2 1 2T 0 0 2 0 1 3 2

X = ATCATY = ATTATC

Page 16: Minimum Edit Distance

Dan Jurafsky

Local alignment example

A T T A T C0 0 0 0 0 0 0

A 0 1 0 0 1 0 0T 0 0 2 1 0 2 0C 0 0 1 1 0 1 3A 0 1 0 0 2 1 2T 0 0 2 0 1 3 2

X = ATCATY = ATTATC

Page 17: Minimum Edit Distance

Minimum Edit Distance

Minimum Edit Distance in Computational Biology


Related Documents