Top Banner
CS 466 Introduction to Bioinformatics Lecture 4 Mohammed El-Kebir September 6, 2019
39

CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Mar 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

CS 466Introduction to Bioinformatics

Lecture 4

Mohammed El-KebirSeptember 6, 2019

Page 2: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix

Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes

2

Page 3: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Allow for inexact matches due to:• Sequencing errors• Polymorphisms/mutations in

reference genome

3

NGS Characterized by Short Reads

GenomeMillions -billions nucleotides

Next-generationDNA sequencing

10-100’s million short readsShort read: 100 nucleotides

… GGTAGTTAG …

… TATAATTAG …

… AGCCATTAG …

… CGTACCTAG …

… CATTCAGTAG …

… GGTAAACTAG …

Question: How to account for discrepancy between lengths of reference and short read?

Human reference genome is 3,300,000,000 nucleotides, while a short read is 100 nucleotides. Global sequence alignment will not work!

Page 4: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment

4

For short read alignment, we want to align complete short read 𝐯 ∈Σ$ to substring of reference genome𝐰 ∈ Σ&. Note that 𝑚 ≪ 𝑛.

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a alignment of 𝐯 and a substring of 𝐰 with

maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

𝐯 ∈ Σ$𝐰 ∈ Σ&

Page 5: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Naive Approach

• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛• How many?

5

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

𝐯 ∈ Σ$𝐰 ∈ Σ&

Page 6: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Naive Approach

• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛• How many? Answer: 𝑛 + &

2• What are their total lengths?• What is the running time?

6

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

𝐯 ∈ Σ$𝐰 ∈ Σ&

Page 7: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Dynamic Programming

7

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

s[i, j] = max

8>>><

>>>:

0, if i = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if i > 0 and j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = max{s[m, 0], . . . , s[m,n]}A

G

G

0T A C G G C0𝐯\𝐰

Page 8: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Dynamic Programming

8

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

s[i, j] = max

8>>><

>>>:

0, if i = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if i > 0 and j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = max{s[m, 0], . . . , s[m,n]}A

G

G

0T A C G G C0𝐯\𝐰 Start anywhere on first row

End anywhere on last row

Page 9: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Dynamic Programming

9

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

s[i, j] = max

8>>><

>>>:

0, if i = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if i > 0 and j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = max{s[m, 0], . . . , s[m,n]}

Question: Let match score be 1, mismatch/indel score be -1. What is 𝑠∗?

Question: Same scores. What is optimal global alignment and score?

A

G

G

0T A C G G C0

- A - G G -

T A C G G C

𝐯𝐰

𝐯\𝐰 Start anywhere on first row

End anywhere on last row

Page 10: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Fitting Alignment – Dynamic Programming• Online:

https://valiec.github.io/AlignmentVisualizer/index.html

10

Question: Let match score be 1, mismatch/indel score be -1. What is 𝑠∗?

Question: Same scores. What is optimal global alignment and score?

A

G

G

0T A C G G C0

- A - G G -

T A C G G C

𝐯𝐰

𝐯\𝐰s[i, j] = max

8>>><

>>>:

0, if i = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if i > 0 and j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = max{s[m, 0], . . . , s[m,n]}

Page 11: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix

Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes

11

Page 12: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Local Alignment – Biological Motivation

12

ABL1

SHKA

From Pfam database (http://pfam.sanger.ac.uk/)

Proteins are composed of functional units called domains. Such domains may occur in different proteins even across species.

Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has

maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰

Page 13: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Global, Fitting and Local Alignment

13

Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has

maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

Global Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find alignment of 𝐯 and 𝐰 with maximum score.

Page 14: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Local Alignment – Naive Algorithm

Brute force:1. Generate all pairs (𝐯4, 𝐰4) of substrings of 𝐯 and 𝐰2. For each pair (𝐯4, 𝐰4), solve global alignment problem.

14

Question: There are $2

&2 pairs of substrings.

But they have different lengths. What is the running time?

Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has

maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰

Page 15: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Key Idea

15

Local alignment:• Start and end anywhere

Global alignment:• Start at (0,0) and end at (𝑚, 𝑛)

Page 16: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Local Alignment Recurrence

16

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = maxi,j

s[i, j]

Local Alignment problem: Given strings 𝐯 ∈ Σ$and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring

of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰

Page 17: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Local Alignment Recurrence

17

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = maxi,j

s[i, j]

Local Alignment problem: Given strings 𝐯 ∈ Σ$and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring

of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰

Start anywhere

End anywhere

Running time: 𝑂(𝑚𝑛)

Page 18: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Local Alignment – Dynamic Programming• Online:

https://valiec.github.io/AlignmentVisualizer/index.html

18

Question: Let match score be 2, mismatch score be -2 and indel be -4. What is 𝑠∗?

A

G

G

0T A C G G C0

G G

G G

𝐯𝐰

𝐯\𝐰

s[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s[i� 1, j] + �(vi,�), if i > 0,

s[i, j � 1] + �(�, wj), if j > 0,

s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.

s⇤ = maxi,j

s[i, j]

Page 19: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Global, Fitting and Local Alignment

19

Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has

maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰

Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰

Global Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find alignment of 𝐯 and 𝐰 with maximum score.

Page 20: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix

Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes

20

Page 21: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Scoring Gaps

21

Let 𝐯 = AAC and 𝐰 = ACAGGC

Match 𝛿 𝑐, 𝑐 = 1; Mismatch 𝛿 𝑐, 𝑑 = −1 (where 𝑐 ≠ 𝑑); Indel 𝛿 𝑐, − = 𝛿 −, 𝑐 = −2

Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1

A - - A C

A C A A C

𝐯𝐰

A - A - C

A C A A C

𝐯𝐰

Page 22: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Scoring Gaps

22

Let 𝐯 = AAC and 𝐰 = ACAGGC

Match 𝛿 𝑐, 𝑐 = 1; Mismatch 𝛿 𝑐, 𝑑 = −1 (where 𝑐 ≠ 𝑑); Indel 𝛿 𝑐, − = 𝛿 −, 𝑐 = −2

Question: Which alignment is better?

A - - A C

A C A A C

𝐯𝐰

A - A - C

A C A A C

𝐯𝐰

Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1

Page 23: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Scoring Gaps – Affine Gap Penalties

23

Desired: Lower penalty for consecutive gaps than interspersed gaps.

Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.

A - - A C

A C A A C

𝐯𝐰

A - A - C

A C A A C

𝐯𝐰

Page 24: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Scoring Gaps – Affine Gap Penalties

24

Desired: Lower penalty for consecutive gaps than interspersed gaps.

Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.

Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).

A - - A C

A C A A C

𝐯𝐰

A - A - C

A C A A C

𝐯𝐰

Page 25: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Scoring Gaps – Affine Gap Penalties

25

Desired: Lower penalty for consecutive gaps than interspersed gaps.

Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.

Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).

Let 𝜌 = 10 and 𝜎 = 1. Left: 3 ∗ 1 − 10 + 1 ∗ 2 = −9. Right: 3 ∗ 1 − (10 + 1 ∗ 1) − 10 + 1 ∗ 1 = −19.

A - - A C

A C A A C

𝐯𝐰

A - A - C

A C A A C

𝐯𝐰

Page 26: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Affine Gap Penalty Alignment – Naive Approach

26

Idea: Insert horizontal (deletion) and vertical (insertion) edges

spanning 𝑘 > 1 gaps with score − (𝜌 + 𝜎𝑘).

new edges

old edges

Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).

... ... ... ... ... ...

...

...

...

Page 27: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Affine Gap Penalty Alignment – Naive Approach

27

Idea: Insert horizontal (deletion) and vertical (insertion) edges

spanning 𝑘 > 1 gaps with score − (𝜌 + 𝜎𝑘).

new edges

old edges

Question: What’s the running time?Question: What’s the recurrence?

Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).

... ... ... ... ... ...

...

...

...

Page 28: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

28

Affine Gap Penalty AlignmentIdea: Three separate recurrences:

(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]

(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]

Page 29: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

29

Affine Gap Penalty AlignmentIdea: Three separate recurrences:

(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]

(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]

s![i, j] = max

(s![i, j � 1]� �, if j > 1,

s&[i, j � 1]� (� + ⇢), if j > 0,

s&[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s![i, j], if j > 0,

s#[i, j], if i > 0,

s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,

s#[i, j] = max

(s#[i� 1, j]� �, if i > 1,

s&[i� 1, j]� (� + ⇢), if i > 0.

Page 30: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms

Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c

Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

30

Affine Gap Penalty AlignmentIdea: Three separate recurrences:

(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]

(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]

Running time: 𝑂(𝑚𝑛)

s![i, j] = max

(s![i, j � 1]� �, if j > 1,

s&[i, j � 1]� (� + ⇢), if j > 0,

s&[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s![i, j], if j > 0,

s#[i, j], if i > 0,

s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,

s#[i, j] = max

(s#[i� 1, j]� �, if i > 1,

s&[i� 1, j]� (� + ⇢), if i > 0.

Page 31: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Affine Gap Penalty Alignment – Example

31

𝐯 = AAC 𝐰 = ACAAC

Let 𝜌 = 10 and 𝜎 = 1. Match = 1. Mismatch = -1

s![i, j] = max

(s![i, j � 1]� �, if j > 1,

s&[i, j � 1]� (� + ⇢), if j > 0,

s&[i, j] = max

8>>><

>>>:

0, if i = 0 and j = 0,

s![i, j], if j > 0,

s#[i, j], if i > 0,

s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,

s#[i, j] = max

(s#[i� 1, j]� �, if i > 1,

s&[i� 1, j]� (� + ⇢), if i > 0.

Page 32: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Gapped Alignment – Additional Insights

•Naive approach supports arbitrary gap penalties given two sequences 𝐯 ∈ Σ$ and 𝐰 ∈ Σ&. This results in an 𝑂(𝑚𝑛 𝑚 + 𝑛 ) algorithm.

•Alignment with convex gap penalties given two sequences 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& can be computed in 𝑂(𝑚𝑛 log𝑚) time.See: Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA.

32

Page 33: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix

Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes

33

Page 34: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Substitution Matrices• Given a pair (𝐯,𝐰) of aligned sequences, we want to assign a score that

measure the relative likelihood that the sequences are related as opposed to being unrelated• We need two models:

• Random model 𝑅: each letter 𝑎 ∈ Σ occurs independently with probability 𝑞U• Match model 𝑀: aligned pair 𝑎, 𝑏 ∈ Σ × Σ occur with joint probability 𝑝U,Z

34

Pr 𝐯,𝐰 𝑅 =]^

𝑞_` ⋅]^

𝑞b` Pr 𝐯,𝐰 𝑀 =]^

𝑝_`,b`

log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = log hi,j

kikj

Page 35: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

BLOSUM (Blocks Substitution Matrices)• Henikoff and Henikoff, 1992• Computed using ungapped alignments of protein

segments (blocks) from BLOCKS database• Thousands of such blocks go into computing a

single BLOSUM matrix• Example of a one such block (right):

• 31 positions (columns)• 61 sequences (rows)

• Given threshold 𝐿, block is pruned down to largest set 𝐶 of sequences that have at least 𝐿% sequence identity to another sequence in 𝐶• How to compute 𝐶?

35

Page 36: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

BLOSUM (Blocks Substitution Matrices)

36

• Null model frequencies 𝑞U𝑞Z of letters 𝑎 and 𝑏:• Count the number of occurrences of 𝑎 (𝑏) in all

blocks• Divide by sum of lengths of each block

(sequences * positions)

• Match model frequency 𝑝U,Z:• Count the number of pairs (𝑎, 𝑏) in all columns

of all blocks• Divide by the total number of pairs of columns:

• ∑n 𝑛(𝐶)$(n)2

• 𝑚(𝐶) is the number of sequences in block 𝐶• 𝑛(𝐶) is the number of positions in block 𝐶

log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o

plog hi,j

kikj

Page 37: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

BLOSUM (Blocks Substitution Matrices)

37

log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o

plog hi,j

kikj

• Null model frequencies 𝑞U𝑞Z of letters 𝑎 and 𝑏:• Count the number of occurrences of 𝑎 (𝑏) in all

blocks• Divide by sum of lengths of each block

(sequences * positions)

• Match model frequency 𝑝U,Z:• Count the number of pairs (𝑎, 𝑏) in all columns

of all blocks• Divide by the total number of pairs of columns:

• ∑n 𝑛(𝐶)$(n)2

• 𝑚(𝐶) is the number of sequences in block 𝐶• 𝑛(𝐶) is the number of positions in block 𝐶

Example: (𝜆 = 0.5)

A A TS A LT A LT A VA A L

𝑞s =715

𝑞u =315

𝑝s,u =430

𝑠 𝐴, 𝑇 = 2 ⋅ log430715 ⋅

315

≈ 0.3

Page 38: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

BLOSUM62

38

log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o

plog hi,j

kikj

https://doi.org/10.1038/nbt0804-1035

Page 39: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters

Take Home Messages1. Edit distance2. Global alignment 3. Fitting alignment4. Local alignment5. Gapped alignment6. BLOSUM substitution matrix

Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes

39

Global alignment is longest path in DAG

Small tweaks enable different extensions

Edit distance is shortest path in DAG