Top Banner
Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016
38

Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Mar 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Algorithmic Approaches for Biological Data, Lecture #20

Katherine St. John

City University of New YorkAmerican Museum of Natural History

20 April 2016

Page 2: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 3: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 4: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Outline

Aligning with Gaps and Substitution Matrices

Global versus Local Alignment

Searching Graphs: Breadth First & Depth First

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 2 / 16

Page 5: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 6: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 7: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Pairwise Sequence Alignment

A G A G

0 -1 -2 -3 -4A -1 1G -2G -3

Pictorially:

As equations:

where:

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 3 / 16

Page 8: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 9: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 10: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Aligning with Gaps and Substitution Matrices

where:

The basic dynamic programming formatcan be adjusted for different gaps andsubstitutions models.

δ: the gap penalty

σ: scores matches/mismatches.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 4 / 16

Page 11: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 12: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 13: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Gaps Are Treated Equally

A G A G

0 -1 -2 -3 -4

A -1 1

G -2

G -3

Commonly use affine gap penalty

function:

I h: penalty associated withopening a gap

I g : (smaller) penalty associatedwith extending the gap.

To implement this efficiently, use 2additional matrices that keeps track ofthe gaps (one for each sequence).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 5 / 16

Page 14: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Affine Gap

Burr Settles, U Wisconsin, 2008

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 6 / 16

Page 15: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 16: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 17: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 18: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 19: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 20: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Using Substitution Matrices

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

Can view σ(i , j) as a substitution matrix.

Substitution matrices commonly used for proteinseqeunces.

PAM = Percent Accepted Mutation

I Dayhoff et al., 1978I Used for closely related protein sequencesI Based on global alignment

BLOSUM = Blocks Substitution Matrix

I Henikoff & Henikoff, 1992I Used for more divergent sequencesI Based on local alignment

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 7 / 16

Page 21: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Global versus Local Alignment

Paul Reiners, IBM, 2008

Global: Needleman & Wunsch, 1970.

Local: Smith & Waterman, 1981.

Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.

Examples:

I finding motifs (conservedpatterns) across sequences,

I comparing sequences againstlonger sequences (e.g. blastsearch).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16

Page 22: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Global versus Local Alignment

Paul Reiners, IBM, 2008

Global: Needleman & Wunsch, 1970.

Local: Smith & Waterman, 1981.

Instead of looking for the global bestscore, look for the best score forsubsequences of the initial sequences.

Examples:

I finding motifs (conservedpatterns) across sequences,

I comparing sequences againstlonger sequences (e.g. blastsearch).

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 8 / 16

Page 23: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

The equation is slightly different:

s(i , j) = max

σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0

Initialize: first row and first column set to 0’s

Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16

Page 24: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

The equation is slightly different:

s(i , j) = max

σ(i , j) + s(i − 1, j − 1)−δ + s(i , j − 1)−δ + s(i − 1, j)0

Initialize: first row and first column set to 0’s

Traceback: find maximum value of s(i , j) anywhere inthe the matrix, stop when we get to a cell with 0.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 9 / 16

Page 25: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Smith-Waterman Algorithm

Paul Reiners, IBM, 2008

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 10 / 16

Page 26: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

T

T

A

A

G

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16

Page 27: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

T

T

A

A

G

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 11 / 16

Page 28: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

0 0 0 0 0

T 0

T 0

A 0

A 0

G 0

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 12 / 16

Page 29: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Local Alignment

A A G A

0 0 0 0 0

T 0 0 0 0 0

T 0 0 0 0 0

A 0 1 1 0 1

A 0 1 2 0 1

G 0 0 0 3 1

Use σ from Monday, but δ = 2.

What are the best local alignments?

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 13 / 16

Page 30: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 31: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 32: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Develop a strategy tovisit every node of thegraph(i.e. what datastructures areneeded?)

The bookkeeping isimportant.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 14 / 16

Page 33: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Two common strategies:

I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.

I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.

Bookkeeping is important:

I Keep a “To Do” list (priority queue) ofnodes still to visit.

I Mark nodes as you visit them, so, youknow not to visit again.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16

Page 34: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

In Pairs: Searching Graphs

Bastert et al., 2002

Two common strategies:

I Breadth First Search (BFS): visit all theneighbors, then visit all the neighbors’neighbors, etc.

I Depth First Search (DFS): for eachneighbor, visit its’ neighbors, andcontinue as far down as possible.

Bookkeeping is important:

I Keep a “To Do” list (priority queue) ofnodes still to visit.

I Mark nodes as you visit them, so, youknow not to visit again.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 15 / 16

Page 35: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 36: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 37: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16

Page 38: Algorithmic Approaches for Biological Data, …Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History

Recap

Dynamic Programming: will do local &global alignments in lab today.

More on searching graphs on Monday.

Email lab reports to [email protected].

Challenges available at rosalind.info.

K. St. John (CUNY & AMNH) Algorithms #20 20 April 2016 16 / 16