YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: The Needleman-Wunsch Algorithm for Sequence Alignment

Saul B. Needleman & Christian D. Wuncsch (1969)

KAVINDRI DILSHANI

H.M.K.G BANDARA

PARINDA RAJAPAKSHE

Page 2: The Needleman-Wunsch Algorithm for Sequence Alignment

ABOUT RESEARCH

Title

• “A General method applicable to the search for similarities

in the amino acid sequence of two proteins”

Authors

• Saul B. Needleman & Christian D. Wuncsch, Department of Biochemistry,

North-western University & Nuclear Medicine Service , V.A Research

Hospital ,Chicago, USA. (1969) [Cited by 8474]

S.B. Needleman & C.D. Wuncsch , “A General method applicable to the search for similarities in

the amino acid sequence of two proteins” , J. Mol . Biol .(1970) 48, 443- 453.

Page 3: The Needleman-Wunsch Algorithm for Sequence Alignment

OUTLINE

• Introduction - Sequence Alignment

- Approaches

- Needleman-Wunch Algorithm vs. Dynamic Programming

• Example - Optimal Alignment Score

- Optimal Alignment

- Algorithm Cost

• Applications - Results & Discussion

- Methodology

- Usefulness

Page 4: The Needleman-Wunsch Algorithm for Sequence Alignment

INTRODUCTION

Page 5: The Needleman-Wunsch Algorithm for Sequence Alignment

SEQUENCE ALIGNMENT

• Sequence alignment is a way of arranging two or more sequences of characters to identify regions of similarity.

• Identification of residue-residue correspondences

• Sequence : Can be taken as ordered strings of letters.

• Sequences in Bio-Informatics ?

• DNA sequences

• RNA sequences

• Protein sequences

Page 6: The Needleman-Wunsch Algorithm for Sequence Alignment

MOTIVATION

• Find homologous proteins

– Allows to predict structure and function

• Locate similar subsequences in DNA

– e.g.: Allows to identify regulatory elements

– Infer Biological similarities

• Locate DNA sequences that might overlap

– Helps in sequence assembly

Page 7: The Needleman-Wunsch Algorithm for Sequence Alignment

SEQUENCE ALIGNMENT - RESULTS

• Input: Two sequences over the same alphabet. - GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA

• Output: An alignment of the two sequences.

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

• Input: Two sequences over the same alphabet. - GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA

• Output: An alignment of the two sequences.

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

• Input: Two sequences over the same alphabet. - GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA

• Output: An alignment of the two sequences.

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

Insertions / Deletions

(indel)

Perfect matches

Mismatches

Page 8: The Needleman-Wunsch Algorithm for Sequence Alignment

APPROACHES

Sequence Alignment

Qualitative Quantitative

Dot-plot Global

Local

Multiple

Page 9: The Needleman-Wunsch Algorithm for Sequence Alignment

QUALITATIVE

• Dot-plot

-Pictorial representation & relationship between two sequences

- Uses a Table or a Matrix

- Doesn’t quantifies the similarity figure !!

Page 10: The Needleman-Wunsch Algorithm for Sequence Alignment

QUANTITATIVE

• Construction of the best alignment between

the sequences.

• Assessment of the similarity from the

alignment. ( Numerically Quantifies)

Page 11: The Needleman-Wunsch Algorithm for Sequence Alignment

GLOBAL SEQUENCE ALIGNMENT

• The best alignment over the entire length of two

sequences.

• Suitable : Two sequences are of similar length,

with a significant degree of similarity throughout .

Page 12: The Needleman-Wunsch Algorithm for Sequence Alignment

LOCAL SEQUENCE ALIGNMENT

• Compares short portions of sequence or a whole

library of sequences with short portions of

another.

• Suitable : Comparing substantially different

sequences, which possibly differ significantly in

length, and have only a short patches of similarity.

Page 13: The Needleman-Wunsch Algorithm for Sequence Alignment

MULTIPLE SEQUENCE ALIGNMENT

• Simultaneous alignment of more than two

sequences.

• Suitable : Suitable when searching for subtle

conserved sequence patterns in a protein family,

and when more than two sequences of the protein

family are available.

Page 14: The Needleman-Wunsch Algorithm for Sequence Alignment

EXAMPLE

S1 : SIMILARITY S2 : PILLAR S3 : MOLARITY

Global Local Multiple

SIMILARITY

PI-LLAR--- MILAR

ILLAR

SIMILARITY

PI-LLAR---

--MOLARITY

Page 15: The Needleman-Wunsch Algorithm for Sequence Alignment

HOW TO QUANTIFY ?

• Introduces a Scoring Schema

• Set of rules which assigns the Alignment score to

any given alignment of two sequences.

• Alignment score : Goodness of Alignment

• Scoring Schema

Substitution scores

Gap penalties

Page 16: The Needleman-Wunsch Algorithm for Sequence Alignment

THE SUBSITUTION MATRIX

• Simple scoring schema for Residue substitution

• Express the residue substitution costs can be achieved with

a N x N matrix (N is 4 for DNA and 20 for proteins).

C T A G

C 1 -1 -1 -1

T -1 1 -1 -1

A -1 -1 1 -1

G -1 -1 -1 1

Page 17: The Needleman-Wunsch Algorithm for Sequence Alignment

EXAMPLE

• Consider the "best" alignment of ATGGCGT and

ATGAGT

• +1 as a reward for a match, -1 as the penalty for a mismatch,

and ignore gaps

ATGGCGT ATG_ AGT

Score: +1 + 1 + 1 + 0 - 1 + 1 + 1 = 4

Alternative alignment

ATGGCGT

A_TGAGT Score: +1 + 0 - 1 + 1 - 1 + 1 + 1 = 2

Page 18: The Needleman-Wunsch Algorithm for Sequence Alignment

BETTER MATRIX

• Certain changes in DNA /Protein sequences are more

likely to occur naturally than the others.

• Proteins are composed of twenty amino acids, and

physico-chemical properties of individual amino acids

vary considerably.

• Important to incorporate evolutionary relationships

for this substitution schema.

Page 19: The Needleman-Wunsch Algorithm for Sequence Alignment

EVOLUTIONARY SUBSTITUTION MATRIX

• PAM ("point accepted mutation") family

- PAM250, PAM120, etc.

• BLOSUM ("Blocks substitution matrix") family - BLOSUM62, BLOSUM50, etc.

• Derived from the analysis of known alignments of closely related proteins

• Assigns variable weights to different substitution operations.

Page 20: The Needleman-Wunsch Algorithm for Sequence Alignment

BLOSUM 62

Page 21: The Needleman-Wunsch Algorithm for Sequence Alignment

GAPS

• A Gap, indicates consecutive run of spaces in an

alignment , may be introduced in either sequence.

(insertion or a deletion of a residue)

• Objective :- Optimal sequence alignment with

meaningful alignments.

• Is it Good ?

– Interrupts the entire polymer chain

– In DNA shifts the reading frame Penalty

Page 22: The Needleman-Wunsch Algorithm for Sequence Alignment

GAP PENALTIES

Constant Linear Affine

• Whatever size it is,

receives the constant

negative penalty : -g

• Depends linearly on

the size of a gap.

Parameter : -g, is the

penalty per unit

length of a gap.

• Gap introduction cost >

Gap extension cost

g = o + (L-1)e.

|e| < |o|

Page 23: The Needleman-Wunsch Algorithm for Sequence Alignment

ENRICHED SCORING SCHEMA

• Scoring scheme provides us with the quantitative

measure of how good is some alignment relative to

alternative alignments .

• Does this scoring scheme tell us how to find the

best alignment ?

Page 24: The Needleman-Wunsch Algorithm for Sequence Alignment

BASIC APPROACH

• Brute-force approach

- Generate the list all possible alignments between two

sequences, score them

- Select the alignment with the best score

– The number of possible global alignments between

two sequences of length N is

22𝑁

𝜋𝑁

- For two sequences of 250 residues this is ~ 10149

Page 25: The Needleman-Wunsch Algorithm for Sequence Alignment

NEEDLEMAN WUNCH ALGORITHM

Page 26: The Needleman-Wunsch Algorithm for Sequence Alignment

NEEDLEMAN-WUNCH ALGORITHM

• Reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found.

• Global Sequence Alignment Technique.

• Build up the best alignment by using optimal alignments of smaller sub sequences.

• Dynamic Programming

Page 27: The Needleman-Wunsch Algorithm for Sequence Alignment

DYNAMIC PROGRAMMING

• Dynamic Programming is an algorithmic

paradigm.

- Breaking problem into sub problems

- Stores results of sub problems

- Avoids computing the same results again.

• Main properties of the problem

- Overlapping Sub problems

- Optimal substructure

Page 28: The Needleman-Wunsch Algorithm for Sequence Alignment

OVERLAPPING SUB PROBLEMS

• Segregate main problem into sub-problems.

• Mainly used when solutions of same sub problems are

needed again and again.

• Computed solutions to sub problems are stored in a

table/matrix so that these don’t have to recomputed.

• Dynamic Programming is not useful when there are no

common (overlapping) sub problems.

Page 29: The Needleman-Wunsch Algorithm for Sequence Alignment

OPTIMAL SUBSTRUCTURE

• If an optimal solution can be constructed efficiently

from optimal solutions of its sub problems.

• Optimal global solution contains the optimal

solutions of all its sub problems.

• Dynamic Programming is not useful when there

isn’t optimal substructure in the problem.

Page 30: The Needleman-Wunsch Algorithm for Sequence Alignment

HOW IT WORKS?

• Governed by three steps

- Break the problem into smaller sub problems.

- Solve the smaller problems optimally

- Use the sub-problem solutions to construct an

optimal solution for the original problem

• Needleman-Wunsch Algorithm incorporates

the Dynamic Algorithm paradigm Optimal global

alignment and the corresponding score.

Page 31: The Needleman-Wunsch Algorithm for Sequence Alignment

WORKOUT

Page 32: The Needleman-Wunsch Algorithm for Sequence Alignment

Definitions

• A scoring function (σ)

defines the score to give to a substitution mutation

eg. -1 for a match, -1 for mismatch

• A gap penalty

defines the score to give to an insertion or deletion

eg. -1

• A recurrence relation

defines what actions we repeat at each iteration (step) of the algorithm

T(i-1, j-1) + σ(S1(i), S2(j))

T(i, j) = max T(i-1, j) + gap penalty T(i, j-1) + gap penalty

Page 33: The Needleman-Wunsch Algorithm for Sequence Alignment

Steps

• Step 1

– Fill up a matrix (table) T using the recurrence

relation

• Step 2

–The Trace back step use the filled-in matrix T to

work out the best alignment

Page 34: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out

• Sequences

S1= TGGTG

S2= ATCGT

• Scoring function

For matches : +1

For mismatches : -1

A C G T

A +1 -1 -1 -1

C -1 +1 -1 -1

G -1 -1 +1 -1

T -1 -1 -1 +1

Substitution Matrix

Page 35: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

• Initializing the table

T G G T G

A

T

C

G

T

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

Left to Right Top to Bottom

Step 1 - The value of T(0,0) is set to zero at the start

Page 36: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

T G G T G

0

A

T

C

G

T

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

T(i-1, j-1) + σ(S1(i), S2(j))

T(i, j) = max T(i-1, j) + gap penalty

T(i, j-1) + gap penalty

previous column & row

previous column & same row

same column & previous row

Gap penalty = -2

A C G T

A +1 -1 -1 -1

C -1 +1 -1 -1

G -1 -1 +1 -1

T -1 -1 -1 +1

Page 37: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G

A C G T

A +1 -1 -1 -1

C -1 +1 -1 -1

G -1 -1 +1 -1

T -1 -1 -1 +1

Gap penalty = -2

T(i-1, j-1) + σ(S1(i), S2(j))

T(i, j) = max T(i-1, j) + gap penalty

T(i, j-1) + gap penalty

Page 38: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G

Gap penalty = -2

Page 39: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G Trace Back

Page 40: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G Trace Back

Page 41: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G Trace Back

Page 42: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G Trace Back

Page 43: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

0 -2 -4 -6 -8 -10

-2 -1 -3 -5 -7 -9

-4 -1 -2 -4 -4 -6

-6 -3 -2 -3 -5 -5

-8 -5 -2 -1 -3 -4

-10 -7 -4 -3 0 -2

i=0 i=1 i=2 i=3 i=4 i=5

j=0

j=1

j=2

j=3

j=4

j=5

A

T

C

G

T

T G G T G Trace Back

S1= TGGTG S2= ATCGT

-

A

T

|

T

G

C

G

|

G

T

|

T

G

-

→ Score = 3-1-4 = -2

Page 44: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W

|

W

H

|

H

A

-

T

Y

(Pink traceback)

W

|

W

H

|

H

A

Y

T

-

(Orange traceback)

match:+1

mismatch:-1

gap:-2

Two possible trace backs ?

Page 45: The Needleman-Wunsch Algorithm for Sequence Alignment

Work Out cont..

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W H A T

0 -2 -4 -6 -8

W -2 1 -1 -3 -5

H -4 -1 2 0 -2

Y -6 -3 0 1 -1

W

|

W

H

|

H

A

-

T

Y

(Pink traceback)

W

|

W

H

|

H

A

Y

T

-

(Orange traceback)

Page 46: The Needleman-Wunsch Algorithm for Sequence Alignment

Performance

• The N-W algorithm takes time proportion to n2

• Accessing all possible alignment one by one 2nCn

N2 < 2nCn

N-W is much faster than assessing all possible alignments one-

by-one

Page 47: The Needleman-Wunsch Algorithm for Sequence Alignment

APPLICATIONS

Page 48: The Needleman-Wunsch Algorithm for Sequence Alignment

Role of weighing factors in evaluating a maximum

match

Proteins not expected to exhibit homology

Proteins expected to exhibit homology

• Whale myoglobin

• Human β-hemoglobin

• Bovin pancreatic ribonuclease

• Hen’s egg lysozyme

Page 49: The Needleman-Wunsch Algorithm for Sequence Alignment

APPLICATION OF THE METHOD

• Identification of the types of amino acid pairs

• Establish variable sets consisting of values to be assigned

to each type of pair

• Determine a value for the penalty

Page 50: The Needleman-Wunsch Algorithm for Sequence Alignment

TYPES OF AMINO ACID PAIRS

• Pairs having a maximum of three corresponding bases in their codons Type 3

• Pairs having a maximum of two corresponding bases in their codons

Type 2

• Pairs having a maximum of one corresponding bases in their codons

Type 1

• Pairs having no possible corresponding bases in their codons Type 0

Page 51: The Needleman-Wunsch Algorithm for Sequence Alignment

• Reading the amino acid sequences to be compared into the

computer

• Maximum-Correspondence array

– Contain all possible pairs of amino acids

– Identify each pair to the corresponding type

• Generating the two-dimensional array row-by-row

• Assigning the variable set containing the type values and

appropriate value from that set to the appropriate cell of the

comparison array

METHODOLOGY

Page 52: The Needleman-Wunsch Algorithm for Sequence Alignment

Nucleotide sequences of

RNA codons recognized by

AA-tRNA*

*

Marshall RE, Caskey CT, Nirenberg M. Fine structure of RNA codewords recognized by bacterial, amphibian, and mammalian transfer RNA. Science. 1967 Feb 17;155(3764):820–826.

Page 53: The Needleman-Wunsch Algorithm for Sequence Alignment

• Determination of the maximum-match by the procedure

of successive summations

• Randomizing the amino acid sequence of only one

member of the protein

– Sequences of β-hemoglobin and ribonuclease

– Randomization procedure: A sequence shuffling routine based

on computer-generated random

numbers

• Repeating the cycle of sequence randomization &

maximum-match determination

• Estimating the average and standard deviation for the

random values of each variable set

METHODOLOGY

Page 54: The Needleman-Wunsch Algorithm for Sequence Alignment

RESULTS AND DISCUSSION

• A small random sample size (ten)

• Assumption: For each set of variables the random

values would be distributed in the fashion

of the normal-error curve

• The values of the first six random sets in the β-hemoglobin–

myoglobin comparison were converted to standard measures

• Probit plot

Page 55: The Needleman-Wunsch Algorithm for Sequence Alignment
Page 56: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 57: The Needleman-Wunsch Algorithm for Sequence Alignment

RIBONUCLEASE – LYSOZYME MAXIMUM MATCHES

Page 58: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 59: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 60: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 61: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 62: The Needleman-Wunsch Algorithm for Sequence Alignment

β-HEMOGLOBIN – MYOGLOBIN MAXIMUM MATCHES

Page 63: The Needleman-Wunsch Algorithm for Sequence Alignment

• To detect homology and define its nature

• Assumption:

– Homologous proteins are the result of gene duplication and subsequent mutations

• Construct several hypothetical amino-acid sequences that would be expected to show homology

– Following the duplications, point mutations occur at a constant or variable rate

• After a relatively short period of time pairs will have nearly identical sequences

USEFULLNESS

Page 64: The Needleman-Wunsch Algorithm for Sequence Alignment

DETECTION OF THE HIGH DEGREE OF HOMOLOGY PRESENT

• Use of values for non-identical pairs

• Assigning a relative high penalty for gaps

• Attaching substantial weight

• Reducing the penalty

• Assessing a very small or even negative penalty factor

Page 65: The Needleman-Wunsch Algorithm for Sequence Alignment

THE NATURE OF HOMOLOGY

• Indication?

–Variables which maximize the significance of

the difference between real and random

proteins

Page 66: The Needleman-Wunsch Algorithm for Sequence Alignment

EVOLUTIONARY DIVERGENCE

• Similar populations accumulate difference over

evolutionary time, and so become increasingly

distinct

Page 67: The Needleman-Wunsch Algorithm for Sequence Alignment

EVOLUTIONARY DIVERGENCE

• “Divergent evolution" can be applied to molecular

biology characteristics.

• To genes and proteins derived from two or more

homologous genes

• Assignment of weight to type 2 pairs

–Enhances the significance of the results

– Substantial Evolutionary Divergence

Page 68: The Needleman-Wunsch Algorithm for Sequence Alignment

EVOLUTIONARY DIVERGENCE

• Exception??

–Evolutionary divergence manifested by

cytochrome and other heme proteins

• Non-random mutations along the genes

Page 69: The Needleman-Wunsch Algorithm for Sequence Alignment

THE DEGREE & TYPE OF HOMOLOGY

• Differ between protein pairs

• Due to the difference

–No a priori best set of cell and operation values

–No best set of value to detect only slight

homology

Page 70: The Needleman-Wunsch Algorithm for Sequence Alignment

METHODS OF DETERMINING THE DEGREE OF HOMOLOGY

• Counting the number of non-identical pairs in the

homologous comparison

• Counting the number of mutations represented by the

non-identical pairs

• Measure of evolutionary distance

Page 71: The Needleman-Wunsch Algorithm for Sequence Alignment

QUICK WRAPUP


Related Documents