B.sc biochem i bobi u 3.1 sequence alignment

Course: B.Sc Biochemistry

Subject: Basic of Bioinformatics

Unit: III

OUTLINE

Sequence Alignment Scoring Alignments and Substitution Matrices Inserting Gaps Dynamic Programming Database Searches

Sequence Alignment

Comparing sequences for– Similarity– Homology

Prediction of function of genes and proteins Construction of phylogeny Finding motifs

Sequence Alignment - HOMOLOGY

Orthologues : any gene pairwise relation where the ancestor node is a speciation event. Often have similar function

Paralogues : any gene pairwise relation where the ancestor node is a duplication event. Paralogs tend to have different functions


1.


2.

Sequence Alignment - PHYLOGENY

3.

Sequence Alignment – PROTEIN FUNCTIONS

4.

Scoring Alignments and Substitution Matrices

The quality of an alignment is measured by giving it a quantitative score

The simplest way of quatifying similarity between two sequences is percentage identity.– Simply measured by counting the number of

identical bases or amino acids matched between the aligned sequences.


The dot-plot gives a visual assesment of similarity based on identity.

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.


Percentage identity is a relatively crude measure and does bot give a complete picture of the degree of similarity of two sequences.

Scoring identical matches 1 and mismatches as 0 ignores the fact that the type of amino acids involved is highly significant.


Genuine matches may not be identical:

Seq1: T H I S I S A S E Q U E N C E

Seq1: T H A T _ _ _ S E Q U E N C E

Isoleucine – Alanine: both hydrophobic

Serine – Threonine : both polar


Scoring pairs of amino acids:– with similar properties higher scores– With different properties lower scores


To assign scores for alignmens use SUBSTITUTION MATRICES

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

5.


Different types of substitution matrices are being used based on:– The number of mutations required for

convertion of one amino acid to the other– Similarities in physicochemical properties.


PAM substitution matrices:– Use closely related protein sequences to

derive substitution frequencies– Accepted Point Mutations per 100 residues

250 PAM 250 mutation on 100 residues


BLOSUM substitution matrices:– BLOcks of Amino Acid SUbstitution Matrix – Use mutation data from highly conserved

local regions– BLOSUM 62 62% identity


Which matrix to use ?– Depends on the problem properties,– Distantly related sequences : PAM 250 –

BLOSUM 50– Closely related sequences: PAM 120,

BLOSUM 80


Which matrix to use ?– Some special purpose matrices (SLIM and

PHAT are designed for membrane proteins)– The length of the sequende is important

Short sequences PAM 40 or BLOSUM 80 Long sequences PAM 250 or BLOSUM 50


BLOSUM – 62 and PAM 120

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.

Inserting Gaps

Gap insertion requires a scoring penalty (gap penalty).

To achieve correct matches gaps are required

Alignment programs use gap penalties to limit the introduction of gaps in the alignments

Inserting Gaps

Insertions tend to be several residues long rather than just a single residue long– Fewer insertions and deletions occur in sequences

of structural importance– Smaller penalty on lengthening an existing gap

(gap extension penalty) than introducing a new gap

– Gap penaly is high the number of gaps will be decreased

– Gap penalty is low more and large gaps will be inserted.

Inserting Gaps

Choosing gap penalties:– Linear– Affine

Gap open penalty Gap extension penlty

Dynamic Programming

Global and Local alignments

Pairwise and Multiple alignments

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.

For a pair of sequences there is a large number of possible alignments.

2 sequences of length 1000 have appriximately 10600 different alignments.

Dynamic Programming

Dynamic Programming:– Problem can be divided into many smaller parts.– Optimal alignment will not contain parts that are

not themselves optimal.– Start from sufficiently short sub-sequences.– Alignement is additive:

Dynamic Programming

Needleman and Wunsch were the first to propose this method.

Find optimal global alignments. Align sequences:

– Seq1: x (x1x2x3…xm)

– Seq1: y (y1y2y3…yn)

Dynamic Programming

s(a,b) = score of aligning a and b F(i,j) = optimal similarity of X(1:i) and Y(1:j) Recurrence relation:

– F(i,0) = Σ s(X(k), gap), 0 <= k <= i

– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j

– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

– Assume linear gap penalty

Dynamic Programming

Dynamic Programming

Matrix S of optimal scores of sub-sequence alignments.


9.

Dynamic Programming

S(I, T) = -1,

10.

Dynamic Programming

S(I, H) = -3,

S(I, gap) = -8,

S(gap, H) = -8Recurrence relation:

F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]


11.

Dynamic Programming


12.

Dynamic Programming

–Linear gap penalty (E=4)


13.

Dynamic Programming

Semi – global alignment:– When we treat terminal gaps differently than

internal gaps– How to modify dynamic programming to be able

to make semi – global alignment ?

Dynamic Programming

Local alignment:– If we compare a sequence to whole genome– Find sub-strings whose optimal global

alignment value is maximum

Dynamic Programming

What is the difference between global and local alignment ?

Can we define the recuernce relation of local alignment similar to global alignment ?

Recurrence relation of GLOBAL ALIGNMENT:

(Needleman & Wunsch)

– F(i,0) = Σ s(X(k), gap), 0 <= k <= i

– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j

– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming

Recurrence relation of LOCAL ALIGNMENT:

(Smith-Waterman)

– F(i,0) = 0

– F(0,j) = 0

– F(i,j) = max [ 0,

F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming

Database Searches

FASTA and BLAST Use some heuristics Dynamic Programming Complexity

– Time O(n*m)– Space O(n*m)

Database Searches FASTA

Good local alignment should have some exact match subsequence.

Find all k-tuples. (k=1-2 for proteins, 3-6 for DNA sequences)

Protein k – tuples nc, sp, … (k = 2) Nucleotide k – tuples TAAA, CTCC,…(k = 4)


If k = 3 for nucleotide sequences.– There will be 64 possible k – tuples– Assign a number e( ):

e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3

Each 3 – tuples are represented as xi xi+1xi+2

Assign a number to each 3 – tuple

– Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40

– For example: AAA AAA 042 + 041 + 040 = 0 CAA 142 + 041 + 040 = 16


Find each occurance of k – tuples in the sequences.

Chaining Look – Up Tables Consider TAAAACTCTAAC (if k = 3):

3 - tuples Position

AAA (0) 2, 3

AAC (1) 4, 10

AAG (2) 0

AAT (3) 0

… …

Database Searches BLAST

Use short words to search the database sequence.

Searches for k – mers that will score above a threshold (T) value when aligned with query k - mer (Remember FASTA looks for k – tuples which are identical).

Use a scheme based on finite state automata (Remember FASTA use hashing and chaining fot rapid identification of k - tuples)


From Query Sequence, create query words (for protein sequences word size is 3)


Blast uses a list of high scoring words created from words similar to query words. Considers the words with a score bigger than a threshold value.


Scan each database sequence for an exact match to the list of words.

Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S".


Keep only the extended matches that have a score at least S.

Determine statistical significance of each remaining match.


http://blast.ncbi.nlm.nih.gov/Blast.cgi

1.

14.


15.


16.


17.


18.

Database Searches HISTORY

1970: NW 1980: SW 1985: FASTA 1989: BLAST

Books and Web References

Books Name :

1. Introduction To Bioinformatics by T. K. Attwood

2. BioInformatics by Sangita

3. Basic Bioinformatics by S.Ignacimuthu, s.j.

http://en.wikipedia.org/wiki/Sequence_alignment http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics-tutorial/

bioinformatics.pdf M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland

Science Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A

practical guide to the analysis of genes and proteins”, 2001, Wiley.

54

Images References

1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg 2.http://www.ensembl.org/info/genome/compara/

tree_example1.png 3.http://www.nature.com/nature/journal/v496/n7445/images/

nature12027-f1.2.jpg 4.

http://upload.wikimedia.org/wikipedia/commons/e/e6/Spombe_Pop2p_protein_structure_rainbow.png

5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi

B.sc biochem i bobi u 3.1 sequence alignment

Documents

use substitution matrices

substitution frequencies

sequence alignment phylogeny

outline sequence alignment

special purpose matrices

related sequences

related protein sequences

aligned sequences