Top Banner
Course: B.Sc Biochemistry Subject: Basic of Bioinformatics Unit: III
55
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: B.sc biochem i bobi u 3.1 sequence alignment

Course: B.Sc Biochemistry

Subject: Basic of Bioinformatics

Unit: III

Page 2: B.sc biochem i bobi u 3.1 sequence alignment

OUTLINE

Sequence Alignment Scoring Alignments and Substitution Matrices Inserting Gaps Dynamic Programming Database Searches

Page 3: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment

Comparing sequences for– Similarity– Homology

Prediction of function of genes and proteins Construction of phylogeny Finding motifs

Page 4: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment - HOMOLOGY

Orthologues : any gene pairwise relation where the ancestor node is a speciation event. Often have similar function

Paralogues : any gene pairwise relation where the ancestor node is a duplication event. Paralogs tend to have different functions

Page 5: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment - HOMOLOGY

1.

Page 6: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment - HOMOLOGY

2.

Page 7: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment - PHYLOGENY

3.

Page 8: B.sc biochem i bobi u 3.1 sequence alignment

Sequence Alignment – PROTEIN FUNCTIONS

4.

Page 9: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

The quality of an alignment is measured by giving it a quantitative score

The simplest way of quatifying similarity between two sequences is percentage identity.– Simply measured by counting the number of

identical bases or amino acids matched between the aligned sequences.

Page 10: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

The dot-plot gives a visual assesment of similarity based on identity.

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.

Page 11: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Percentage identity is a relatively crude measure and does bot give a complete picture of the degree of similarity of two sequences.

Scoring identical matches 1 and mismatches as 0 ignores the fact that the type of amino acids involved is highly significant.

Page 12: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Genuine matches may not be identical:

Seq1: T H I S I S A S E Q U E N C E

Seq1: T H A T _ _ _ S E Q U E N C E

Isoleucine – Alanine: both hydrophobic

Serine – Threonine : both polar

Page 13: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Scoring pairs of amino acids:– with similar properties higher scores– With different properties lower scores

Page 14: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

To assign scores for alignmens use SUBSTITUTION MATRICES

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

5.

Page 15: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Different types of substitution matrices are being used based on:– The number of mutations required for

convertion of one amino acid to the other– Similarities in physicochemical properties.

Page 16: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

PAM substitution matrices:– Use closely related protein sequences to

derive substitution frequencies– Accepted Point Mutations per 100 residues

250 PAM 250 mutation on 100 residues

Page 17: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

BLOSUM substitution matrices:– BLOcks of Amino Acid SUbstitution Matrix – Use mutation data from highly conserved

local regions– BLOSUM 62 62% identity

Page 18: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Which matrix to use ?– Depends on the problem properties,– Distantly related sequences : PAM 250 –

BLOSUM 50– Closely related sequences: PAM 120,

BLOSUM 80

Page 19: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

Which matrix to use ?– Some special purpose matrices (SLIM and

PHAT are designed for membrane proteins)– The length of the sequende is important

Short sequences PAM 40 or BLOSUM 80 Long sequences PAM 250 or BLOSUM 50

Page 20: B.sc biochem i bobi u 3.1 sequence alignment

Scoring Alignments and Substitution Matrices

BLOSUM – 62 and PAM 120

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.

Page 21: B.sc biochem i bobi u 3.1 sequence alignment

Inserting Gaps

Gap insertion requires a scoring penalty (gap penalty).

To achieve correct matches gaps are required

Alignment programs use gap penalties to limit the introduction of gaps in the alignments

Page 22: B.sc biochem i bobi u 3.1 sequence alignment

Inserting Gaps

Insertions tend to be several residues long rather than just a single residue long– Fewer insertions and deletions occur in sequences

of structural importance– Smaller penalty on lengthening an existing gap

(gap extension penalty) than introducing a new gap

– Gap penaly is high the number of gaps will be decreased

– Gap penalty is low more and large gaps will be inserted.

Page 23: B.sc biochem i bobi u 3.1 sequence alignment

Inserting Gaps

Choosing gap penalties:– Linear– Affine

Gap open penalty Gap extension penlty

Page 24: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

Global and Local alignments

Pairwise and Multiple alignments

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.

Page 25: B.sc biochem i bobi u 3.1 sequence alignment

For a pair of sequences there is a large number of possible alignments.

2 sequences of length 1000 have appriximately 10600 different alignments.

Dynamic Programming

Page 26: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming:– Problem can be divided into many smaller parts.– Optimal alignment will not contain parts that are

not themselves optimal.– Start from sufficiently short sub-sequences.– Alignement is additive:

Dynamic Programming

Page 27: B.sc biochem i bobi u 3.1 sequence alignment

Needleman and Wunsch were the first to propose this method.

Find optimal global alignments. Align sequences:

– Seq1: x (x1x2x3…xm)

– Seq1: y (y1y2y3…yn)

Dynamic Programming

Page 28: B.sc biochem i bobi u 3.1 sequence alignment

s(a,b) = score of aligning a and b F(i,j) = optimal similarity of X(1:i) and Y(1:j) Recurrence relation:

– F(i,0) = Σ s(X(k), gap), 0 <= k <= i

– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j

– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

– Assume linear gap penalty

Dynamic Programming

Page 29: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

Matrix S of optimal scores of sub-sequence alignments.

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

9.

Page 30: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

S(I, T) = -1,

10.

Page 31: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

S(I, H) = -3,

S(I, gap) = -8,

S(gap, H) = -8Recurrence relation:

F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

11.

Page 32: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

12.

Page 33: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

–Linear gap penalty (E=4)

[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

13.

Page 34: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

Semi – global alignment:– When we treat terminal gaps differently than

internal gaps– How to modify dynamic programming to be able

to make semi – global alignment ?

Page 35: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

Local alignment:– If we compare a sequence to whole genome– Find sub-strings whose optimal global

alignment value is maximum

Page 36: B.sc biochem i bobi u 3.1 sequence alignment

Dynamic Programming

What is the difference between global and local alignment ?

Can we define the recuernce relation of local alignment similar to global alignment ?

Page 37: B.sc biochem i bobi u 3.1 sequence alignment

Recurrence relation of GLOBAL ALIGNMENT:

(Needleman & Wunsch)

– F(i,0) = Σ s(X(k), gap), 0 <= k <= i

– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j

– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming

Page 38: B.sc biochem i bobi u 3.1 sequence alignment

Recurrence relation of LOCAL ALIGNMENT:

(Smith-Waterman)

– F(i,0) = 0

– F(0,j) = 0

– F(i,j) = max [ 0,

F(i,j-1) + s(gap,Y(j),

F(i-1,j) + s(X(i),gap),

F(i-1, j-1) + s(X(i), Y(j)]

Dynamic Programming

Page 39: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches

FASTA and BLAST Use some heuristics Dynamic Programming Complexity

– Time O(n*m)– Space O(n*m)

Page 40: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches FASTA

Good local alignment should have some exact match subsequence.

Find all k-tuples. (k=1-2 for proteins, 3-6 for DNA sequences)

Protein k – tuples nc, sp, … (k = 2) Nucleotide k – tuples TAAA, CTCC,…(k = 4)

Page 41: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches FASTA

If k = 3 for nucleotide sequences.– There will be 64 possible k – tuples– Assign a number e( ):

e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3

Each 3 – tuples are represented as xi xi+1xi+2

Assign a number to each 3 – tuple

– Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40

– For example: AAA AAA 042 + 041 + 040 = 0 CAA 142 + 041 + 040 = 16

Page 42: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches FASTA

Find each occurance of k – tuples in the sequences.

Chaining Look – Up Tables Consider TAAAACTCTAAC (if k = 3):

3 - tuples Position

AAA (0) 2, 3

AAC (1) 4, 10

AAG (2) 0

AAT (3) 0

… …

Page 43: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

Use short words to search the database sequence.

Searches for k – mers that will score above a threshold (T) value when aligned with query k - mer (Remember FASTA looks for k – tuples which are identical).

Use a scheme based on finite state automata (Remember FASTA use hashing and chaining fot rapid identification of k - tuples)

Page 44: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

From Query Sequence, create query words (for protein sequences word size is 3)

Page 45: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

Blast uses a list of high scoring words created from words similar to query words. Considers the words with a score bigger than a threshold value.

Page 46: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

Scan each database sequence for an exact match to the list of words.

Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S".

Page 47: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

Keep only the extended matches that have a score at least S.

Determine statistical significance of each remaining match.

Page 48: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

http://blast.ncbi.nlm.nih.gov/Blast.cgi

1.

14.

Page 49: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

15.

Page 50: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

16.

Page 51: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

17.

Page 52: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches BLAST

18.

Page 53: B.sc biochem i bobi u 3.1 sequence alignment

Database Searches HISTORY

1970: NW 1980: SW 1985: FASTA 1989: BLAST

Page 54: B.sc biochem i bobi u 3.1 sequence alignment

Books and Web References

Books Name :

1. Introduction To Bioinformatics by T. K. Attwood

2. BioInformatics by Sangita

3. Basic Bioinformatics by S.Ignacimuthu, s.j.

http://en.wikipedia.org/wiki/Sequence_alignment http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics-tutorial/

bioinformatics.pdf M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland

Science Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A

practical guide to the analysis of genes and proteins”, 2001, Wiley.

54

Page 55: B.sc biochem i bobi u 3.1 sequence alignment

Images References

1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg 2.http://www.ensembl.org/info/genome/compara/

tree_example1.png 3.http://www.nature.com/nature/journal/v496/n7445/images/

nature12027-f1.2.jpg 4.

http://upload.wikimedia.org/wikipedia/commons/e/e6/Spombe_Pop2p_protein_structure_rainbow.png

5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j. 14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi