Dr.S.Parthasarathy, Bhara thidasan Univ., Trichy 1 16 FEB 2006 Sequence Alignment Algorithms – Application to Bioinformatics Tool Development Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: [email protected])
72
Embed
Sequence Alignment Algorithms – Application to Bioinformatics Tool Development
Sequence Alignment Algorithms – Application to Bioinformatics Tool Development. Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: [email protected] ). Plan. Introduction to Bioinformatics - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
116 FEB 2006
Sequence Alignment Algorithms – Application to Bioinformatics Tool
Development
Dr. S. ParthasarathyReader and Head
Department of BioinformaticsBharathidasan UniversityTiruchirappalli – 620 024
Unit of Genetic distance centiMorgan (cM) - arbitrary unit ; Named for Thomas Hunt
Morgan (e.g. 1 cM = 0.01 recombinant frequency)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
616 FEB 2006
Introduction Biological Data : Genome Projects
16 February 2001 15 February 2001
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
716 FEB 2006
Biological Data : Recombinant DNA Technology
Old Revolution 1940 – Role of DNA as the genetic material was confirmed 1953 – Discovery of DNA structure by James Watson & Francis Crick 1966 – Establishment of the Genetic Code 1967 – DNA ligase was isolated – (join two strands of DNA together) – Molecular Glue 1970 – Isolation of Restriction enzyme – Molecular Scissors 1972 – Recombinant DNA molecules were generated at Stanford University, USA 1973 – Joining DNA fragments to the plasmid pSC101 isolated from
E.Coli. They could replicate when introduced into E.Coli. The discoveries of 1972 & 1973 triggered off the biggest
scientific revolution – Genetic Engineering
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
816 FEB 2006
Biological Data explosion
GenBank, NCBI, USA 44 Gbps of DNA & 40 Million Sequences (upto 2004) GenBank, National Center for Biotechnology Information, USA
Protein Data Bank (PDB), RCSB, USA 29,000 structures (2004) PDB, Research Collaboratory for Structural Bioinformatics,
USA QUALITY of Data - HIGH
Experimental error in modern genomic sequencing is extremely low
QUANTITY of Data - HUGE With Recombinant DNA technology & genomic sequencing,
size of sequence data bases is increasing very rapidly SEQUENCE Versus STRUCTURE Databases
Sequence Databases are HUGE than Structure DatabasesLeads to Bioinformatics
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
916 FEB 2006
What?
What is Bioinformatics?
Define Bioinformatics
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1016 FEB 2006
Bioinformatics - Definition
F(i,j) = max {
F(i-1, j-1)+s(xi,yj),
F(i-1, j) – d,
F(i, j-1) – d.}Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions.
The marriage of biology and computer science has created a new field called ‘Bioinformatics’. - Arthur M. Lesk
Bioinformat ics
PEPTIDESE QSEDITPEP
atcggcatgcatcagtcatgcaactg
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1116 FEB 2006
Biology Basic Definitions
Cell - It is the building block of living organisms Eukaryotic Cells or organisms have the nucleus
separated from the cytoplasm by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein
Chromosome The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of
eukaryotesContains DNA and protein arranged in compact mannerReplicate identically during cell divisionSame number of chromosomes present in cells of a particular
species (e.g. Human : 22, X and Y)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1216 FEB 2006
GenomeBasic Definitions
Genome A complete set of chromosomes inherited from one parent
Gene One of the units of inherited material carried on by
chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus.
DNA (Deoxyribo Nucleic Acid) DNA is made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine
Protein Protein is made up of TWENTY different amino acids A T G C ... – Alanine, Threonine, Glycine, Cysteine, …
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1316 FEB 2006
Central Dogma
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
1416 FEB 2006
Genome DataHuman & Model Organisms
Most mapping and sequencing technologies were developed from studies of simpler non-human organisms
Bioinformatics - Structure Prediction Homology Modelling – InsightII, SwissPDBViewer, Biosuite ‘ab initio’ method - Monte Carlo Simulation
Protein Structure Classification SCOP - Structural Classification Of Proteins CATH - Class, Architecture, Topology, Homologous superfamily FSSP - Fold Classification based on Structure- Structure alignment of Proteins – obtained by DALI (Distance-matrix ALIgnment)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2116 FEB 2006
Bioinformatics Tasks
Protein Engineering Mutations
Alter particular amino acid/base for desired effect Site directed mutagenesis
Identify the potential sites where we can do alterations Applications
Pharmaceutical – Molecular Modelling base Drug Design Medical – Gene Therapy
DNA Bending Application to Genomes
(Ref: M.G.Munteanu, K.Vlahovicek, S.Parthasarathy, I.Simon and S.Pongor, Rod Models of DNA: Sequence-dependent anisotropic elastic modelling of local phenomena, Trends in Biochemical Sciences, 23 (1998) 341-347)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2216 FEB 2006
Bioinformatics TasksGenomics & Proteomics
Genomics is the study of the structure, content, evolution and functions of genes in genomes
Aims of Genomics To establish an integrated web based
database and research interface To assemble Physical,Genetic and Cytological
maps of the Genome To identify and annotate the complete set of
genes encoded within a genome To provide the resources for comparison with
other genomes
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2316 FEB 2006
Proteomics – Proteome
Proteome is the complete collection of proteins in a cell/tissue/organism at a particular time. Unlike genomes, which are stable over the life time of the organism, proteomes change rapidly as each cell response to its changing environment and produces new proteins and at different amounts.
Genome is a more stable entity. An organism has only one genome but many proteomes.
For an organism, there may be one body wide proteome, about 200 tissue proteomes about a trillion (~1012) individual cell proteomes.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2416 FEB 2006
The study of proteomes that includes determining the 3D shapes of proteins, their roles inside cells, the molecules with which they interact, and defining which proteins are present and how much of each is present at a given time.
Proteomics – Definition
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2516 FEB 2006
To correlate proteins on the basis of their expression profiles.
To observe patterns in protein synthesis and this observed pattern changes can be used as an indicator of the state of cell and its gene expression.
To characterize bacterial pathogens and to develop novel antimicrobials.
To identify regions of the bacterial genome that encode pathogenic determinants.
To develop drugs and in toxicology – Structural Proteomics
Proteomics as a tool for plant genetics and breeding
Proteomics – Applications
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2616 FEB 2006
Systems Biology Systems Biology is a new perspective and emerging
field for research in the post-genomic era. It aims at system level understanding of biological
systems. It studies whole cells/tissues/organisms not by a
traditional reductionist’s approach but by holistic means in a reiterative attempt to model the complete cell/tissue/organism.
It is an integrated and interacting network of genes, proteins and biochemical reactions which give rise to life.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2716 FEB 2006
Systems Biology
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2816 FEB 2006
Sequence Alignment Algorithms
Similarity and Homology
Sequence Comparison - Issues
Types of alignments
Algorithms Used
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
2916 FEB 2006
Sequence similarity and homology
Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. – Fortunate for computational sequence analysis
Similarity – Measurement of resemblance and differences, independent of the source of resemblance.
Homology – The sequences and the organisms in which they occur are descended from a common ancestor.
If two related sequences are homologous, then we can transfer information about structure and/or function, by homology.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3016 FEB 2006
3-D Structure and Homology
3-D structure patterns (motifs) of proteins are much more evolutionarily conserved than amino acid sequences - This type of Homology search could prove more fruitful
Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analysis
Only a few protein motifs can be recognised at the sequence level
Development of more analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more useful
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3116 FEB 2006
Sequence ComparisonIssues
Types of alignment Global – end to end matching (Needleman-Wunsch) Local – portions or subsequences matching (Smith-
Waterman) Scoring system used to rank alignments
PAM & BLOSUM matrices Algorithms used to find optimal (or good) scoring
alignments Heuristic Dynamic Programming Hidden Markov Model (HMM)
Statistical methods used to evaluate the significance of an alignment score Z- score, P- value and E- value
A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.
In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.
Dynamic Programming The algorithm for finding optimal alignments given an
additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the
optimal scoring alignment or set of alignments. HMM - Based on Probability Theory – very versatile.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3416 FEB 2006
Global AlignmentNeedleman-Wunsch Algorithm
Formula { F(i-1,j-1) + s(xi,yj)
D F(i, j) = max { F(i-1 , j) - d
H { F(i , j-1) - d
V F(i-1,j-1) D
F(i,j-1) V
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3516 FEB 2006
Global AlignmentNeedleman-Wunsch Algorithm
Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend penaltyg = gap length
Trace back Take the value in the bottom right corner and
trace back till the end. (i.e. align end – end always).
Algorithm complexity It takes O(nm) time and O(nm) memory, where
n and m are the lengths of the sequences.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3616 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Same as Global alignment algorithm with TWO differences. F(i,j) to take 0 (zero), if all other options
have value less than 0. Alignment can end anywhere in the
matrix.Take the highest value of F(i,j) over the wholematrix and start trace back from there.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3716 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Formula { F(i-1,j-1) + S(xi,yj) D
F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <
0) }
F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3816 FEB 2006
Web based server development
Design the web page to get the dataUse cgi-bin or Perl script to parse the
submitted dataInvoke the corresponding program to
get the appropriate resultsSend the results either by e-mail or
to the web page directly
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
3916 FEB 2006
Application to Bioinformatics Tool Development
To predict a fold to protein sequence
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4016 FEB 2006
To predict a fold to protein sequence
To predict possible folds for a given protein sequence, whose structure is not known
To develop a fold recognition technique / tool that is sensitive in detecting folds of given protein sequences in the twilight zone (sequences sharing less than 25% identity)
Application of the fold recognition strategy to genomic annotation
Exploration of suitable fold recognition techniques that are sensitive in detecting similar folds despite low sequence similarity
Identification of functional motifs in proteins at sequence (1D) and structure (3D) level
Development of a protocol that aid in the rapid classification and annotation of genomic data based on functional motifs
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4516 FEB 2006
Methodology
Reduction of 3D-structure to 1D-environment string. Environment at each residue position is a function of local secondary structure and extent of exposure to the solvent (based on 3D-1D profile method developed by Eisenberg et al., 1991).
Extract residue environment profiles of the available protein structures.
A scoring matrix is generated from a library of profiles. Each matrix element is the information value of a residue in the given environment.
A library of environment strings is created for the available protein fold structures.
The probe sequence is queried against this library to look for best matches.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4616 FEB 2006
Scoring Table
PredictFold
Fold library
3D-Profiles
New Sequence
FOLD PREDICTION
1D-Environment Sequence
Annotate New sequence
Workflow
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4716 FEB 2006
Residue Environments
_Exposed Partially buried
Buried_
_Helix
_CoilStrand_
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4816 FEB 2006
Residue Environments
The residue environments are described by 1. the area (A) of the
residue buried in the protein
2. the fraction (f) of side-chain area that is covered by polar atoms (O and N)
3. the local secondary structure
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
4916 FEB 2006
Residue Environments
CLASS Area (A) Å2 FRACTION (f)
BURIED 1 (B1) A > 114 f < 0.45BURIED 2 (B2) 0.45 < f < 0.58BURIED 3 (B3) f > 0.58
PARTIAL 1 (P1) 40 < A < 114 f < 0.67PARTIAL 2 (P2) f > 0.67
EXPOSED (E0) A < 40 f > 0.67
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5016 FEB 2006
Residue Environment classes
We have 6 classes based on the extend of exposure to solvent
We have 3 classes based on secondary structure – Alpha Helix(A), Beta Sheet (B) & Coil(C)
Total : 6 x 3 = 18 environmentsB1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C
Scoring Table The scoring table used in this case is a 20 x 18
matrix, constructed from a statistical analysis of the profile library (consisting of 1200 protein structures) provided by PROFILES_3D module of Insight II (Accelrys Inc.)
The scores Sij are calculated using the formula
Sij = ln [ P(i : j) / Pi ] x 100 where P(i : j) is the probability of finding
residue i in the environment j and Pi is the overall probability of finding residue i in any environment.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5216 FEB 2006
Scoring Table
The scoring table contains measure of the compatibility of the 20 amino acids with the 18 environmental classes.
The individual matrix elements are propensities (information values) for the amino acid residues.
Scan PDB to identify all the structures having these folds
Identify a representative structure with resolution 2.5Å or better
Quality of the structure
(Occupancy, R-Factor, Stereochemistry)
968 Chains
Fold Library
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5516 FEB 2006
DALI / FSSP Fold Library
DALI : http://www.ebi.ac.uk/dali Touring protein fold space with
DALI/FSSP. Lisa Holm and Chris Sander, Nucleic Acid Research, (1998), 26, 316-319
Mapping the Protein Universe, Lisa Holm and Chris Sander, Science, (1996), 273, 595-602
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5616 FEB 2006
Sequence ComparisonDetails
Type of Alignment Local - portions or subsequences
matching Smith-Waterman Algorithm
Scoring Table : 3D-1D matrixAlgorithm used : Dynamic
ProgrammingAlignment Score : Z- Score
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5716 FEB 2006
Local AlignmentSmith-Waterman Algorithm
Formula { F(i-1,j-1) + S(xi,yj) D
F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is <
0) }
F(i-1,j-1) D V F(i,j-1)
F(i-1,j) H F(i,j)
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5816 FEB 2006
Gap Penalties
Gap penalties Linear score f(g) = - gd Affine score f(g) = - d – (g-1) e
d = gap open penalty e = gap extend penalty
g = gap length
Gap penalty values used are d = 500 e = 50
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
5916 FEB 2006
Local Alignment
Trace back Alignment can end anywhere in the
matrix Take the highest value of F(i,j) over the
whole matrix and start trace back from there.
Algorithm complexity It takes O(nm) time and O(nm) memory,
where n and m are the lengths of the sequences.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6016 FEB 2006
Significance of an Alignment Score
Statistical methods used to evaluate the significance of an alignment score Z-score, P-value and E-value
Significance of Score Z- score = (score – mean)/std. dev
Measures how unusual our original match is. Z 5 are significant.
P- value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores)
P 10-100 exact match. E- value is the expected number of sequences that give
the same Z- score or better. (E = P x size of the database)
E 0.02 sequences probably homologous
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6116 FEB 2006
Benchmarking
All 968 proteins in the fold library were profiled on each of the other members
A histogram indicating the rank and the number of sequences which got the self score as the highest, is shown in Figure.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6216 FEB 2006
Benchmarking
797
633 3 2 17 29 54
0
200
400
600
800
1000
1 2 3 4 5 6 7 8
Rank
No
. o
f S
equ
ence
s
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6316 FEB 2006
Benchmarking
Report 797 retain the self as the highest score 63 report the self to have the second highest score There were about 100 proteins that have ranks
between 5 and 100. Limitations
Prediction is restricted to the 968 folds in the library The algorithm is insensitive to partially folded
sequences Specific to globular proteins and not for membrane
proteins Sequences that fold in the presence of cofactors and
ligands are not accounted for
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6416 FEB 2006
Web based server development
Design the web page to get the dataUse cgi-bin or Perl script to parse the
submitted dataInvoke the corresponding program to get
the appropriate resultsSend the results either by e-mail or to the
web page directlyPrepare a ‘user manual’ to describe the
salient features of the server
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6516 FEB 2006
Conclusions
PredictFold – A program to predict possible folds for a new protein sequence based on the 3D-1D profile method
Benchmarking results show the reliability of the method
There are lot of scopes for further improvements
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6616 FEB 2006
Future Directions
To update the fold library by including more known folds
To use the predicted secondary structure information of the given sequence also
To optimise the source code for efficient handling of genome sequences, automatically
To combine results from other algorithms ORF, HMM, etc. to detect remote homologs
To develop & maintain a web-based sever for fold recognition
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6716 FEB 2006
BT versus IT
Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects
Bioinformatics is one of the potential areas for IT professionals also
Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past)
BT will take on IT soon … in the near future …
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6816 FEB 2006
Conclusions
Developing Web based Bioinformatics tools Develop/modify useful algorithms Generate computer source codes Create/Maintain Web based server
Using existing Web based tools efficientlyEthical issues
Bioethics & Biosafety : Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor been used by you
Cloning of human, Terminator technology, GM Food, etc.
Dr.S.Parthasarathy, Bharathidasan Univ., Trichy
6916 FEB 2006
References (latest) Arthur M. Lesk, Introduction to Bioinformatics, Oxford University
Press, New Delhi (2003). D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence
structure and databanks, Oxford University Press, New Delhi (2000).
R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis, Cambridge Univ. Press, Cambridge, UK (1998).
A. Baxevanis and B.F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, (Third Edition) Wiley-Interscience, Hoboken, NJ (2005).
G.Gibson and S.V.Muse, A Primer of Genome Science, Sinauer Associates, USA (2002).
N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, Ane Books, New Delhi (2005).
Michael S. Waterman, Introduction to computational Biology, Chapman & Hall, (1995).
J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research, Academic press, New York (1995).
D.S. T.Nicholl, An Introduction to Genetic Engineering, (Second Edition) Cambrdige Univ. Press, UK (2002).