JM - http://folding.chmcc.o rg 1 Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching Jarek Jarek Meller Meller Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, & Department of Biomedical Engineering, UC UC
16
Embed
Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching
Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching. Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC. Outline of the lecture. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
JM - http://folding.chmcc.org 1
Introduction to Bioinformatics: Lecture IIFrom Molecular Processes to String Matching
Jarek MellerJarek Meller
Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC
JM - http://folding.chmcc.org 2
Outline of the lecture
Sequence approximation in computational molecular biology: the premise and the limits
Getting ready for analysis of exact string matching and sequence alignment algorithms: some definitions and interplay with biology
The notion of string/sequence similarity Substitution matrices for sequence alignment
JM - http://folding.chmcc.org 3
Before we start: literature watch
A draft of the Rat genome has been published! RGSPC Nature 428
What are the first conclusions from the comparison with other mammalian genomes?
What approaches and tools have been used to perform this comparative analysis?
H: 2.9 Gb
M: 2.5 Gb
R: 2.75 GbR: unique - 0.7 Gb; common with both H and M – 1.1 Gb
4
Biological Polymers and Central Dogma
Bio-Polymer (alphabet) Process (algorithm)
DNA (A,T,G,C) replication
transcription
mRNA (U,A,C,G) splicing
translation
Proteins (20 a.a.) folding
interactions
Lipids, polysaccharides, membranes, signal transduction, environmental signals etc.
JM - http://folding.chmcc.org 5
Complexity of “DNA computing”
http://www.genecrc.org/site/lc/lc2d.htm
JM - http://folding.chmcc.org 6
Get the relevant sequences to compare them: conservation and differences
Problem Algorithms Programs
Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)
Ex. Find the sequence of 1mba in the PDB and “blast” against nr using NCBI
An example: sperm whale vs. human myoglobin:
JM - http://folding.chmcc.org 8
Limits of the sequence approximation
• All the information and various fingerprints of information processing at the molecular level (via interactions etc.), including adjustment to physiologically relevant external signals seem to be included in nucleotide and protein sequences
However, there are limits to this simple approximation: actual understanding of molecular processes requires structure, chemistry, kinetics and thermodynamics
On the other hand, a deeper understanding of the nature of biological objects and processes greatly facilitates sequence-based studies by suggesting critical features, similarity measurements etc.
JM - http://folding.chmcc.org 9
Strings, sequences and string operations
String vs. sequence duality will be important for exact vs. inexact string matching
10
Beyond the letters: how to find better models (e.g. GC content for gene finding)
http://www.imb-jena.de/IMAGE_BPDIR.html
JM - http://folding.chmcc.org 11
Another example: active sites, functional motifs and multiple alignment
JM - http://folding.chmcc.org 12
Distance and similarity measures
JM - http://folding.chmcc.org 13
Edit distance vs. substitution score
JM - http://folding.chmcc.org 14
Substitution matrices for protein sequence alignment: learning and extrapolating from examples
PAM matrices (Dayhoff et. al): extrapolating longer evolutionary times from data for very similar proteins with more than 85% sequence identity (short evolutionary time),
s(a,b | t) = log P(b|a,t)/qa e.g. P(b|a,2)=
c P(b|c,1)P(c|a,1)
BLOSUM matrices (Henikoff & Henikoff): multiple alignments of more distantly related proteins (e.g. BLOSUM50 with 50% sequence identity),