2019.10.01. Bioinformatics 2 Bioinformati cs 2 − 1 st lecture Prof. László Poppe BME Department of Organic Chemistry and Technology Bioinformatics – proteomics Lecture and computer room practice
2019.10.01. Bioinformatics 2
Bioinformatics 2 − 1st lecture
Prof. László Poppe
BME Department of Organic Chemistry and Technology
Bioinformatics – proteomics
Lecture and computer room practice
2 Bioinformatics 22019.10.01.
Bioinformatics – What is it ?
Bioinformatics:
In broad sense: storage, analysis and explanation of biological
information by the aid of computational methods.
In more restricted sense: handling, analyzing and explaining
biological sequence and structure data.
3 Bioinformatics 22019.10.01.
Bioinformatics – What is it ?
Bioinformatics
More detailed definition - Oxford English Dictionary:
(Molecular) bio – informatics: bioinformatics is conceptualising
biology in terms of molecules (in the sense of physical chemistry) and
applying "informatics techniques" (derived from disciplines such as
applied maths, computer science and statistics) to understand and
organise the information associated with these molecules, on a large
scale. In short, bioinformatics is a management information system
for molecular biology and has many practical applications.
4 Bioinformatics 22019.10.01.
Bioinformatics – What is it ?
Bioinformatics – Computational biology
More detailed definitions - definition Committee, National Institute of Mental Health
Bioinformatics: Research, development, or application ofcomputational tools and approaches for expanding the use ofbiological, medical, behavioral or health data,including those toacquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling andcomputational simulation techniques to the study of biological,behavioral, and social systems.
5 Bioinformatics 22019.10.01.
Definition and history of bioinfomatics
Sequence analysis – nucleotide and protein sequences and their relationships, pairwise and multiple alingment, phylogenetic analysis
Prediction of the secondary structure from sequence
Domain analysis, function prediction from sequence
Relationship between genetics and structure data and molecular function or role in metabolism
Aspects of protein structure data. Role of various interactions in maintaining structure. Structural classes of proteins.
Methods of protein structure determination (protein crystallography, NMR)
Methods for modeling 3D structure of proteins. Basic methods of homology modeling: template based and ab initio methods
Interaction of the proteins with small molecules and biological macromolecules, dimnamics of proteins
Databases related to proteomics (nucleotide and protein sequence databases, structure databases, function related databases)
Bioinformatics programs and program systems for proteomics applications (alone standing and Web-based applications)
Practical applications of bioinformatics
Bioinformatics 2 – Summary of themes
6 Bioinformatics 22019.10.01.
Bioinformatics – Early beginnings
1951 – Pauling & Corey: structure of alfa-helix and beta-sheet
1953 – Watson & Crick: DNA double helices structure (based on Franklin & Wilkins X-ray structure).
1954 – Perutz's group: heavy atom method for solving the phase problem in protein crystallography
1955 – F. Sanger: the first protein structure (bovine insulin).
1958 – J. Kilby (Texas Instr.): First integrated circuit / ARPA (Advanced Research Projects Agency, USA) is established
1962 – Pauling’s theory on molecular evolution
1965 – Margaret Dayhoff‘: Atlas of Protein Sequences
1969 – ARPANET: connection of computers of UCSB (Stanford) and UCLA (University of Utah)
1970 – Needleman-Wunsch algorithm: sequence comparison.
1971 – Ray Tomlinson (BBN): e-mail
1972 – Paul Berg’s group: the first recombinant DNA molecule
1973 – Brookhaven Protein DataBank (PDB) announcement / Robert Metcalfe (Harvard University) -Ethernet.
1974 – Vint Cerf & Robert Khan: the "internet" and Transmission Control Protocol (TCP).
1975 – Microsoft Co. (Bill Gates & Paul Allen) / 2D elektrophoresis (P. H. O'Farrell)
1976 – Unix-To-Unix Copy Protocol (UUCP) - Bell Labs / Southern Blot technique (E. M. Southern).
1977 – Brookhaven PDB full description / DNA sequencing (A. Maxam, W. Gilbert & F. Sanger) andsoftware (Staden)
1978 – The Usenet connection (T. Truscott, J. Ellis & S. Bellovin).
7 Bioinformatics 22019.10.01.
1980 – Sequence of a full gene (FX174 - 5386 base pairs / 9 proteins) / Multidimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wüthrich, K.).
1981 – Smith-Waterman algorithm (sequence alignment) / IBM - Personal Computer (PC) / Sequence motif concept (Doolittle)
1982 – Genetics Computer Group (GCG) - Wisconsin Suite molecular biology tools / GenBank Release 3 / Lambda phage genome sequence
1983 – Compact Disc (CD) / Sequence database searching algorithm (Wilbur-Lipman) / DNA clone(cosmid) libraries / PCR (Polymerase Chain Reaction): DNA analysis is enabled
1984 – Jon Postel: Domain Name System (DNS) / Macintosh (Apple Computer)
1985 – FASTP/FASTN algorithm / Human Genome Initiative idea is borning
1986 – Human Genome Initiative is established / "Genomics" term / SWISS-PROT database is established
1987 – Yeast artifical chromosome (YAC) / E. coli mapping / PERL (Practical Extraction Report Language)/ NIH NIGMS – beginnings of genome projects
1988 – National Center for Biotechnology Information (NCBI) is established / EMBnet network for database distribution / Human Genome Intiative starts / FASTA algorithm (Pearson and Lupman)
1989 – Oxford Molceular Group, Ltd.(OMG) is established (Anaconds, Asp, Cameleon – molecular modeling, drug design, protein design).
Bioinformatics – Beginnings
8 Bioinformatics 22019.10.01.
Bioinformatics – Near past
1990 – BLAST program (Altschul, et.al.) / (M. Levitt, C. Lee): Look & SegMod (molecular modeling and protein design) / HGP plan - USA Congress (start of a 15 year long project)
1991 – Genf (CERN) - World Wide Web / Expressed sequence tags (ESTs) / Human chromosome map database (GDB) is established
1992 – Humane genome – Low resolution genetic map
1993 – IMAGE consortium – co-ordinated cDNA gene sequencing and mapping / LBNL - novel transposon-aided chromosome-seqencing / GRAIL Internet based sequence-interpretation service (ORNL)
1994 – Netscape Co (Navigator) / PRINTS database: protein motifs (Attwood & Beck) / EMBL European Bioinformatics Institute / Second-generation DNA clone libraries of all human chromosomes
1995 – The first bacterial (Haemophilus influenzea) genome (1.8) sequence ( Fleischmann et al) / Sequencing the smallest bacterium (Mycoplasma genitalium) – the least number of genes for independent life
1996 – The genome of baker’s yeast (Sacharomyces cerevisiae 12.1 Mb) / Prosite database (Bairoch, et.al) / Affymetrix – the first commercial DNA chip
1997 – The E. coli (4.7 Mbp) genome / National Human Genome Research Institute (NHGRI)
1998 – Swiss Institute of Bioinformatics is established
1999 – The first full sequence of a human chromosome
9 Bioinformatics 22019.10.01.
Bioinformatics – Recent past
2000 – Bacterial (Pseudomonas aeruginosa, 6.3 Mbp), plant (A. thaliana, 100 Mb) and insect (Drosophila melanogaster, 180 Mbp) genome sequences / További humán kromoszómák szekvenálása
2001 – Human genome (3000 Mbp) is published / Full sequencing of several human chromosomesaccording to the high level standards of the Human Genome Project
2002 – Structural Bioinformatics and GeneFormatics are unified / Mouse Genome Sequencing Consortium –shot-gun sequence of the mouse genome
2003 – The Human Genome Project is finished
2004 – Rat Genome Sequencing Consortium – genome of brown rat (Rattus norvegicus)
2008 – Start of the 1000 Genomes Project – The aim of the project is to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. This marks thestart of „ Personalised Medicine”
2013 – The Nobel Prize in Chemistry 2013 (M. Karplus, M. Levitt, A. Warshel) „for the development of multiscale models for complex chemical systems”
2016 - 1000 Genomes Project: more than 30,000 x coverage of the human genome. There is currently no bioinformatics tool to look for in the full data mass.
.
10 Bioinformatics 22019.10.01.
Neutral-Apolar 3-letter 1-letter
Glycine Gly G
L-Alanine Ala A
L-Valine Val V
L-Izoleucine Ile I
L-Leucine Leu L
L-Phenylalanine Phe F
L-Proline Pro P
L-Metionine Met M
Neutral-Polar
L-Serine Ser S
L-Threonine Thr T
L-Tyrosine Tyr Y
L-Triptophane Trp W
L-Asparagine Asn N
L-Glutamine Gln Q
L-Cysteine Cys C
Acidic
L-Aspartic acid Asp D
L-Glutamic acid Glu E
Basic
L-Lysine Lys K
L-Arginine Arg R
L-Histidine His H
11 Bioinformatics 22019.10.01.
ProteinsStructure - Folding
Protein structures – organization levels
Primary structure („folding-free” state - sequence)
Secondary structure (stable local conformations: -helices, -sheets)
Terctiary structure (global chain conformation: domains, subunits)
Quaternary structure (multiple chain conformations)
Intra- and intermolcular disulfide bonds
12 Bioinformatics 22019.10.01.DNA aminoacid sequence ”folding”
Primary structureMNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL
GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG
LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS
TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF
PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD
GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM
KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ
SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW
LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE
ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL
RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF
(ie. the amino acid sequence)
ProteinsPrimary and secondary structure
Secondary structure
13 Bioinformatics 22019.10.01.
ProteinsTertiary and quaternary structure
Tertiary and quaternary structure active conformation
Enzymes (catalytic proteins):
positive catalysis (fit of the substrate to the catalytic parts of the active site)
negative catalysis (protection of the substrate, biological ”protecting group”)
Tertiary structure Quaternary structure
14 Bioinformatics 22019.10.01.
ProteinsThe active site of enzymes
The active site of carboxypeptidase A; (a) shematic representation of the active site; (b) the active site of the protein with Cbz-Gly-Phe substrate
(as it is assumed to occupy the active site).
15 Bioinformatics 22019.10.01.
Processes following protein expressionProtein folding and degradation
16 Bioinformatics 22019.10.01.
The genetic codeStructure of DNA
17 Bioinformatics 22019.10.01.
The genetic codeStructure of DNA
18 Bioinformatics 22019.10.01.
Gene expressionThe central dogma
19 Bioinformatics 22019.10.01.
Gene expressionTranscription
20 Bioinformatics 22019.10.01.
The standard genetic codeRedundancy in protein – DNA direction
21 Bioinformatics 22019.10.01.
Gene expressionThe transcription – translation processes
22 Bioinformatics 22019.10.01.
Gene expressionThe translation process
23 Bioinformatics 22019.10.01.
Gene expressionReading frames – Importance of start-stop codons
24 Bioinformatics 22019.10.01.
Gene multiplicationIn vitro multiplication of DNA by PCR
The PCR cycle:
1. - DNA melting (~90oC)
2. – Replication of the
complementary strand
(synthesis at ~70oC by a
thermostable polymerase)
Primers
(large excess)
and dNTP’s.
(deoxynucleotide
triphosphates)
Single strand DNA
Result:
exponential multiplication
25 Bioinformatics 22019.10.01.
Gene multiplicationCloning and multiplication of genes
Recombinant cell
Wild type host cell
Cleaved plasmid
Plasmid with the
desired gene
Desired gene
26 Bioinformatics 22019.10.01.
Aims of bioinformatics
Creation and maintenence of databases. Organization of the data
layout so that researchers can easily retrieve and extend the existing
information.
Development of methods and procedures for analyzing data. The data
are useless without analysis.
Application of the developed tools and methods for data analysis and
biological interpretation of the results.
27 Bioinformatics 22019.10.01.
Types of biomolecular information andbioinformatics methods
Source of data Size of data Bioinformatics topics
Crude DNA−sequences ~201 million sequences - 235 billion
ases (gene) [GenBank]
[+488 million sequences,
2.165 billion bases (WGS: whole
genom shotgun)] date: 09. 2017.
· Coding and non−coding regions
· Introns and exons
· Gene products predictions
· Forensic analysis
Protein sequences 89.9 million sequences (UniProtKB)
(~ 300 amino acids, in each)
(0.56 million Swiss-Prot + 89.4
million TrEMBL) date: 09. 2017.
· Sequence alignment
· Multiple alignments
· Conserved sequence motifs
Macromolecular structures
(DNA, RNA, protein)
~133 thousand structures
(~ 1000 atomic coordinates in each)
(RCSB PDB) date: 09. 2017.
· 3D structure alignments
· Protein geometrics
· Surface, volume, shape
calculation
· Intermolecular interactions
· Molecular simulations (energy,
molecular motions, docking)
28 Bioinformatics 22019.10.01.
Types of biomolecular information andbioinformatics methods
Source of data Size of data Bioinformatics topics
Genomes ~ 25 300 full genomes (ca 1,6×106 -
3×109 bases in each)
(NCBI Genome)
[ ~ 193 000 published raw genomes]
(NCBI WGS) date: 09. 2017.
· Repetitions
· Structure - gene relationships
· Phylogenetic analysis
· Genome sized projects
(eg. metabolic pathways)
· Disease - gene relationships
Gene expression data ~ 88 000 gene expression datasets
(NCBI GEO) date: 09. 2017.
(one of the largest: ca. 20 time points
for the ca. 6000 genes of yeast
· Expression pattern correlations
· Relationship of expression with
structural and biochemical data
Other:
Literature
Metabolic pathways
~27 million articles (Medline)
~45 milllion references (CAplus)
518 metabolic maps
~ with 533 000 references (KEGG)
date: 09. 2017.
· Electronic libraries / automatic
literature surveys
· Knowledge bases
· Reaction pathway simulations
29 Bioinformatics 22019.10.01.
Data Grouping based on similarities
repetitive sequences in the genomes
genes can be grouped according to function (eg. enzyme activity or metabolic pathways)
different proteins often have similar sequences
the number of basic structures of proteins is limited ( according to estimations: maximum 10,
000)
Based on real biological similarities, much of the information can be sorted into groups
Biological systems consist of a finite number of component parts
30 Bioinformatics 22019.10.01.
Pattern recognition and prediction
The two basic operations of the bioinformatics are pattern recognition and prediction
Pattern recognition: finding similarities
Search for a conserved feature which is characteristic to a certain function / structure
base on proteins with already known similar function / structure
Use of the recognised feature to identify function / structure of novel sequences
Condition: the novel sequence should belong to a protein of a cetain degree of alredy
known similarity
Prediction:
Prediction of function or structure: based on similarity or by ab initio methods
The basic wish of bioinformatics – structure predicted from sequence
31 Bioinformatics 22019.10.01.
MNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL
GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG
LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS
TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF
PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD
GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM
KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ
SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW
LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE
ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL
RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF
Problems of structure prediction from sequence
Folding: the amino acid sequence determines
the spatial structure, but still do not understand
how
Only the secondary structure can be predicted
by limited reliability
It remains so in the near future
32 Bioinformatics 22019.10.01.
Differences in 2D – 3D data
The gap between the known protein sequences and structures of known proteins increases in
time
Large information deficit – bioinformatics
may play an important role
Ca. 2000 more sequences as 3D structures
Kb. 2107 known sequences but only ca.
100,000 unique 3D structures
The gap is continuously increasing (genome
programs) [almost 1 novel sequence / second
but only about 10 novel structures / day]
33 Bioinformatics 22019.10.01.
Genome projects
Genome sequencing „BAC to BAC” sequencing
„whole genom shotgun” seguencing
Completed genomes (ca. 25 000 full genomes); running genome projects (~85 000):
• Yeast
• Caenorhabditis elegans (worm)
• Drosophila melanogaster (common fruit fly)
• Arabidopsis thaliana (mouse-ear cress)
• Human
Completed genome projects:
• ca. 14 600 prokaryotes
• ca. 2 500 eukaryotes
• ca. 200 archaea
• ca. 7 400 of viruses and phages
GOLD: https://gold.jgi.doe.gov/
34 Bioinformatics 22019.10.01.
Sequence analysis
The most important bioinformatics method: search for new sequences belonging to
proteins of unknown structure / function which are similar to sequences of proteins with
known structure / function.
Sequence alignment
Sequence identity: percentage of the same amino acid pairs in the aligned sequences
With decrease of sequence identity the portability of function / structure decreases
35 Bioinformatics 22019.10.01.
Sequence alignment
36 Bioinformatics 22019.10.01.
Importance of the degree of sequence matching
Degree of sequence mathing:
<30 %: inadequate models
30-60 %: adequate models, with uncertain
regions
>60 %: high quality models, with less than
1Å average deviation of C from
experimental structures
Fibroblast growth factor model (based on
rat GF ceratinocita structure, 40% identity)
compared to the experimental structure (X-
ray data).
M. J. Foster: Micron 2002, 33, 365-384.
37 Bioinformatics 22019.10.01.
Two types of homology
Orthology:
Orthologs are genes that are related by vertical descent from a common ancestor and
encode proteins with the same function in different species. They serve the same
function in the two species.
Example: carboxyl esterases (ie. their genes) in humans and pigs.
Paralogy:
Paralogs are homologous genes that have evolved by duplication and code for protein
with similar, but not identical functions.
Example: the enzymes of the histidine biosynthesis (their genes) in humans (the are very
similar in structure, but catalyze different reactions).
Types of homology
38 Bioinformatics 22019.10.01.
• Structure: conserved rregions vs. variable regions
• Alignment methods
• Structure refinement – (over-refinement)
• Evaluation of the structure quality (PROCHECK / WhatIf ...)
1CPC 2FAL
The structural fit (threading) problem
A high degree of structural similarity
can be observed at low sequence
matching
Comparison:
structures of the C-fikociamine
(1CPC) and mioglobin of sea hare
(2FAL)
M. J. Foster: Micron 2002, 33, 365-384.
39 Bioinformatics 22019.10.01.
EMBL (EMBL-EBI, etc.)
NCBI (Medline, Genbank, etc.)
Expasy (UniProtKB/SwissProt, etc)
Bioinformatics Websites
40 Bioinformatics 22019.10.01.
Bioinformatics databaseshttp://www.oxfordjournals.org/our_journals/nar/database/c/
2017 NAR Database Summary Paper Category List
Nucleotide Sequence Databases
RNA sequence databases
Protein sequence databases
Structure Databases
Genomics Databases (non-vertebrate)
Metabolic and Signaling Pathways
Human and other Vertebrate Genomes
Human Genes and Diseases
Microarray Data and other Gene Expression Databases
Proteomics Resources
Other Molecular Biology Databases
Organelle databases
Plant databases
Immunological databases
Cell biology