Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 1
CHAPTER 1: INTRODUCTION TO BIOINFORMATICS
Table of contents: Introduction to bioinformatics
Chapter 1: Introduction to bioinformatics ........................................................................................................ 1
Voorwoord Bio-informatica .......................................................................................................................... 2
Bioinformatics, a research domain at the cross roads of different disciplines ............................................. 3
Driving force for bioinformatics: ................................................................................................................... 3
The high-throughput-ization of molecular biology ................................................................................... 4
the sequencing revolution ........................................................................................................................ 4
Different subfields in bioinformatics research .............................................................................................. 8
Structural genomics .................................................................................................................................... 10
assembly ................................................................................................................................................. 10
structural annotation .............................................................................................................................. 15
Biological application: genome sequencing ........................................................................................... 16
Comparative genomics ................................................................................................................................ 21
Overview ................................................................................................................................................. 21
Biological application 1: evolutionary biology ........................................................................................ 24
Biological application 2: Studying genome evolution ............................................................................. 25
Biological application 3: metagenomics (G. Venter) .............................................................................. 29
Functional genomics & Systems Biology ..................................................................................................... 32
Systems biology ...................................................................................................................................... 32
BIOLOGICAL Application: Synthetic biology from an engineering point of view: rational design .......... 32
CONCLUSION ............................................................................................................................................... 34
Updated 23/01/2015
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 2
VOORWOORD BIO-INFORMATICA
Bio-informatica, hoewel een relatief recente term, bestaat reeds meer dan 400 jaar. Galileo schreef immers
“the book of nature is written in the language of mathematics!”.
Het gebruik van wiskundige modellen om biologische fenomenen te verklaren en gegevens te analyseren is
zeker niet nieuw. Tot nog toe was het enkel gemeengoed in bepaalde deeldomeinen van de biologie (e.g.
populatiegenetica, fylogenie, “molecular modeling” etc.).
Belangrijke technologische vernieuwingen in de moleculaire biologie in het begin van de jaren ‘90 brachten
hierin grondige verandering. De toepassing van de hoge-doorvoer technologieën (genomica, transcriptomica,
proteomica, metabolomica) laat immers toe om in zeer korte tijd de DNA-sequentie van hele genomen in
kaart te brengen, de expressie van duizenden genen of proteïnen in een organisme te analyseren, de aard en
concentratie van alle metabolieten te evalueren en de interacties tussen deze verschillende genetische
entiteiten te identificeren. Dit heeft geleid tot een onevenaarbare data-explosie. Voor het analyseren van
deze data volstaat een excel spread sheet niet langer, maar is een interdisciplinaire aanpak noodzakelijk
Deze dataexplosie heeft ook geleid tot een drastische verruiming in het “biologisch” denken (ook wel de
nieuwe biologie geheten). De finale doelstelling van de moleculaire biologie “het verwerven van inzicht in
de werking en evolutie van organismen” bleef dezelfde. De manier om dit doel te bereiken is gewijzigd.
Tot voor enkele jaren werden in het functioneel moleculair biologisch onderzoek, genen, proteïnen en andere
moleculen één voor één als geïsoleerde entiteiten bestudeerd. Het gebruik van de nieuwe technologieën
situeert de functie van een gen nu in een globale context, namelijk als deel van een complex regulatorisch
netwerk. Vanuit dit nieuw perspectief wordt het organisme beschouwd als een systeem dat interageert met
zijn omgeving. Het gedrag ervan wordt bepaald door de complexe dynamische interacties tussen
genen/proteïnen/metabolieten op het niveau van het regulatorische netwerk. Door de beschikbaarheid van
data van verschillende modelorganismen kunnen bovendien de cellulaire mechanismen tussen de
organismen vergeleken worden.
Organisme voorgesteld als een systeem dat interageert met zijn omgeving. Via de werking van regulatorische netwerken past een organisme zich
voortdurend aan aan wisselende omgevingssignalen. Deze aanpassingen resulteren in een gewijzigd gedrag of fenotype. De regulatorische netwerken
kunnen beschouwd worden als de biologische signaalverwerkingssystemen.
Traditionele studies van biologische systemen waren veeleer beschrijvend. De systeembenadering van de
biologie impliceert echter een doorgedreven kwantitatieve en geïntegreerde analyse van complexe gegevens.
Onder invloed van deze nieuwe tendens ontstond de term "bio-informatica" (voor het eerst gebruikt rond
1993) en werd de hoge-doorvoer functionele moleculaire biologie een deel van de “systeembiologie”.
Zoals de moleculaire biologie zijn systeembiologie en bio-informatica onderzoeksdomeinen met vele
deeldisciplines (structurele, functionele, comparatieve bio-informatica).
De bio-informatica vraagstelling ontstaat vanuit de biologie. De computationele wetenschappen stellen een
arsenaal standaardalgoritmes en principes ter beschikking. Beide moeten op een zinvolle manier verenigd
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 3
worden, rekening houdend zowel met de specifieke eigenschappen van het gebruikte algoritme als met deze
van het biologisch probleem. Het verzoenen van algoritmen uit exacte wetenschappen met experimentele
data afkomstig van stochastische biologische systemen vormt hierbij de belangrijkste uitdaging. Om die
reden kan het oplossen van een biologisch probleem via computationele weg al snel een paar jaar onderzoek
in beslag nemen maar leidt het tot waardevolle resultaten die in sommige gevallen het traditioneel biologisch
onderzoek overstijgen.
Toekomst van bio-informatica
Bio-informatica is dus geen “hype”. Naarmate de moleculair biologische technologie evolueert zal ze verder
aan belang toenemen. De meest succesvolle moleculair biologische laboratoria zullen daarbij ongetwijfeld
deze zijn, die het “wet lab” onderzoek sturen a.h.v. de predicties van geavanceerd computioneel onderzoek.
De toekomst van zowel de moleculaire biologie als de bio-informatica ligt in de uitbouw van het onderzoek
waarbij de grens tussen het “wet lab” en het computationeel aspect vervaagt.
Doel van de cursus bio-informatica
Het doel van de cursus is tweeledig:
De eerste en waarschijnlijk meest belangrijke doelstelling is om jullie ervan te overtuigen dat bioinformatica
een essentieel onderdeel is van jullie curriculum. Met een aantal voorbeelden en verwezenlijkingen uit het
domein hoop ik jullie van te kunnen overtuigen dat ‘bioinformatica’ en ‘systeem biologie’ ons leven en
denken zullen veranderen. De moleculaire bioloog van de 21e eeuw zal niet enkel beschikken over een goed
uitgebouwde biologische kennis, maar hij dient ook vertrouwd te zijn met belangrijke principes uit de
wiskunde, de statistiek en de informatietechnologie. Dergelijke integratie van biologisch inzicht, analytisch
en probleemoplossend denken is eigen aan de bioinformatica.
Een tweede aspect van de cursus is om jullie vertrouwd te maken met het gebruik van bioinformatica tools.
Het bio-informatica domein is echter zeer ruim en in volle expansie. Het is dan ook onmogelijk om alle tools
en onderdelen te belichten. We zullen een aantal belangrijke en veel gebruikte voorbeelden bespreken. Het
is hierbij van belang dat jullie realiseren dat zinvolle bio-informatica meer is dan enkel het toepassen van
tools maar dat het essentieel is om ook de onderliggende mathematische principes van deze tools te
begrijpen en tegelijk inzicht te verwerven in de datageneratie protocols en de biologische complexiteit. Dit
impliceert dat bio-informatica meer is dan een hulpmiddel bij het moleculair biologisch onderzoek maar dat
het een volwaardig onderzoeksdomein op zichzelf vormt.
BIOINFORMATICS, A RESEARCH DOMAIN AT THE CROSS ROADS OF DIFFERENT DISCIPLINES
Bioinformatics is an interdisciplinary research area at the interface between biological and computational
sciences. It is the scientific field deals with the computational management of all kinds of molecular biological
information. Most of the bioinformatics work that is being done deals with either analyzing biological data,
or with the organization of biological information. The ultimate goal of the field is the discovery of new
biological insights as well as to create a global perspective from which unifying principles in biology can be
discerned. Bioinformatics is the application of computer science to biological science, but is also the creation
of computer science for biological science. Bioinformatics emerged as an important discipline shortly after
the development of DNA sequencing technologies in the 1970s, although the word “bioinformatics” did not
start appearing in the biomedical literature until around 1990.
DRIVING FORCE FOR BIOINFORMATICS:
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 4
THE HIGH-THROUGHPUT-IZATION OF MOLECULAR BIOLOGY
Traditional genetics and molecular biology have been directed toward understanding the role of a particular
gene or protein in a molecular biological process. A gene is sequenced to predict its function or to manipulate
its activity or expression. Traditional molecular biology was focusing on single genes. With the advent of novel
molecular biological techniques such as genome scale sequencing, large scale expression analysis (gene,
protein expression, microarrays, 2D-electrophoresis, mass spectroscopy), large scale identification of
protein-protein interactions (yeast 2 hybrid; protein chips) or protein-DNA interactions (immunochromatine
precipitation), the scale of molecular biology has changed. One is no longer focusing on a single gene but
many genes or proteins are analyzed simultaneously (i.e. at high throughput level transcriptomics,
translatomics interactomics, metabolomics). This novel approach offers advantages: one can study the
function or the expression of a gene in a global context of the cell. Because a gene does not act on its own, it
is always embedded in a larger network (systems biology). These holistic approaches allow better
understanding of fundamental molecular biological processes.
On the other hand, high throughput approaches pose several novel challenges to molecular biology: the
analysis of such large scale data is no longer trivial. Simple spreadsheet analysis such as excel are no longer
sufficient. More advanced datamining procedures become necessary. Another urgent problem is also how
to store and organize all the information.
There is, in fact, an inseparable relationship between the experimental and the computational aspects.
On the one hand, data resulting from high-throughput experimentation require intensive
computational interpretation and evaluation.
On the other hand, computational methods produce questionable predictions that should be
reviewed and confirmed through experiments.
THE SEQUENCING REVOLUTION
The evolution of sequence technologies provides the most prominent example of technological revolution of
the 21 century.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 5
1953: James Watson and Francis Crick publish their classic paper that describes the double helical structure
of DNA. In 1962, Watson and Crick, together with Maurice Wilkins, receive the Nobel prize for their discovery
1955: Frederick Sanger (a British biochemist) determined, together with colleagues, the complete amino
acid sequence of bovine insulin. He concluded that each protein has a unique sequence. It was this
achievement that earned him his first Nobel prize in chemistry in 1958.
1966: Over the course of several years, Marshall Nirenberg, Har Khorana and Severo Ochoa and their
colleagues elucidated the genetic code
1967: Walter Fitch and Emanuel Margolish publish a paper on ‘The construction of phylogenetic trees’. This
marks the start of phylogenetics based on sequence data.
1972: The first complete nucleotide sequence of a gene is determined by a group at Ghent University. It is
the sequence of the gene coding for the Bacteriophage MS2 Coat protein. The gene is 129 amino acids long
1976: The Enterobacterial phage MS2 genome was the first genome to be completely sequenced. This was
accomplished by Walter Fiers and his team, building upon their earlier milestone in 1972 of the first gene to
be completely sequenced
1977: Frederick Sanger and colleagues in developed what was later called the “Sanger sequencing” method.
It will become the most widely-used sequencing method for approximately 25 years.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 6
2001: Special issues of Science (Feb. 16, 2001) and Nature (Feb. 15, 2001) publish the first drafts of the
human genome sequence. Former president Bill Clinton gives a press conference, together with Francis
Collins and Craig Venter. To everyone’s (?) surprise, the Human genome contains only about 23.000 protein-
coding genes.
2006: Introduction of next (second) generation sequencing technologies. These (will) lead to a huge increase
in the number of genome sequence data produced
NGS technology
Sequencing human genome: 13 jaar/ 3 miljard dollar
Genome Watson (454 techn): 20 mensen/2 maanden. Totaalprijs 1.000.000 dollar
Te verwachten: 1000 dollar human genome
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 7
2008 : The 1000 Human Genomes Project, an international research effort to establish a detailed catalogue
of human genetic variation, is launched. Scientists plan to sequence the genomes of at least one thousand
anonymous participants from a number of different ethnic groups within the following three years, using the
newly developed sequencing technologies that are faster and less expensive.
The revolution goes on….
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 8
DIFFERENT SUBFIELDS IN BIOINFORMATICS RESEARCH
This intricate merge between molecular biology and computational biology has given rise to new research
fields and application. In each of these research fields, a specific field of bioinformatics expertise is required.
Three main fields can be distinguished:
Structural genomics
o Input: raw sequence data
o Goal: annotation
o Bioinformatics Tools: genome assembly, gene/promoter/intron prediction
Comparative genomics
o Input: annotated genomes
o Goal: annotation, evolutionary genomics
o Bioinformatics Tools: sequence alignment, tree construction
Functional genomics.
o Input: experimental information
o Goal: function assignment, systems biology
o Bioinformatics Tools: microarray analysis, network reconstruction, dataintegration
Note that the field of molecular dynamics and protein modeling is not covered in this course.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 9
For some purposes, different subfield have to be combined i.e., the distinction is not always as clear cut as it
seems.
For instance, for genome annotation:
As these genomes are collected they need to be annotated. This means that we will have
To identify the location of the genes on the genome (structural annotation)
To assign a function to each of the potential genes (functional annotation).
In structural annotation, the question to be answered is 'where are the genes'? One needs to localize the
gene elements on the sequence (chromosome) and find the coding sequences, intergenic sequences,
exons/intro boundaries, promoters, 5'UTR, 3'UTR regions, and so on.
In functional annotation, one tries to get information on the function of genes. Often, it is possible to get
hints on the biochemical function of the gene products by finding homologs in protein databases or by
studying the biochemical characteristics of the gene (proteome, transcriptome analysis).
In the following, each of the bioinformatics subfields will be briefly described and illustrated with a biological
case study.
Functional
genomics/
Systems Biology
Structural
Genomics/Annotation
Comparative
Genomics/
evolutionary
biology
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 10
STRUCTURAL GENOMICS
ASSEMBLY
Structural genomics is based on raw sequence data. The first step in structural genomics consists of
assembling raw sequence fragments into contigs or whole genomes. The complexity of the assembly process
depends on the used sequence technique. Two major sequencing approaches will be described below:
For more information see also
http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml
http://www.bio.davidson.edu/courses/genomics/method/shotgun.html
Sequencing and assembly
TOP DOWN SEQUENCING/ DE NOVO ASSEMBLY
The first genome sequencing approach “top down” is based on the known order of DNA fragments. To
sequence larger molecules such as human chromosomes,
1) individual chromosomes are broken into random fragments of approximately 150 kb.
2) These fragments are then cloned into BACs (vectors).
3) In an intensive but largely automated laboratory procedure, the resulting library is screened for clusters
of fragments called contigs which have overlapping or common sequences. These contigs are then joined to
produce an integrated physical map of the genome based on the order of the BACS. Once the correct map
has been identified unique overlapping clones are chosen for sequencing.
4) However, these clones are too large for direct sequencing. One procedure for sequencing these subclones
is to subclone them further into smaller fragments that are of sizes suitable for sequencing (500 bp). 5) From
the DNA sequences of approximate length of 500 bp, genome sequences are assembled using the fragment
order on the physical map as a guide.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 11
This method of creating physical maps of genomes and then using this map to guide the sequencing was used
by the Public Human genome Consortium to create a draft of the human genome. This carefully crafted, but
laborious procedure was designed to produce a sequence of the human genome that was based on a top
down approach, at each stage using the physical map to guide the placement of sequences (Lander, Nature
2001). The reasoning behind this strategy was the avoidance of sequence repeats that might otherwise
confound obtaining the correct genome sequence.
Top down sequencing
1. 2.
3. 4.
Genome fragmentation
Physical map
BAC library
Subclone library
5.
Genome assembly
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 12
SHOT GUN SEQUENCING/ DE NOVO ASSEMBLY
A contrasting “bottom-up” method in which the genome sequence is derived from solely overlaps in large
numbers of random sequence without using the physical map as a guide, has been devised. This alternative
method, called shotgun sequencing attempts to assemble a linear map from subclone sequences without
knowing their order on the chromosome. Contigs are assembled based on alignment of all possible sequence
pairs in the computer. This method is now routinely used to sequence microbial genomes and the cloned
fragments of larger clones (see also metagenomics).
The shotgun method was used by Celera Genomics to sequence the human genomes (Venter, Science 2001;
http://www.jcvi.org/). There has since been controversy as to whether or not use of the public data by the
Venter group contributed significantly to their draft of the human genome or from the overlaps in a highly
redundant set of fragments by automatic computational methods (shotgun method).
1. Genome
fragmentation
2. Library
3. Sequences 4. Genome assembly
Shot Gun
Sequencing
1. Genome
fragmentation
2. Library
3. Sequences 4. Genome assembly
Shot Gun
Sequencing
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 13
Fuzz about the public versus the commercial effort (lander versus venter)
http://www.dnalc.org/view/15326-Analysis-in-public-and-private-Human-Genome-Projects-Eric-Lander-
.html
For large genomes a combination between top down and bottom up sequencing is used as illustrated below.
NEXT GENERATION SEQUENCE TECHNOLOGY
However, with the next generation sequencing technology, relying on shotgun sequencing is customarily
used, making use of short read sequencing (illumina, 150 bp) and libraries of different size e.g. 500 bp
standard library + 3 kb library. With the latest PACbio technology very long reads become possible, making
the shot gun sequencing even more feasible. The figure below shows why the assembly can get difficult and
why a good design is important. Assembly is complicated by the presence of repeats which make the
alignment between reads ambiguous. This problem is even more complicated because the reads can contain
errors (some technologies are more error prone than others). The use of reads generated from libraries of
different insert lengths or the use of long reads can resolve the ambiguities. In general, the quality of the
assembly depends on the coverage and length of the reads. The longer the reads the easier, the less
pronounced the problem of generating ambiguous alignments between reads. In addition a high coverage is
needed so that sufficient overlap between reads is guaranteed.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 14
From Nagarajan Nature reviews, 2013
RESEQUENCING/ DE NOVO ASSEMBLY
Sequancing for de novo assembly in general is expensive as it requires multiple libraries and high coverage
and preferentially long reads. Once a good reference genome is assembled (with the de novo assembly), this
reference genome can be used as a scaffold (replaces the physical map) to guide reference based assembly.
Reference based assembly is used in case of resequencing e.g. when the genome of a human individual is
sequenced to search for genetic variants (mutations compared to the reference genome). Because of the
availability of a reference genome resequencing is cheaper (lower coverage possible).
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 15
STRUCTURAL ANNOTATION
Once assembled, structural elements such as the location of genes, introns, exons, splice sites, promoters,
repeated elements etc. need to be predicted in the genome (the structural analysis). Distinct gene predictions
algorithms have been developed. Methods for ab initio gene predictions are based on supervised machine
learning techniques(1). The model (e.g. a hidden markov model or a neural network) is trained on a set of
known genes (or promoters or introns) and subsequently used to predict the location of unknown genes (or
promoters or introns) in an organism. Features (properties in the genome) that are extracted from the
trainingsset and that thus help recognizing genes are for instance specific codon usage (which differs
between coding and non coding regions), spice site recognition sites (when predicting splice sites) etc.
Because of differences in codon usage and splice junctions between organisms, a model must be trained for
each novel genome.
Once the complete genome is known genome maps can be constructed that indicate the position of each
gene on the genome. Comparing gene maps of different organisms allows identification of translocation or
other chromosome arrangements (important in cancer research).
Fig. genome map of the bacterium A. tumefaciens
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 16
BIOLOGICAL APPLICATION: GENOME SEQUENCING
GENOME SEQUENCES OF MODEL ORGANISMS
The first bacterial genome to be sequenced was that of Haemophilus influenzae (sequenced by the TIGR
institute (http://www.tigr.org) in 1995). The success of sequencing this genome in relatively short time
heralded the sequencing of a large number of additional prokaryotic organisms. To data the genomes 96 of
these species have been sequenced among which the model organisms E. coli and B. subtilis.
Later on eukarotic genomes were sequenced. In 2002 the human genome sequence was completed by two
distinct research groups in parallel: a commercial group Celera and an academic sequence consortium
(Sanger Center). Nowadays the sequences of several eukaryotic model organisms have been determined and
the number of sequences is steadily increasing.
Microbial genomes http://www.ncbi.nlm.nih.gov/genomes/static/micr.html
Genome resources at ncbi:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
Vertebrate genomes:
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=7227 http://www.ncbi.nlm.nih.gov/genome/guide/human/ http://www.ncbi.nlm.nih.gov/genome/guide/mouse/index.html http://www.ncbi.nlm.nih.gov/genome/guide/rat/index.html http://www.ncbi.nlm.nih.gov/genome/guide/zebrafish/index.html http://www.ensembl.org/
yeast genomes
http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=scerevisiae.inf
Plant genomes:
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/Resources_1.html#arab
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 17
From Nature Reviews Genetics 4, 251-262 (2003);
GENOME SEQUENCES AT SPECIES LEVEL
In the early days most genomes were sequenced by the classical Sanger sequencing approach (see figure
below), but nowadays the next-generation sequencing (NGS) methodology is taken over. Mainly the
developments in nanotechnology have resulted in the origin of novel technologies for sequencing and
synthesizing DNA sequences. Next-generation sequencing has the ability to process millions of sequence
reads in parallel rather than 96 at a time. All NGS platforms share a common technological feature: massively
parallel sequencing of clonally amplified or single DNA molecules that are spatially separated in a flow cell
(for a recent review see Metzker, M.L. (2010) Nature Reviews Genetics 11:31-46) (see figure below). This
design is a paradigm shift from that of Sanger sequencing, which is based on the electrophoretic separation
of chain-termination products produced in individual sequencing reactions.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 18
Fig. Sanger sequencing methodology: The dideoxynucleotide termination DNA sequencing technology
invented by Fred Sanger and colleagues in 1977, formed the basis for DNA sequencing from its inception
through 2004. Originally based on radioactive labeling, the method was automated by the use of fluorescent
labeling coupled with excitation and detection on dedicated instruments, with fragment separation by slab
gel and ultimately by capillary gel electrophoresis.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 19
Overview of next generation sequencing technologies.
For instance Illumina:
http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
The breakthroughs in these technologies are unprecedented and follow the law of Moore. Related to the
sequencing technology, it is to be expected that within a few years we will have the 1000 dollar genome,
which allows the genome of a human to be sequenced within a few hours for 1000 dollar.
(comparison the human genome is 3.4 Gb=3.4 miljard baseparen en heeft 20000-25000 genen).
As a result recent sequencing projects start focusing on sequencing different individuals of the same species
(1000 genomes project e.g. http://www.1000genomes.org/) rather than only sequencing representatives of
different species. This has been made possible thanks to the lower sequencing cost of the next generation
sequencing approaches. This opens novel perspectives for amongst others, personalized medicine,
sequence-based trait selection, evolution experiments.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 20
The ‘1000 genomes project’
The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to
provide a comprehensive resource on human genetic variation. As with other major human genome
reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide
scientific community through freely accessible public databases.
The goal of the 1000 Genomes Project is to find the genetic variants that have frequencies of at least 1% in
the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a
person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The
many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome.
The pieces are then aligned to the reference sequence and joined together. To find the complete genomic
sequence of one person with current sequencing platforms requires sequencing that person's DNA the
equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across
the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered
by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome
will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely
that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for
detecting structural variants, and allows sequencing errors to be corrected.
The data now available to scientists contains 99% of all genetic variants that occur in the populations studied,
down to the level of rare variations that only occur in 1 out of every 100 people. "The whole point of this
resource is that we're moving to a point where individuals are being sequenced in clinical settings and what
you want to do there is sift through the variants you find in an individual and interpret them," said Professor
Gil McVean of Oxford University, a lead author for the study.
The information will be pored over by thousands of researchers, who will analyze and interpret the DNA
variations between people in a bid to work out which ones are implicated in disease. In addition to the DNA
sequences, the 1,000 Genomes Project has stored cell samples from all the people it has sequenced, to allow
future scientific projects to look at the biological effect of the DNA variations they might want to study. How
is this done e.g. through a GWAS.
Genome wide association studies (GWAS): Any two human genomes differ in millions of different ways.
There are small variations in the individual nucleotides of the genomes (SNPs) as well as many larger
variations, such as deletions, insertions and copy number variations. Any of these may cause alterations in
an individual's traits, or phenotype, which can be anything from disease risk to physical properties such as
height. In a genetic association study one asks if the allele of a genetic variant is found more often than
expected in individuals with the phenotype of interest (e.g. with the disease being studied) than in individuals
without the disease.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 21
Overview of a genomewide association study, from W. Gregory Feero et al 2010Th e new england journal of
medicine.
The most common approach of GWA studies is the case-control setup which compares two large groups of
individuals, one healthy control group and one case group affected by a disease. All individuals in each group
are genotyped for the majority of common known SNPs. The exact number of SNPs depends on the
genotyping technology, but are typically one million or more. For each of these SNPs it is then investigated if
the allele frequency is significantly altered between the case and the control group. In such setups, the
fundamental unit for reporting effect sizes is the odds ratio. The odds ratio reports the ratio between two
proportions, which in the context of GWA studies are the proportion of individuals in the case group having
a specific allele, and the proportions of individuals in the control group having the same allele. When the
allele frequency in the case group is much higher than in the control group, the odds ratio will be higher than
1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is
typically calculated. Finding odds ratios that are significantly different from 1 is the objective of the GWA
study because this shows that a SNP is associated with disease (see course Integrative biology).
COMPARATIVE GENOMICS
OVERVIEW
The basic idea of comparative genomics is the comparison of sequences between genomes. Sequence
alignment methodologies form the basis tools for comparative genomics (Blast, clustalW, Needleman Wunsh,
Markov models…) (see chapter sequence alignment).
VALIDATE GENE PREDICTION
1) Comparative genomics can be used to aid or validate gene predictions: Since gene prediction methods
based on sequence features only (ab initio gene prediction) are only partially accurate, gene identification is
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 22
facilitated by high-throughput sequencing of partial cDNA copies of expressed genes (called expressed
sequence tags or ESTs). Presence of ESTs confirms that the predicted gene is transcribed. A more through
sequencing of full length cDNA clone may be necessary to confirm the structure of genes. Gene prediction
methods that not only take into account sequence features (codon usage, intron exon recognition sites), but
also sequence homology (with ESTs, cDNAs, proteins) are called extrinsic gene finding methods (and are in
fact a combination of structural and comparative genomics). An example is the Genescan method discussed
in ensembl.
HOMOLOGY BASED ANNOTATION:
The amino acid sequence of proteins encoded by the predicted genes can be used as a query sequence in a
database similarity search. A match of a predicted protein sequence to one or more database sequences not
only serves to validate the gene prediction, but also can give indications on the function of the gene.
Not all genes will give hits in database searches. Some proteins might be unique for a certain organism or
might not have been characterized before. In such cases it might also be important to search for characteristic
domains (conserved amino acid patterns that can be aligned) that represent a structural fold or a biochemical
feature (see chapter pattern searches).
STUDYING GENE FAMILIES
Another important goal of comparative genomics is the study of protein families proteins come in families.
In theory we all inherit our genetic material from a common ancestor. All descendants have the same number
of genes. In principle all species have one copy of the same ancestral gene that during evolution underwent
changes (referred to as orthologs). Orthologs are genes that are so highly conserved by sequence in different
genomes that the proteins they encode are strongly predicted to have the same structure and function and
to have arisen from a common ancestor through speciation. Some parts of the gene/protein that are under
selective pressure will not tolerate changes during evolution and will remain conserved. These usually
correspond to the functional regions of the protein. Multiple alignment can thus be used to infer the
important domains in a protein.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 23
However during evolution sometimes part of a (or complete) genome can duplicate. In case multiple
homologs of the same gene are present in ne genome we refer to them as paralogs.
Although the definition of paralogs and orthologs is in theory straightforward their occurrence in genomes
can give rise to complex evolutionary relations within a gene family (see figure). Genes in two species that
have directly evolved from a single gene in the last common ancestor (orthologs by definition) are most likely
to share the function. However, often, these sequences have duplicated after the speciation event (i.e. after
the two species diverged from each other). In this case one-to-many or many-to-many relationships between
genes originate. In such cases, it is non-trivial to determine which of the orthologs is functionally equivalent
to the ortholog in the other species. It may be only one, but several genes could also have redundant
functions (especially if the duplication event took place recently).
Fig. Example of a cluster of a protein family (sigma factors in E. coli). Orthologs are often determined as reciprocal best
blast hits.
This is an important issue to account for when attempting to extrapolate the function of a gene to its
homologs in other species.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 24
Fig. A schematic gene tree showing speciation (S), duplication (D) and loss events (L). Human1, Mouse 1, Human 2 and
Worm 1 are all orthologs as the last common event in their evolution was a speciation. Human 1 and Human 2 are
paralogs as the last event in their common evolution was a duplication. Adapted from (Alexeyenko et al., 2006). In Figure
1 Human 1 and Human 2 are out-paralogs relative to mouse since they resulted from a duplication event that occurred
before their speciation with Mouse.
Identifying conserved regions in multiple sequence alignments to infer functional importance does not have
to be limited to the coding region and can equally well be applied to non-coding regions, in which case we
refer to Phylogenetic footprinting. Functional regions here often refer to regulatory motifs. The principle is
the same as for protein coding regions except that different types of alignment procedures need to be used.
BIOLOGICAL APPLICATION 1: EVOLUTIONARY BIOLOGY
With all these genome sequences at hand, comparative genomics allows studying our own evolution.
In September 2006, an international team published the genome of our closest relative, the chimpanzee.
With the human genome already in hand, researchers could begin to line up chimp and human DNA and
examine, one by one, the 40 million evolutionary events that separate chimps from us. The genome data
confirm our close kinship with chimps: We differ by only about 1% in the nucleotide bases that can be aligned
between our two species, and the average protein differs by less than two amino acids.
Given the dramatic behavioral and developmental differences that have arisen since their divergence from a
common ancestor 6-7 million years ago, the question arises of how these phenotypic differences are
reflected at the genome sequence level.
Recent studies have shown that mainly genes involved in smell and hearing are significantly different
between humans and chimpanzees.
Also changes in gene regulatory binding sequences (promoters, enhancers, and silencers) are likely to have
contributed to divergence between humans and chimps. Using a comparative approach, it has been shown
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 25
that regulatory binding sites lost in human, but still present in chimp are located in specific genomic regions
and are associated with genes involved in sensory perception.
In this figure, two examples are given of regulatory binding sites that changed between human and chimp.
Note how small differences in sequences can have such large phenotypic influences.
BIOLOGICAL APPLICATION 2: STUDYING GENOME EVOLUTION
Studying genomes has unveiled that during evolution genes, but sometimes also full genomes can duplicate
(paralogs). This gives rise to multiple copies of the same gene in a single species.
These duplicates are often the source of novel genetic material on which evolution can act. The most likely
fate of a duplicated gene is gene loss. Indeed most duplications are deleterious. In that case redundant genes
are removed from the genomes or evolve into pseudogenes.
However, through mutation and natural selection, one of the copies can develop a new function, leaving the
second copy to cover for the original function (evolution can experiment as you have a spare copy). If paralogs
(duplicated genes) coexists during evolution, they therefore usually have divergent functionalities or
complementary functions. Either their proteins become involved in novel biochemical pathways or the
Ancestral gene
Copy 1
Copy 1
Function 1
Function 1 Function 1
Function 1 New Function 2
Copy 2
Copy 2
Gene duplication
TimeAncestral gene
Copy 1
Copy 1
Function 1
Function 1 Function 1
Function 1 New Function 2
Copy 2
Copy 2
Gene duplication
Time
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 26
paralogs have distinct expression domains. The latter means that the original function e.g. was expressed in
the roots and leaves of a plant but that after duplication one copy is expressed exclusively in the roots while
the other copy becomes responsible for expression in the leaves. Both copies become complementary
because their expression domain is different (can be caused by mutations in the regulatory elements). The
latter phenomenon is referred to as subfunctionalization.
By comparing genomes against themselves, researchers have found that all flowering plants underwent 3 full
genome duplication events.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 27
Because for some genes a multiplication level of 8 was observed,the occurrence of 3 whole genome
duplication events was assumed (1->2->4->8 copies).
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 28
These whole genome duplications could be dated and the eldest could be mapped to fossil records. The
duplication coincided with the origin and specialization of flowering plants during the Early Cretaceous.
Fossil records of plant evolution
The extant (or modern) angiosperms, i.e. flowering plants did not appear until the Early Cretaceous (145–
125 Mya), when the final combination of angiosperm features occurred, as supported by evidence from
micro- and macrofossils. During the Aptian, 125–112Mya (Figure below) species diversity was low and pollen
and megafossils were rare components of terrestrial floras. Angiosperm fossils show a dramatic increase in
diversity between the Albian (112–99.6 Mya) and the Cenomanian (99.6–93.5 Mya) at a global scale (Figure
1). The angiosperm radiation yielded species with new growth architectures and new ecological roles. Early
angiosperms had small flowers with a limited number of parts that were probably pollinated by a variety of
insect taxa but specialized for none. Accordingly, Cenomanian flowers do not yet provide strong evidence for
specialization of pollination syndromes. However, by the Turonian (93.5–89.3 Mya), flowering plants had a
wide variety of features that are, in extant species, closely associated with several types of specialized insect
pollination and with high species diversity within angiosperm subclades. The evolution of larger seed size in
many angiosperm lineages during the early Cenozoic (from 65 Mya) indicates that animal-mediated dispersal
and shade-tolerant life-history strategies. In summary, fossils with affinities to diverse angiosperm lineages,
including monocots, are all found in Early Cretaceous floras. However, the question remains why this was
such a decisive time in the evolution of plants. Can whole-genome duplication events have had a key role in
the origin of angiosperms and their morphological and ecological diversification.
Evidences:
Many angiosperms have experienced one or more episodes of polyploidy in their ancestry.
o Duplicated genes and genomes can provide the raw material for evolutionary diversification
and the functional divergence of duplicated genes
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 29
o The dates of the duplication events correspond to time periods of large expansion in
angiosperms as recorded based on fossils.
Bioinformatics methods used
Field of comparative genomics and phylogeny. Methodologies mainly based on sequence alignment and
phylogenetic tree construction.
BIOLOGICAL APPLICATION 3: METAGENOMICS (G. VENTER)
METAGENOMICS: DNA SEQUENCING OF ENVIRONMENTAL SAMPLES
Nature Reviews Genetics 6, 805-814 (2005); doi:10.1038/nrg1709
Although genomics has classically focused on pure, easy-to-obtain samples, such as microbes that grow
readily in culture or large animals and plants, these organisms represent only a fraction of the living or once-
living organisms of interest. Many species are difficult to study in isolation because they fail to grow in
laboratory culture, depend on other organisms for critical processes, or have become extinct. Methods that
are based on DNA sequencing circumvent these obstacles, as DNA can be isolated directly from living or dead
cells in various contexts. Such methods have led to the emergence of a new field, which is referred to as
metagenomics.
DNA sequencing can provide insights into organisms that are difficult to study because they are
inaccessible by conventional methods such as laboratory culture. Examples are for instance,
organisms that exist only in tight association with other organisms, including various obligate
symbionts and pathogens, members of natural microbial consortia and an extinct cave bear
Isolation and sequencing of DNA from mixed communities of organisms (metagenomics) has
revealed surprising insights into diversity and evolution.
Partially assembled or unassembled genomic sequence from complex microbial communities has
revealed the existence of novel and environment-specific genes.
Corresponds with ageolder
Corresponds with ageolder
Corresponds with ageolder
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 30
The application of high-throughput shotgun sequencing environmental samples has recently provided global
views of those communities not obtainable from 16S rRNA or BAC clone–sequencing surveys. The sequence
data have also posed challenges to genome assembly, which suggests that complex communities will demand
enormous sequencing expenditure for the assembly of even the most predominant members.
However, for metagenomic data, this complete assembly may not always be necessary or feasible.
Determining the proteins encoded by a community, rather than the types of organisms producing them,
suggests a means to distinguish samples on the basis of the functions selected for by the local environment
and reveals insights into features of that environment.
For instance, Examination of higher order processes reveals known differences in energy production (e.g.,
photosynthesis in the oligotrophic waters of the Sargasso Sea and starch and sucrose metabolism in soil) or
population density and interspecies communication, overrepresentation of conjugation systems, plasmids,
and antibiotic biosynthesis in soil (Fig. 4, lower left). The predicted metaproteome, based on fragmented
metagenomic analyses performed here could be used to predict features of the sampled environments such
as energy sources or even pollution levels.
Metagenomics data bases are currently been set up: for instance
http://www.megx.net/index.php?navi=EasyGenomes
EXAMPLE G. VENTER SARGASSO SEA
Boston (04/16/04)—This Spring, J. Craig Venter is sailing around the French Polynesian Islands scooping up
bucketfuls (figuratively) of seawater in an ambitious voyage to sample microbial genomes found in the
world's oceans. His 95-foot yacht, Sorcerer II, has been outfitted with all manner of technical equipment to
accommodate the task, as well as a few surfboards should that opportunity arise.
Venter and colleagues report finding 1.2 million genes, including almost 70,000 entirely novel genes, from an
estimated 1,800 genomic species, including 148 novel bacterial phylotypes. This diversity is staggering and
to a large extent unexpected. "We chose the Sargasso seas because it was supposed to be a marine desert,"
says Venter wryly. "The assumption was low diversity there because of the extremely low nutrients." His
team sequenced a total of 1.045 billion base pairs of non-redundant sequence. At the height of the work,
"over 100 million letters of genetic code were sequenced every 24 hours." The results have been deposited
in GenBank. You can go and search for them.
PALEOGENOMICS
Mammoth genome
A very recent application is the use of metagenomics approaches to sequence the mammoth genome:
Usually mitochondrial genomes are sequenced form extinct species as it is abundantly present in
eukaryotic cells and thus easier to sequence. In permafrost settings, theoretical calculations predict
DNA fragment survival up to 1 million years (11, 12). When preserved in such conditions sequencing
of genomic DNA is still possible.
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 31
1 g of bone was used to extract DNA which was subsequently used for library construction and
sequencing technology that recently became available (13, 19).
The mammalian fraction dominated the identifiable fraction of the metagenome.
Nonvertebrate eukaryotic and prokaryotic species occur at approximately equal ratios, with paucity
of fungal species and nematodes.
hits against grass species to outnumber the ones from Brassicales by a ratio of 3:1, which could be
indicative of ancient pastures on which the mammoth is believed to have grazed.
From Poinar et al., Science 2006.
Ancient salt crystals
Bacteria have been found associated with a variety of ancient samples, however few studies are
generally accepted due to questions about sample quality and contamination. Cano and Borucki isolated in
1995 a strain of Bacillus sphaericus from an extinct bee trapped in 25-30 million-year-old amber. More
recently a report about the isolation of a 250 million-year-old halotolerant bacterium from a primary salt
crystal has been published. Halite crystals from the dissolution pipe at the 569 m level of the Salado
Formation were taken from sampling. A fluid volume of 9 l was recovered from an inclusion in the crystal
and inoculated into two different media: casein-derived amino acids medium (CAS) and glycerol-acetate
medium (GA). Only the CAS enrichment yielded the bacteria, designated 2-9-3.
Once isolated the bacteria, the next step in the research is to achieve its taxonomical classification.
Two important genotypic markers widely used in recent bacterial taxonomy are the 16S rRNA gene sequence
data and DNA-DNA hybridization data. Many researchers reported the correlation between 16S rRNA gene
sequence similarity values and genomic DNA relatedness. It has been proposed that phenotypically related
bacterial strains showing 70% or greater genomic DNA relatedness constitute a single bacterial species. In
contrast, those having <70% but >20% similarity are considered to be different species within a genus.
These analysis showed that the organism was most similar to Bacillus marismortui (99% similarity S)
and Virgibacillus pantothenticus (97.5% S). Phylogenetic analysis showed that isolate 2-9-3 is part of a distinct
lineage within the larger Bacillus cluster
Additional info
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 32
Metagenomics and industrial applications. Nat Rev Microbiol. 2005 Jun;3(6):510-6. Review.
The metagenomics of soil. Nat Rev Microbiol. 2005 Jun;3(6):470-8. Review.
http://www.bio-itworld.com/news/041604_report4889.html
The gut microbiome project
http://en.wikipedia.org/wiki/Microbiome
FUNCTIONAL GENOMICS & SYSTEMS BIOLOGY
SYSTEMS BIOLOGY
Is field that originated in the early 90’s: it stems from ‘molecular biology’ but reflects a novel holistic way of
thinking: understanding complex biological phenomena in their entirety. In systems biology a cell is
considered as a system that interacts with its environment. It receives dynamically changing environmental
cues and transduces these signals into the observed behavior (phenotype or dynamically changing
physiological responses). This signal transduction is mediated by the regulatory network (below). Genetic
entities (proteins), located on top of a regulation cascade, are activated by external cues. They further
transduce the signal downstream in the cascade via protein-protein interactions, chemical modifications of
intermediate proteins, etc into transcriptional activation and subsequent translation. Ultimately, these
processes turn the genetic code into functional entities, the proteins. The action of regulatory networks
determines how well cells can react or adapt to novel conditions. This signaling network in a cell can be
compared with the electronic circuitry on a microchip. It also consists of individual components (often called
modules). Systems biology is the science that tries to decode the design principles of biological systems. It
can be used for both fundamental and applied purposes. A typical example of a fundamental application is
the domain of evolutionary systems biology which has as a goal studying the impact of network rewiring on
adaptive behavior and organism evolution.
Figure: The cell as a signal transduction system. The signaling circuitry can be considered as having
a modular composition, in which each part form an individual functional unit.
BIOLOGICAL APPLICATION: SYNTHETIC BIOLOGY FROM AN ENGINEERING POINT OF VIEW:
RATIONAL DESIGN
Synthetic biology best reflects the paradigm shift in biological thinking that occurred during the last decennia.
It reflects how biology is approached from a computational/engineering point of view.
The major difference between ‘genetic engineering/biotechnology’ to ‘synthetic biology’ would reflect a
novel mind set: the idea of rational design. Synthetic biology relies on the identification, reuse or adaptation
of existing parts of systems to construct reduced systems tailored to an aim whose starting assumptions
might be very different from those of the natural system. The idea of using parts stems from the parallel
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 33
between electronic circuits and biological systems. Each component of the system can be seen as an
individual transistor. By combining the different signaling components a circuitry can be designed that has
operational characteristics or that gives rise to functionalities which do not occur as such in nature.
The premise of synthetic biology is thus built on the modularity of signal transduction pathways. This
modularity thus allows constructing and synthesizing artificial biological systems by combining “microchip
design principles” with libraries of molecular modules to obtain a desired microbial functionality. According
to this vision, Synthetic Biology should be able to rely on a list of standardized parts (amino acids, bases,
proteins, genes, circuits, cells, etc) whose properties have been characterized quantitatively and on software
modeling tools that would help putting parts together to create a new biological function.
The idea behind the MIT ‘Registry of Standard Biological Parts’ (http://parts.mit.edu), is that as more libraries
of parts are being constructed and provided that all these parts are well documented and standardized, in
the end one can select immediately his appropriate part from the library and the tedious step of making a
mutant library or synthesizing all possible sequences and characterizing their in and output characteristics
can be omitted.
Currently the Registry is a collection of ~3200 genetic parts that can be mixed and matched to build synthetic
biology devices and systems. Founded in 2003 at MIT, the Registry is part of the Synthetic Biology
community's efforts to make biology easier to engineer. It provides a resource of available genetic parts to
iGEM teams and academic labs.
Current challenges in synthetic biology:
The premise of synthetic biology is built on the modularity of signal transduction pathways. Artificial
biological systems are synthesized by combining parts with desired functionalities and kinetic behavior, as
predicted by a model-based design. However the generation of the parts with the proper characteristics is
still very laborious and ad hoc (large libraries are made randomly, see figure below. All parts within such
library need to be characterized experimentally).
A fundamental systems understanding of how regulation or a certain kinetic behavior is encoded could
further rationalize the design of modules and contribute to a better standardization (that is the key features
in the primary sequence that drive a specific expression behavior, motifs, motif spacing, nucleosome
positioning, etc).
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 34
Modeling is the key to both systems and synthetic biology (see figure below). For systems biology modeling
aims at getting a fundamental understanding of the host cellular behavior, while for synthetic biology ‘model-
based design’ is used to determine the circuit topology and its parameters subject to predefined design
requirements. Such design requirements should not only consider desired input/output characteristics
(linear, oscillating behavior, bistability), but also take into account the easiness by which certain parts can be
manipulated in the lab. A challenging task is making design models that determine design parameters
conditioned on systems properties of the global cellular system.
CONCLUSION
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 35
Bioinformatics has several subdisciplines (functional genomics/systems biology; comparative
genomics; structural genomics.
Each subdomain is a research domain in evolution, with novel tools being produced at an ever
increasing pace
The real life examples of the different subdomains show how computational analysis (bioinformatics) in
combination with novel data generation procedures will change the way ‘molecular biologists’ think and
perform research (referred to as the new biology).
How can bioinformatics change the world?
Bioinformatics has numerous application domains, and will for instance revolutionize the medical field,
because it for example will make personalized medicine possible. Now that the genome (the blueprints of
life) of everyone can (and will) get sequenced, we can start investigating why some persons are susceptible
to certain diseases while others are not, and why a certain treatment works for some and not for others. In
agriculture, we can for instance study why certain plants are more resistant to drought than others, which is
important in these days of global warming and climate change. Bioinformatics can also address more
fundamental questions in evolutionary biology, such as whether Neanderthals and the ancestor of modern
humans ever had sex (the answer is yes), questions that can only be addressed with bioinformatics or
computational biology.
If Bioinformatics will become so prominent and is referred to as ‘the new biology’, how will this affect the
more classical wet lab science?
Of course without data there is no bioinformatics. But there is indeed a tendency that increasingly, data
generation becomes robotized or outsourced. This has a consequence that wet lab scientists have more time
left to spend on the design of their experiment and will be confronted at a much earlier stage with the analysis
of their data, and the problems related to this. What do you hope to get out of your data, how will you
summarize all these data, what is the hypothesis you want to formulate, and so on? So rather than focusing
on a single gene, they will need to start thinking more globally, solving the bigger picture and that is what
the term ‘new biology’ is referring to. This is now often considered the problem of the bioinformatician but
obviously, the wet lab scientist of the future will have to adopt at least some of those skills. So the distinction
between a bioinformatician and a wet lab scientist (systems biologist) will become fuzzier and in the coming
decades, we expect that about one third of the people in the life sciences will be bioinformaticians or at least
use some sort of bioinformatics in their research. However, although genome hackers and number crunchers
can learn a lot from the loads of data generated, wet lab work will always be necessary. Bioinformatics is
also often about making predictions, but of course these still will need to be validated in the lab. On the other
hand, for some specific fields such as evolutionary research, bioinformatics is often sufficient or even the
only way to obtain results.
What is the most important skill of being a bioinformatician
First of all you need to be a generalist rather than a specialist. You need to know a bit of everything but
nothing too much in detail (that can even be disadvantageous I think). To give an example: wet lab scientists
typically have a very detailed view on biology: biological systems have randomly evolved into emerging
complex systems that can not be captured in a few rules. There are more exceptions than fixed rules in
biology. Engineers on the other hand model systems and these models depend on predefined rules. As a
bioinformatician you need to keep both parties happy: a good formalization of a biological question should
Introduction Bioinformatics (Ba Biochemistry and Biotechnology)
Kathleen Marchal 36
reduce the problem to a model that is mathematically tractable but that still captures the intricacies the
biologist is interested in. Finding the right assumptions and simplifications builds on this generic knowledge.
This generic knowledge is also key to the scientific Intuition you need to have as a bioinformatician. As was
already mentioned: with bioinformatics we can solve research questions that could not be addressed before.
There is so much data out there that when you integrate it all you can tackle research questions that go far
beyond what was accessible or could be dreamt of by a single person or even a single lab. The difficulty often
is defining these novel research questions or hypothesis no one has ever thought of before. This again
requires very good interdisciplinary knowledge on how the data was generated, what type of information
does it contain, how can it be integrated etc biological scientists need to know some bioinformatics.