Top Banner
An Introductory Course on BIOINFORMATICS Liviu Ciortuz, 2007 0.
32

An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

An Introductory Course on

BIOINFORMATICS

Liviu Ciortuz, 2007

0.

Page 2: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Plan

1 What is bioinformatics?

Why study it?

2 Bibliography

3 A molecular biology primer

3.1 The cell

3.2 The DNA

3.3 The central dogma of molecular biology

3.4 Exemplifying genetic deseases: Thalassemia

4 Discovery question

1.

Page 3: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

1 What is Bioinformatics?

Bioinformatics is a pluri-disciplinary science focussing on

the applications ofcomputational methods and mathematical statisticsto molecular biology

Bioinformatics is also called

Computational Biology (USA)Computational Molecular Biology

Computational Genomics

The related ...ics family of subdomains:Genomics, Proteomics, Phylogenetics, Pharmacogenics, ...

2.

Page 4: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Why should I teach/study bioinformatics?

Because bioinformatics is

an opportunity to use some of the most interesting computa-

tonal techniques...

to understand some of the deep misteries of life and diseases

Note: The next 3 slides are from Thomas Nordahl Petersen, University of Copenhagen

3.

Page 5: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Example: the Parkinson disease

a degenerative central nervous disorder

due to the loss of brain cells which produce dopamine,

a protein important for the initiation of movement

Muhammed Ali, Pope John-Paul II died from Parkinson..., my father too

4.

Page 6: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Dopamine produced by cells in Substantia nigraactivates neurones in Striatum/Basal ganglia

5.

Page 7: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Cure for Parkinson disease?

Parkinson disease may be cured provided that new

dopamine producing cells replace the dead ones.

As a medical experiment, dopamine producing brain

cells from aborted foetuses have been operated into

the brain of Parkinson patients and in some cases cured

the disease. Brain tissue from approx. 6 foetuses were

needed. Major ethical problems!

Search for a protein drug is the only valid option.

The genes producing dopamine are still unknown. Un-

til now, only genes involved in the dopamine transport

were identified.

6.

Page 8: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

2 Bibliography for this course

• Biological sequence analysis:

Probabilistic models of proteins and nucleic acids

R. Durbin, S. Eddy, A. Krogh, G. Mitchison,

Cambridge University Press, 1998

• An Introduction to Bioinformatics Algorithms

Neil Jones, Pavel Pevzner

MIT Press, 2004

◦ Computational Molecular Biology: An Algorithmic Approach

Pavel Pevzner

MIT Press, 2000

◦ Introduction to Computational Genomics: A Case Studies Approach

Nello Cristianini, Matthew Hahn

Cambridge University Press, 2006

• Introduction to Computational Molecular Biology

Joao Setubal, Joao Meidanis

PWS Publishing Company, 1997

7.

Page 9: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Bibliography (II), more “Bio...”

• Cell Biology, (2nd ed.)Gerald Karp,McGraw-Hill, 1979

• Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.)Malcolm Campbell, Laurie HayerBenjamin Cummings, 2006

• Fundamental Concepts of BioinformaticsDan Krane, Michael RaymerBenjamin Cummings, 2003

• Introduction to BioinformaticsArthur LeskOxfrod University Press, 2002

• BioinformaticsDavid MountCold Spring Harbor Laboratory Press, 2001

8.

Page 10: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Bibliography (III), more “...informatics”• Algorithms on Strings, Trees, and Sequences

Computer Science and Computational BiologyDan GusfieldCambridge University Press, 1997

• Jwels of StringologyM. Crochemore and W. RytterWorld Scientific Press, 2002

• Flexible Pattern Matching in Strings:Practical on-line search algorithms for texts and biological sequencesGonzalo Navarro, Mathieu RaffinotCambridge University Press, 2002

• Bioinformatics: the Machine Learning ApproachPierre Baldi, Søren BrunakMIT Press, 2001

• Statistical Methods in Bioinformatics: An IntroductionWarren Ewens, Gregory GrantSpringer, 2001

9.

Page 11: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Prof. Larry Hunter about

Pevzner’s “Computational Molecular Biology”

“This is a awesome compendium of specific problemsand well-defined algorithms to solve them.

Requires comfort with algorithmic computer science to

make much sense of it, anddoesn’t provide a lot of background about why these

particular problems are important.

However, many problems (in sequence analysis in par-ticular) are very elegantly solved here.

Also, this is a good place to find inspiration when youneed a more effective way to solve your own problem.”

10.

Page 12: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

“Computational Molecular Biology” Content

[1. Computational Gene Hunting]

2. Restriction Mapping3. Map Assembly

4. Sequencing5. DNA Arrays

6. Sequence Comparison7. Multiple Alignment

8. Finding Signals in DNA

9. Gene Prediction

10. Genome Rearrangements

11. Computational Proteomics

12. Problems

13. All You Need to Know about Molecular Biology

11.

Page 13: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

“Bioinformatics Algorithms” Content

1. A Molecular Biology Primer

2. Exhaustive Search

3. Greedy Algorithms

4. Dynamic Programming Algorithms

5. Divide-and-Conquer Algorithms

6. Graph Algorithms

7. Combinatorial Pattern Matching

8. Clustering and Trees

9. Randomized Algorithms

12.

Page 14: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

“Bioinformatics Algorithms” Content1. A Molecular Biology Primer

2. Exhaustive Search:

Mapping DNA, Finding signals

3. Greedy Algorithms:

Finding signals, Genome rearrangements

4. Dynamic Programming Algorithms:

Comparing sequences, Predicting genes

5. Divide-and-Conquer Algorithms:

Comparing sequences

6. Graph Algorithms:

Sequencing DNA, Identifying proteins, DNA arrays

7. Combinatorial Pattern Matching:

Comparing sequences, Repeat analysis

8. Clustering and Trees:

Molecular evolution

9. Randomized Algorithms:

Finding signals

13.

Page 15: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

“Biological Sequence Analysis” Content

1. Hidden Markov Models

2. Profile identification in genetic sequences using HMMs

3. Alignment of pairs of DNA/proteins sequences4. Alignment of pairs of DNA/proteins seq. using HMMs

5. Multiple alignment of DNA/proteins sequences6. Multiple alignment of DNA/proteins seq. using HMMs

7. Philogenetics; probabilistic models

8. Probabilistic CFGs9. Alignment of RNA sequences using PCFGs

14.

Page 16: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

3 A Molecular Biology Primer

3.1 The Cell

The cell is the fundamental workingunit of every organism.

Instead of having brains, cells makedecisions trough complex networks

of chemical networks called path-ways:

• synthesize new materials

• break other materials down for spare parts

• signal to eat, replicate or die

There are two different types of cells/organisms:

Prokariotes and Eukariotes.

15.

Page 17: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Life depends on 3 critical molecules

DNAs — made of A,C,G,T nucleotides (“bases”)

hold information on how cell works

RNAs — made of A,C,G,U nucleotides

provide templates to sythesize into proteinstransfer short pieces of information to different parts of

the cell

Proteins — made of (20) amino acids

form enzymes that send signals to other cells and regulategene activity

make up the cellular structureform body’s major components (e.g. hair, skin, etc.)

16.

Page 18: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Some basic terminology

Genome: the complete set of one organism’s DNA

• a bacteria contains approx. 600,000 base pairs

• human: approx. 3 billion, on 23 pairs of chromosomes

• each chromosome contains many genes

Gene: the basic functional and physical unit of heredity,

a specific sequence of bases that encode instructions on

how to make proteins

17.

Page 19: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

18.

Page 20: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

3.2 The DNA Helix

Discovered in 1952

(following hints by Erwin Chargaff and Rosalind Franklin) by

James Watson (biologist), and Francis Crick (phisicist, PhD std.)

— Nobel Prize

19.

Page 21: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

James Watson

and

Francis Crick

20.

Page 22: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

DNA copied/“replicated”21.

Page 23: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

3.3 The Central Dogma of Molecular Biology

DNA → RNA → proteins

22.

Page 24: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

The Central Dogma of

Molecular Biology

DNA → RNA → proteins

in Eukariotes

23.

Page 25: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

A Romanian won the Nobel Prizein molecular bilogy

In 1956 George Emil Palade showed

that

the site of protein manufacturing in

the cytoplasm is

made on RNA organelles called ribo-

zomes.

24.

Page 26: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

DNA to Amino AcidCoding Table

Each codon (triplet of DNAnucleotides) correponds to oneof the 20 amino acids.

Among the 64 codons there isa start codon and three stopcodons.

The redundancy in the table— one amino acid may beencoded by several differentcodons — is a kind of defenceagainst mutations...

25.

Page 27: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

3.4 Thalassemia — a genetic disease

due to faulty DNA replication

A mutation in a gene is a change in the DNA’s sequence of

nucleotides.

Sometimes even a mistake of just one position can have a

profound effect.

Here is a small but devastating mutation in the gene for

hemoglogin, the protein which carries oxygen in the blood.

good gene: AACCAG

mutant gene: AACTAG

26.

Page 28: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

from “The Cartoon Guide to Genetics”, Larry Gomick, Mark Wheelis

27.

Page 29: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Note

In Cyprus, a screening policy — including pre-natal

screening and abortion — introduced since 1970s to

reduce the incidence of thalassemia,

has reduced the number of children born with the

hereditary blood desease from 1 out of every 158

births to almost 0.

28.

Page 30: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

4 Discovery Question:

How do we read DNA sequences?

Knowing how DNA replication works,and assuming that you can get the molecular mass ofany given DNA fragment,

design a strategy to get the “reading” of the base com-position of an unknown DNA sequence (i.e. the output

should be a string over the alphabete {A, C, G, T}).

What if, due to physical limitations, only fragments ofrelatively short length (500-700 bases) can be treated

in the above way, but the genome that you want to“read” is much larger (10

6 or more)?

29.

Page 31: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Fred Sanger’s Method, Nobel Prize, 1980

In 1977 Sanger se-quenced the DNA ofthe FX 174 Phage virus(5386 nucleotides).

From Discovering Genomics, Proteomics, and Bioinformatics,Campbell and Hayer, 2006

30.

Page 32: An Introductory Course on BIOINFORMATICSciortuz/SLIDES/2006/bio-intro.pdf · Discovering Genomics, Proteomics, and Bioinformatics, (2nd ed.) Malcolm Campbell, Laurie Hayer Benjamin

Scaling up Sanger’s method to

whole genome sequencingProblems:

• limited size of the reads: 500–700 nucleotides

• genomes are much larger (human: 3 ×109), and

contain lots of repeats (human: more than 50%)

• sequencing errors: 1-3%

Solutions:

• use overlaping reads, then assemble them

• BAC-by-BAC sequencing

• using tandem reads to cope with repeats

Recommened reading:Bioinformatic Algorithms, Jones & Pevzner, Ch. 8.

31.