Introduction to Bioinformatics Dr. Yael Mandel-Gutfreund TA: Oleg Rokhlenko.

Introduction to Bioinformatics

Dr. Yael Mandel-GutfreundTA: Oleg Rokhlenko

2

Course Objectives

• To introduce the bioinfomatics discipline • To make the students familiar with the major

biological questions which can be addressed by bioinformatics tools

• To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..)

3

Course Requirements1. Submit written assignments .

1. 9/12 short class assignments 4/4 home assignments2. Each assignment is to be done and submitted in pairs (except

the first two class assignment).3. The pairs are ideally composed of a person from computer

science and a person from life science.

2. A final project or a take home exam, submitted in pairs.

3. The course web site: http://webcourse.cs.technion.ac.il/236523

http://webcourse.cs.technion.ac.il/236523

4

Grading

• 10 % class assignments• 30 % home assignments• 60% final project/ test

5

Literature list• Gibas, C., Jambeck, P. Developing Bioinformatics

Computer Skills. O'Reilly, 2001. • Lesk, A. M. Introduction to Bioinformatics. Oxford

University Press, 2002.• Mount, D.W. Bioinformatics: Sequence and Genome

Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, 2004.

Advanced Reading

Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004

6

Course Outline• Introduction to bioinformatics • Bioinformatics databases• Pairwise and multiple sequence alignment • Searching for sequences in databases• Searching for motifs in sequences• Phylogenetics• RNA secondary Structure• Protein structure: secondary and tertiary structure• Proteins families: motifs, domains, clustering• The Human Genome Project• Gene prediction, alternative splicing• Gene expression analysis (DNA microarrays)• Comparative genomics, Biological networks

7


8

Introduction to Bioinformatics

• What is Bioinformatics?• From DNA to Genome• What’s next? the post genomic era

9

“the field of science in which biology, computer science, and information technology merge to form a single discipline

Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.”

What is Bioinformatics?

10

Central Paradigm in Molecular Biology

mRNAGene (DNA) Protein

TranslationTranscription

DNA RNA Protein Symptomes (Phenotype)

11

21st century Biology –from purely lab-based science to an information science

12

Central Paradigm of Bioinformatics

GeneticInformation

Molecular Structure

BiochemicalFunction Symptoms

13

From DNA to Genome

Watson and Crick DNA model

Sanger sequences insulin protein

ARPANET (early Internet)

Sanger dideoxy DNA sequencing

PDB (Protein Data Bank)

N-W sequence alignment

GenBank database

PCR (Polymerase Chain Reaction)

1955

1960

1965

1970

1975

1980

1985

Dayhoff’s Atlas of Protein Seqs.

14

1995

1990

2000

SWISS-PROT databaseUSA’s NCBI

WWW (World Wide Web)

Celera Genomics First human genome draft

Israel’s INN

Human Genome Initiative

BLAST algorithm

FASTA algorithm

First bacterial genome

Europe’s EBI

Yeast genome

15

• 1994 0

• 1995 1

• 2004 234

eukaryotes 20

bacteria 194

archaea 19

Complete Genomes

16

The “post-genomics” eraThe “post-genomics” era

Goal: to understand the functional networks of a living cell

Annotation Comparativegenomics

Structuralgenomics

Functionalgenomics

What’s Next ?

17

Annotation

Open reading frames

Functional sites

Structure, function

18

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................... TGAAAAACGTA

19

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................................

.............. TGAAAAACGTA

TF binding sitepromoter

Ribosome binding SiteORF=Open Reading FrameCDS=Coding Sequence

Tran

script

ion

Start Si

te

20

Comparativegenomics

Comparing ORFs

Identifying orthologs

Concluding on structure and function

Comparing functional sites

Concluding on regulatory networks

21

Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse.

Conservation of the IGFALS (Insulin-like growth factor)Between human and mouse.

22

Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3W.James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates.

23

Functionalgenomics

Genome-wide profiling of:• mRNA levels• Protein levels

Co-expression of genes and/or proteins

Identifying protein-protein interaction

Networks of interactions

24

Understanding the function of genes and other parts of the genome

25

Structural genomics

Assign structure to all proteins encoded in a genome

26

Structural Genomics Expectations

~300unique folds

in PDB

~300 unique folds

Currently

27761 structure

27

Structural Genomics Expectations

1000-3000unique folds

in “structure space”

Estimate

28


29

Database TypesSequence databases

General specialGenBank, embl TF binding sitesPIR, Swissprot Promoters

Genomes

Structure databases

General SpecialPDB Specific protein families

folds

Databases of experimental resultsCo-expressed genes, prot-prot interaction, etc.

30

• World Wide Web– USA National Center for Biotechnology

Information: www.ncbi.nlm.nih.gov– European Bioinformatics Institute:

www.ebi.ac.uk– ExPASy Molecular Biology Server:

www.expasy.org– Israeli National Node: inn.org.il

http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm

http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm

31

Entrez – NCBI Engine

• Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others.

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar

http://www.ncbi.nlm.nih.gov/Entrez/index.html

32

Entrez – NCBI Engine

33

Nucleotide

Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB.

April 2004 -> 38,989,342,565 bases

34

PubMed

• MEDLINE publication database– Over 17,000 journals– Some other citations

• Papers from 1960s– Over 12,000,000 entries

• Alerting services– http://www.pubcrawler.ie/– http://www.biomail.org/

35

OMIM

• Online Mendelian Inheritance in Man– Genes and genetic disorders– Edited by team at Johns Hopkins– Updated daily

• Entries– 10670 single-loci phenotypes (*)– 1294 multi-loci phenotypes (#)– 2415 unclassified phenotypes

36

Searching PubMed

• Structureless searches– Automatic term mapping

• Structured searches– Field names, e.g. [au], [ta], [dp], [ti]– Boolean operators, e.g. AND, OR, NOT, ()

• Additional features– Subsets, limits– Clipboard, history

37

Searching OMIM

• Search Fields– Disease name, e.g. hypertension– Cytogenetic location, e.g. 1p31.6– Inheritance, e.g. autosomal dominant

• Browsing Interfaces– Alphabetical by disease– Genetic map

• Additional features like PubMed

Introduction to Bioinformatics Dr. Yael Mandel-Gutfreund TA: Oleg Rokhlenko.

Documents