Introduction to Bioinformatics Dr. Yael Mandel-Gutfreund TA: Oleg Rokhlenko
Jan 18, 2018
Introduction to Bioinformatics
Dr. Yael Mandel-GutfreundTA: Oleg Rokhlenko
2
Course Objectives
• To introduce the bioinfomatics discipline • To make the students familiar with the major
biological questions which can be addressed by bioinformatics tools
• To introduce the major tools used for sequence and structure analysis and explain in general how they work (limitation etc..)
3
Course Requirements1. Submit written assignments .
1. 9/12 short class assignments 4/4 home assignments2. Each assignment is to be done and submitted in pairs (except
the first two class assignment).3. The pairs are ideally composed of a person from computer
science and a person from life science.
2. A final project or a take home exam, submitted in pairs.
3. The course web site: http://webcourse.cs.technion.ac.il/236523
4
Grading
• 10 % class assignments• 30 % home assignments• 60% final project/ test
5
Literature list• Gibas, C., Jambeck, P. Developing Bioinformatics
Computer Skills. O'Reilly, 2001. • Lesk, A. M. Introduction to Bioinformatics. Oxford
University Press, 2002.• Mount, D.W. Bioinformatics: Sequence and Genome
Analysis. 2nd ed.,Cold Spring Harbor Laboratory Press, 2004.
Advanced Reading
Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms MIT Press, 2004
6
Course Outline• Introduction to bioinformatics • Bioinformatics databases• Pairwise and multiple sequence alignment • Searching for sequences in databases• Searching for motifs in sequences• Phylogenetics• RNA secondary Structure• Protein structure: secondary and tertiary structure• Proteins families: motifs, domains, clustering• The Human Genome Project• Gene prediction, alternative splicing• Gene expression analysis (DNA microarrays)• Comparative genomics, Biological networks
7
Course Outline• Introduction to bioinformatics • Bioinformatics databases• Pairwise and multiple sequence alignment • Searching for sequences in databases• Searching for motifs in sequences• Phylogenetics• RNA secondary Structure• Protein structure: secondary and tertiary structure• Proteins families: motifs, domains, clustering• The Human Genome Project• Gene prediction, alternative splicing• Gene expression analysis (DNA microarrays)• Comparative genomics, Biological networks
8
Introduction to Bioinformatics
• What is Bioinformatics?• From DNA to Genome• What’s next? the post genomic era
9
“the field of science in which biology, computer science, and information technology merge to form a single discipline
Ultimate goal: to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.”
What is Bioinformatics?
10
Central Paradigm in Molecular Biology
mRNAGene (DNA) Protein
TranslationTranscription
DNA RNA Protein Symptomes (Phenotype)
11
21st century Biology –from purely lab-based science to an information science
12
Central Paradigm of Bioinformatics
GeneticInformation
Molecular Structure
BiochemicalFunction Symptoms
13
From DNA to Genome
Watson and Crick DNA model
Sanger sequences insulin protein
ARPANET (early Internet)
Sanger dideoxy DNA sequencing
PDB (Protein Data Bank)
N-W sequence alignment
GenBank database
PCR (Polymerase Chain Reaction)
1955
1960
1965
1970
1975
1980
1985
Dayhoff’s Atlas of Protein Seqs.
14
1995
1990
2000
SWISS-PROT databaseUSA’s NCBI
WWW (World Wide Web)
Celera Genomics First human genome draft
Israel’s INN
Human Genome Initiative
BLAST algorithm
FASTA algorithm
First bacterial genome
Europe’s EBI
Yeast genome
15
• 1994 0
• 1995 1
• 2004 234
eukaryotes 20
bacteria 194
archaea 19
Complete Genomes
16
The “post-genomics” eraThe “post-genomics” era
Goal: to understand the functional networks of a living cell
Annotation Comparativegenomics
Structuralgenomics
Functionalgenomics
What’s Next ?
17
Annotation
Open reading frames
Functional sites
Structure, function
18
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................... TGAAAAACGTA
19
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATGCGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAACTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTCAGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGAAGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAATAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
TF binding sitepromoter
Ribosome binding SiteORF=Open Reading FrameCDS=Coding Sequence
Tran
script
ion
Start Si
te
20
Comparativegenomics
Comparing ORFs
Identifying orthologs
Concluding on structure and function
Comparing functional sites
Concluding on regulatory networks
21
Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse.
Conservation of the IGFALS (Insulin-like growth factor)Between human and mouse.
22
Ultraconserved Elements in the Human Genome Gill Bejerano,1* Michael Pheasant,3 Igor Makunin,3 Stuart Stephen,3W.James Kent,1 John S. Mattick,3 David Haussler2* There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates.
23
Functionalgenomics
Genome-wide profiling of:• mRNA levels• Protein levels
Co-expression of genes and/or proteins
Identifying protein-protein interaction
Networks of interactions
24
Understanding the function of genes and other parts of the genome
25
Structural genomics
Assign structure to all proteins encoded in a genome
26
Structural Genomics Expectations
~300unique folds
in PDB
~300 unique folds
Currently
27761 structure
27
Structural Genomics Expectations
1000-3000unique folds
in “structure space”
Estimate
28
Course Outline• Introduction to bioinformatics • Bioinformatics databases• Pairwise and multiple sequence alignment • Searching for sequences in databases• Searching for motifs in sequences• Phylogenetics• RNA secondary Structure• Protein structure: secondary and tertiary structure• Proteins families: motifs, domains, clustering• The Human Genome Project• Gene prediction, alternative splicing• Gene expression analysis (DNA microarrays)• Comparative genomics, Biological networks
29
Database TypesSequence databases
General specialGenBank, embl TF binding sitesPIR, Swissprot Promoters
Genomes
Structure databases
General SpecialPDB Specific protein families
folds
Databases of experimental resultsCo-expressed genes, prot-prot interaction, etc.
30
• World Wide Web– USA National Center for Biotechnology
Information: www.ncbi.nlm.nih.gov– European Bioinformatics Institute:
www.ebi.ac.uk– ExPASy Molecular Biology Server:
www.expasy.org– Israeli National Node: inn.org.il
http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm
31
Entrez – NCBI Engine
• Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others.
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar
32
Entrez – NCBI Engine
33
Nucleotide
Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB.
April 2004 -> 38,989,342,565 bases
34
PubMed
• MEDLINE publication database– Over 17,000 journals– Some other citations
• Papers from 1960s– Over 12,000,000 entries
• Alerting services– http://www.pubcrawler.ie/– http://www.biomail.org/
35
OMIM
• Online Mendelian Inheritance in Man– Genes and genetic disorders– Edited by team at Johns Hopkins– Updated daily
• Entries– 10670 single-loci phenotypes (*)– 1294 multi-loci phenotypes (#)– 2415 unclassified phenotypes
36
Searching PubMed
• Structureless searches– Automatic term mapping
• Structured searches– Field names, e.g. [au], [ta], [dp], [ti]– Boolean operators, e.g. AND, OR, NOT, ()
• Additional features– Subsets, limits– Clipboard, history
37
Searching OMIM
• Search Fields– Disease name, e.g. hypertension– Cytogenetic location, e.g. 1p31.6– Inheritance, e.g. autosomal dominant
• Browsing Interfaces– Alphabetical by disease– Genetic map
• Additional features like PubMed