Computational Genomics Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall 2015-16 1 CG © 2015
Computational Genomics
Irit Gat-Viks & Ron Shamir & Haim Wolfson Fall 2015-16
1 CG © 2015
What’s in class this week
• Motivation • Administrata • Some very basic biology • Some very basic biotechnology • Examples of our type of computational
problems
CG © 2015 2
• The information science of biology: organize, store, analyze, visualize biological data
• Responds to the explosion of biological data, and builds on the IT revolution
Bioinformatics
3 CG © 2015
Paradigm shift in biological research
Classical biology: focus on a single gene or sub-system. Hypothesis driven
Systems biology: measure (or model) the behavior of numerous parts of an entire biological system. Hypothesis generating
Large-scale data; Bioinformatics
4 CG © 2015
Personalized medicine
6 CG © 2015
Administration • ~5 home assignments as part of a home exam, to be done independently (50%) • Final exam (50%) • Must pass the Final to pass the course (TAU rules)
• Classes: Tue 12:15-13:30; Thu 14:45-16:00 • TA: Ron Zeira (Thu 16-17).
7 CG © 2015
Administration (cont.) • Web page of the course: http://www.cs.tau.ac.il/~rshamir/cg/15/
• Includes slides and full lecture scribes of previous years on each of the classes.
8 CG © 2015
Bibliography
• No single textbook covers the course :-( • See the full bibliography list in the
website (also for basic biology) • Key sources:
– Gusfield: Algorithms for strings, trees and sequences
– Durbin et al.: Biological sequence analysis – Pevzner: Computational molecular biology – Pevzner and Shamir (eds.): Bioinformatics for
Biologists CG © 2015 9
CG © 2015 10
lear
n.ge
neti
cs.u
tah.
edu
Lecture 1: Introduction
1. Basic biology 2. Basic biotechnology + some computational challenges arising along the way
11 CG © 2015
Slides prepared mainly by Ron Shamir and Adi Akavia
1. Basic Biology
•Touches on Chapters 1-8 in “The Cell” by Alberts et al.
12 CG © 2015
The Cell • Basic unit of life. • Carries complete characteristics of the species. • All cells store hereditary information in DNA. • All cells transform DNA to proteins, which are “the robots of the cell” and determine cell’s structure and function. • Two classes: eukaryotes (with nucleus) and prokaryotes (without).
http://regentsprep.org/Regents/biology/units/organization/cell.gif 13 CG © 2015
Nucleotide Chain Double helix
sugar
phosphate
Nucleotides/ Bases: Adenine (A), Guanine (G), Cytosine (C), Thymine (T).
Weak hydrogen bonds between base
pairs
Strong covalent bonds (phophodiester linkage) between sugars
Gregor Mendel laws of inheritance, “gene” 1866
Watson and Crick DNA structure 1953
14 CG © 2015
kidzsoft/src/rnadnatutor.html -5-10/98s/f2http://www.cs.utexas.edu/users/almstrum/s 15 CG © 2015
DNA (Deoxy-Ribonucleic acid) • Bases:
– Adenine (A) – Guanine (G) – Cytosine (C) – Thymine (T)
• Bonds: – G - C – A - T
• Oriented from 5’ to 3’. • Located in the cell nucleus
Purines
pyrimidines
16 CG © 2015
DNA and Chromosomes • DNA is packaged (105-fold)
• Chromatin: complex of DNA and proteins that pack it (histones)
• Chromosome: contiguous stretch of DNA
• Diploid: two homologous chromosomes, one from each parent
• Genome: totality of DNA material
17 CG © 2015
Replication
Replication fork
18 CG © 2015
Genes • Gene: a segment of DNA that specifies a protein. • The transformation of a gene into a protein is called
expression. • Genes are < 3% of human DNA • The rest - non-coding (used to be called “junk DNA”)
– RNA elements – Regulatory regions – Retrotransposons – Pseudogenes – and more…
19 CG © 2015
20
Gene Structure
CG © 2015
21
Gene Structure
CG © 2015
CG © Ron Shamir 2010 22
The Gene Finding Problem Given a DNA sequence, predict the location of genes (open reading frames) exons and introns. •A simple solution: seeking stop codons.
•6 ways of interpreting DNA sequence
• In most cases of eukaryotic DNA, a segment encodes only one gene.
•Difficulty in Eukaryotic DNA: introns & exons
22 CG © 2015
Proteins • Build the cell and drive
most of its functions. • Polymers of amino-acids
(20 total), linked by peptide bonds.
• Oriented (from amino to carboxyl group).
• Fold into 3D structure of lowest energy.
24 CG © 2015
DNA RNA protein
transcription translation
The hard disk
One program
Its output
http://www.ornl.gov/hgmis/publicat/tko/index.htm
25 CG © 2015
RNA (Ribonucleic acid) • Bases:
– Adenine (A) – Guanine (G) – Cytosine (C) – Uracil (U); replaces T
• Oriented from 5’ (phosphate) to 3’ (sugar). • Single-stranded => flexible backbone =>
secondary structure => catalytic role.
26 CG © 2015
27
The RNA Folding Problem Given an RNA sequence, predict its (secondary structure) folding = the one that creates a maximum number of matched pairs
27 http://www.phys.ens.fr/~wiese/highlights/RNA-folding.html
GCCUUAAUGCACAUGGGCAAGCCCACGUAGCUAGUCGCGCGACACCAGUCCCAAAUAUGUUCACCCAACUCGCCUGACCGUCCCGCAGUAGCUAUACUACCGACUCCUACGCGGUUGAAACUAGACUUUUCUAGCGAGCUGUCAUAGGUAUGGUGCACUGUCUUUAAUUUUGUAUUGGGCCAGGCACGAAAGGCUUGGAAGUAAGGCCCCGCUUGACCCGAGAGGUGACAAUAGCGGCCAGGUGUAACGAUACGCGGGUGGCACGUACCCCAAACAAUUAAUCACACUGCCCGGGCUCACAUUAAUCAUGCCAUUCGUUGCCGAUCCGACCCAUAAGGAUGUGUAUGCCUCAUUCCCGGUCGGGGCGGCGACUGUUAACGCAUGAGAACUGAUUAGAUCUCGUGGUAGUGCUUGUCAAAUAGAAUGAGGCCAUUCCACAGACAUAGCGUUUCCCAUGAGCUAGGGGUCCCAUGUCCAGGUCCCCUAAAUAAAAGAGUCUCAC
CG © 2015
Transcription
http://www.iacr.bbsrc.ac.uk/notebook/courses/guide/words/transcriptiongif.htm
Template
28 CG © 2015
The Genetic Code
• Codon - a triplet of bases, codes a specific amino acid (except the stop codons)
• Stop codons - signal termination of the protein synthesis process
• Different codons may code the same amino acid
http://ntri.tamuk.edu/cell/ribosomes.html 29 CG © 2015
Translation
http://biology.kenyon.edu/courses/biol114/Chap05/Chapter05.html#Protein 30 CG © 2015
31
CG © 2015
DNA Protein
transcription translation
RNA
Expression and Regulation
Gene
Transcription factors (TFs) : proteins that control transcription by binding to specific DNA sequence motifs.
32 CG © 2015
33
Proteins: The Cellular Machines
CG © 2015
CG © Ron Shamir 2010 34
The Protein Folding Problem
•Given a sequence of amino acids, predict the 3D structure of the protein. •Motivation: functionality of protein is determined by its 3D structure. •Solution Approaches:
•Homology •Threading •de novo (=from scratch) 34 CG © 2015
The Human Genome: numbers • 23 pairs of chromosomes • ~3,200,000,000 bases • ~21,000 genes • Gene length: 1000-3000 bases,
spanning 30-40K bases
35 CG © 2015
Model Organisms
• Eukaryotes; increasing complexity • Easy to grow, manipulate.
Budding yeast • 1 cell • 6K genes
Nematode worm • 959 cells • 19K genes
Fruit fly • vertebrate-like • 14K genes
mouse • mammal • 30K genes
36 CG © 2015
• Lots of common ground with humans: many / most genes are common – but with mutations
CG © Ron Shamir 2010 37
Sequence Alignment problems
Given two sequences, find their best alignment: Match with insertion/deletion of min cost. Same for best match of contiguous subeq. Same for several sequences “Workhorse” of Bioinformatics! Key challenge: huge volume of data (more on this later) 37 CG © 2015
38 CG © 2015
Introduction II: Basic Biotechnology and computational
challenges
Ron Shamir and Roded Sharan CG, Fall 2014-15
39 CG © 2015
40
Restriction Enzymes • Natural role: break foreign DNA
entering the cell. • Ability:
– Breaks the phosphodiester bonds of a DNA upon appearance of a certain cleavage (cut) sequence.
– Different sequence for each enzyme – Hundreds of different enzymes known.
• Digestion = application of restriction enzymes to a sequence. CG © 2015
Cloning vector (plasmids)
Foreign DNA
Recombinant DNA
Introduction into host cell
Use of antibiotics to grow recombinant cells
Cloning
CG © 2015
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’
5’
3’
3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’ 5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’ 5’ 3’
5’ 5’ 3’ 3’
5’
5’ 3’
5’ 3’
5’ 3’
3’
5’ 3’
5’ 3’
5’ 3’
5’ 3’
Denaturation
Annealing
Extension
Cycle 1
Cycle 2
Cycle 3
PCR
42 CG © 2015
CG © 2015 43 http://www.atdbio.com/content/20/Sequencing-forensic-analysis-and-genetic-analysis
44
Gel Electrophoresis
• Use: “race” digested DNA fragments through electrically charged gel
• Goals: – Separate a mixture of DNA fragments – Measure length of DNA fragments
• How does it work: – smaller molecule travel faster than larger ones – same size and shape ⇒ the same movement
speed
CG © 2015
45 CG © 2015
http
://dl
ab.re
ed.e
du/p
roje
cts/
vgm
/vgm
/VG
MPr
ojec
tFol
der/V
GM
/RED
/RED
.ISG
/map
ping
.htm
l
46
The Double Digest Problem Given 3 sets of distances {Xi} {Yi}
{Zi}, reconstruct cut sites A1<…<An and B1<…<Bm s.t. – {Ai-A i-1}={X}, {Bi-B i-1}={Y} – for C=A U B (ordered), {Ci-C I-1}={Z}
Complexity: NP hard, many
heuristics. 46 CG © 2015
47
The Partial Digest Problem •Problem: Given a (multi-) set of distances {|Xi-Xj|} 1 ≤ i ≤ j ≤ n, reconstruct the original series X1,…,Xn
•Complexity: unknown (yet)
47 CG © 2015
48
Sequencing • Sequencing: determining the sequence of
bases in a given DNA molecule. • Classical approach: gel electrophoresis • Basic idea: knowing the lengths of all prefixes
ending with letter X gives a partial seq • Creating DNA strands of different lengths :
catalyzing replication in environment with “terminator” A*.
• Repeat separately with C*, G*, T* • Abilities: reconstructs sequences of 500-1000
nucleotides. CG © 2015
• ---A-----A-
• -CC---CC—--
• T---T------
• -----G----G
CG © 2015 49 http://www.atdbio.com/content/20/Sequencing-forensic-analysis-and-genetic-analysis
51
The Sequence Assembly Problem
• Given a set of sub- strings, find the shortest (super)string containing all the members of the set.
http://www.ornl.gov/hgmis/graphics/slides/images1.html CG © 2015
52
Rearrangement
Rearrangement is a change in the order of complete segments along a chromosome.
CG © 2015 http://www.copernicusproject.ucr.edu/ssi/HSBiologyResources.htm
53
Genome Rearrangements
Challenges: •Reconstruct the evolutionary path of rearrangements •Shortest sequence of rearrangements between two permutations 53 CG © 2015
54
More problems in sequencing data
Solve all the problems above (alignment, gene finding, rearrangements,…) on really huge datasets Need to handle practical problems of efficiency – time and space Need to overcome large noise (errors) due to data size
54 CG © 2015
DNA Microarrays
55 CG © 2015
Hybridization
• DNA double strands form by “gluing” of complementary single strands
• Complementarity rule: A-T, G-C
ACTCCG TGAGGC
| | | | | |
56 CG © 2015
57 CG © 2015
Gene Expression Arrays • Assumption: transcription level indicates
gene’s importance in a specific condition.
Given the expression profiles of normal vs. disease: - Build an algorithm to predict if a new sample is normal or disease (classifier) - Cluster disease profiles into sub-classes - Cluster genes into functional groups
58 CG © 2015
Breast cancer treatment
Van’t veer et al., nature’02 59 CG © 2015
FIN
CG © 2015 60
Classifier construction Pa
tient
s
70-gene signature
No met.
Met.
• Classify to minimize incorrect assignments in the no met. class.
• 17/19 correct predictions on test set. 61 CG © 2015
62
Human Variation • DNA of two human
beings is ~99.9% identical
• Phenotype and disease variation is due these 1/1000 mutations
Challenges: •Associate mutations to specific disease •Deal with huge datasets (noise and statistics)
62 CG © 2015
Challenges in network analysis:
identifying modules
63 CG © 2015
64
Complexity summary
• ~21,000 genes in the genome • Hard to identify • Harder to figure their function • Even harder to figure how they work together
CG © 2015
Promise and problems in comparative genomics (2009)
65
arabidposis c. elegans h. sapiens s. cerevisiae
genes
arabidopsis 26207 0.19 0.24 0.42 c. elegans 19992 0.26 0.38 0.38 h. sapiens 21673 0.30 0.28 0.43 s. cerevisiae 5884 0.21 0.13 0.19
a(x,y)= fraction of genes in genome y that have strict orthologs in genome x Source: 10/2009 CG © 2015