NCSU Summer Institute of Statistical Genetics, Raleigh 2004: Genome Science Session II: Genome Sequencing
Jan 20, 2016
NCSU Summer Institute of Statistical Genetics, Raleigh 2004:
Genome Science
Session II: Genome Sequencing
Genome and EST sequencing
The Summer Institute 2004
•Sequencing Technologies•Informatics Tools•Sequencing project approaches•EST sequencing projects•Genome sequencing projects•What we have learned
Some Terms
The Summer Institute 2004
•Complementary – nucleotide sequences that will form specific hybrids
•Hybridize – duplex formation•Label – a molecular tag that facilitate detection•Oligonucleotide – a short single-stranded piece of nucleic acid
•Anneal – to incubate nucleic acid species together under conditions that promote specific hybridization
Why study genomes
The Summer Institute 2004
•Molecular biology and biochemistry need a point of entry•Genetics is reliant on phenotype•Hypothesis driven versus data production - parallels with early Naturalists and modern day physics•Identify similarities and differences amongst diverse life forms
Data mining vs. Data Dredging
The Summer Institute 2004
Gene structural features
The Summer Institute 2004
Sequence read
cDNA
Genomic DNA
Exon
polyA tail
•Hybridization of complementary strands•Specificity of base pairing•Almost any DNA is clonable•You can have the same sequence - but different genes
Sequencing Technologies
•Basic principles•Dideoxy chain termination•Electrophoretic separation•Visualization
•Innovations•Fluorescent tags•Thermocycling•Capillary electrophoresis
•Novel methodologies•Sequencing by hybridization•Mass spectrometry•Nanopore sequencing•Other things of note
The Summer Institute 2004
Primer extension
The Summer Institute 2004
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’3’-AAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
3’-AAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ AAATCTAGCTAAGCT-5’
•The extended molecule is the reverse complement of the target•The extended molecule can be tagged for visualization•Extension occurs via a 3’ hydroxyl group
Dideoxy chain termination
The Summer Institute 2004
Dideoxy dNTPs will terminate extension because they lack a 3’-OH
By mixing ddATP with dATP a pool of extension products is created wherein termination at each available A occurs
The termination products can be separated by size and visualized by labeling either the ddNTP or the primer
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGGTCAAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGATCGGTCAAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’ATCGATCGATCGGTCAAATCTAGCTAAGCT-5’
Sanger sequencing
The Summer Institute 2004
ddNTP/dNTP mixtures are made up for each of the four nucleotides - adenine, cytosine, guanine, thymine
Proportion of dideoxy to deoxy NTP determines the frequency of termination
Products from the four reactions are separated by size and DNA sequence is inferred
T C G A A G C T
Invert gel to read the sequence 5’ to 3’
Fluorescent sequencing
The Summer Institute 2004
Each ddNTP is labeled with a different fluor - now all four products can be run in the same gel lane
Fluorescence is detected using a laser scanner to produce a false color image
Electropherograms (chromatograms) are produced that display peak intensity for each fluor
Can also differentially label the primer to achieve the same end
Cycle sequencing with PCR
The Summer Institute 2004
Sanger sequencing can require large amounts of template
Polymerase chain reaction exponentially amplifies specific DNAs
Use of ddNTPs allows the combination of amplification and dideoxy terminator sequencing
Cycle sequencing animation
press
High-throughput sequencing
The Summer Institute 2004
•Dideoxy terminator sequencing is robust and flexible
•Microtiter format•PCR based cycle sequencing requires less template
•Fluorescent sequencing increased gel capacity 4X
•Supporting robotics upstream of sequencing process
•Computational tools•Capillary sequencers
Capillary gels
The Summer Institute 2004
Slab gels make life in the sequencing lab difficult for many reasons:Pouring the gel is time consuming and prone to errorThe microtiter plate format (sequencing reactions) has spacing that is different than the gel loading comb - cumbersomeAssembly and disassembly of the sequencing apparatus is messy and time consumingManual lane tracking is time consuming and prone to errorGels never run perfectly - lanes can sometimes run together making lane tracking difficult
Capillary gels help becauseEach sequencing reaction is run in a separate capillary - there is no lane tracking to worry overMatrix for the capillary gel is robotically assembled, injected and QC’dRobotic loading of samples is compatible with walk-away capability
Informatics essentials
The Summer Institute 2004
•Basecallers convert trace data to sequence•Assemblers form contiguous sequences from small chunks
•Viewers/Editors allow the scientist to interactively work with data
•Databases store sequencing data - from electropherograms to annotation
•Analysis tools compare the sequence against databases of sequences and use algorithms to make educated guesses about the structure and function of a given sequence
Basecalling
The Summer Institute 2004
• Is the spacing of the peaks what is expected?•Is there a peak in the electropherogram?•What fluor is responsible for this peak?•Since noise ensures the presence of more than one peak, which peak is the correct peak?
•What is the probability that the base that is assigned is the correct base?
•Phred score - Phred 20 (1 error in 100 bases) is a typical quality standard
•TraceTuner - algorithm is similar to Phred but reportedly more accurate with ABI3700 traces, plus accelerated execution
•Others are available
Assembly
The Summer Institute 2004
• Production of a single contiguous sequence from multiple sequence reads
• The best assembly programs (including Phrap) use probability scores directly from the output of basecallers such as Phred
• Phrap was designed for genome sequencing projects - EST assemblies make different assumptions
• Final assembly products include contigs and singletons
• Accuracy of the contig consensus sequence is based on error models propagated from basecalling software
Viewers/editors
The Summer Institute 2004
Consedbreak
Storage and analysis of sequence
The Summer Institute 2004
http://www.ebi.ac.uk/Databases/index.html
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
http://www.ncbi.nlm.nih.gov/
• The amount of sequence information deposited in databases is increasing at a very rapid rate
• Tools to manage sequence data are imperfect and in development• Development of controlled vocabularies and gene ontologies will facilitate database integration
• Analytical tools and algorithm development are growth industries
Impact of database structure
The Summer Institute 2004
•Flat file databases are great for speed but are not built for integration•Lack of controlled vocabularies impedes efficient and reliable searching and inhibits integration
•GenBank uses a controlled index and vocabulary - sort of
•Example of searching for genomic sequence, EST sequence and complete cDNAs
•Relational databases are great for integration but can be slow and changing the schema takes an act of Congress
•Flat file databases with robust re-indexing routines have the advantage of speed and the ability to integrate different data types
Sequencing by hybridization
The Summer Institute 2004
GCATGC…3
TGCATGCATGCATGCATG…1
CATGCA…9
ATGCAT…2
TGCATG…12
GCATGC…3CATGCA…9
ACGTAC
CGTACG
CGCGGA
TACGTA
ACGTAC
CGTACG
GTACGTCACAGA
GGGCCCAATTCC
AGCAGC
TTCCGG1 2 3
4 5 6
7 8 9
10 11 12
Determine constituent sequences by hybridizing to oligos of known sequenceAssemble sequence fragments into contiguous sequence
CATGCA
Sequencing by mass spectrometry
The Summer Institute 2004
GCATGCTGCATG
ATGCATTGCATGGCATGC
Obtain Mass spectra from Reference panel of oligos
Fragment unknown and obtain mass spectra
Deconvolute data
454
The Summer Institute 2004
US Genomics
The Summer Institute 2004
Nanopore sequencing
The Summer Institute 2004
•Two solution filled compartments separated by a membrane with a channel
•Ions flow through the channel in response to an applied voltage
•DNA is negatively charged and will be drawn through the channel
•Channel size allows DNA molecules to be drawn into and through the channel one at a time
•Current is reduced when the channel is occupied by DNA
•Length of current drop is proportional to length of DNA
•Extent of current drop is indicative of physicochemical properties of DNA - thus, one can infer sequence from the trace
Sequencing project approaches
The Summer Institute 2004
•EST projects
•Map-based: assembly based on physical ordering of clones
•Shotgun: assembly based on computational ordering of sequences
•Combination strategies: minimal scaffolding from physical maps, fill in the blanks by shotgun and directed sequencing
EST sequencing projects
The Summer Institute 2004
•Only the expressed genome is sequenced, thereby avoiding the “junk”
•Relatively inexpensive and fast - accessible to small laboratories
•May fail to capture many genes because the appropriate biological condition leading to expression is not captured
•May overestimate gene number due to non-overlapping sequences from the same gene
Project is the operative word
The Summer Institute 2004
Libraries of overlapping clones
The Summer Institute 2004
•Library clones can be ordered by the presence of restriction sites, known sequences, etc.
•Assembly of contiguous sequences is straightforward because the clones form an ordered array
Map-based sequencing
The Summer Institute 2004
•Produce large insert libraries in BACs, cosmids, etc. to “cover” the genome multiple times
•Determine a minimal tiling path of clones by restriction mapping, hybridization of end based probes or end sequencing
•Ordered sets of clones are subcloned into pools of small clones
•Smaller clones can be order or sequenced by shotgun methods
•Fewer sequencing runs = lower costs•Obtaining an ordered array of clones can be time consuming
Shotgun sequencing
The Summer Institute 2004
• Produce sequences from random clones irrespective of their physical order along the chromosomes
• Clones can be small insert or large insert because alignment takes into account only the sequence - not properties of the physical clones
• Assemble sequences to produce contigs• Identify gaps in contiguous sequence and undersequenced areas• Perform directed sequencing to fill in the gaps
Shotgun sequencing issues
The Summer Institute 2004
•Assembly is computationally intensive•Repetitive sequences have to be masked so that they do not confound the preliminary alignment
•First pass alignment based upon non-masked sequences to produce contiguous sequence fragments
•Alignments must account for potential polymorphisms•Repetitive sequences still need to be aligned - their treatment is however distinct from non-repetitive sequences
•Resolution of conflicts in the assembly is challenging
•When is a genome truly finished?•The press release is only the beginning of the process
Complementary strategies
The Summer Institute 2004
• Pure shotgun approaches are likely to leave significant gaps
• Directed sequencing of specific regions is necessary to fill in the gaps
• Pure map-based strategies are cumbersome and time consuming and do not take advantage of efficiencies of scale found in modern industrial sequencing
• A complementary approach combines data from both approaches
• There are adherents to working from the bottom-up and working from the top-down
Genome sequencing projects
The Summer Institute 2004
What is a gene
The Summer Institute 2004
•ESTs and cDNAs identify those parts of the genome that are actually transcribed
•Transcripts have structural features including starts, stops and open reading frames
•Computers can be trained to “sniff” for relevant features in the sequence
•Genefinding algorithms construct probability models based on presence of one or more gene-like features
•Coordination with genetic features gives a comfort level because it is empirical
•Computational methods that rely on similarity to “known” genes in databases can be perilous - a sort of regressive uncertainty
BLAST
The Summer Institute 2004
BLAST Example
Example Sequence
How to make a human
The Summer Institute 2004
The Human Genome Project
The Summer Institute 2004
http://www.nature.com/genomics/human/index.html
http://www.sciencemag.org/content/vol291/issue5507/
Genome information challenges
The Summer Institute 2004
Link to Ensembl
Link to Entrez Genomes
Link to SachDB
Link to TAIR
Link to KEGG
Link to ExPASy
Link to FlyBase
• Data integration from sequence, mutant analysis, mapping, expression analysis, metabolic profiling, and other data types will be the primary challenge to biological science in the 21st century
• Informatics tools are in their infancy• The literature is growing at a rate surpassing sequence data• Importance of statistics cannot be overstated• Gene annotation is regressive
• Danger of balkanization of data?• Is natural language processing the holy grail?
Genome information challenges
Link to Ensembl
Link to Entrez Genomes
Link to SachDB
Link to TAIR
Link to KEGG
Link to ExPASy
Link to FlyBase
• Data integration from sequence, mutant analysis, mapping, expression analysis, metabolic profiling, and other data types will be the primary challenge to biological science in the 21st century
• Informatics tools are in their infancy• The literature is growing at a rate surpassing sequence data• Importance of statistics cannot be overstated• Gene annotation is regressive
• Danger of balkanization of data?• Is natural language processing the holy grail?
The Summer Institute 2004