-
Course outline Goal: Learn basic programming and bioinformatic
skills to complete a project using available NGS data Structure:
Lectures (4) Journal club (4) Workshops (4) Grading: Problem sets
(3) Class participation (journal club) Project report (oral and
written)
-
Introduction to genome sequencing: Approaches and Platforms
Bio472- Spring 2014 Amanda Larracuente
-
Outline 1. History 2. Basic assembly approaches 3. First
generation technology 4. Second generation technology 5. Third
generation technology 6. Challenges
-
Progress in genome sequencing
NHGRI at genome.gov
1. History
-
History: Sanger sequencing • Introduced in 1975 • 1982-
Bacteriophage lambda • 1995- H. influenzae • 1996- Yeast • 1998-
C. elegans • 2000- Drosophila melanogaster • 2000- Arabidopsis •
2001- Human
1. History
-
Sequence reads • Reads
• Sequence output from a DNA fragment • Base qualities
• Paired-end reads
• Reads from both ends of a DNA fragment • Similar to or same
as mate pairs (depending on platform)
2. Basic Assembly Approaches
DNA fragment
Paired-end reads
-
Genome assemblies
Human male karyotype http://www.genome.gov
109 short sequencing reads 3Gb whole genome
2. Basic Assembly Approaches
-
Whole Genome Shotgun (WGS) approach
( (
Overlapping reads
contig
Mate pairs
scaffold
Chromosomes GATCGTGTCCCATTGTCAGATCGTG Finished assembly
1. Shear genome into 3-5kb
fragments, clone into plasmids and sequence
2. Find overlapping reads 3. Assemble overlapping reads
into contigs
4. Assemble contigs into scaffolds 5. Link scaffolds into
“finished”
sequence corresponding to chromosomes
2. Basic Assembly Approaches
-
Hierarchical Approach
( (
BACs
100-150 kb inserts
Mate pairs
scaffold
Chromosomes
1. Shear genome into 150kb
fragments and put in BACs 2. Create map of BACs to
genome and create a tiling path 3. Shotgun sequence
individual
BACs from tiling path
4. Assemble BAC sequences 5. Use sequenced tiling path to
reconstruct genome
GATCGTGTCCCATTGTCAGATCGTG Finished assembly
Tiling path
2. Basic Assembly Approaches
-
Comparing assembly approaches • Whole Genome Shotgun
• Faster • Assembly is a huge
computational effort
• Celera Genomics approach to human genome
• Hierarchical • Slower • Labor-intensive • Higher quality
assembly in
difficult-to-assemble regions
• Publicly funded Human Genome Project
2. Basic Assembly Approaches
Took >10 years and cost $3 billion
-
First generation sequencing technology
Shear genomic DNA
Subclone into vectors
Bacterial replication
Isolate amplified clones
Capillary sequencing
3. First generation technology
-
!"!#$#$""!$"##!#"$#!"%!"!#$#$""!$"##!#"$#!%!"!#$#$""!$"##!#"$#%!"!#$#$""!$"##!#"$%!"!#$#$""!$"##!#"%!"!#$#$""!$"##!#%!"!#$#$""!$"##!%!"!#$#$""!$"##%!"!#$#$""!$"#%!"!#$#$""!$"%!"!#$#$""!$%!"!#$#$""!%!"!#$#$""%
!"!#$#$%!"!#$#%!"!#$%!"!#%!"!%!"%!%
!"!#$#$"%
!"!#$#$""!$"##!#"$#!"%
&'#%()*+,-./0-%
!-,(*/1-%&'#%!"
"/(2**/.+%$-*%
3./4,-5
1%026-
%
7-89-5:-%
(.2,-.%Sanger sequencing • Chain termination • Fluorescently
labeled,
modified nucleotides • Capillary gel
electrophoresis
3. First generation technology
-
Applications • Sequencing PCR fragments • Sequencing off
plasmids
• Sequencing genomes
• Sequencing cDNA libraries
3. First generation technology
-
Second generation sequencing technology
Amplification
Base detection
Shear genomic DNA
Solid support fixation
4. Second generation technology
Wash and Scan
-
454 pyrosequencing
Rothberg and Leamon 2008
a. Isolate gDNA, fragment and ligate adapters
b. Bind to beads and carry out
emulsion PCR (emPCR—1 fragment/bead)
c. Break emulsion and add beads to
fiber-optic slide d. Pyrosequencing reaction, 1 nt
added at a time (peak height corresponds to # of nucl)
a
b
c
d
4. Second generation technology
-
Illumina • Fragment gDNA • Ligate adapters
• Fix fragments on solid surface
• Bridge amplification to generate clusters
• Sequence one end (using reversible terminators)
• If paired-end, regenerate cluster and sequence the other
end
Figure from Mardis 2013
4. Second generation technology
-
Ion Torrent 1. Shear DNA, ligate adapters
2. Attach fragments to beads and amplify with emPCR
3. Place bead in wells on plate
4. Flow nucleotides over wells, one at a time
5. DNA polymerase incorporates bases and give off H+
6. Mini semi-conductor reads pH change
http://www.lifetechnologies.com
4. Second generation technology
*more like 2.5-generation technology
-
Applications • Genome re-sequencing (reference based
assembly)
• Genome sequencing (de novo assembly)
• Sequencing transcriptome (RNAseq)
• Sequencing DNA associated with proteins (CHiPseq)
4. Second generation technology
-
Third generation sequencing technology
No amplification
Base detection
solid support fixation
Shear genomic
DNA
5. Third generation technology
Single-molecule sequencing
-
Single molecule sequencing e.g. Pacific Biosciences (PacBio)
• Single-molecule real-time (SMRT) sequencing • Real time
fluorescent nucleotides • Some reads >10kb • High error
rate
Eid et al. 2009
5. Third generation technology
-
Applications • Low-depth: Scaffolding contigs (de novo
assembly) • High-depth: Genome sequencing of repetitive regions
or
structural rearrangements
5. Third generation technology
-
Comparison of NGS technologies (non-exhaustive)
Method strategy Read length
Error type
Error rate Output per run
454 Synthesis/pyrosequencing Up to 700bp indels 1% 400-600
Mbp
SOLID DNA ligase 75bp AT bias >0.01-0.06% 20-30 Gbp
Illumina (HiSeq)
Synthesis/DNA poly 150bp Subs. >0.1% 600 Gbp
Ion Torrent H+ detection 90bp indels 1.5% 1 Gbp
PacBio Single
molecule/synthesis
>2.5kb (up to 10kb) insertions 15%
75-100 Mbp (5-10 Mbp
usable)
6. Challenges
-
The $1000 genome—Illumina!
“The HiSeq X™ Ten, composed of 10 HiSeq X Systems, is the first
sequencing platform that breaks the $1000 barrier for a 30x human
genome. The HiSeq X Ten System is ideal for population-scale
projects focused on the discovery of genotypic variation to
understand and improve human health”
http://investor.illumina.com/
Reported January 14 2014:
6. Challenges
-
Summary of technology • Point:
• Sequencing is cheap and easy
• Individual labs
• Current challenge • Computational • Data management
6. Challenges
NHGRI at genome.gov
-
Repetitive DNA
Interspersed repeats
e.g. transposable elements
Tandem repeats
e.g. satellites, CNVs
?
?
6. Challenges
-
Challenges for repetitive DNA • Repeat unit longer than read
length (e.g. Transposable
elements)
• Repeat unit longer than insert sizes (e.g. Transposable
elements)
6. Challenges
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Single end libraries
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Paired end libraries
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
TATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATG
AATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
ATGGAATATG ATATGGAATATGG
ATATGGA GCGATAATATGGAA
GCGATAATATG
TGGTGTACCCAATATGGAATAT
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
GGAATATGGAATA
AATATGGTGTA AATATGGAA
Paired end + Mate pair libraries
-
Repeats cause
6. Challenges
• Misassemblies • Complex rearrangements • Gaps
-
Next gen applications and repeats • WGS with Sanger:
• Repetitive DNA unstable in cloning vectors • Paired end/Mate
pairs help with assembly
• 454 pyrosequencing • Problems with homopolymers • Paired
end/Mate pairs help with assembly
• Illumina • Repetitive elements longer than read length •
Deep coverage and mate pairs help with assembly
• PacBio • Problem is very high error rate: requires deep
coverage PacBio or short
reads • Read length plows through repeats
6. Challenges
-
Further reading: • Metzker. 2010. Sequencing technologies—the
next
generation. Nature Reviews. 11:31-46. • Mardis. 2013.
Next-Generation Sequencing Platforms.
Ann. Rev. Anal. Chem 6:287-303. • Treangen and Salzberg. 2012.
Repetitive DNA and next-
generation sequencing: computational challenges and solutions.
Nature Reviews Genetics 13:36-46.
-
Project background reading • Brennecke, J, AA Aravin, A Stark,
M Dus, M Kellis, R Sachidanandam, GJ
Hannon. 2007. Discrete small RNA-generating loci as master
regulators of transposon activity in Drosophila. Cell
128:1089-1103.
• Lemos, B, LO Araripe, DL Hartl. 2008. Polymorphic Y
chromosomes harbor
cryptic variation with manifold functional consequences. Science
319:91-93. • Nagao, A, T Mituyama, H Huang, D Chen, MC Siomi, H
Siomi. 2010.
Biogenesis pathways of piRNAs loaded onto AGO3 in the Drosophila
testis. RNA 16:2503-2515.
• Filion, GJ, JG van Bemmel, U Braunschweig, et al. 2010.
Systematic protein
location mapping reveals five principal chromatin types in
Drosophila cells. Cell 143:212-224.
-
Papers • Akbari, OS, I Antoshechkin, BA Hay, PM Ferree. 2013.
Transcriptome
profiling of Nasonia vitripennis testis reveals novel
transcripts expressed from the selfish B chromosome, paternal sex
ratio. G3 (Bethesda) 3:1597-1605.
• Blumenstiel, JP, X Chen, M He, CM Bergman. 2014. An
Age-of-Allele Test of
Neutrality for Transposable Element Insertions. Genetics
196:523-538. • Rogers, RL, JM Cridland, L Shao, TT Hu, P.
Andolfatto, and KR Thornton.
2014. Landscape of standing variation for tandem duplications in
Drosophila yakuba and Drosophila simulans. ArXiv preprint.
• Kelleher, E.S., and Barbash D.A. (2013) Analysis of
piRNA-mediated
silencing of active TEs in Drosophila melanogaster suggests
limits on the evolution of host genome defense. Molecular Biology
and Evolution. 30:1816-1819.
-
Getting setup to run graphical software on BlueHive • Please go
to: https://www.circ.rochester.edu/wiki/index.php/Getting_Started
And https://www.circ.rochester.edu/wiki/index.php/NX_Cluster •
Install X11 application if needed