Next–generation DNA sequencing technologies – theory & practice
Dec 15, 2015
Next–generation DNA sequencing technologies –
theory & practice
Next-Generation sequencing (NGS) technologies – overview
NGS targeted re-sequencing – fishing out the regions of interest
NGS workflow: data collection and processing – the exome sequencing pipeline
Outline
PART I: NGS technologiesNext-Generation sequencing (NGS) technologies – overview
The automated Sanger method is considered as a ‘first-generation’ technology, and newer methods are referred to as next-generation sequencing (NGS).
DNA Sequencing – the next generation
1953 Discovery of DNA double helix structure 1977
◦ A Maxam and W Gilbert "DNA seq by chemical degradation"◦ F Sanger"DNA sequencing with chain-terminating inhibitors"
1984 DNA sequence of the Epstein-Barr virus, 170 kb 1987 Applied Biosystems - first automated sequencer 1991 Sequencing of human genome in Venter's lab 1996 P. Nyrén and M Ronaghi - pyrosequencing 2001 A draft sequence of the human genome 2003 human genome completed 2004 454 Life Sciences markets first NGS machine
Landmarks in DNA sequencing
Random genome sequencing• 25 Mb• 300k reads• 110bp
Sanger sequencing• Targeted • 700-1000 bp
DNA Sequencing – the next generation
The newer technologies constitute various strategies that rely on a combination of ◦ Library/template preparation◦ Sequencing and imaging
DNA Sequencing – the next generation
Commercially available technologies◦ Roche – 454
GSFLX titanium Junior
◦ Illumina HiSeq2000 MySeq
◦ Life – SOLiD 5500xl Ion torrent
◦ Helicos BioSciences – HeliScope◦ Pacific Biosciences – PacBio RS
DNA Sequencing – the next generation
DNA Sequencing – the next generation
Produce a non-biased source of nucleic acid material from the genome
Template preparation: STEP1
Produce a non-biased source of nucleic acid material from the genome
Template preparation: STEP1
Produce a non-biased source of nucleic acid material from the genome
Current methods:◦ randomly breaking genomic DNA into smaller
sizes◦ Ligate adaptors◦ attach or immobilize the template to a solid
surface or support◦ the spatially separated template sites allows
thousands to billions of sequencing reactions to be performed simultaneously
Template preparation
Clonal amplification◦ Roche – 454◦ Illumina – HiSeq◦ Life – SOLiD
Single molecule sequencing◦ Helicos BioSciences – HeliScope◦ Pacific Biosciences – PacBio RS
Template preparation
In solution – emulsion PCR (emPCR)◦ Roche – 454◦ Life – SOLiD
Solid phase – Bridge PCR◦ Illumina – HiSeq
Template preparation: Clonal amplification
Template preparation: Clonal amplification - emPCR
Sequencing
SOLiD 454
Pyrosequencing
Picotitre plate Pyrosequencing
Pyrosequencing
Sequencing by ligation
Sequencing by ligation
Sequencing by ligation
Template preparation: Clonal amplification – Bridge PCR
Template preparation: Single molecule templates
Heliscope BioPac
HiSeq Heliscope
The major advance offered by NGS is the ability to cheaply produce an enormous volume of data
The arrival of NGS technologies in the marketplace has changed the way we think about scientific approaches in basic, applied and clinical research
DNA Sequencing – the next generation
PART II: NGS targeted resequencing
fishing out the regions of interest
The beginning
Random genome
sequencing
??? ??? Sanger sequencing• Targeted • 700-1000 bp
Library/template preparation Library enrichment for target Sequencing and imaging
DNA Sequencing – the next generation
Target enrichment strategies
Random genome
sequencing
Hybrid Capture
PCR based Sanger sequencing
Target enrichment strategies
Target enrichment strategies
Target enrichment strategies
Target enrichment strategies: MIP
Hybrid Capture
In solution• Agilent• Nimblegen• ...
Solid phase• Agilent• Nimblegen• Febit• ...
Hybrid Capture
In solution• Relatively cheap• High throughput is possible• Small amounts of DNA
sufficient
Solid phase• Straightforward method• Flexible• Higher amounts of DNA
Target enrichment strategies
PCR based approaches
• Uniplex• Multiplex• Fluidigm• Raindance• Multiplicon
• Longrange PCR products• Raindance
PCR based approaches: Raindance
PCR based approaches: Fluidigm• 48.48 Access Array
PCR based approaches: Fluidigm• 48.48 Access Array
PCR based approaches: Fluidigm• 48.48 Access Array
Target enrichment strategies
PART III: NGS workflow
data collection and processing – the exome sequencing pipeline
The human genome◦ Genome = 3Gb◦ Exome = 30Mb◦ 180 000 exons
Protein coding genes ◦ constitute only approximately 1% of the human
genome ◦ It is estimated that 85% of the mutations with
large effects on disease-related traits can be found in exons or splice sites
Whole Exome Sequencing
gDNA3 Gb
Exome 38Mb NGS
Exome sequencing
1/01/2010 1/08/2010 1/01/2011
1100860
300
5900
2600
1000
7000
3460
1300
exome capture Seq - 2.5Gbases total cost
The past, present & future
HiSeq specifications:◦ 2 flow cells◦ 16 lanes (8 per flow cell)◦ 200-300 Gbases per flow cell◦ 10 days for a single run
Exome throughput◦ 96 @ 60x coverage per run◦ 3000 @ 60x coverage per year
Exome sequencing capacity
Data processing workflow
Data formatting & QC
Mapping & QC
Variant calling
Variant annotation
Variant filtering/comparison
Data processing
DATA STORAGEDATA GENERATION DATA PROCESSING
REPORTING &
VALIDATION
RESULTS
INTERPRETATION
Prepare
sample
library
Perfom
exome
capture
Perform
sequencin
g
DATA GENERATION
Prepare
sample
library
Perfom
exome
capture
Perform
sequencin
g
DATA GENERATION
Prepare
sample
library
Perfom
exome
capture
Perform
sequencin
g
DATA GENERATION
Sequence Data10-15 Gb / exome
DATA STORAGEDATA GENERATION DATA PROCESSING
Image processingBase calling
NGS data processing: overview
1
•Mapping
2
•Duplicate marking
3
•Local realignment
4
•Base quality recalibration
5
•Analysis-ready mapped reads
Sequence Data10-15 Gb / exome
DATA STORAGEDATA GENERATION DATA PROCESSING
Image processingBase calling
QC sequencingMapping
sequencesQC capture exp
QC NGS
Mapping
QC HC
DATA PROCESSING
QC NGS
Mapping
QC HC
DATA PROCESSING
Sequence Data10-15 Gb / exome
DATA STORAGE
Mapping results5 Gb / exome
DATA GENERATION DATA PROCESSING
Image processingBase calling
QC sequencingMapping
sequencesQC capture exp
Variant CallingVariant Annotation
Sequence Data10-15 Gb / exome
DATA STORAGE
Mapping results5 Gb / exome
Variant Calls100Mb / exome
DATA GENERATION DATA PROCESSING
Image processingBase calling
QC sequencingMapping
sequencesQC capture exp
Variant CallingVariant Annotation
SNPs vs Indels
0
200000
400000
600000
800000
1000000
1200000
INDELSNP
exonic vs non-exonic
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
stopgain SNVnonsynonymous SNVnonframeshift insertionnonframeshift deletionnon-codingframeshift insertionframeshift deletion
Exonic
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
synonymous SNVstoploss SNVstopgain SNVnonsynonymous SNVnonframeshift insertionnonframeshift deletionframeshift insertionframeshift deletion
Exonic
0
50
100
150
200
250
300
350
400
450
500
stoploss SNVstopgain SNVnonframeshift insertionnonframeshift deletionframeshift insertionframeshift deletion
Sequence Data10-15 Gb / exome
DATA STORAGE
Mapping results5 Gb / exome
Variant Calls100Mb / exome
DATA GENERATION DATA PROCESSING
Image processingBase calling
QC sequencingMapping
sequencesQC capture exp
Variant CallingVariant Annotation
Database knownVariants Public &
PrivateVariant Filtering
Sequence Data10-15 Gb / exome
DATA STORAGE
Mapping results5 Gb / exome
Variant Calls100Mb / exome
DATA GENERATION DATA PROCESSING
Image processingBase calling
QC sequencingMapping
sequencesQC capture exp
Variant CallingVariant Annotation
Database knownVariants Public &
PrivateVariant Filtering
REPORTING &
VALIDATION
RESULTSValidated variants in candidate
genes
INTERPRETATION