Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009
Feb 24, 2016
Next-generation sequencing:informatics & software aspects
Gabor T. MarthBoston College Biology Department
Harvard NanocourseOctober 7, 2009
New sequencing technologies…
… offer vast throughput … & many applications
read length
base
s per
mac
hine
run
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina/Solexa, AB/SOLiD sequencers
ABI capillary sequencer
Roche/454 pyrosequencer(100Mb-1Gb in 200-450bp reads)
(10-50Gb in 25-100 bp reads)
1 Mb
100 Gb
Genome resequencing for variation discovery
SNPs
short INDELs
structural variations
• the most immediate application area
Genome resequencing for mutational profiling
Organismal reference sequence
• likely to change “classical genetics” and mutational analysis
De novo genome sequencing
Lander et al. Nature 2001
• difficult problem with short reads
• promising, especially as reads get longer
Identification of protein-bound DNA
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. (Robertson et al. Nature Methods, 2007)
DNA methylation. (Meissner et al. Nature 2008)
• natural applications for next-gen. sequencers
Transcriptome sequencing: transcript discovery
Mortazavi et al. Nature Methods 2008
Ruby et al. Cell, 2006
• high-throughput, but short reads pose challenges
Transcriptome sequencing: expression profiling
Jones-Rhoads et al. PLoS Genetics, 2007
Cloonan et al. Nature Methods, 2008
• high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays
… & enable personal genome sequencing
The re-sequencing informatics pipelineREF
(ii) read mappingIND
(i) base calling
IND(iii) SNP and short INDEL calling
(v) data viewing, hypothesis generation
(iv) SV calling GigaBayesGigaBayes
The variation discovery “toolbox”
• base callers• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
Base error characteristics vary
Inser-tions
1.43%
Dele-tions
3.23%
Substitutions95.34%
Illumina
454
Read lengths vary
read length [bp]0 100 200 300
~200-450 (variable)
25-100(fixed)
25-50 (fixed)
25-60 (variable)
400
Sequence traces are machine-specific
Base calling is increasingly left to machine manufacturers
Representational biases
Fragment duplication
Read mapping is like a jigsaw
… and they give you the picture on the box
2. Read mapping…you get the pieces…
Unique pieces are easier to place than others…
Multiply-mapping reads
• Reads from repeats cannot be uniquely mapped back to their true region of origin
• “Traditional” repeat masking does not capture repeats at the scale of the read length
Dealing with multiple mapping
Paired-end (PE) reads
fragment length: 100 – 600bp
Korbel et al. Science 2007
fragment length: 1 – 10kb
PE reads are now the standard for whole-genome short-read sequencing
Gapped alignments (for INDELs)
The MOSAIK read mapper
• gapped mapper• option to report multiple map locations• aligns 454, Illumina, SOLiD, Helicos reads• works with standard file formats (SRF, FASTQ, SAM/BAM)
Michael Strömberg
Alignment post-processing
0 5 10 15 20 25 30 35 40 45 500
5
10
15
20
25
30
35
40
45
50
55
60
Act
ual b
ase
qual
ity
Called base quality
• quality value re-calibration
• duplicate fragment removal
Data storage requirements
Alignment visualization
• too much data – indexed browsing• too much detail – color coding, show/hide
SNP calling: old problem, new data
sequencing error polymorphism
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
Allele calling in next-gen data
SNP
INS
New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection
SNP calling in multi-sample read sets
P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(SNP)
“genotype probabilities”
P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)
P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)
P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)
“genotype likelihoods”
Prio
r(G1,.
.,Gi,..
, Gn)
-----a----------a----------c----------c-----
-----a----------a----------a----------a----------c-----
-----c----------c----------c----------c-----
Trio sequencing
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 12 2 11: 111: 1
1 111 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 12 2
1 1 111: 1 1 11:2 2 4
Pr | , 1 112 12 : 2 1 12 2
1 122 : 12 2
M M M
F
C M FF
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 12 4 2 2
1 1 1 1 112 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 14 2 4 2 2
1 111: 12 211: 1
1 122 12 : 1 12 : 12
22 : 1FG
2
22
11:2 1 12 : 2 1
222 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a de novo (germ-line or somatic) mutation in the child
Alignment visualization
• too much data – indexed browsing• too much detail – color coding, show/hide
Standard data formats
SRF/FASTQ
SAM/BAM
GLF/VCF
Human genome polymorphism projects
common SNPs
Human genome polymorphism discovery
The 1000 Genomes Project
Pilot 1
Pilot 2
Pilot 3
1000G Pilot 3 – exon sequencing
• Targets:1K genes / 10K targets
• Capture: Solid / liquid phase
• Sequencing:454 / IlluminaSE / PE
• Data producers:BaylorBroadSangerWash. U.
• Informatics methods:Multiple read mapping &SNP calling programs
Coverage varies
On/off target capture
ref allele*:45%
non-ref allele*: 54%Target region
SNP(outside target region)
Fragment duplication – revisited
Reference allele bias
(*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346
ref allele*:54%
non-ref allele*: 45%
SNP calling findings
BCM/454 BI/SLX WUGSC/SLX SC/SLX# Samples 32 23 16 11
<read depth> per sample 35 X 62 X 117 X 51 X
# SNPs called 7,200 – 8,400 4,500 – 4,700 3,700 3,500 – 3,700
% dbSNPs 39 - 55 65 - 72 68 75 - 85
Ts/Tv(#SNP) 1.7 – 2.6 1.9 – 2.3 2.3 2.5 – 2.6
# Novel SNPs 3,998 1,550 1,947 892
• based on a method comparison / testing exercise
• 80 samples drawn from the 4 Centers
• read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)
Overlap between call sets
45217238.05%1.21
2,2961,86281.10%3.40
413245.81%0.23
Broad callsBC calls
# SNP calls:# dbSNPs:% dbSNPs:Ts/Tv ratio:
The 1000G Structural Variation Discovery Effort
Structural variation detection
Feuk et al. Nature Reviews Genetics, 2006
SV detection – resolution
Expected CNVsKaryotype
Micro-arraySequencing
Rela
tive
num
bers
of e
vent
s
CNV event length [bp]
Detection Approaches
• Read Depth: good for big CNVs
Sample Reference
Lmap
read
contig
• Paired-end: all types of SV
• Split-Readsgood break-point resolution
• deNovo Assembly~ the future
SV slides courtesy of Chip Stewart, Boston College
47
Read depth (RD)
Statistical & systematic biases
Single molecule sequencing?
GC Bias
Coverage bias
RD resolution
dens
ity (l
og10
)
Illum
ina
obse
rved
read
cou
nts (
per k
b)
expected read counts ( per kb)
CN = 2
CN = 1
CN = 3
CNV events detected with RD
Read
s/2k
b lo
g 2(o
/e)
Chromosome 2 Position [Mb]
Read
s/5k
b lo
g 2(o
/e)
Read
s/kb
log 2
(o/e
)
Helicos
40 kb deletion 454
Illumina
individual “NA12878”
SV detection with PE read map positions
Deletion
Individual Reference
LMLdel
Tandemduplication Ldup
InversionLinv
Fragment length distributions
• long fragments ~ better fragment coverage and sensitivity to large events (454)• tighter distributions ~ better breakpoint resolution and sensitivity for shorter events (Illumina)
The SV/CNV event display
chromosomeoverview
fragment lengths
read depth
eventtrack
300 bp deletion in chromosome of NA12878 by Illumina paired-end data from the 1000 Genomes project
Chip Stewart
Deletion event lengths
ALUY
L1
Mobile element insertions: PE reads
• Used with short-read data (Illumina, in our case)• Detect clusters of 5’ & 3’ read pairs with one end mapping
to a mobile element• Clusters far from annotated elements are candidate
insertion events
ALU / L1 / SVA5’ end 3’ end
Mobile element insertions: Split reads
• Requires longer reads (454)• Reads “mapping into” mobile element not present in the
reference genome sequence are candidate insertion events
Mobile element insertions: trio members
356185 182
NA12891684
NA12892664
ALU insertions
NA12878733
47
7996
10
Detection in 1000 G Pilot 2 CEU trio PE Illumina data
Mobile element insertions: PE vs. Split-reads
569247 163454 Split-Read817
SLX PE733
ALU insertions in NA12878
BC event lists in 1000 Genomes data
SV typePilot 1140
samples low coverage
Pilot 26 samples
high coverage
deletions 5,555 4,718tandem
duplications 540 406mobile
element insertions
3,276 2,013
SV calls / validation in 1000G datasets
Software access
http://bioinformatics.bc.edu/marthlab/Software_Release
Credits
Elaine Mardis
Andy Clark
Aravinda Chakravarti
Michael Egholm
Scott Kahn
Francisco de la Vega
Patrice MilosJohn Thompson
R01 HG004719R01 HG003698R21 AI081220RC2 HG5552
Lab
Several positions are available:grad students / postodocs /
programmers
Can we find exon CNVs in Pilot 3 data?
SVs in exon sequencing data