How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies? Dr Joseph Hughes 11th OIE Seminar Saskatoon - 17th June 2015
How to Standardise and Assemble Raw Data into Sequences:What Does it Mean for a Laboratory to Use Such Technologies?"
Dr Joseph Hughes!!!11th OIE Seminar!
Saskatoon - 17th June 2015!
Decreasing sequencing cost!
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14 Sep-17
Cost per raw Megabase of DNA sequence!
http://www.genome.gov/sequencingcosts!
Democratization of sequencing!http://omicsmaps.com!
Applications of High throughput sequencing"
• Whole genome sequencing!• Genome variability within a host!• De-novo assembly of novel viruses!• Metagenomics of communities!
Considerations for a genome assembly pipeline
• Flexible pipeline: Handling unknown genotypes or virus samples!
• Platform independent: work with data from different platforms!
• Virus independent: work on any virus!• Scalable to hundreds or thousands of samples!• Accuracy of SNP calling in the genome (outbreak analysis
where samples are more closely related)!
Known reference" Unknown reference"
Pre-assembly "Processing"
Check format (sff, fastq) !Convert to FASTQ!Remove adaptor contaminants!Remove host genome contamination!Quality & length trimming!
Reference assembly!
De-novo assembly!
Contig merge!
Scaffolding contigs!
Validation!
Consensus!
Variant calling!
Classification!
Assembly"
Post-assembly processing"
Annotation!
Genome comparison!
Examples
1. 1999-2001 in Northern Italy: emergence of highly pathogenic avian influenza H7N1!
• Identify known molecular markers for viral pathogenicity in intra-host viral populations!
• OIE & FAO reference lab for Influenza!
2. 2010 in the Netherlands: die-off of >1000 wild water frogs and newts!
• Isolation, characterisation and relationship to known viruses of the Dutch frog killer!
• Van Beurden et al. (2014). Genome Announc.!
hybrid Edible frog !(Pelophylax kl. esculentus)!
Example 1:Characterization of HPAI signature mutations"
Monne et al. (2014). Journal of Virology!!
Pre-assembly processing"
trim_galore and FastQC for quality control!
Reference assemblers?"
• Hash based tools: Mosaik, Novoalign, Stampy, Tanoti!• Borrrows-Wheeler Transform-based tools: BWA, Bowtie2,
NextGenMap!
Too many to choose from!
http://www.bioinformatics.cvr.ac.uk/Tanoti!
HA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500
M
position
log10(DOC)
0
1
2
3
4
200 400 600 800 1000
NA
position
log10(DOC)
0.00.51.01.52.02.53.03.5
200 400 600 800 1000 1200
NP
position
log10(DOC)
0
1
2
3
500 1000 1500
NS
position
log10(DOC)
0
1
2
3
4
200 400 600 800
PA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500 2000
PB1
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
500 1000 1500 2000
PB2
position
log10(DOC)
0
1
2
3
500 1000 1500 2000
Bowtie2 and Stampy !Tanoti!!
Tablet - assembly
Variant calling – detecting true mutations"• Many tools LoFreq, Vphaser, DiversiTools!• Using replicates to validate mutations (e.g. FMDV
experiments)!!
One LPAI sample collected after the identification of HPAI with an HA cleavage site and multiple HPAI associated mutations at extremely low frequency!
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4756.99
X4827.99
X4828.99
X4911.99
X4708.99
X4618.99
X4618.99.1
X4749.99
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4295.99
X3675.99
X4829.99
X1744.99
X2732.99
X3283.99
Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples !
Amino acid changes!
Samples!
Amino acid changes!
Example 2:Isolation and Sequencing"
• From dead wild water frog in September 2013!• Suspension from pooled internal organs!• Inoculated on BF-2 cells (Bluegill Fry cells fibroblast)!• DNA extracted using Dneasy kit (viral purity of 67%!• DNA sheared by sonication!• KAPA library preparation!• MiSeq (Illumina) Machine #2 test run: total run 26,700,000
reads including 50% PhiX (16Gb)!• 13,127,123 paired-end 300 bp reads from the sample (7.9
Gb)!
Assembly"
• Abyss-pe de-novo assembler reconstructed the full-genome in a single contig of 107,260!
• 5 different regions had ambiguous/repetitive sequences !
• Re-sequencing ambiguous regions with Sanger!
1!
1692!
1693!
21168!
21359!
38364!
38387!
66887!
67100!
73322!
73434!
107260!
?! ?! ?! ?! ?!
Finishing assembly"
• CodonCode Aligner for assembling and checking the Sanger sequences!
• SequencePatcher.pl to stitch the Sanger sequences into the de-novo contig!
• iCORN2!
• Final genome of 107,260 => 107,772bp!
Annotating
• BLAST to find the most similar annotated genome!• Common Midwife Toad Virus (CMTV) from Spain!
• Transfer of annotations from CMTV to the full genome (RATT)!
• Identifies inappropriate start codons, frame-shifts!
• Correcting of transferred models using Artemis!
20 kb
RGV JQ654586
STIV EU627010
FV3 KJ175144
FV3 AY548484
TFV AF389451
CGSIV KF512820
ADRV KF033124
ADRV KC865735
CMTV NL
CMTV JQ231222
ATV AY150217
EHNV FJ433873
ESV JQ724856
84!
95!
100!
100!
76!
100!
100!
100!
Standard formats"• FASTQ – quality score depends on the technology and
base caller!!
• SAM – soon v1.5 extensions!
Genome standards – 5 categories!
Ladner et al.(2014) mBio !
% genome! covered!!>50%!!!~80-90%!!!~90-99%!!!100%!!!100%!!
HTS! coverage!!!!!~15-30 x!!!>100 x!!!RACE!!~ 400 !– 1000 x!!
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 20120
1
10
100
1,000
10,000
100,000
1,000,000
0.1
1
10
100
1000
10,000
100,000
1,000,000
10,000,000
100,000,000
Year
Dis
k st
orag
e (M
byte
s/$)
DN
A sequencing (bp/$)
Hard disk storage (MB/$)Doubling time 14 months
Pre-NGS (bp/$)Doubling time 19 months
-
NGS (bp/$) Doubling time 5 months
http://genomebiology.com/2010/11/5/207!
Challenges: Rates of increase in data"
Challenges: resources and technologies"
• Shift towards more data, labs need to have dedicated bioinformaticians!
• Rule of thumb: invest as much in computers and data scientists as in sequencing equipment and lab technicians!
• Non-uniform coverage, repeat regions, systematic biases, PCR errors, sequencing errors, sequence length!
CVR bioinformatics team!
Director of OIE Collaborating Centre for Viral Genomics and Bioinformatics!
Director of Centre for Virus Research!