NIST Program to Develop Genomic Reference Materials Jus<n Zook and Marc Salit
NIST Program to Develop Genomic Reference Materials
Jus<n Zook and Marc Salit
Scope of NIST work
• Human Whole Genome RMs • Synthe<c DNA constructs • Microbial Whole Genome RMs
RM Development Process
1. Select and procure materials 2. Characterize materials 3. Process and integrate data from mul<ple
plaMorms 4. Confirm selected genotypes 5. Write Report of Analysis 6. Develop methods for end users to obtain
performance metrics from the materials
Proposed Timeline for Human RMs
Proposed Timeline for Synthe<c Structures
32w1.1) Select/Procure human DNA for RM
1.2) **NIST receives packaged DNA for RM/SRM
97w1.3) Develop bioinformatics pipeline for data integration
147w1.4) Human Primary Sequencing
8w1.5) Human Homogeneity assessment
10w1.6) Analyze homogeneity data and produce preliminary SNP calls for RM
10w1.7) Write human RM Report of Analysis
24w1.8) Process Human RM for release
1.9) **Human RM officially released
25w1.10) Human Sequencing data integration
20w1.11) Human Validation
48w1.12) Human other characterization methods
12w1.13) Analyze validation data and refine sequencing calls
40w1.14) Develop pipeline for SVs and test
8w1.15) Write Human SRM Report of Analysis
24w1.16) Process Human SRM for release
1.17) **Human SRM officially released
10w1.18) Procure local data storage
10w1.19) Procure Bioinformatics data analysis tools
10w1.20) Procure Automated sample prep instrumentation
535w1) Human RMs
31w2.1) Select/Procure microbial DNA for RMs
124w2.2) Microbial Primary Sequencing
6w2.3) Microbial Homogeneity assessment
10w2.4.1) Mapping/Alignment
12w2.4.2) Variant calling
12w2.4.3) Form consensus variant calls
6w2.4.4) Select sites for validation
40w2.4) Microbial Sequencing data integration
8w2.5) Microbial Validation
8w2.6) Microbial other characterization methods
20w2.7) Analyze validation data and refine calls
18w2.8) Write Microbial SRM Report of Analysis
24w2.9) Process Microbial SRM for release
2.10) **Microbial SRM officially released
279w2) Microbial RMs
12w3.1) Design constructs
10w3.2) Test constructs in silico
20w3.3) Procure synthetic DNA for testing
124w3.4) Sequence preliminary synthetic DNA
12w3.5) Compare preliminary sequencing data
8w3.6) Design final RM constructs
16w3.7) Procure synthetic DNA for SRMs
88w3.8) Sequence synthetic SRMs
12w3.9) Sequencing data integration
8w3.10) Write synthetic SRM Report of Analysis
24w3.11) Process synthetic SRM for release
3.12) **Synthetic SRM officially released
334w3) Synthetic DNA constructs
Title Effort2011 2012 2013 2014 2015 2016
Proposed Characteriza<on Methods for Whole Genomes
Whole Genome Sequencing • ABI 5500 (1kb, 6kb, and
10kb mate-‐pair libraries) • Illumina • Complete Genomics • Upcoming technologies?
– Ion Proton? – Oxford Nanopore?
• 3x replica<on of sequencing (3 library preps)
Other • Genotyping microarrays • Array CGH • Targeted sequencing • Fosmid sequencing? • Op<cal Mapping?
Father Mother
NA12878 Husband
Son Daughter
Integra<on of Exis<ng Data to Form Consensus Genotype Calls
Find all possible variant sites
Find sites where all datasets agree
Iden<fy sites with atypical characteris<cs signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteris<cs un<l all datasets agree
Even if all datasets agree, iden<fy them as uncertain if few have typical characteris<cs
Consensus has lower FN rate than individual datasets Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Illumina Omni SNP Array
Integrated
Con
sensus
Gen
otypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.45M 613 (0.09%) 977 (0.15%) N/A
Heterozygous 241 (0.04%) 414k (61.5%) 173 (0.03%) N/A Homozygous
Variant 152 (0.02%) 61 (0.01%) 249k (36.9%) N/A
Uncertain 5458 (0.81%) 3421 (0.51%) 4808 (0.71%) N/A
HiSeq – GAT
K
“FNs”
“FPs*”
“FNs”
“FPs*”
* Note that most or all of the puta<ve FPs seem to actually be FNs on the microarray
Illumina Omni SNP Array
SNP arrays overesMmate performance
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.45M 7.24k (1.34%) 5.28k (0.65%) N/A
Heterozygous 196 (0.03%) 411k (60.7%) 133 (0.02%) N/A Homozygous
Variant 154 (0.02%) 150 (0.02%) 249k (37.0%) N/A
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiSeq – GAT
K
“FNs”
“FPs*”
“FNs”
“FPs”
HiSeq – GAT
K
Illumina Omni SNP Array
Samtools has higher FP and lower FN than GATK
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.51M 49.6k (1.47%) 6.74k (0.20%) 3.93M
Heterozygous 3141(0.09%) 2.00M (59.6%) 74 (0.00%) 175k (5.19%) Homozygous
Variant 21 (0.00%) 777 (0.02%) 1.21M (36.0%) 192k (5.71%)
Integrated Consensus Genotypes
Homozygous Reference Heterozygous Homozygous
Variant Uncertain
Homozygous Reference/ No Call
1.52M 157k (4.68%) 30.3k (0.90%) 4.17M
Heterozygous 47 (0.00%) 1.90M (56.4%) 34 (0.00%) 16.9k (0.50%) Homozygous
Variant 1 (0.00%) 298 (0.01%) 1.19M (35.3%) 73.3k (2.18%)
Integrated Consensus Genotypes
HiSeq – samtools
“FNs”
“FPs”
“FNs”
“FPs”
HiSeq – GAT
K
Performance Metrics: Characteris<cs of Mis-‐calls
. . .
QUAL/Depth of Coverage
HiSeq/GA
TK
Consensus Genotypes
Heterozygous
Hom. V
ariant
Hom. R
ef./No call
Heterozygous Hom. Variant Hom. Ref. Uncertain
Strand Bias
Challenges with assessing performance
• All variant types are not equal • Nearby variants are ojen difficult to align • All regions of the genome are not equal – Homopolymers, STRs, duplica<ons – Can be similar or different in different genomes
• Labeling difficult variants as “uncertain” in the Reference Material leads to higher apparent accuracy when assessing performance
• Genotypes fall in 3+ categories (not posi<ve/nega<ve) • It’s important to consider data from mul<ple plaMorms and library prepara<ons when characterizing a Reference Material