genomeinabottle.org
Genome in a Bottle Consortium August 2015
NIST, Gaithersburg, MD
Reference Materials for Clinical Applications of Human Genome Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology
genomeinabottle.org
NIST Released the GIAB Pilot Genome
as RM 8398 in May 2015
genomeinabottle.org
GIAB Scope
• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls.
• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.
genomeinabottle.org
Genome in a Bottle Consortium Development
• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011
• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011
• Small workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others
– developed draft work plan– April 2012
• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford– August 2015 at NIST– January 28-29, 2015 at Stanford
• Website– www.genomeinabottle.org
genomeinabottle.org
Well-characterized, stable RMs• Obtain metrics for validation,
QC, QA, PT• Determine sources and types of
bias/error• Learn to resolve difficult
structural variants• Improve reference genome
assembly• Optimization
– integration of data from multiple platforms
– sequencing and analysis• Enable regulated applications Comparison of SNP Calls for
NA12878 on 2 platforms, 3 analysis methods
genomeinabottle.org
NGS Validation Process usingGenomes in Bottles
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
Analytical ProcessGenome in a Bottle Scope
Pre-Analytical Process
Clinical InterpretationGIAB Data
genomeinabottle.org
Genome in a Bottle Consortium (GIAB)Hosted by US National Institute of Standards and Technology
Goal: Provide infrastructure for performance assessment of NGS
• Appropriately consented widely available DNA samples, distributed by the Coriell Institute– Also, QCed Reference Material (RM)
versions from controlled lots will be available from NIST
– Pilot NIST RM 8398: tinyurl.com/giabpilot
• High-accuracy reference data for these samples
• Tools to facilitate their use– With the Global Alliance Data Working
Group Benchmarking Team
ga4gh.org
genomeinabottle.org
High-confidence SNP/indel calls
Zook et al., Nature Biotechnology, 2014.
• methods to develop SNP/indel call set described in manuscript
• broad and quick adoption of call set for benchmarking– struck nerve
genomeinabottle.org
Highlights
This workshop• Progress Update• Breakouts
– Analyses for PGP GIAB Trios– Other RMs
• GIAB Roadmap– Coordinating analyses– Other RM plans– Papers?
• Using GIAB Products for analytical validation of clinical NGS assays
Future GIAB work• Beyond support,
improvement/development and maintenance of existing GIAB products…– What future work should
GIAB do that would uniquely take advantage of the momentum we’ve built?
genomeinabottle.org
AgendaThursday• Welcome and Status Update• Break• Breakout presentations
– Analysis Team– Other Reference Materials
• Lunch (on your own in cafeteria)
• GIAB Roadmap• Break• Breakouts to plan to carry out
the roadmap• Plenary to discuss Roadmap
plans
Friday• Additional Analysis breakout
if needed• Using GIAB products for
Analytical Validation• Break• GIAB products for analytical
validation?• Lunch (on your own in
cafeteria)• Steering committee meeting
genomeinabottle.org
AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy' variants and regions of the genome
– Topic #2: Selecting future genomes for Reference Materials
Tuesday• Breakfast and registration• Use cases: Experiences using the pilot
Reference Material• Discussion of plans to release pilot
Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans
and discussion• Steering committee Overview• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.
GIAB Roadmap: Where are we, Where are we going?
• Reference Materials– Germline– Somatic
• Informatics– Analysis of GIAB data– Benchmarking
• Documentary Standards/Publications– Documentation of methods– Supporting Use
GIAB
Germline Genomes
Pilot RM High-confidence SNPs/indels RM Release High-confidence
SVs
PGP RMs
High-confidence SNPs/indels RM Release High-confidence
SVs
Other ancestries
Do we need trios?
Other large families?
Sample panelsMany samples with clinically important
mutationsPharmacogenomics
In depth analysesCharacterize harder
parts of the genome
Diploid de novo assemblies
Assign confidence scores to variants
in RMs
Somatic mutation RMs
Interlaboratory study
ctDNA/cfDNA/fetal DNA
Whole cancer genomes
Benchmarking tools
Define performance
metrics
Stratification - Assign confidence
to types of variants
Documents/Publications Analyses
Best practices/analytic
validation
Documentary standards
genomeinabottle.org
Others working in this space…
Well-characterized genomes• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Human Longevity, Inc.• Hyditaform mole haploid
cell line• Genome Reference
Consortium• 1000 Genomes SV group
Performance Metrics• Global Alliance for
Genomics and Health Benchmarking Team
• NCBI/CDC GeT-RM Browser• GCAT website
What should GIAB do?
• Beyond support, improvement/development and maintenance of existing in--process GIAB products…– What future work should GIAB do that would take
advantage of the momentum and unique community we’ve built?
genomeinabottle.org
GIAB Progress Update
August 2015
genomeinabottle.org
NIST Human Genome Reference Materials (RMs)
• NIST RM 8398 is available!– tinyurl.com/giabpilot– DNA isolated from large
growth cell cultures– Stable, homogeneous – Best for regulated uses– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples– Available from Coriell now– NIST RM available in 2016
genomeinabottle.org
Using high-confidence NIST-GIAB genotypes for NA12878
• NIST have released several versions of high-confidence genotypes for its pilot RM
• These data are presently being used for benchmarking– prior to release of RMs– SNPs & indels
• ~77% of the genome•Data on FTP now well-organized
genomeinabottle.org
90000
genomeinabottle.org
GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call
genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers“Next Generation” Sequencing (NGS) guidelines for somatic genetic variant detection
www.bioplanet.com/gcat
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Initial focus on germline SNPs/indels• Developing benchmarking tools
• Comparison engine• Pluggable web interface with
modules for:• Reporting/calculation of metrics• Visualization/user interface
• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes
www.bioplanet.com/gcat
Example User Interface
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: Genotype errorsTruth
Callset
Description ProposedName(s)
CM#1 region match
CM#2 allele match CM#3 genotype match
0/1 1/1 zygosity/genotype error
GE TP 1TP, 1GE FN
1/1 0/1
1/2 0/11/10/22/2
common allele, FN allele
GE_FN TP 1TP, 1GE, 1FN FN
0/1 1/2 common allele, FP allele
GE_FP TP 1TP, 1GE, 1FP FP, FN
1/1 1/2
1/2 1/3 common allele, FP allele, FN allele
GE_FP_FN TP 1TP, 1GE, 1FP, 1FN
FP, FN
genomeinabottle.org
Global Alliance for Genomics and HealthBenchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
Beyond simple T/F classification: no-calls and half-calls
Truth Callset Description ProposedName(s)
CM#1 region match
CM#2 allele match CM#3 genotype match
0/1 ./1 half-call, TP allele HC_TP NC, NCV, TP 1NC, 1NCV, 1TP, 1GE TP
1/1 ./1 1NC, 1NCV, 1TP, 1GE FN
0/11/1
./0 half call, FN allele(s)
HC_FN NC, NCV, TP 1NC, 1NCV, 1FN FN
1/2 ./0 1NC, 2NCV, 2FN FN
1/2 ./1./2
half-call, TP allele, FN allele
HC_TP_FN
NC, NCV, TP 1NC, 1NCV, 1TP, 1GE, 1FN
FN
genomeinabottle.org
Stratifying False PositivesGC ContentTR
Unit <7
TRUnit >=7
TRUnit
2TRUnit
1
TRUnit
3
TRUnit
4
Credit:Abby BeelerEllie Wood
GA4GH - Stratification
genomeinabottle.org
Data from GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end
150x150bp ~300x/individual on SRA/FTP SNPs/indels/some SVs
Illumina Long Mate pair
~6000 bp insert ~20x/individual on FTP SVs
Illumina “moleculo”
Custom library ~20-30x by long fragments
on FTP SVs/phasing/assembly
Complete Genomics
100x/individual On SRA/ftp SNPs/indels/some SVs
Complete Genomics
LFR on SRA/FTP SNPs/indels/phasing
Ion Proton Exome 1000x/individual On SRA/FTP SNPs/indels in exome
BioNano Genomics
200-250kbp optical map reads
~100x/AJ individual; 57x on Asian son
Raw reads and assemblies on FTP
SVs/assembly
10X Linked reads 30-45x/individual On FTP SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent
on SRA/FTP SVs/phasing/assembly/STRs
genomeinabottle.org
GIAB Analysis Group – New Data Sets
Leaders• Francisco de la Vega
– Annai Systems• Chris Mason
– Weil Cornell Medical Center• Tina Graves
– Washington University• Valerie Schneider
– NCBI•and Justin and Marc
Status• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10eA0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXHhtNH1uzw/edit?usp=sharing
• Analysis Milestones:– https://docs.google.com/spreadsheets/d/1Pj4nSzH742g4
0wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing
• Analysis Methods– https://docs.google.com/spreadsheets/d/1Je2g85
H7oK6kMXbBOoqQ1FMNrvGnFuUJTJn7deyYiS8/edit?usp=sharing
• Analysis Plan:– https://drive.google.com/file/d/0B7Ao1qqJJDHQdn
VEaVdqbWdEdkE/view?usp=sharing
• Collecting Data into a Central FTP Site• Recruiting people to help with the work.
This could be you.We need volunteers!
Goal: Establish and distribute a set of authoritative benchmark variant calls of all types and sizes, as well as homozygous reference regions, on GIAB PGP trios
genomeinabottle.org
Data Release Plans: Real-time, Open, Public Release
Individual Datasets• Uploaded to GIAB FTP site
as it is collected• Includes raw reads, aligned
reads, and variant/reference calls
Integrated High-confidence Calls• First develop SNP, indel, and
homozygous reference calls• Then develop SV and non-
SV calls• Released calls are versioned• Preliminary callsets will be
made available to be critiqued
genomeinabottle.org
SNP/Indel Integration Method Update
• Implementing refined integration methods on DNAnexus– Others can readily reproduce results– Consistent results for all GIAB genomes
• Validating with released NA12878 RM data– Planned completion Sep 2015
• Then, apply to PGP trios– Plan to analyze AJ trio by Nov 2015– Release of NIST RMs in early 2016
genomeinabottle.org
Integration to form high-confidence SNP/indel calls
VCFs with 0 FP PASS and 0 FN PASS+filtered in
BED files
If 1+ datasets PASS and all PASSing datasets have
same genotype
High-confidence variant, include in high-
confidence regions
If all datasets are filtered or outside BED
Unless manually inspect alignments: not high-
confidence, exclude +-50 bp from high-confidence
regions
If PASSing datasets disagree about genotype
or variant
Unless manually inspect alignments: not high-
confidence, exclude +-50 bp from high-confidence
regions
If inside BED and not in VCF for 1+ datasets, and no datasets have PASSing
variants
High-confidence region
genomeinabottle.org
Forming high-confidence calls on AJ Trio
Generate candidate calls with multiple analysis methods from
multiple types of data
Compare/integrate candidate calls and manually inspect data to
understand differences; refine calls?
Generate integrated calls with several methods (MetaSV,
Parliament, svclassify, others?)
Combine integrated calls (with heuristics and/or machine learning)
to generate high-confidence calls
https://docs.google.com/spreadsheets/d/1Pj4nSzH742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?usp=sharing
August 30, 2015
Nov 1, 2015
Dec 1, 2015
Jan 26, 2016
genomeinabottle.org
Analysis Progress: AJ Trio• SNPs/indels
– Several candidate callsets– NIST working on integration
• Assembly– 2 de novo assemblies of AJ trio (MHAP and Falcon/Bionano)– Will be used by at least 2 groups for SV calling
• Structural variants– Candidate calls being generated by 14+ groups with >14 different
algorithms and 6 datasets– 3 integration methods: MetaSV, Parliament, svclassify
• Long-range Phasing– 2 phased calls so far (CG LFR and 10X)– Integration methods needed!