Genome in a Bottle Consortium January 2015 Stanford University Reference Materials for Clinical Applications of Human Genome Sequencing Marc Salit, Ph.D. and Justin Zook, Ph.D National Institute of Standards and Technology Advances in Biological/Medical Measurement Science (ABMS @ Stanford)
30
Embed
Jan2015 GIAB intro, Update, and Data Analysis Planning
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Genome in a Bottle Consortium January 2015
Stanford University
Reference Materials for Clinical Applications of Human Genome Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.DNational Institute of Standards and Technology
Advances in Biological/Medical Measurement Science (ABMS @ Stanford)
GIAB Scope
• The Genome in a Bottle Consortium is developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls.
• A principal motivation for this consortium is to enable performance assessment of sequencing and science-based regulatory oversight of clinical sequencing.
Genome in a Bottle Consortium Development
• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011
• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011
• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash
• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford
• Website– www.genomeinabottle.org
Well-characterized, stable RMs
• Obtain metrics for validation, QC, QA, PT
• Determine sources and types of bias/error
• Learn to resolve difficult structural variants
• Improve reference genome assembly
• Optimization– integration of data from
multiple platforms– sequencing and analysis
• Enable regulated applications Comparison of SNP Calls forNA12878 on 2 platforms, 3
analysis methods
Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials will be developed to characterize performance of a part of process– materials will be
certified for their variants against a reference sequence, with confidence estimates
gen
eric
me
asu
rem
en
t p
roce
ss
• NIST working with GiaBto select genomes
• Current plan– NA12878 HapMap
sample as Pilot sample• part of 17-member
pedigree
– trios from PGP as more complete set• 2 trios, focus on children
Monday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy' variants and regions of the genome
– Topic #2: Selecting future genomes for Reference Materials
Tuesday• Breakfast and registration• Use cases: Experiences using the pilot
Reference Material• Discussion of plans to release pilot
Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans
and discussion• Steering committee Overview• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests
otherwise. Please use #giab as the hashtag.
What’s the future of GIAB?
• What is GIAB uniquely positioned to do?– how will we know when we’re done?
• If we do other stuff, are we the best cohort to do it?
• Other biogeographical ancestry groups?
• Cancer?– spike-in controls– whole-genomes
• tumor/normal?
• Create list of mutattions for spike-ins for germline
• Somatic genomes other than cancer• Prenatal• Forensics – decay of DNA• Transcriptome?• Epigenome?• Interpretation standards?
– functional– clinical
Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid cell line
• Genome Reference Consortium
Performance Metrics
• Global Alliance for Genomics and Health Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
Plan for analyses of new PGP RM Trio data
January 2015
Data Release Plans
Individual Datasets
• Uploaded to GIAB FTP site as it is collected
• May include raw reads, aligned reads, and variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel, and homozygous reference calls
• Then develop SV and non-SV calls
• Released calls are versioned
• Preliminary callsets will be made available to be critiqued
Pilot RM (NA12878)
• Developing reproducible methods for new integrated high-confidence SNPs/indels
• Illumina Platinum Genomes released phased pedigree calls in Dec 2014– Blog will be posted
– also working on SVs
• Developing SV calls– High-confidence
deletions and pre-print will be released Feb 2015
• Planned release as NIST RM8398 in April 2015
Ashkenazim PGP trio
Short reads
• Completed– 300x Illumina paired end on
trio
– Complete Genomics
– Ion exome
• Scheduled– Illumina mate-pair
– possibly SOLiD
Long reads
• Completed– 20x/8x/8x PacBio
– BioNano Genomics
• Scheduled– 60x/30x/30x PacBio (or more)
– custom moleculo
Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…
Illumina Paired-end
150x150bp ~300x/individual
Fastq on ftp SNPs/indels/some SVs
Illumina Long Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina “moleculo”
Custom library ~30x by long fragments
Feb-Mar 2015 SVs/phasing/assembly
Complete Genomics
100x/individual On ftp SNPs/indels/some SVs
Complete Genomics
LFR ?? SNPs/indels/phasing
Ion Proton Exome 1000x/individual
On SRA SNPs/indels in exome
BioNanoGenomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on AJ trio
Finished ~Mar 2015
SVs/phasing/assembly/STRs
Asian PGP trio
• Similar sequencing to Ashkenazim trio except for PacBio
• Only son will be NIST RM
SNP/Indel Integration Method Update
• Implementing new integration methods on DNAnexus
– Easier for others to reproduce results
– Easier to apply same methods to new genomes
• First, analyzing NA12878 RM data with new methods to ensure they work well
• Then, apply to PGP trios
Reference genome, Repeatmasker data
SVClassifyUp to 180
annotations per SV
Aligned sequence data (BAM file)
List of structural variants (bed file)
Up to 35 selected
annotations per SV
One class methods
Unsupervised clustering
Support vector
machine
L1distance
SV Integration Methods
Multidimensional scaling plot for visualizing the 8 clusters. We use a 3 dimensional representation of the data space which associates 3 MDS coordinates to each site, one for each dimension. This figure plots MDS-3 against MDS-1
Multi-dimensional scaling showing separation of 8 clusters
ROC curves for One-class Classification
Number of sites from each candidate callset that have k=3 L1 Classification scores in each range, where the score is the proportion p of random sites that are closer to the center than each candidate site. These numbers are after filtering sites for which the flanking regions have low mapping quality or high coverage.
<0.68 0.68-0.90 0.90-0.99 >0.99
Random 2,599 773 279 17
Personalis 4 4 182 1,783
1000 Genomes
38 65 557 1,493
One-class scores for Random Non-SVs and “Validated” Callsets