Top Banner
NIST Program to Develop Genomic Reference Materials Jus<n Zook and Marc Salit
12

NIST program to develop genomic reference materials

Jun 26, 2015

Download

Technology

GenomeInABottle
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NIST program to develop genomic reference materials

NIST  Program  to  Develop  Genomic  Reference  Materials  

Jus<n  Zook  and  Marc  Salit  

Page 2: NIST program to develop genomic reference materials

Scope  of  NIST  work  

•  Human  Whole  Genome  RMs  •  Synthe<c  DNA  constructs  •  Microbial  Whole  Genome  RMs  

Page 3: NIST program to develop genomic reference materials

RM  Development  Process  

1.  Select  and  procure  materials  2.  Characterize  materials  3.  Process  and  integrate  data  from  mul<ple  

plaMorms  4.  Confirm  selected  genotypes  5.  Write  Report  of  Analysis  6.  Develop  methods  for  end  users  to  obtain  

performance  metrics  from  the  materials  

Page 4: NIST program to develop genomic reference materials

Proposed  Timeline  for  Human  RMs  

Page 5: NIST program to develop genomic reference materials

Proposed  Timeline  for  Synthe<c  Structures  

32w1.1) Select/Procure human DNA for RM

1.2) **NIST receives packaged DNA for RM/SRM

97w1.3) Develop bioinformatics pipeline for data integration

147w1.4) Human Primary Sequencing

8w1.5) Human Homogeneity assessment

10w1.6) Analyze homogeneity data and produce preliminary SNP calls for RM

10w1.7) Write human RM Report of Analysis

24w1.8) Process Human RM for release

1.9) **Human RM officially released

25w1.10) Human Sequencing data integration

20w1.11) Human Validation

48w1.12) Human other characterization methods

12w1.13) Analyze validation data and refine sequencing calls

40w1.14) Develop pipeline for SVs and test

8w1.15) Write Human SRM Report of Analysis

24w1.16) Process Human SRM for release

1.17) **Human SRM officially released

10w1.18) Procure local data storage

10w1.19) Procure Bioinformatics data analysis tools

10w1.20) Procure Automated sample prep instrumentation

535w1) Human RMs

31w2.1) Select/Procure microbial DNA for RMs

124w2.2) Microbial Primary Sequencing

6w2.3) Microbial Homogeneity assessment

10w2.4.1) Mapping/Alignment

12w2.4.2) Variant calling

12w2.4.3) Form consensus variant calls

6w2.4.4) Select sites for validation

40w2.4) Microbial Sequencing data integration

8w2.5) Microbial Validation

8w2.6) Microbial other characterization methods

20w2.7) Analyze validation data and refine calls

18w2.8) Write Microbial SRM Report of Analysis

24w2.9) Process Microbial SRM for release

2.10) **Microbial SRM officially released

279w2) Microbial RMs

12w3.1) Design constructs

10w3.2) Test constructs in silico

20w3.3) Procure synthetic DNA for testing

124w3.4) Sequence preliminary synthetic DNA

12w3.5) Compare preliminary sequencing data

8w3.6) Design final RM constructs

16w3.7) Procure synthetic DNA for SRMs

88w3.8) Sequence synthetic SRMs

12w3.9) Sequencing data integration

8w3.10) Write synthetic SRM Report of Analysis

24w3.11) Process synthetic SRM for release

3.12) **Synthetic SRM officially released

334w3) Synthetic DNA constructs

Title Effort2011 2012 2013 2014 2015 2016

Page 6: NIST program to develop genomic reference materials

Proposed  Characteriza<on  Methods  for  Whole  Genomes  

Whole  Genome  Sequencing  •  ABI  5500  (1kb,  6kb,  and  

10kb  mate-­‐pair  libraries)  •  Illumina  •  Complete  Genomics  •  Upcoming  technologies?    

–  Ion  Proton?    –  Oxford  Nanopore?  

•  3x  replica<on  of  sequencing  (3  library  preps)  

Other  •  Genotyping  microarrays  •  Array  CGH  •  Targeted  sequencing  •  Fosmid  sequencing?  •  Op<cal  Mapping?  

Father   Mother  

NA12878  Husband  

Son   Daughter  

Page 7: NIST program to develop genomic reference materials

Integra<on  of  Exis<ng  Data  to  Form  Consensus  Genotype  Calls  

Find  all  possible  variant  sites  

Find  sites  where  all  datasets  agree  

Iden<fy  sites  with  atypical  characteris<cs  signifying  sequencing,  mapping,  or  alignment  bias  

For  each  site,  remove  datasets  with  decreasingly  atypical  characteris<cs  un<l  all  datasets  agree  

Even  if  all  datasets  agree,  iden<fy  them  as  uncertain  if  few  have  typical  characteris<cs  

Page 8: NIST program to develop genomic reference materials

Consensus  has  lower  FN  rate  than  individual  datasets  Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.45M   7.24k  (1.34%)   5.28k  (0.65%)   N/A  

Heterozygous   196  (0.03%)   411k  (60.7%)   133  (0.02%)   N/A  Homozygous  

Variant   154  (0.02%)   150  (0.02%)   249k  (37.0%)   N/A  

Illumina  Omni  SNP  Array  

Integrated

 Con

sensus  

Gen

otypes  

Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.45M   613  (0.09%)   977  (0.15%)   N/A  

Heterozygous   241  (0.04%)   414k  (61.5%)   173  (0.03%)   N/A  Homozygous  

Variant  152  (0.02%)   61  (0.01%)   249k  (36.9%)   N/A  

Uncertain   5458  (0.81%)   3421  (0.51%)   4808  (0.71%)   N/A  

HiSeq  –  GAT

K  

“FNs”  

“FPs*”  

“FNs”  

“FPs*”  

*  Note  that  most  or  all  of  the  puta<ve  FPs  seem  to  actually  be  FNs  on  the  microarray  

Illumina  Omni  SNP  Array  

Page 9: NIST program to develop genomic reference materials

SNP  arrays  overesMmate  performance  

Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.45M   7.24k  (1.34%)   5.28k  (0.65%)   N/A  

Heterozygous   196  (0.03%)   411k  (60.7%)   133  (0.02%)   N/A  Homozygous  

Variant   154  (0.02%)   150  (0.02%)   249k  (37.0%)   N/A  

Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.52M   157k  (4.68%)   30.3k  (0.90%)   4.17M  

Heterozygous   47  (0.00%)   1.90M  (56.4%)   34  (0.00%)   16.9k  (0.50%)  Homozygous  

Variant  1  (0.00%)   298  (0.01%)   1.19M  (35.3%)   73.3k  (2.18%)  

Integrated  Consensus  Genotypes  

HiSeq  –  GAT

K  

“FNs”  

“FPs*”  

“FNs”  

“FPs”  

HiSeq  –  GAT

K  

Illumina  Omni  SNP  Array  

Page 10: NIST program to develop genomic reference materials

Samtools  has  higher  FP  and  lower  FN  than  GATK  

Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.51M   49.6k  (1.47%)   6.74k  (0.20%)   3.93M  

Heterozygous   3141(0.09%)   2.00M  (59.6%)   74  (0.00%)   175k  (5.19%)  Homozygous  

Variant   21  (0.00%)   777  (0.02%)   1.21M  (36.0%)   192k  (5.71%)  

Integrated  Consensus  Genotypes  

Homozygous  Reference   Heterozygous   Homozygous  

Variant   Uncertain  

Homozygous  Reference/  No  Call  

1.52M   157k  (4.68%)   30.3k  (0.90%)   4.17M  

Heterozygous   47  (0.00%)   1.90M  (56.4%)   34  (0.00%)   16.9k  (0.50%)  Homozygous  

Variant  1  (0.00%)   298  (0.01%)   1.19M  (35.3%)   73.3k  (2.18%)  

Integrated  Consensus  Genotypes  

HiSeq  –  samtools  

“FNs”  

“FPs”  

“FNs”  

“FPs”  

HiSeq  –  GAT

K  

Page 11: NIST program to develop genomic reference materials

Performance  Metrics:  Characteris<cs  of  Mis-­‐calls  

.  .  .  

QUAL/Depth  of  Coverage  

HiSeq/GA

TK  

Consensus  Genotypes  

Heterozygous  

Hom.  V

ariant  

Hom.  R

ef./No  call  

Heterozygous   Hom.  Variant  Hom.  Ref.   Uncertain  

Strand  Bias  

Page 12: NIST program to develop genomic reference materials

Challenges  with  assessing  performance  

•  All  variant  types  are  not  equal  •  Nearby  variants  are  ojen  difficult  to  align  •  All  regions  of  the  genome  are  not  equal  –  Homopolymers,  STRs,  duplica<ons  –  Can  be  similar  or  different  in  different  genomes  

•  Labeling  difficult  variants  as  “uncertain”  in  the  Reference  Material  leads  to  higher  apparent  accuracy  when  assessing  performance  

•  Genotypes  fall  in  3+  categories  (not  posi<ve/nega<ve)  •  It’s  important  to  consider  data  from  mul<ple  plaMorms  and  library  prepara<ons  when  characterizing  a  Reference  Material