Development & applications of a segregation-phasing ground truth Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc. GENOME-IN-A-BOTTLE WORKSHOP
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Development & applications of a segregation-phasing ground truth
Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine
In collaboration with Real Time Genomics, Inc.
G E N O M E - I N - A - B O T T L E W O R K S H O P
Evaluating Variant Calls
O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5, 28 (2013).
Beyond Venn Diagrams
Experimental validation (e.g. Sanger, qPCR) Expensive Limited by platform success Statistical sampleReference orthogonal data available for some genomes SNP array data Sparse fosmid sequencing data IncompleteReference genomes sequenced by multiple platforms Arbitration methods (e.g. NIST, Genome-in-a-Bottle) Low FP, but unknown FN (genome-wide) Biases?
Mendelian segregation as “ground truth”
CEPH/Utah Pedigree 1463
Sequenced by CGI and Illumina (Platinum Genomes)Started with 2x100bp 50X WGS Illumina Platinum data Aligned & variant called with rtgVariant 1.1, filter by quality score (AVR≥0.15)
across the samples, excluding problematic sites
Example: Heterozygous variant segregation
Segregation of heterozygous variants to offspring
1 2 3 4 5 6 7 8 9 10 110
20,000
40,000
60,000
80,000
SNV
# of offspring segregating
SNV
coun
t
1 2 3 4 5 6 7 8 9 10 110
100
200
300
400
500
MNP
# of offspring segregating
MN
P co
unt
1 2 3 4 5 6 7 8 9 10 110
2,000
4,000
6,000
8,000
10,000
indel
# of offspring segregating
inde
l co
unt
1 2 3 4 5 6 7 8 9 10 110
20,000
40,000
60,000
80,000
100,000
All Variants
# of offspirng segregating
Varia
nt co
unt
Steps for haplotype phasing in large family
Check calls vs haplotype framework
Connect haplotype islands
Phase contiguity extension
Identify crossovers
Phasing labels given parent and child genotypes
Parents Children fa/fb ma/mb
0/0 0/1 0/0 0/1
fa/ma fa/mb
fb/ma fb/mb
0/1 0/1 0/0 0/1 1/1
fa/ma fb/ma fb/mb
fa/mb
0/0 1/2 0/1 0/2
fa/ma fa/mb
fb/ma fb/mb
0/1 1/2 0/1 0/2 1/1 1/2
fa/ma fa/mb fb/ma fb/mb
0/1 2/3 0/2 0/3 1/2 1/3
fa/ma fa/mb fb/ma fb/mb
Identification of recombination crossoversChr 1 Mother
Given that there are d different genotypes across both the parents and children and that the number of times each of these genotypes occurs is ni and , then the probability is:
Probability of a set of genotypes being phase-consistent by chance
Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. bioRxiv (2014). doi:10.1101/001958
Probability of a set of genotypes being phase-consistent by chance – some examples
RTG, Hamilton, New Zealand John Cleary Ross Braithwaite Len TriggRTG, San Bruno, CA Sahar Malakshah Minita ShahMichael Eberle, Illumina, Inc. – Platinum Project dataComplete Genomics, Inc. – CEPH pedigree dataJustin Zook – NIST
Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)
This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.