ANALYSIS OF STRUCTURAL VARIANTS FROM NEXT GENERATION SEQUENCING Hemang Parikh, Ph.D. NIST
Jun 24, 2015
ANALYSIS OF STRUCTURAL VARIANTS FROM NEXT GENERATION SEQUENCING
Hemang Parikh, Ph.D.
NIST
Challenges for identifying true SVs
This Venn diagram shows the numbers of unique and shared structural variants (SVs) found by different sequencing-based discovery approaches that have been used in the 1000 Genomes Project
Hence we decided to develop methods to look for evidence of SVs in mapped sequencing reads from multiple sequencing technologies
From Alkan et al. (2011)
• Coverage (mean and standard deviation)• Paired-end distance/insert size (mean and standard deviation)
• # of discordant paired-ends reads• Soft clipping of the reads (mean and standard deviation)
• Mapping quality (mean and standard deviation)
• # of heterozygous and homozygous SNP genotype calls
• % of GC content
Validation parameters for each SV
Reference sequence Repeatmasker data
Perl scriptAbout 180
annotations per SV
Aligned sequence data (BAM file)
List of structural variants (bed file)
NA12878 Data Sets—RM for GIAB
• Illumina (250 bp long sequences with 50X coverage)
• Illumina NIST (150 bp long sequences with 300X coverage)
• Illumina Platinum Genome (100 bp long sequences with 200X coverage)
• Illumina Moleculo
• Pacific Biosciences
Deletions Gold Sets for NA12878
• Personalis (n=2,306)• The 1000 Genomes pilot (n=2,773)• Complete Genomics (n=2,032)• Conrad et al. (n=515)• Kidds et al. (n=317)• McCaroll et al. (n=128)• The 1000 Genomes—aCGH array based (n=3,901)• Roche NimbleGen 42 million—aCGH array based (n=719)
• Randomly generated (n=2,306)
Personalis deletions call set (n=2,306)
Log10 (SV Size)
2 3 4 5
Cou
nts
600
400
200
0
• BAM-level evidence in the vicinity of each SV, in most of the 19 CEPH pedigree samples
• SV breakpoints were identified
• Some SVs were validated with PCR
Illumina NIST
-2 0 2 4
400
300
200
100
0
Cou
nts
Log10 (M coverage) Log10 (M coverage)
-1 0 1 2 3
Cou
nts
900
600
300
0
Personalis Random genome
Identifying likely SVs and likely non-SVs
Log10 (M coverage)
Cou
nts
400
300
200
100
0
Random genome
Identify 99 percentile value of an annotation parameter
-3 -2 -1 0 1 2
Compared this value with an
annotation parameter from SV Gold Set
Annotating with Illumina NIST and Illumina Moleculo
Personalis SV Gold Set for Illumina NIST annotation parameters
Personalis SV Gold Set for Illumina Moleculo annotation parameters
L Insert sizeL Soft ClippedL # of discordant paired-ends readsM CoverageM Coverage SDM Mapping qualityM Insert sizeM Soft ClippedM # of discordant paired-ends reads
L Soft ClippedM CoverageM Coverage SDM Mapping qualityM Soft Clipped
0 1 2 3 4 5 6 7 8 9 10
0 21 96 323 350 231 126 80 40 10 2 1
1 4 19 45 59 61 29 16 9 9 0 1
2 1 22 108 200 214 111 69 36 8 3 0
3 0 0 0 1 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
Illumina NISTM
olec
ulo
0 1 2 3 4 5 6 7 8 9 10
0 2059 94 18 6 2 3 1 0 0 0 0
1 62 15 12 5 1 3 2 0 0 1 0
2 13 3 5 0 0 0 0 1 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
Illumina NIST
Mol
ecul
o
(B) Random genome
(A) Personalis
Conclusions
• Graphical visualization of the annotation parameters has shown clear distinction between true positive and false positive SVs
• A key advantage of the proposed method is its simplicity and flexibility to generate various annotation parameters from aligned sequence data based on different sequencing datasets from the same genome
• This allows integration of multiple sequencing datasets to identify high-confidence SV and non-SV calls that can be used as a benchmark to assess false positive and false negative rates
• We are currently testing classification methods based on the annotation parameters to generate both high-confidence SV calls and high-confidence non-SV calls for NA12878
Acknowledgements
NISTMarc Salit
Justin Zook
Hariharan Iyer
Desu Chen
Sumona Sarkar
Jennifer McDaniel
Lindsay Vang
David Catoe
Nathanael Olson
Genome in a Bottle Consortium
Personalis Inc.Mark Pratt
Gabor Bartha
Jason Harris
Illumina Inc.Michael Eberle
Stanford UniversityMichael Snyder
Amin Zia
Somalee Datta
Cuiping Pan
Sean Michael Boyle
Rajini Haraksingh
Natalie Jaeger