SUPPLEMENTARY INFORMATION Materials and Methods Whole genome sequencing Page 2 Exome Sequencing Page 6 Variant Validation Page 10 Identification of DLBCL Cancer Genes Page 16 Gene Expression Microarray Analysis Page 23 Gene Annotation and GO Term Enrichment Page 23 Biological Validation Page 25 1
49
Embed
· Web viewIn order to assess the efficacy of this method in our data, we began by generating simulated paired end reads as positive controls for 12 different structural variants
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SUPPLEMENTARY INFORMATION
Materials and Methods
Whole genome sequencing Page 2
Exome Sequencing Page 6
Variant Validation Page 10
Identification of DLBCL Cancer Genes Page 16
Gene Expression Microarray Analysis Page 23
Gene Annotation and GO Term Enrichment Page 23
Biological Validation Page 25
1
Materials and MethodsSample acquisition and processing Archival lymphoma tumors (N=73) and normal tissue (N=34) from 73 patients were obtained
from the institutions that constitute the Hematologic Malignancies Research Consortium
(HMRC) 1. These cases were anonymized, shipped to Duke University, and processed in
accordance with a protocol approved by the Institutional Review Board at Duke University.
RNA and genomic DNA were extracted from these 73 cases in addition to 21 DLBCL cell lines
using column-based methods described previously1.
Whole genome sequencing
Library PreparationWhole genome sequencing libraries were prepared using methods
described in the “Sample Preparation” section of the Agilent SureSelect protocol (pre-capture
portion). Genomic DNA was sheared to 500 bp using Covaris settings: duty cycle-10%,
intensity-5, frequency-200 cycles/burst, duration-135s, waterbath temperature-4°C, and
quantified by BioAnalyzer (Agilent) using the DNA1000 chip. Then, it was end-repaired, A-
tailed, and ligated to Illumina paired-end adapters at a ratio of 2μl per μg of DNA as quantified
by BioAnalyzer. The ligated library was amplified for 6 cycles using Illumina PE PCR primers
and 2x Phusion HF Master Mix. Post-PCR, the library was purified and assayed on BioAnalyzer
to determine size and concentration. Libraries were diluted to 5 pM for Illumina clustering and
paired-end sequenced over 9 days.
Sequence Alignment
Raw reads in fastq format 2 were masked for Illumina adapter sequences, barcodes, and Phred-
scaled base qualities of 10 and less using GATK3. All the alignments were output as BAM files 4
and merged using Picard (http://picard.sourceforge.net). PCR/optical duplicates were marked
with Picard, and base quality recalibration and localized Indel realignments were performed
using GATK 3. Read alignments were visualized with Integrative Genomics Viewer5.
SAMtools mpileup with settings “-C50 -m3 -F0.0002” was run for the samples concurrently and
output to a VCF file. Individual SNVs and Indels were annotated with gene names and predicted
function using SequenceVariantAnalyzer6, dbSNP130, HapMap v3 allele frequencies, 1000
Genome Project pilot 1 allele frequencies and CCDS Gene IDs using BEDTools, AWK, and
Exon primers described previously16 were used to amplify regions of interest. Each 12.5 μl
reaction contained: 5 ng genomic DNA of interest, 6.25 μl of High Fidelity PCR Master Mix
(Roche, 12 140 314 001), and 300 nM of each primer. Amplification was carried out as
described by the manufacturer (94°C 2:00, 10 cycles of 94°C 0:10, 50°C 1:10, 72°C 0:45, 20
cycles of 94°C 0:15, 50°C 0:30, 74°C 0:45 incremented by 5s per cycle, 72°C 2:30). The
reaction specificity was verified by Agarose gel, and reactions were purified with Agilent
Ampure XP beads using manufacturer instructions (Agilent, A63881).
We have validated 118 variants at 32 unique loci in 26 genes using Sanger sequencing. The
concordance between genotype calls and NGS to be excellent (111/118, 94%). We expect the
concordance rates for variants we did not sample here to be comparable. The 7 discordant cases
all occurred because next-generation sequencing called a variant that was not supported by
Sanger sequencing (false positive). We did not observe any cases where a variant determined to
be absent by NGS was found to actually be present by Sanger sequencing (false negative).
Therefore, we are highly confident that NGS calls for somatic variants are real because our NGS
calls for absence of variants are highly accurate.
Our NGS accuracy is high because we have deliberately used conservative cutoffs for identifying
our mutations, which must be supported by a minimum of 5 reads and quality score
corresponding to an error rate of <100. Thus, our identified mutations agreed well with our 3
methods of validation (Sanger, Raindance, and SNP array). NGS (or any form of sequencing for
that matter) has the same accuracy for genotyping regardless of the population frequency the
variant it measures, so our high concordance rates for SNP array calls are indicative of the
12
accuracy of our experimental and bioinformatic pipeline methods for identify known and novel
variants.
Table S6: Summary of Sanger sequencing results. Genes with an asterisk are depicted in the sample chromatograms in Figure S2. Function: NS=Nonsynonymous. Concordant: total experiments consistent between NGS and Sanger. Discordant: inconsistent between NGS and Sanger. NGS:+, Sanger:+: comparison of results where NGS was calling a variant as present. NGS:-, Sanger:-: comparison of results where NGS called a sample negative for the variant.
Figure S2: Representative chromatograms from Sanger experiments summarized in Table S6 (Starred genes). The top trace for each experiment depicts the trace expected if the genomic sequence matches the reference genome. The bottom trace is the chromatogram actually observed in DLBCL samples.
MLL3 Variants
15
Figure S3: Example variants discovered within the MLL3 gene, which was the most recurrently mutated in this study. Exome sequencing reads are visualized using the Integrated Genomics Viewer; grey color indicates matches to the reference genome, and mismatches are labelled in colored letters. A. Somatic mutation in DLBCL835 (top), clearly absent from the matched normal sequence (bottom) B. Six additional mutations are shown.
Identification of DLBCL Cancer GenesThe overall schema for identifying DLBCL cancer genes is summarized in Figure S4.
Genes mutated in DLBCL were identified by analyzing the 73 primary tumor samples. The
initial set of DLBCL mutations were determined from the 34 DLBCL primary tumors with
paired normal samples, which constituted the discovery set. Data from cell lines were not used in
this analysis.
For each of these cases, we identified mutations that were present in tumor but absent from the
paired normal cases (somatically mutated). We eliminated common genetic variants by
16
Figure S4: Summary of DLBCL variant and cancer gene identification from 73 primary tumor samples.
excluding those that occurred in the general population as identified from the following sources:
dbSNP 17, publicly available data (pilot 1) from the 1000 genomes project 18, 256 recently
published exomes from otherwise healthy individuals19-21, one additional hapmap exome that we
sequenced in this study, and those found to have a minor allele frequency of greater than 1% in
the 6500 exome dataset from the NHLBI Exome Sequencing Project.
We identified 5884 variants that were somatically mutated in at least one these 34 tumor-normal
pairs. From this list, we identified 2589 variants that represented frameshift, nonsense, missense,
or loss of a stop codon changes, corresponding to 2140 genes. These 2140 genes were examined
in all primary DLBCL cases and found to have 4928 frameshift, nonsense, missense or loss of a
stop codon variants. Among these variants were also 125 variants from 58 genes that were
identified as potential mutational hotspots, occurring 4 or more times in DLBCLs (and none in
controls). We estimated the functional impact of each of these 4928 variants on the encoded
protein using a program that outputs a functional index score (described below). We also tallied
the number of variants by gene and noted whether the gene had been previously annotated as a
cancer gene in the COSMIC database 22. Finally, we estimated the rate of nonsynonymous
variation in these genes in normal controls. We limited this analysis to previously sequenced 257
exomes from otherwise healthy individuals because these cases have similar exonic coverage as
our DLBCLs, and were processed using methods identical to those used to characterize the
DLBCL exomes.
We generated a statistical model for genes likely to be drivers. It takes into account 4 features:
gene size, background nonsynonymous mutation rates in normal samples, somatically acquired
events, and the rate of these events in carriers. Given that mutations are rare and the number of
genes is high relative to the number of samples, standard regression techniques do not apply.
Also chi-squared tests of independence, or other similar tests, for each individual gene, besides
having the obvious problem of multiple testing, would never account for important mutants that
occur with very low frequency but have other important characteristics. After filtering for genes
in which we observed a minimum of 1 somatic event and 1 additional rare event from the same
class or presence in the COSMIC database, we ranked genes based on their distance from known
cancer genes.
17
We calculated the distance of a gene from a pool of known cancer genes based on the 4 variables
listed above. Let μ and Σ be the mean vector and covariance matrix for the population of known
cancer genes, from which we calculate cancer gene candidate j’s distance. We use the well-
known Mahalanobis distance D=(gj-μ)T Σ-1(gj-μ) where Σ=n1 Σ1+n2 Σ2
n1+n2−2
This distance and has several desirable properties: unlike the Euclidean metric, which can only
deal with circular forms and is a special case of D where Σ=I, the identity matrix. Here, because
μ and Σ are unknown, we estimate them from the same mean and sample covariance matrix for
the population. In the special case where we assume that the population is multivariate normal, D
is drawn from the chi-squared distribution with p degrees of freedom. Therefore we can test
whether each individual gene belongs to the population of cancer genes or not, based on this
assumption.
We compared two populations: known cancer genes from the literature and our candidate novel
cancer genes. The p-value measuring the level of distance is calculated from the F-distribution
F(d1,d2) with d1 and d2 degrees of freedom respectively:
n1 n2(n1+n2−p−1)❑(n¿¿1+n2) p (n¿¿1+n2−2) D F ( p , n−p−1)¿¿
Genes closest in distribution (P<10-6) to known cancer genes were identified as DLBCL cancer
genes. Previously annotated cancer genes were required to have 1 involved case, while genes
that were not previously annotated as cancer genes were required to have a minimum of two
involved primary tumor DLBCL cases. Using these statistics, we found that 90% of the known
cancer genes and the newly identified DLBCL cancer genes had at least one variant with a
functional index of 0.9 or higher, and a rate of non synonymous variants of less than one per case
in the unmatched controls.
Using these criteria, we identified a total of 426 genes that were recurrently mutated. Excluding
those for which more than two-thirds of the variants also were found by the NHLBI Exome
Sequencing Project, 322 genes remained. 52 genes within this list were previously annotated as
18
cancer genes. These 322 genes (Dataset S3) comprised 1418 variants, which are listed in Dataset
S4.
Figure S5 (below) shows a side-by-side comparison of DLBCL cancer genes identified in this
study and those that were previously identified in COSMIC. The distributions were largely
identical in the two gene groups.
Overlap with other studies
We observed partial overlap in genelists between our study and three other DLBCL studies using
similar methodologies and deep sequencing23-25, shown in figure 5. We explored whether the
degree of overlap changed when genes from the Lohr study were stratified by frequency. As
19
Figure S5
Figure 5A depicts the distribution of the number of cases affected by the genes annotated in COSMIC (orange line) as well as recurrently mutated genes in our data (blue line).
Figure 5B shows the distribution of non-synonymous variation in normal controls for the genes annotated in COSMIC(orange line) and the recurrently mutated genes in our data (blue line).
Figure 5C illustrates the distribution of the computed functional index scores for variants in the genes annotated in COSMIC (orange line) compared to that for the recurrently mutated genes in our data (blue line).
expected, as the frequency of genes increased, the overlap increased (Figure S6, below). Similar
results were observed when using genes from the other studies.
>0 (58 genes)
>1 (58 genes)
>2 (54 genes)
>3 (36 genes)
>4 (28 genes)
>5 (17 genes)
0102030405060708090
100
Our workPasqualucci et alMorin et al
Number of Patients in Lohr study with Gene Somatically Mutated
% O
verla
p
However, even among these more frequently mutated genes, the genes that overlapped between
studies was different. While some gene mutations were observed in multiple studies including
MYC26, B2M27 and PRDM1 23-25,28, a substantial portion of genes identified in each study were
not identified in the others. These observations, again, highlight the underlying genetic
heterogeneity of these tumors and the importance of biological validation of these findings.
We further explored whether this degree of heterogeneity also occurred in two published studies
that applied exome sequencing to define the genetics of head and neck cancer. While one study29
reports 462 genes as recurrently mutated in the disease and the other30 reports 199 genes, only 20
genes overlapped (Venn diagram, Figure S7). These findings suggest that our observations
regarding the heterogeneity of DLBCLs may also hold true in other cancers.
20
Figure S6: The accompanying chart shows the degree of overlap between different studies and that of Lohr et al, as a function of the frequency of somatic events. We found increasing overlap as the number of somatic events increases.
Figure S7: The accompanying Venn diagram depicts the comparison of overlapping gene mutations from two head and neck cancer studies. The number in parenthesis indicates the number of genes identified in the study.
Genes with low coverage
We noted a number of genes that were not covered in our analysis, either due to not being a
Version 36 CCDS gene (e.g. MLL2) or due to methodological reasons (e.g. RERE). All genes
with fewer than four supporting reads in all samples were systematically excluded from analysis.
Genes with average depth less than 4 are listed in Table S7. These genes collectively comprise
phosphatase inhibitor (Sigma cat #P-5726) and PMSF (100mM stock). The lysate was vortexed
10s, incubated on ice for 10 minutes, vortexed again, and then incubated on ice again. Then, it
was sonicated using Covaris settings: duty cycle:5%, intensity:4, Cycles/burst:200 for 2 1-minute
pulses. The lysate was centrifuged at 4°C for 10 minutes at 16,000 rcf, and the supernatant was
transferred to a fresh tube.
100μg of protein was used per well for PI3K activity measurement by ELISA (Echelon
Biosciences part number K-1000s) per manufacturer instructions.
29
Figure S8: PI3Kinase ELISA Activity measurements of FL5 mAkt cells transfected with wild-type PIK3CD or mutant PIK3CD. Kinase activity of cells subject to IL3 removal compared to those not subject to removal is shown for each transfection.
References
1. Jima, D.D. et al. Deep sequencing of the small RNA transcriptome of normal and malignant human B cells identifies hundreds of novel microRNAs. Blood 116, e118-27 (2010).
2. Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L. & Rice, P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38, 1767-71 (2010).
3. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-303 (2010).
4. Parmigiani, G. et al. Design and analysis issues in genome-wide somatic mutation studies of cancer. Genomics 93, 17-21 (2009).
5. Robinson, J.T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-6 (2011).6. Ge, D. et al. SVA: Software for Annotating and Visualizing Sequenced Human Genomes.
Bioinformatics (2011).7. Reva, B., Antipin, Y. & Sander, C. Determinants of protein function revealed by combinatorial
entropy optimization. Genome Biol 8, R232 (2007).8. Chiang, D.Y. et al. High-resolution mapping of copy-number alterations with massively parallel
sequencing. Nat Methods 6, 99-103 (2009).9. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural
variation. Nat Methods 6, 677-81 (2009).10. Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res 19,
1639-45 (2009).11. Needleman, S.B. & Wunsch, C.D. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol Biol 48, 443-53 (1970).12. Pruitt, K.D. et al. The consensus coding sequence (CCDS) project: Identifying a common protein-
coding gene set for the human and mouse genomes. Genome Res 19, 1316-23 (2009).13. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features.
Bioinformatics 26, 841-2 (2010).14. Kamai, T. et al. Increased Rac1 activity and Pak1 overexpression are associated with
lymphovascular invasion and lymph node metastasis of upper urinary tract cancer. BMC Cancer 10, 164 (2010).
15. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-8 (2010).
16. Wood, L.D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108-13 (2007).
17. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-11 (2001).
18. Siva, N. 1000 Genomes project. Nat Biotechnol 26, 256 (2008).19. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature
461, 272-6 (2009).20. Yi, X. et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75-
8 (2010).21. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-
synonymous coding variants. Nat Genet 42, 969-72 (2010).22. Forbes, S.A. et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic
Mutations in Cancer. Nucleic Acids Res 39, D945-50 (2011).
30
23. Morin, R.D. et al. Frequent mutation of histone-modifying genes in non-Hodgkin lymphoma. Nature 476, 298-303 (2011).
24. Pasqualucci, L. et al. Analysis of the coding genome of diffuse large B-cell lymphoma. Nat Genet 43, 830-7 (2011).
25. Lohr, J.G. et al. Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc Natl Acad Sci U S A 109, 3879-84 (2012).
26. Pasqualucci, L. et al. Hypermutation of multiple proto-oncogenes in B-cell diffuse large-cell lymphomas. Nature 412, 341-6. (2001).
27. Challa-Malladi, M. et al. Combined genetic inactivation of beta2-Microglobulin and CD58 reveals frequent escape from immune recognition in diffuse large B cell lymphoma. Cancer Cell 20, 728-40 (2011).
28. Mandelbaum, J. et al. BLIMP1 is a tumor suppressor gene frequently disrupted in activated B cell-like diffuse large B cell lymphoma. Cancer Cell 18, 568-79 (2010).
29. Agrawal, N. et al. Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science 333, 1154-7 (2011).
30. Stransky, N. et al. The mutational landscape of head and neck squamous cell carcinoma. Science 333, 1157-60 (2011).
31. Dave, S.S. et al. Molecular diagnosis of Burkitt's lymphoma. N Engl J Med 354, 2431-42 (2006).32. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25, 25-9 (2000).33. Zhang, J. et al. Patterns of microRNA expression characterize stages of human B cell
differentiation. Blood 113, 4586-94 (2009).34. Kelley, L.A. & Sternberg, M.J. Protein structure prediction on the Web: a case study using the