Top Banner
BioMed Central Page 1 of 11 (page number not for citation purposes) BMC Bioinformatics Open Access Methodology article Estimation of tumor heterogeneity using CGH array data Kai Wang 1,2 , Jian Li 1 , Shengting Li 1 , Lars Bolund 1 and Carsten Wiuf* 2 Address: 1 Institute of Human Genetics, University of Aarhus, Aarhus, Denmark and 2 BiRC – Bioinformatics Research Center, University of Aarhus, Aarhus, Denmark Email: Kai Wang - [email protected]; Jian Li - [email protected]; Shengting Li - [email protected]; Lars Bolund - [email protected]; Carsten Wiuf* - [email protected] * Corresponding author Abstract Background: Array-based comparative genomic hybridization (CGH) is a commonly-used approach to detect DNA copy number variation in whole genome-wide screens. Several statistical methods have been proposed to define genomic segments with different copy numbers in cancer tumors. However, most tumors are heterogeneous and show variation in DNA copy numbers across tumor cells. The challenge is to reveal the copy number profiles of the subpopulations in a tumor and to estimate the percentage of each subpopulation. Results: We describe a relation between experimental data and exact DNA copy number and develop a statistical method to reveal the heterogeneity of tumors containing a mixture of different-stage cells. Furthermore, we validate the method on simulated data and apply the method to 29 pairs of breast primary tumors and their matched lymph node metastases. Conclusion: We demonstrate a new method for CGH array analysis that allows a tumor sample to be classified according to its heterogeneity. The method gives an interpretable series of copy number profiles, one for each major subpopulation in a tumor. The profiles facilitate identification of copy number alterations in cancer development. Background Heterogeneity is an important characteristic of most can- cers. It manifests itself in various different ways, for exam- ple as heterogeneity in gene expression, protein abundance and/or genomic DNA copy number [1-3]. In this paper we focus exclusively on heterogeneity in genomic DNA copy number. Genomic DNA copy number variation in a tumor reflects concomitant or successive development of various foci and indicates that malignant transformation of cells is a dynamic evolutionary process. Numerous studies have demonstrated that the develop- ment of tumors involves accumulation of various genetic alterations [4-8]. Comparative genomic hybridization (CGH), matrix-based BAC/oligo array CGH, or oligonu- cleotide-based arrays are techniques that frequently are applied to elucidate intertumor heterogeneity across can- cers, patients or stages; the genomic profile of a tumor is presented at a fixed time point and averaged across differ- ent cells in the tumor. In contrast, intratumor heterogeneity is rarely reported [9]. Laser-capture micro-dissection is a powerful tool to select few phenotypically homogeneous tumor cells, and thus a way circumvent the problem of averaging across many potentially inhomogeneous tumor cells. Methods for whole genome amplification enable researchers to obtain Published: 9 January 2009 BMC Bioinformatics 2009, 10:12 doi:10.1186/1471-2105-10-12 Received: 30 July 2008 Accepted: 9 January 2009 This article is available from: http://www.biomedcentral.com/1471-2105/10/12 © 2009 Wang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
11

Estimation of tumor heterogeneity using CGH array data

Apr 28, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Estimation of tumor heterogeneity using CGH array data

BioMed CentralBMC Bioinformatics

ss

Open AcceMethodology articleEstimation of tumor heterogeneity using CGH array dataKai Wang1,2, Jian Li1, Shengting Li1, Lars Bolund1 and Carsten Wiuf*2

Address: 1Institute of Human Genetics, University of Aarhus, Aarhus, Denmark and 2BiRC – Bioinformatics Research Center, University of Aarhus, Aarhus, Denmark

Email: Kai Wang - [email protected]; Jian Li - [email protected]; Shengting Li - [email protected]; Lars Bolund - [email protected]; Carsten Wiuf* - [email protected]

* Corresponding author

AbstractBackground: Array-based comparative genomic hybridization (CGH) is a commonly-usedapproach to detect DNA copy number variation in whole genome-wide screens. Several statisticalmethods have been proposed to define genomic segments with different copy numbers in cancertumors. However, most tumors are heterogeneous and show variation in DNA copy numbersacross tumor cells. The challenge is to reveal the copy number profiles of the subpopulations in atumor and to estimate the percentage of each subpopulation.

Results: We describe a relation between experimental data and exact DNA copy number anddevelop a statistical method to reveal the heterogeneity of tumors containing a mixture ofdifferent-stage cells. Furthermore, we validate the method on simulated data and apply the methodto 29 pairs of breast primary tumors and their matched lymph node metastases.

Conclusion: We demonstrate a new method for CGH array analysis that allows a tumor sampleto be classified according to its heterogeneity. The method gives an interpretable series of copynumber profiles, one for each major subpopulation in a tumor. The profiles facilitate identificationof copy number alterations in cancer development.

BackgroundHeterogeneity is an important characteristic of most can-cers. It manifests itself in various different ways, for exam-ple as heterogeneity in gene expression, proteinabundance and/or genomic DNA copy number [1-3]. Inthis paper we focus exclusively on heterogeneity ingenomic DNA copy number. Genomic DNA copy numbervariation in a tumor reflects concomitant or successivedevelopment of various foci and indicates that malignanttransformation of cells is a dynamic evolutionary process.Numerous studies have demonstrated that the develop-ment of tumors involves accumulation of various geneticalterations [4-8]. Comparative genomic hybridization

(CGH), matrix-based BAC/oligo array CGH, or oligonu-cleotide-based arrays are techniques that frequently areapplied to elucidate intertumor heterogeneity across can-cers, patients or stages; the genomic profile of a tumor ispresented at a fixed time point and averaged across differ-ent cells in the tumor.

In contrast, intratumor heterogeneity is rarely reported [9].Laser-capture micro-dissection is a powerful tool to selectfew phenotypically homogeneous tumor cells, and thus away circumvent the problem of averaging across manypotentially inhomogeneous tumor cells. Methods forwhole genome amplification enable researchers to obtain

Published: 9 January 2009

BMC Bioinformatics 2009, 10:12 doi:10.1186/1471-2105-10-12

Received: 30 July 2008Accepted: 9 January 2009

This article is available from: http://www.biomedcentral.com/1471-2105/10/12

© 2009 Wang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 11(page number not for citation purposes)

Page 2: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

sufficient DNA for CGH analysis even from few cells [10].In this way the genomic profile of a small (homogeneous)region in the tumor can be studied, whereas the heteroge-neity of the tumor might be elucidated by investigatingseveral different regions across the tumor. Naturally, thelatter is time-consuming and labor intensive, and to ourknowledge, has not been reported.

With the above in mind, we have developed a statisticalmethod to study tumor heterogeneity. It takes CGH arraydata from individual tumors as input; one tumor sampleis represented by one array and contains DNA from apotentially heterogeneous cell population. Our methodestimates the number of dominant tumor subpopula-tions, the percentages of the subpopulations in the sam-ple, and the copy number profiles of the dominantsubpopulations. Also, the method estimates the percent-age of normal cells. Normal cells are diploid (two copiesof all genomic DNA) and typically consist of nonmalig-nant epithelium, fibroblast and/or penetrated lym-phocytes. To validate the method we have simulated dataaccording to a model derived from real CGH data. Addi-tionally, we have mixed some real tumor samples toobtain samples with partially known profiles. Subse-quently, we applied our method to 29 paired primary andlymph node metastasis breast cancer samples.

Our method can be considered a classifier in the sensethat it assigns a number of subpopulations to a giventumor sample. Alternatively the method might be consid-ered as a model selection procedure over an extensivenumber of models: We seek the model that explains thedata best, optimizing over the number of subpopulationsand the copy number profile for each subpopulation.

Results and discussionCalibration experiment

A series of calibration experiments were conducted to testthe array CGH platform in our laboratory. The majority ofsamples were from normal males and females (diploidsamples), but some samples were from patients with

genomic abnormalities, e.g. trisomies and monosomies.Importantly, all these samples are assumed to be homoge-neous, i.e. all cells in a sample have the same copynumber alteration(s). We fit a linear model to describe therelationship between log-copy number and log-intensity;as described in Methods (The copy number model). The

parameters of the linear model, y = αx + β, are estimated

to = 0.6049, 95% CI: (0.5542,0.6556), and = -

0.039, 95% CI (-0.085,0.0067), respectively. Theobserved and fitted values show high correlation; Pear-son's regression coefficient R2 is greater than 0.98. Theobserved values and the regression line are shown [seeAdditional file 1].

Heterogeneity in real tumorsWe applied the procedure described in Methods (Classi-fication of samples) to estimate the level of genomic het-erogeneity in 29 pairs of primary tumor and lymph nodemetastasis.

The method allows us to estimate the number of domi-nant subpopulations, the copy number profile and thepercentage of cells in each subpopulation. Our methodassumes a model of sequential tumor evolution whereeach subpopulation is evolved from the previous popula-tion by introducing new aberrations, or by making aberra-tions in the previous population more extreme, i.e. byincreasing copy numbers or decreasing copy numbers, seeMethods (Mixture modeling of tumor samples) fordetails.

The results from the analysis of the primary tumor andlymph node metastasis are shown in details [see Addi-tional file 2] and summarized in Table 1. To estimate thecomplexity of a tumor, we introduce the following meas-ure called the Aberration Index (AI),

α̂ β̂

AImi Ciki

miik =

−∑∑

| |,

2(1)

Table 1: Subpopulation summary of the 29 pairs of primary and metastasis samples

# P1 P2 P3 AI1 AI2 AI3 Total Pure

T-2 15 25.3 (8.8) - - 1.13 (0.76) - - 0.25 (0.13) 1.13 (0.76)T-3 13 33.4 (9.4) 14.2 (5.1) - 0.34 (0.13) 1.18 (0.26) - 0.28 (0.10) 0.59 (0.20)T-4 1 23 (-) 34 (-) 8 (-) 0.25 (-) 0.53 (-) 1.40 (-) 0.35 (-) 0.54 (-)M-2 16 24.8 (8.3) - - 1.07 (1.00) - - 0.21 (0.11) 1.07 (1.00)M-3 10 32.5 (8.7) 15.1 (6.1) - 0.45 (0.23) 1.33 (0.41) - 0.33 (0.12) 0.71 (0.30)M-4 3 16.7 (3.1) 31 (6) 8.7 (2.1) 0.18 (0.04) 0.48 (0.11) 1.33 (0.17) 0.29 (0.05) 0.52 (0.09)

The table summarizes the analysis done on the 29 pairs of samples [see Additional file 2]. Shown are mean values with standard deviation in parenthesis. T-i: Primary tumor with i subpopulations, M-i: Metastasis with i subpopulations, #: Number of samples, pk: Percentage of (abnormal) subpopulation k ≥ 1, AIk: Aberration Index for subpopulation k, Total: Weighted sum of AIk, ΣkpkAIk, Pure: AI•, normalized weighted sum of AIk, cf. equation (2).

Page 2 of 11(page number not for citation purposes)

Page 3: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

where mi is the number of clones in segment i, Cik the copynumber of segment i in subpopulation k, and |x| denotesthe absolute value of x. Here we assume clones are uni-formly spaced across the genome; if this is not the case thecontribution from each clone can be weighted by its dis-tance to neighbor clones. The estimated subpopulationsare named P0, P1, P2, ..., and ordered according toincreasing AI. P0 consists of normal cells only and has AI0= 0.

Tumors with only one abnormal subpopulation havehigher average AI than tumors with many abnormal sub-populations; see Table 1. Also, the average complexity ofall abnormal subpopulations in a tumor,

("Pure" in Table 1) where pk denotes the percentage ofsubpopulation k, is decreasing with the number of sub-populations. This is re ecting that the percentage of themost complex subpopulation is generally not very high,whereas the percentage of the least complex is relativelymuch higher.

We clustered all subpopulation profiles, rather than justthe overall profiles of the samples. The result, shown inFigure 1, presents the similarity among all subpopulationsacross the 29 tumor pairs. In 16 cases (out of 29) all pri-mary and metastasis subpopulations cluster together(shown in blue in the figure), i.e. the subpopulations ofthe metastasis are more similar to the subpopulations ofthe primary tumor than to subpopulations of other sam-ples. The high similarity between the primary tumor andthe lymph node metastasis from the same patient indi-cates that biological characteristics of the primary tumorare maintained in the lymph node metastasis. In othersamples the primary tumor and the metastasis show muchless similarity, and even within a sample, the subpopula-tions can be very dissimilar. The yellow cluster in Figure 1consists mainly of subpopulations with low AI, i.e. fewgenomic aberrations.

Table 2 shows the relationship between the estimatednumber of subpopulations in the primary tumor and inthe corresponding metastasis. There is not a clear relation-ship between the two numbers with only 14 pairs out of29 showing the same subpopulation number. Of the 16tumor pairs that clustered together in Figure 1 (see aboveparagraph), 10 pairs have the same subpopulationnumber while in 4 cases the metastasis shows a lowernumber than the primary tumor.

The average percentage of the normal cell subpopulationin a tumor is around 60%, which is higher than we

AIpkAIk

pk

•>

=−∑ 1 00

, (2)

Cluster diagram of the 29 pairs of tumorsFigure 1Cluster diagram of the 29 pairs of tumors. The 29 pairs of primary and metastasis samples were divided into 89 sub-populations using the method described in the paper. For two leaves with the same ID (e.g. T53), P1 refers to the abnormal subpopulation with the least aberration, P2 (if it exists) refers to the subpopulation with more aberrations than P1, and P3 (if it exists) the one with most aberrations. The percentage of each subpopulation is also included. The cluster diagram was generated using average linkage cluster-ing based on the estimated copy numbers for all 3340 clones. Reducing the number of clones produces very similar results.

Page 3 of 11(page number not for citation purposes)

Page 4: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

expected. According to the pathologists involved inremoving the tumors by surgery, the samples contain atleast 70% malignant cells. However, this percentage isjudged by eye and represents how big a fraction of thetumor that appears to consist of normal cells. This is likelyto be an overestimate because malignant cells generallyare bigger than normal cells [11]. Also, some normal cellsare typically removed before the samples are subjected toarray analysis.

Finally, based on our results, we can make some predic-tions about possible subpopulation developments inpaired tumor samples (Figure 2). For example, the P1 (P2)subpopulations from the samples T51 and M51 clustertogether and they are likely immigrated from the primarytumor to the lymph node as a whole. Whereas, for exam-ple the subpopulations of M84 do not cluster togetherwith T84-P1, which might indicate that the metastasis hasarisen from the T84-P2 in the primary tumor.

Simulation resultsWe simulated samples based on the 29 pairs of real sam-ples (in total 58 samples), as described in the Methods(Simulation). For each real sample we fit copy numberprofiles assuming 2, 3 or 4 subpopulations and use theseprofiles as templates for simulation of artificial log-inten-sities. For example, for a real sample with an estimated 2subpopulations, we fit 2, 3 and 4 subpopulations andsimulate log-intensities based on these profiles. In thiscase, the profiles of the four subpopulations are very sim-ilar because the real sample is best explained by two sub-populations. Subsequently, we applied our method to thesimulated samples; the results are summarized in Tables 2and 3.

We learn several things from the simulations (Table 3).Generally, it is possible to predict the number of subpop-ulations under the chosen simulation model. However,samples with 3 or 4 similar subpopulations (Table 3: Real= 2, Simulated = 3 or 4) are difficult to predict correctly(48% and 15%, respectively), whereas samples with fewless similar subpopulations are much easier to predict; inTable 3 simulated samples with 2 subpopulations achievea prediction accuracy above 87%.

Next, we estimated the accuracy of the predicted subpop-ulation percentages and the corresponding profiles (Table4). When the subpopulation number was predicted cor-rectly, the estimated subpopulation percentages and copynumber profiles were compared to the values used in thesimulation. Table 4 shows that the copy numbers of theabnormal subpopulations are predicted correctly for morethan 80% of the clones, with higher accuracy obtainedwhen there are few subpopulations than many.

Our method of estimation assumes a model of sequentialtumor evolution. To test the method's robustness to viola-tions of the model we did the following; see Methods(Simulation) for details. First we simulated a sampleaccording to the model. Then, for each segment we mod-ified the copy numbers by adding/subtracting a Poissonnumber of copies. The results are shown in Table 5. Evenwhen the percentage of segments violating the model ishigh, we predict the correct subpopulation number in themajority of the cases.

Validation experimentThe method was also validated using experimental data.We performed a series of hybridization experiments withdifferent combinations of malignant and normal DNA asfollows: 1) The tumor DNA is hybridized with normalDNA (pooled female healthy population). 2) A combina-tion of 85% tumor DNA and 15% normal DNA is hybrid-ized with normal DNA (pooled female healthypopulation). 3) A combination of 70% tumor DNA and

Table 2: Primary tumor vs. metastasis

MetastasisPrimary 2 3 4

2 9 4 23 7 5 14 0 1 0

Shown is the estimated number of subpopulations in the primary tumor and the corresponding lymph node metastasis.

Subpopulation developmentFigure 2Subpopulation development. Here we show possible subpopulation development in two paired tumor samples. The dashed lines connect the subpopulations from same sample. The solid and dashed lines represent the most likely and the least likely development path, respectively. From the top-down, the subpopulation contains more and more genomic aberrations.

Page 4 of 11(page number not for citation purposes)

Page 5: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

30% normal DNA is hybridized with normal DNA(pooled female healthy population). The hybridizationswere performed using two different tumor DNA samples.For each of the hybridizations we estimated the numberof subpopulations and percentages (Table 6). Since we donot know the composition of the tumor DNA, we do notknow the true number of subpopulations and percent-ages, but based on our estimates we can predict what theestimated number of subpopulations and percentagesshould be in the mixed samples.

Table 6 shows that the predictions generally are in accord-ance with our expectations. All discrepancies betweenexperimental and estimated percentages fall within theerror bounds reported in Table 4. However, one of thetumor samples is best explained by three subpopulations(S1 in Table 6), whereas the two mixed samples ("S1 with15%" and "S1 with 30%") are best explained by two sub-populations. By adding normal cells the signal from aber-rant clones become diluted and it becomes more difficultto distinguish different abnormal subpopulations.

ConclusionTumor heterogeneity is an important aspect of tumor evo-lution and progression. However, this aspect has, to thebest of our knowledge, largely been ignored in analysis ofCGH and SNP array data [12,13]. In Refs. [12,13], a frac-tion of the tumor is assumed to contain normal cellswhich weaken the signal from the aberrated cells. OnlyRef. [13] estimates the fraction of normal cells directly,but we cannot compare this method to ours since it isdeveloped to SNP array data. The method described inRef. [12] does not output the frequency of normal cells.

We have introduced a novel algorithm to estimate tumorheterogeneity and evaluated its performance on simulatedand real tumor data. The method adds to our understand-ing of the genomic aberration profile, the quantificationof genomic instability in the tumor, and the heterogeneityof the tumor.

One of the main difficulties of developing quantitativemethods for array CGH data is the lack of knowledgeabout how tumors evolve and differentiate. Better andmore accurate models could be developed if more wereknown about tumor evolution. Therefore, it might be dif-ficult to make decisions on a strict mathematical basisonly, because the underlying hypotheses might be diffi-cult test or validate with current data sets. The appropri-ateness of the novel methodology can only be evaluatedin a long run in which the conclusions demonstrate utilityfor improving biological understanding and clinical deci-sions. Our approach is one possible algorithm to interpretthe biology of the tumor genomic profile.

In CGH array analysis copy number changes are measuredrelatively to a reference level. Generally, the reference levelis not known and the median (or mean) log-intensity istypically assumed to correspond to two copies; loss andamplifications are then measured relatively to the medianlog-intensity level. Our method makes the same assump-tion. This implies that a tumor sample consisting of e.g.two subpopulations, one diploid and one n-ploid, wouldbe identified as purely diploid. Each clone will have a log-intensity value that reflects the mixture of the two subpop-ulations and will, erroneously, be equated with two cop-ies. However, if the two subpopulations are not euploid,our method might be able to disentangle the two subpop-ulations. This situation is not unlike traditional CGHarray analysis where the tumor sample will be identifiedas one homogeneous population. Only if additionalinformation is available, e.g. from karyotyping, can thereference level be properly adjusted.

We anticipate various lines of improvement, both in thechosen statistical methodology (e.g. to adopt a Bayesian

Table 3: Prediction accuracy for simulated samples

Simulated Predicted as CorrectReal 2 3 4 (in %)

2 2 115 9 0 0.933 61 59 4 0.484 57 48 19 0.15

3 2 80 11 1 0.873 8 78 6 0.854 6 60 26 0.28

4 2 15 1 0 0.943 0 13 3 0.814 0 1 15 0.94

For each real sample four simulated samples were created; in total 174 samples. The real sample was used as template for the simulated samples. In the table, the simulation results are shown according to the estimated number of subpopulations in the real samples. Real: Estimated number of subpopulations in the real sample, Simulated: The number of subpopulations in the simulated sample, Predicted: The predicted number of subpopulations in the simulated sample.

Table 4: Accuracy of copy numbers and percentages

#Subpopulations2 3 4

A (in %) 2.05 (2.26) 2.76 (2.50) 4.78 (4.01)B (in %) 89.5 (13.6) 82.8 (10.0) 80.5 (9.0)

The table shows accuracy of the estimated copy numbers and subpopulation percentages when the number of subpopulations is correctly predicted. A) The average absolute difference between the estimated and true percentages in the simulated samples, B) The average number of times the copy number was predicted correctly, excluding the normal subpopulation. Standard deviations in parenthesis.

Page 5 of 11(page number not for citation purposes)

Page 6: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

framework to control the vast number of copy numberparameters) as well in the mathematical modelling oftumor progression. These advances should be developedin tandem with richer and larger data set that are likely tooccur with improved genomic technology. Our method(and improvements) can also be applied to SNP arraydata. Recent methods for SNP array analysis, e.g. [13,14],distingusih the two possible alleles; this might be usefulfor providing more accurate inference on copy numbersand the copy number level of the reference population,because each SNP carries two observations and not justone as for CGH arrays.

MethodsMaterialsCell lines with known copy number gains and losses wereused to establish a copy number model. Here, we appliedseveral cell lines including trisomy13, trisomy18,

trisomy21 and 49, XXXXX. Normal male and female DNAwere also used.

Twenty-nine pairs of primary breast tumors and theirmatched lymph node metastasis were provided by Copen-hagen University Hospital. The project was approved bythe Scientific and Ethical Committee of the Copenhagenand Frederiksberg Municipalities.

Arrays covering the whole genome with elements pro-duced from bacterial artificial chromosome (BAC) cloneswere obtained from the Wellcome Trust Sanger Institute.The human DNA fragments of the 3340 BAC clones arespaced at approximately 1 Mb intervals across each chro-mosome arm. The experimental process is explained indetails in [15]. Briefly, each clone is spotted on slides in aneighboring triplicate pattern. Annotations of the clonesare based on the 1-Mb clone information published bythe Wellcome Trust Sanger Institute and updated usingthe 38_36 version of the 1-Mb clone information releasedby Ensembl.

Normalization of arraysThe intensities of Cy3 (tumor sample) and Cy5 (refer-ence) were extracted respectively from 16 bit TIF filesusing the Tracker (Applied Precision) software. Subse-quently data were subjected to quality assessment and afiltering process to remove the clones with poor quality.

Clones were removed from the subsequent analysis if oneof the following conditions is fulfilled: a) The spot islabeled "Undetected" by Tracker, b) The Sanger annota-tion of a clone is inconsistent with the Ensemble annota-tion (see above), c) The spot's Cy5 (reference) intensity isless than two times the standard deviation (SD) of itsbackground intensity, d) Only one spot out of the threereplicates is left after the above procedure, e) The CV ofthe intensity ratios Cy3/Cy5 for one clone exceeds 0.08,and f) The clone maps to chromosome Y.

Table 5: Robustness of the method

3 subpopulations 4 subpopulationsλ 1 0.5 0.25 1 0.5 0.25

Correct 27 27 33 27 21 28Incorrect 19 19 13 13 19 12A (in %) 4.74 (4.68) 4.15 (3.62) 3.52 (3.70) 7.70 (7.95) 5.43 (4.23) 5.59 (5.03)B (in %) 5.37 (6.89) 5.59 (6.19) 5.43 (7.13) 7.43 (9.64) 5.65 (5.35) 6.93 (6.98)Segments (in %) 37.0 24.4 11.5 65.1 45.1 25.6

The table shows results for simulated data not fulfilling the assumption of sequential tumor evolution. With increasing λ, an increasing number of samples are incorrectly classified. A) The average absolute difference between the estimated and true percentages in the simulated samples when the number of subpopulations are predicted correctly, B) The average absolute difference between the estimated and true percentages of the normal subpopulation. Standard deviations in parenthesis. 'Segments' is the number of segments violating the model of sequential tumor evolution.

Table 6: Validation experiment

Estimated based on (X = S1, S2)Experiment X X with 15% X with 30%

S1 76,24 - 79,21 77,23S1 with 15% 82,18 80,20 - 81,19S1 with 30% 84,16 83,17 85,15 -S1 60,29,11 - 72,20,8 70,21,9S1 with 15% 76,17,7 66,25,9 - 75,18,7S1 with 30% 79,15,6 72,20,8 80,14,6 -S2 62,24,14 - 49,33,18 50,31,19S2 with 15% 57,28,15 68,20,12 - 57,27,16S2 with 30% 65,22,13 73,17,10 65,23,12 -

In the "Experiment" column, the estimated subpopulation percentages from the 6 experiments are shown: Two pure tumor samples (S1 and S2), and four samples with tumor mixed with 15 or 30% normal cells. The best fit for S1 is three subpopulations, whereas it is two for "S1 with 15%" and "S1 with 30%"; therefore we show results for both two and three subpopulations to facilitate comparison. The remaining three columns contain percentages estimated from the Experiment column; e.g. to estimate the percentages of the sample with 85% malignant cells and 15% normal cells ("S2 with 15%") from the sample S2 do (0.62·85 + 15, 0.24·85, 0.14·85) = (68, 20, 12).

Page 6 of 11(page number not for citation purposes)

Page 7: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

Finally, the ratios of Cy3/Cy5 intensities are calculatedand log transformed. Subsequently, the median of thelog-ratios from the whole array is subtracted from eachlog-ratio to normalize all spots.

The copy number modelWe modeled the Cy3/Cy5 intensity ratios in the followingway. Assume that the test sample is homogeneous, i.e. agiven clone has the same copy number in all cells in thesample, and that the clones are divided into distinct seg-ments such that all clones in a segment have the samecopy number.

Let xij be the ratio of clone j in segment i, and let C0i andC1i be the copy numbers of the reference sample and thetest sample, respectively. We assume

where γ is a constant depending on the quality of the DNAin the tube, amplification, scanning and other hybridiza-tion and experimental conditions. The error term εij isassumed to have mean zero and common variance, and αis a constant that is justified from calibration experiments[see Additional file 1] and appears to be sample independ-ent (but likely platform specific). The model assumes thevariance is proportional to the true ratio of copy numbersin the test and reference samples.

Let xR be the median over all intensity ratios (over all seg-ments). Then

where C1R (C0R) is the copy number of the test (reference)

sample corresponding to xR and ε' is an error term (not

equal to ε). The error term is defined such that (1 + εij)/(1

+ εR) = 1 + . Typically, the majority of clones in a tumor

sample have copy number two and we assume C1R = 2. In

general, C0i/C0R = 1 in the reference sample, unless the ref-

erence sample has only one chromosome X and C0i/C0R =

1/2 for chromosome X clones.

Put Ci = C1i/C1R, Ci ∈ {0, , 1, , ...}, then zij = (1 +

). We refer to zij as the normalized (intensity) ratio.

With this notation we have

Further, assume

log(zij) ~ N(αlog(Ci) + β, σ2), (6)

where β is the mean of and σ2 the variance. Equation

(3) ensures the variance is independent of the copynumber. If a series of experiments with known copy num-bers are available, the parameters in equation (6) can beestimated using linear regression.

In order to determine α and β in equation (6), we used thenormal references and the samples with known copynumber aberrations. We have data corresponding to thefollowing ratios: 0.5 (46, XY versus 46, XX), 1.5 (47,XX+13 versus the normal reference, and 47, XX+18 versusthe normal reference), 2 (chr X, the normal females versusthe normal males), and 2.5 (49, XXXXX versus 46, XX).

Mixture modeling of tumor samplesThe intensity ratios in a tumor sample is modeled using amixture model approach. Specifically, the log-ratio log(zij)has intensity given by

where

K is the number of subpopulations, pk is the percentage ofthe kth subpopulation, Σkpk = 1, and Cik ≥ 0 is the copynumber in the kth subpopulation relative to the copynumber of the test sample. Normally, the same region indifferent subpopulations will not experience both gainsand losses [16].

Therefore, we restrict our model parameters in the follow-ing way. The first subpopulation, with percentage p0, isassumed to be normal; i.e. Ci0 = 1 for all clones in this sub-population. We assume the other subpopulations arederived from each other, such that either

1 = Ci0 ≤ Cik ≤ Ci,k+1 (9)

or

1 = Ci0 ≥ Cik ≥ Ci,k+1. (10)

xC iC i

ij ij=⎛

⎝⎜

⎠⎟ +γα

10

1( ),e (3)

zxijxR

C iC RC RC i

ij ij= =⎛

⎝⎜

⎠⎟ +1 0

1 01

α

( ),e ′ (4)

e ′ij

12

32 Ci

α

e ′ij

log( ) log( ) log( )

log( ) .

zij Ci ij

Ci ij

= + +

≈ +

α

α

1 e

e

′(5)

e ′ij

log( ) ~ ( log( ) , ),zij N Ciα β σ+ 2 (7)

C p Ci k ik

k

K

==

∑0

, (8)

Page 7 of 11(page number not for citation purposes)

Page 8: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

ClassifierFigure 3Classifier. Here we show three examples of classification of real samples. From top to bottom, the three samples are classi-fied as 2, 3, and 4 subpopulations, respectively. Each subplot shows two empirical distributions and a blue line representing the NLSK of the query sample. In the first column, the black curve is the smoothed empirical distribution (SED) of NLSb22 (simulated as two subpopulations and fitted as two) and the red curve is the SED of NLSb32 (simulated as three and fitted as two). In the second column, the red curve is the SED of NLSb33 (simulated as three and fitted as three) and the green curve is the SED of NLSb43 (simulated as four and fitted as three). Finally, in the last column, the green curve is the SED of NLSb44 (simulated as four and fitted as four) and the yellow curve is the SED of NLSb54 (simulated as five and fitted as four). The number in the subplots shows how many samples (in %) in the left distribution that obtain a value greater than the value indicated by the blue line.

Page 8 of 11(page number not for citation purposes)

Page 9: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

That is, we consider subpopulation k + 1 to be derivedfrom subpopulation k by either A) introducing a new copynumber aberration (Cik = 1, but Ci,k+1 ≠ 1), B) increasing anexisting copy number gain, or C) increasing an existingcopy number loss.

Estimation of copy numbers and percentagesTo estimate the copy numbers and the percentages of thesubpopulations, we first divide the clones into segments,such that all clones in a segment have the same copynumber profile. To segment the clones, we used DNAcopy[17] implemented in R. A comparison study of several seg-mentation approaches have been done recently [18], andDNAcopy came out best.

After segmentation, all clones in one segment are assignedthe same value, namely the mean of the intensity values inthat particular segment. Missing clone values mappingwithin a segment are given the same value as the segment,while missing clone values located between segmentshave values imputed using the minimum absolute valueof the two flanking segments. The copy number level clos-est to zero is declared unchanged ("normal level") andcorresponds to two copies. In the final step, all segmentsare normalized by subtracting the value of the normallevel.

Denote by the residual error

where M is the total number of clones, mi the number ofclones in segment i, and

the mean intensity of segment i.

For a given number of subpopulations, K = 2, 3, or 4, weuse least square to fit the parameters (copy numbers, per-centages); i.e. for each K we minimize

over pk, k = 0, ..., K and Cik with the constraints given in

equations (9) and (10). Here ( , ) is obtained from

the samples with known copy number alterations andassumed to be known in equation (12).

Alternatively, one can minimize

where mi is the number of clones in segment i. Equation

(13) involves summation over fewer terms than equation(12) and might thus be preferred. For fixed K, there are K- 1 percentage parameters and K - 1 copy numbers for eachsegment; in total (K - 1) + n(K - 1) = (n + 1)(K - 1) param-eters, where n is the number of segments. The number ofparameters scales with the number of clones; however,since the copy numbers assume integer values we do notobtain a perfect fit to the log-intensities in equation (12)or (13). To facilitate comparison between different exper-

iments, we use the normalized quantity ,

where n is the number of segments in one experiment.

Classification of samplesTo classify a sample we go through the following steps.The estimation procedure outlined in the previous sectionis applied.

Estimation of subpopulation number and parameters1) Apply DNACopy to obtain a list of segments.

2) Fit K subpopulations to obtain NLSK, K = 2, 3, 4, withcorresponding percentages (p0, p1, ..., pK) and copynumber profiles (Ci1, ..., CiK). The first subpopulation issupposed to consist of pure normal cells.

Simulation of bootstrap samplesTo simulate bootstrap samples the estimated copynumber profiles are applied. Noise are added to the pro-files to obtain log-intensity values.

3) Choose α and β according to the estimated normal dis-tributions obtained by linear regression. The distributionsare restricted to the 95% CI to avoid extreme values.

4) For fixed K, simulate log-intensity values for each esti-mated copy number profile, k = 1, ..., K, by adding noise:For a clone with copy number C, compute the mean log-

intensity β + αC and add noise according to a normal dis-

tribution N(0, ).

5) Repeat the previous step B times for each sample andeach value of K to obtain simulated samples with K = 2, 3or 4 subpopulations. For each simulated sample fit 2, 3and 4 subpopulations according to step 1 and 2, and cal-culate the corresponding NLSbKC. Here b denotes the bth

σ̂ 2

ˆ (log( ) ˆ )σ μ2 1 2= −∑Mzij j

ij

(11)

ˆ log( )μi mizij

j

= ∑1

LS z Cij i

i j

= − −( )∑ log( ) log( ),

α β2

(12)

α̂ β̂

LS m Ci i i

i

′ = − −( )∑ ˆ ˆ log( ) ˆ ,μ α β2

(13)

NLS LS n= ′ /

ˆ (log( ) ˆ )σ μ2 1 2= −∑Mzij j

ij

Page 9 of 11(page number not for citation purposes)

Page 10: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

simulated/bootstrapped sample with K subpopulations,fitted to C subpopulations, C = 2, 3, 4.

Evaluation of NLSK from real samplesIn the final step the NLSK, K = 2, 3, 4, from a real sampleis compared to the bootstrapped samples to find the opti-mal number of subpopulations for the real sample.

6) If NLS2 is below the 95 percentile of the empirical dis-tribution of NLSb22, accept the sample as two populations,otherwise

7) If NLS3 is below the 95 percentile of the empirical dis-tribution of NLSb33, accept the sample as three popula-tions, otherwise

8) Accept the sample as 4 subpopulations.

The part described in steps 6–8 is illustrated in Figure 3.The whole procedure is a bootstrap procedure; for a realsample the fitted profiles (one for each subpopulation)are compared to simulated samples with the same profilesas the fitted. For a (supposedly) normal sample, one canstart with a single population, K = 1 (only normal cells).

SimulationTo test the classifier we choose some of the simulatedsamples and used these as input to the bootstrap proce-dure described above. For each real sample we choosefour simulated samples as input and compared the resultto the known input.

We also tested how robust the classifier is to deviationsfrom the assumption of sequential tumor evolution. Thiswe did by adding or subtracting a Poisson number of cop-ies to the original copy number. For each segment, SiXicopies were added to the original copy number in subpop-ulation i. Here, P(Si = 1) = P(Si = 1) = 0.5 and Xi is Pois-son(λ). If the copy number fell below 0, it was put to zero.The parameter λ was varied over λ = 0.25, 0.5 and 1. Foreach real sample with K = 3 estimated subpopulations wesimulated 2 samples (in total 2·23 = 46 simulated sam-ples) in this way, and for each real sample with K = 4 wesimulated 4 samples (in total 4·10 = 40 simulated sam-ples).

Authors' contributionsKW and CW developed the method with input from theother authors, KW and SL implemented the method, KWand JL carried out the data analysis, and KW and CWwrote the manuscript with input from JL and LB; JL per-formed all experiments. All authors read and approvedthe final manuscript.

Additional material

AcknowledgementsKW has been supported by the PhD-school for Industrial-related Molecular Biology at Aarhus University. CW is supported by the Danish Cancer Soci-ety. The work has been financed by the Danish Platform for Integrative Biology of the National Research Foundation, the will of Edith Stern and the "Race Against Breast Cancer".

References1. Dexter D, Leith J: Tumour heterogeneity and drug resistance.

J Clin Oncol 1986, 4(2):244-257.2. Heppner G, Yamashina K, Miller B, Miller F: Tumour heterogene-

ity in metastasis. Prog Clin Biol Res 1986, 212:45-59.3. Heppner G: Tumour heterogeneity. Cancer Res 1984,

44(6):2259-2265.4. Black D: The genetics of breast cancer. Eur J Cancer 1994,

30A(13):1957-1961.5. Devilee P, Cornelisse C: Somatic genetic changes in human

breast cancer. Biochim Biophys Acta 1994, 1198(2–3):113-130.6. el Ashry D, Lippman M: Molecular biology of breast carcinoma.

World J Surg 1994, 18:12-20.7. Ford D, Easton D: The genetics of breast and ovarian cancer.

Br J Cancer 1995, 72(4):805-812.8. O'Connell P, Pekkel V, Fuqua S, Osborne C, Allred D: Molecular

genetic studies of early breast cancer evolution. Breast CancerRes Treat 1994, 32:5-12.

9. Andersen CL, Wiuf C, Kruhoffer M, Korsgaard M, Laurberg S, Orn-toft TF: Frequent occurrence of uniparental disomy in color-ectal cancer. Carcinogenesis 2007, 28:38-48.

10. Aubele M, Mattis A, Zitzelsberger H, Walch A, Kremer M, Hutzler P,Hofler H, Werner M: Intratumoural heterogeneity in breastcarcinoma revealed by laser-microdissection and compara-tive genomic hybridization. Cancer Genet Cytogenet 1999,110(2):94-102.

11. Khan MZ, Haleem A, Hassani HA, Kfoury H: Cytopathologicalgrading, as a predictor of histopathological grade, in ductalcarcinoma (NOS) of breast, on air-dried Diff-Quik smears.Diagn Cytopathol 2003, 29(4):185-93.

12. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain ANAN: Hid-den Markov models approach to the analysis of array CGHdata. Journal of Multivariate Analysis 2004, 90:132-153.

Additional file 1Regression analysis. The figure shows the the observed averaged intensity values from clones with known copy number changes (e.g. trisomies) and the linear regression fit to the observed values. The x-axis represents the known log2 copy number ratio (copy number divided by 2) and the y-axis represents the observed log2 intensity ratio. The blue spots represent the observed averaged intensities and red spots show the predicted values.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-10-12-S1.pdf]

Additional file 2Detailed results for the 29 pairs of primary tumors and lymph node metastasis. The table shows results for the 29 pairs of tumors organized in two times eight columns. ID: Name of sample, Pops: Estimated number of subpopulations, %: Subpopulation percentages, AIk: Aberration Index for subpopulation k, excluding the normal subpopulation, Total: Weighted sum of AIk, ΣkpkAIk, Pure: AI•, normalized weighted sum of AIk, Σkp-

kAIk/(1 - p0).Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-10-12-S2.pdf]

Page 10 of 11(page number not for citation purposes)

Page 11: Estimation of tumor heterogeneity using CGH array data

BMC Bioinformatics 2009, 10:12 http://www.biomedcentral.com/1471-2105/10/12

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

13. Lamy P, Andersen C, Dyrskjot L, Torring N, Wiuf C: A HiddenMarkov Model to estimate population mixture and alleliccopy-numbers in cancers using Affymetrix SNP arrays. BMCBioinformatics 2007.

14. LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, Harrington D,Sellers WR, Meyerson M: Allele-specific amplification in cancerrevealed by SNP array analysis. PLoS Comput Biol 2005, 1(6):e65.

15. Li J, Gromov P, Gromova I, Moreira J, Timmermans-Wielenga V, RankF, Wang K, Li S, Li H, Wiuf C, Yang H, Zhang X, Bolund L, Celis J:Omics-based profiling of carcinoma of the breast andmatched regional lymph node metastasis. Proteomics 2008,8(23-24):5038-5052.

16. Lips EH, van Eijk R, de Graaf EJR, Doornebosch PG, de MirandaNFCC, Oosting J, Karsten T, Eilers PHC, Tollenaar RAEM, van WezelT, Morreau H: Progression and tumor heterogeneity analysisin early rectal cancer. Clin Cancer Res 2008, 14(3):772-81.

17. Olshen AB, Venkatraman ES: Circular binary segmentation forthe analysis of array-based DNA copy number data. Biostatis-tics 2004, 5(4):557-572.

18. Willenbrock H, Fridlyand J: A Comparison Study: Applying Seg-mentation to Array CGH Data for Downstream Analyses.Bioinformatics 2005, 21(22):4084-4091.

Page 11 of 11(page number not for citation purposes)