-
Inference of tumor subclonal compositionand evolution by the use
of single-cell and
bulk DNA sequencing databy
Salem Malikić
M.Sc., Simon Fraser University, 2014B.Sc., University of
Sarajevo, 2011
Thesis Submitted in Partial Fulfillment of theRequirements for
the Degree of
Doctor of Philosophy
in theSchool of Computing ScienceFaculty of Applied Sciences
c© Salem Malikić 2019SIMON FRASER UNIVERSITY
Summer 2019
All rights reserved.However, in accordance with the Copyright
Act of Canada, this work may bereproduced without authorization
under the conditions for “Fair Dealing.”
Therefore, limited reproduction of this work for the purposes of
private study,research, education, satire, parody, criticism,
review and news reporting is likely
to be in accordance with the law, particularly if cited
appropriately.
-
Approval
Name: Salem Malikić
Degree: Doctor of Philosophy (Computing Science)
Title: Inference of tumor subclonal composition andevolution by
the use of single-cell and bulk DNAsequencing data
Examining Committee: Chair: Jian PeiProfessor
Leonid ChindelevitchSenior SupervisorAssistant ProfessorS. Cenk
SahinalpCo-SupervisorProfessorSchool of Informatics, Computingand
Engineering,Indiana UniversityCedric ChauveSupervisorProfessorColin
CollinsSupervisorProfessorDepartment of Urologic SciencesUniversity
of British ColumbiaMaxwell W LibbrechtInternal ExaminerAssistant
ProfessorRussel SchwartzExternal ExaminerProfessorDepartment of
Biological SciencesCarnegie Mellon University
Date Defended: August 21, 2019
ii
-
Abstract
Cancer is a genetic disease characterized by the emergence of
genetically distinct populationsof cells (subclones) through the
random acquisition of mutations at the level of single-cellsand
shifting prevalences at the subclone level through selective
advantages purveyed bydriver mutations. This interplay creates
complex mixtures of tumor cell populations whichexhibit different
susceptibility to targeted cancer therapies and are suspected to be
the causeof treatment failure. Therefore it is of great interest to
obtain a better understanding of theevolutionary histories of
individual tumors and their subclonal composition. In this thesiswe
present three methods for the inference of tumor subclonal
composition and evolutionby the use of bulk and/or single-cell DNA
sequencing data.
First, we present CTPsingle, a method which aims to infer tumor
subclonal compositionfrom single-sample bulk sequencing data.
CTPsingle consists of two steps: (i) robust cluster-ing of
mutations using beta-binomial mixture modelling and (ii) inference
of tumor phyloge-nies by the use of integer linear programming. On
simulated data, we show that CTPsingleis able to infer the purity
and the clonality of single-sample tumors with high accuracyeven
when restricted to a coverage depth as low as ∼ 30×. CTPsingle is
currently used toinfer clonality as a part of the Evolution and
Heterogeneity Working Group of Pan Can-cer Analysis of Whole
Genomes project where sequencing data of over 2700 tumors
areanalyzed.
Next, we present B-SCITE, the first available computational
approach that infers tumorphylogenies from combined single-cell and
bulk sequencing data. B-SCITE is a probabilisticmethod which
searches for tumor phylogenetic tree maximizing the joint
likelihood of thetwo data types. Tree search in B-SCITE is
performed by the use of customized MCMCsearch over the space of
labeled rooted trees. Using a comprehensive set of simulated
data,we show that B-SCITE systematically outperforms existing
methods with respect to treereconstruction accuracy and subclone
identification. On real tumor data, mutation historiesgenerated by
B-SCITE show high concordance with expert generated trees.
In the third part, we introduce PhISCS, the first method which
integrates single-cell andbulk sequencing data while accounting for
the possible existence of mutations affected byundetected copy
number aberrations, as well as mutations for which the commonly
used and
iii
-
recently debated Infinite Sites Assumption is violated. PhISCS
is a combinatorial methodand, in contrast to the available
alternatives which are mostly based on the probabilisticsearch
schemes, it can provide guarantee of optimality of the reported
solutions. We providetwo different implementations of PhISCS: (i)
the implementation based on the use of integerlinear programming
and (ii) the implementation based on the use of constraint
satisfactionprogramming. We show that the latter has lower running
time on most of the instancesthat we used to asses the performance
of the two implementations. These results suggestthat in some
applications constraint satisfaction programming might be a viable
alternativeto commonly used integer linear programming. We also
demonstrate the utility of PhISCSin analyzing real sequencing data
where it reports more plausible and parsimonious tumorphylogenies
than the available alternatives.
Keywords: Intra-tumor heterogeneity; Tumor evolution;
Single-cell DNA sequencing; BulkDNA sequencing; Infinite sites
assumption; Markov chain Monte Carlo; Joint probabilisticmodel;
Integer linear programming; Constraint satisfaction programming
iv
-
Acknowledgements
First and foremost, I would like to thank my supervisor Dr. S.
Cenk Sahinalp for hisextensive guidance, support and patience
during my studies. I especially thank him for theendless effort he
put into training me in the scientific field. I am also very
thankful to theother supervisors: Dr. Leonid Chindelevitch, Dr.
Cedric Chauve and Dr. Colin Collins forfollowing my work in the
past years and providing many suggestions that helped improvingit.
In addition, I thank Dr. Maxwell Libbrecht for his insightful
questions and helpfulsuggestions during the depth exam and thesis
defence, where he served as an Examiner.This thesis was
considerably improved by the input from Dr. Russell Schwartz, whom
I amvery grateful for serving as an External Examiner, for his very
detailed proofreading of thethesis and for providing numerous
suggestions. Also, I thank dr. Jian Pei for chairing
thedefence.
This work would be impossible without many of the collaborators
whom I worked withduring my master’s and doctoral studies. I thank
Dr. Nilgun Donmez and Dr. AndrewMcPherson for introducing me to the
studies of tumor heterogeneity, providing extensiveguidance in the
first years of my research and their vast contribution to the
development ofCITUP, which was the basis of my master thesis, and
CTPsingle, which is the first methodpresented in this thesis.
B-SCITE, the second method presented in the thesis, is a resultof a
collaboration with the Computational Biology Group at ETH Zurich
lead by Dr. NikoBeerenwinkel. I spent five months in Switzerland
working together with Dr. Beerenwinkeland two of the members of his
group, Dr. Katharina Jahn and Dr. Jack Kuipers. I thankthem all for
their hospitality and collaboration on B-SCITE. Work on PhISCS,
which is thethird presented method, is a joint effort of the labs
lead by Dr. Sahinalp and Dr. ImanHajirasouliha from Cornell
University. In addition to Dr. Sahinalp and Dr. Hajirasouliha,here
I would like to thank Simone Cicolella, Ehsan Haghshenas, Md.
Khaledur Rahman,Camir Ricketts, Daniel Seidman and Dr. Faraz Hach
for their contributions to this project.My special thanks go to
Farid Rashidi Mehrabadi, who put an endless effort in
PhISCS,contributing to the data analysis, code preparation and
methods design.
Some of the research that I was involved in is not included in
the thesis. However, ithelped me in gaining a valuable experience
in method development, collaborative researchand gave me an
opportunity to attend several scientific conferences. For these
collabo-
v
-
rations, I would first like to thank Dr. Ibrahim Numanagic and
Michael Ford, whom Iworked together with on the development of
methods for genotyping highly polymorphicgenes. I also thank
Nikolai Karpov and Md. Khaledur Rahman for a joint work on
thedevelopment of a similarity measure for comparing trees of tumor
evolution. With Dr.Sahand Khakabimamaghani I worked on the
development of a method for collaborativeintra-tumor heterogeneity
detection. I thank Dr. Khakabimamaghani for leading this workand
for many insightful discussions about the tumor heterogeneity and
potential new waysof solving several important problems in the
field. I also thank to all members of Evolutionand Heterogeneity
Working Group of Pan Cancer Analysis of Whole Genomes
(PCAWG)project that I have been a member of since 2014.
I would also like to acknowledge insightful feedback received
from other colleagues fromDr. Sahinalp’s lab and Laboratory for
Advanced Genome Analysis at Vancouver ProstateCenter: Dr. Yen Yi
Lin, Ermin Hodzic, Can Kockan, Dr. Alex Gawronski, Iman
Sarrafi,Hossein Asghari and Dr. Raunak Shrestha.
I am indebted to many teachers and professors who helped me in
developing passion andenthusiasm towards Mathematics and the other
scientific fields. Here, I especially thankto Nermin Suljic, Ali
Lafcioglu, Dr. Hasan Jamak and Dr. Dino Oglic. Furthermore, Ithank
all of the people from Bosna Sema Educational Institutions for
providing an excellentenvironment and support during my high school
and undergraduate studies, as well as toCanadian granting agencies,
in particular NSERC, for supporting my research.
I devote a special thanks to two of my colleagues and friends,
Dr. Ibrahim Numanagicand Ermin Hodzic. Our friendship dates back to
the time of our undergraduate studies atthe University of Sarajevo
in Bosnia and Herzegovina. Without Ibrahim coming to SFU in2011, it
is very unlikely that I would have ever ended up studying here. He
introduced me todr. Sahinalp and provided extensive help with
everything. With Ermin, I have been livingduring the whole course
of my PhD studies and he has been a great roommate, colleagueand
friend.
I am very grateful to my dear aunt Faiza and uncle Dzevad
together with their familyfor providing moral support and for all
the help that they provided during my internshipin Switzerland
(where they are currently living). I also thank people from
ContextualGenomics, a company where I have been working over the
past eight months, for providingan excellent work environment and
for the great understanding that they showed while Iwas preparing
the thesis.
Last, but not least, I would like to express deep gratitude to
my parents, Sadeta andFaiz, and sister Faiza, for their
unconditional love and support. I devote this thesis to themand to
my sweet little niece Amina.
vi
-
Table of Contents
Approval ii
Abstract iii
Acknowledgements v
Table of Contents vii
List of Tables xi
List of Figures xii
1 Introduction 11.1 Genetic basis of cancer and evidence for the
existence of genetic intra-tumor
heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 11.2 Cancer onset and evolution of cancerous
cells . . . . . . . . . . . . . . . . . 2
1.2.1 Clonal theory and branching model of tumor evolution . . .
. . . . . 31.2.2 Other theories of tumor evolution . . . . . . . .
. . . . . . . . . . . . 5
1.3 Clinical relevance of intra-tumor heterogeneity . . . . . .
. . . . . . . . . . 71.4 Motivation, Contributions and Thesis
Organization . . . . . . . . . . . . . . 8
2 Background 122.1 Next Generation Sequencing . . . . . . . . .
. . . . . . . . . . . . . . . . . . 12
2.1.1 Preparing the input of NGS experiment . . . . . . . . . .
. . . . . . 132.1.2 Output of NGS experiment . . . . . . . . . . .
. . . . . . . . . . . . 142.1.3 The uses of NGS data in studies of
intra-tumor heterogeneity and
tumor evolution . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 152.2 Inference of tumor subclonal composition and
evolution from bulk sequencing
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 162.2.1 Variant and reference read counts as a
proxy for the fraction of cells
harboring mutation . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 172.2.2 Clustering of mutations based on the read counts
. . . . . . . . . . . 192.2.3 Basic principles in inferring trees
of tumor evolution . . . . . . . . . 23
vii
-
2.2.4 Theoretical limitations . . . . . . . . . . . . . . . . .
. . . . . . . . . 252.2.5 Potential benefits of the use of multiple
samples . . . . . . . . . . . . 262.2.6 Methods for the inference
of clonal trees based on the use of SNVs . 282.2.7 Methods based on
the use of CNAs and other types of mutations . . 29
2.3 Inference of tumor subclonal composition and evolution from
single-cell se-quencing data . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 302.3.1 The main characteristics of
single-cell sequencing data . . . . . . . . 312.3.2 Strengths and
weaknesses of single-cell sequencing data in recon-
structing trees of tumor evolution . . . . . . . . . . . . . . .
. . . . . 332.3.3 The existing methods for studying ITH and
evolution by the use of
SNVs from single-cell sequencing data . . . . . . . . . . . . .
. . . . 332.3.4 Analysis of CNAs from single-cell sequencing data .
. . . . . . . . . 36
2.4 Inference of tumor evolution and subclonal composition by
integrative use ofsingle-cell and bulk sequencing data . . . . . .
. . . . . . . . . . . . . . . . 36
3 Clonality inference from single tumor samples using low
coverage se-quencing data 393.1 Introduction . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 403.2 Methods . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 41
3.2.1 Input processing . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 413.2.2 Robust clustering using beta-binomial
mixture modelling . . . . . . 423.2.3 Estimation of tumor purity .
. . . . . . . . . . . . . . . . . . . . . . 433.2.4 Inference of
tree of tumor evolution . . . . . . . . . . . . . . . . . . .
43
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 453.3.1 Simulations . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 453.3.2 Applications in
real data analysis . . . . . . . . . . . . . . . . . . . . 49
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 51
4 Integrative inference of subclonal tumor evolution from
single-cell andbulk sequencing data 524.1 Introduction . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 534.1.2 Contributions . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 55
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 564.2.1 Tree models of tumor evolution . .
. . . . . . . . . . . . . . . . . . . 564.2.2 Input data . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.3
Tree scoring based on bulk sequencing data . . . . . . . . . . . .
. . 584.2.4 Tree scoring based on single-cell data . . . . . . . .
. . . . . . . . . . 594.2.5 Combined B-SCITE approach . . . . . . .
. . . . . . . . . . . . . . . 61
viii
-
4.2.6 Compression of mutation trees into clonal trees . . . . .
. . . . . . . 614.3 Results . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 62
4.3.1 Performance assessment on simulated data . . . . . . . . .
. . . . . 624.3.2 Application to real data . . . . . . . . . . . .
. . . . . . . . . . . . . 69
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 75
5 A combinatorial approach for sub-perfect tumor phylogeny
reconstruc-tion via integrative use of single-cell and bulk
sequencing data 775.1 Introduction . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 785.2 Methods . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 835.2.2 PhISCS-I for tumor phylogeny inference
via single-cell sequencing
(SCS) data with no mutation elimination allowed . . . . . . . .
. . . 835.2.3 Allowing mutations elimination in PhISCS-I . . . . .
. . . . . . . . . 855.2.4 Additional ILP constraints to integrate
VAFs derived from bulk se-
quencing data into PhISCS-I . . . . . . . . . . . . . . . . . .
. . . . 865.2.5 PhISCS-B for tumor phylogeny inference via SCS data
. . . . . . . . 885.2.6 Additional Boolean constraints to integrate
VAFs derived from bulk
sequencing data into PhISCS-B . . . . . . . . . . . . . . . . .
. . . . 905.3 Results on simulated data . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 92
5.3.1 Comparative running time analysis of PhISCS-I and PhISCS-B
. . . 935.3.2 Measuring accuracy in tree inference . . . . . . . .
. . . . . . . . . . 935.3.3 Comparing the accuracy of PhISCS and
alternative methods . . . . 95
5.4 Results on real sequencing data . . . . . . . . . . . . . .
. . . . . . . . . . . 1015.5 Discussion . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . 107
6 Conclusion 1086.1 Future Directions . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 110
Bibliography 112
Appendix A Supplementary Material for CTPsingle: Clonality
inferencefrom single tumor sample using low coverage sequencing
data 128A.1 Simulation set up . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 128A.2 Calculation of evaluation
measures and run-time settings for AncesTree,
LICHeE and PyClone . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 129
Appendix B Supplementary Material for B-SCITE: Integrative
inferenceof subclonal tumor evolution from single-cell and bulk
sequencing data 131B.1 Details of generating simulated data . . . .
. . . . . . . . . . . . . . . . . . 131B.2 Phylogenetic accuracy
measures . . . . . . . . . . . . . . . . . . . . . . . . . 134
ix
-
B.3 Derivation of the Binomial distribution approximation
formula . . . . . . . 134B.4 Details of running ddClone, OncoNEM,
SCITE, PhyloWGS and B-SCITE . 135B.5 Details of input data
pre-processing for ALL, TNBC and CRC patients . . 137B.6
Supplementary figures . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 139
Appendix C Supplementary Material for PhISCS: A combinatorial
ap-proach for sub-perfect tumor phylogeny reconstruction via
integrativeuse of single-cell and bulk sequencing data 154C.1
Generalizing the triple-VAF constraints to arbitrary number of
mutations . 154C.2 Simulation models used for benchmarking tumor
phylogeny inference methods155
C.2.1 Generating simulated data without ISA violations . . . . .
. . . . . 155C.2.2 Simulation of mutations violating ISA . . . . .
. . . . . . . . . . . . 155C.2.3 Simulations involving mutations
from regions affected by Copy Num-
ber Gains . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 156C.3 TPTED measure for comparing tumor phylogenies .
. . . . . . . . . . . . . 157C.4 Benchmarking SCITE, SiFit, B-SCITE
and PhISCS . . . . . . . . . . . . . 158C.5 Details of obtaining
and pre-processing real data . . . . . . . . . . . . . . . 159C.6
Source codes of Max-SAT solvers used for the implementation of CSP
for-
mulation of PhISCS . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 159
x
-
List of Tables
Table 5.1 Comparison of running times of ILP and CSP
implementations of PhISCS 94
xi
-
List of Figures
Figure 1.1 Branching clonal model and clonal tree of tumor
evolution . . . . . 4Figure 1.2 Alternative illustration of the
branching clonal model and a clonal
tree of tumor evolution in case where no losses of mutations are
allowed 5Figure 1.3 Linear and netrual models of tumor evolution .
. . . . . . . . . . . 6
Figure 2.1 An example of copy number event affecting region
harboring SNV . 18Figure 2.2 Clonal tree of tumor evolution and
plot of distribution of cellular
prevalences (estimated based on the read counts with
sequencingdepth of 200×) . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 20
Figure 2.3 Clonal tree of tumor evolution and plot of
distribution of cellularprevalences (estimated based on the read
counts with sequencingdepth of 50×) . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 21
Figure 2.4 Desirable clustering of mutations from hypothetical
examples with200× and 50× coverage datasets . . . . . . . . . . . .
. . . . . . . . 22
Figure 2.5 Limitation of bulk sequencing data in separating
mutations of thesame prevalence . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 26
Figure 2.6 Multiple clonal trees consistent with the mutation
frequencies ob-served in bulk data . . . . . . . . . . . . . . . .
. . . . . . . . . . . 27
Figure 2.7 An overview of single-cell sequencing experiment and
output data . 32Figure 2.8 Strengths and weaknesses of single-cell
sequencing data in inferring
pairwise order of mutations in tree of tumor evolution . . . . .
. . . 34Figure 2.9 Bulk data can improve phylogenetic inference by
reducing the effects
of noise in single-cell sequencing data . . . . . . . . . . . .
. . . . . 37
Figure 3.1 Comparison of purity inference accuracy of CTPsingle,
PyClone,LICHEeE and AncesTree . . . . . . . . . . . . . . . . . . .
. . . . . 46
Figure 3.2 Comparison of CTPsingle, PyClone, LICHeE and
AncesTree basedon the absolute difference between the true and
predicted number ofsubclones . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 47
Figure 3.3 Comparison of CTPsingle, PyClone, LICHeE and
AncesTree basedon the quadratic mean of difference of true and
predicted lineagefrequencies . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 48
xii
-
Figure 3.4 Effect of false positive SNVs and copy number
aberrations on theperformance of CTPsingle . . . . . . . . . . . .
. . . . . . . . . . . 50
Figure 3.5 Performance of CTPsingle on simulated datasets
containing increasednumber of subclones . . . . . . . . . . . . . .
. . . . . . . . . . . . 50
Figure 4.1 Comparison of the inference of tumor evolution based
on single-celland bulk sequencing data . . . . . . . . . . . . . .
. . . . . . . . . . 54
Figure 4.2 Schematic overview of B-SCITE . . . . . . . . . . . .
. . . . . . . . 56Figure 4.3 Comparison of v-measure accuracy of
mutation clustering by dd-
Clone, OncoNEM and B-SCITE for simulated clonal trees with
10nodes and 50 mutations. Three different rates of doublet noise
wereadded to single-cell data which consists of 25 genotypes drawn
undervarious values of sampling distortion parameter. . . . . . . .
. . . . 63
Figure 4.4 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated
clonal treeswith 10 nodes and 50 mutations. Three different rates
of doubletnoise were added to single-cell data which consists of 25
genotypesdrawn under various values of sampling distortion
parameter. . . . 64
Figure 4.5 Comparison of ancestor-descendant accuracy measure of
phyloge-netic inference of OncoNEM, SCITE and B-SCITE for
simulatedclonal trees with 10 nodes and 50 mutations. Three
different ratesof doublet noise were added to single-cell data
which consists of 25genotypes drawn under various values of
sampling distortion param-eter. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 65
Figure 4.6 Comparison of different-lineages accuracy measure of
phylogeneticinference of OncoNEM, SCITE and B-SCITE for simulated
clonaltrees with 10 nodes and 50 mutations. Three different rates
of dou-blet noise were added to single-cell data which consists of
25 geno-types drawn under various values of sampling distortion
parameter. 65
Figure 4.7 The effect of CNAs on the co-clustering accuracy
measure of phylo-genetic inference of B-SCITE with bulk data
coverage of 10, 000× . 67
Figure 4.8 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data
with 1, 2 and 4bulk samples, varying bulk coverage and 25 sampled
single cells . . 68
Figure 4.9 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data
with 1, 2 and 4bulk samples, varying bulk coverage and 50 sampled
single cells . . 68
xiii
-
Figure 4.10 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data
with 1, 2 and 4bulk samples, varying bulk coverage and 100 sampled
single cells . 69
Figure 4.11 Mutation histories inferred by CTPsingle, SCITE and
B-SCITE forPatient 1 from childhood leukemia study (Gawad et al.
2014) . . . 70
Figure 4.12 Mutation histories inferred by CTPsingle, SCITE and
B-SCITE forPatient 2 from childhood leukemia study (Gawad et al.
2014) . . . 71
Figure 4.13 Mutation histories inferred by the original study,
SCITE and B-SCITE for triple-negative breast cancer patient (Wang
et al. 2014) 73
Figure 4.14 Mutation histories inferred by B-SCITE for two
colorectal patientswith liver metastasis (Leung et al. 2017) . . .
. . . . . . . . . . . . 74
Figure 5.1 Comparisons of PhISCS and SiFit based on the
normalized Robinson-Foulds distance . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 97
Figure 5.2 Comparison of PhISCS with SCITE based on the
normalized MLTSMsimilarity measure. . . . . . . . . . . . . . . . .
. . . . . . . . . . . 98
Figure 5.3 Comparison of PhISCS with SCITE based on TPTED
dissimilaritymeasure. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 99
Figure 5.4 Comparison of PhISCS with SCITE on larger number of
subclonesand larger number of mutations. . . . . . . . . . . . . .
. . . . . . . 100
Figure 5.5 Comparison of PhISCS and B-SCITE according to both
MLTSMand its dual MLTD measures. . . . . . . . . . . . . . . . . .
. . . . 102
Figure 5.6 Mutation histories inferred by PhISCS for patient
with primary col-orectal cancer and liver metastasis (patient CRC2
from Leung et al.2017) . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 103
Figure 5.7 Mutation histories inferred by SCITE, B-SCITE and
PhISCS forPatient 2 from childhood leukemia study (Gawad et al.
2014) . . . 106
Figure A.1 The distribution of lineage frequencies and fraction
of mutations persubclone across all simulation datasets generated
in CTPsingle . . . 128
Figure A.2 Comparison of CTPsingle and CITUP on the simulated
data . . . . 130
Figure B.1 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal
trees with 6nodes and 50 mutations. Three different rates of
doublet noise wereadded to single-cell data which consists of 25
genotypes drawn undervarious values of sampling distortion
parameter. . . . . . . . . . . . 139
xiv
-
Figure B.2 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal
trees with 10nodes and 50 mutations. 25, 50 and 100 genotypes were
drawn undervarious values of sampling distortion parameter. . . . .
. . . . . . . 140
Figure B.3 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal
trees with 20nodes and 100 mutations. 25, 50 and 100 genotypes were
drawnunder various values of sampling distortion parameter. . . . .
. . . 141
Figure B.4 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal
trees with 40nodes and 100 mutations. 25, 50 and 100 genotypes were
drawnunder various values of sampling distortion parameter. . . . .
. . . 142
Figure B.5 Comparison of adjusted Rand index accuracy of
mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated
clonal trees with10 nodes and 50 mutations. 25, 50 and 100
genotypes were drawnunder various values of sampling distortion
parameter. . . . . . . . 142
Figure B.6 Comparison of adjusted Rand index accuracy of
mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated
clonal trees with20 nodes and 100 mutations. 25, 50 and 100
genotypes were drawnunder various values of sampling distortion
parameter. . . . . . . . 143
Figure B.7 Comparison of adjusted Rand index accuracy of
mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated
clonal trees with40 nodes and 100 mutations. 25, 50 and 100
genotypes were drawnunder various values of sampling distortion
parameter. . . . . . . . 143
Figure B.8 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE as a function of the
false negativerate. False positive rate was set to 0.00001. . . . .
. . . . . . . . . . 144
Figure B.9 Comparison of v-measure accuracy of mutation
clustering by dd-Clone, OncoNEM and B-SCITE as a function of the
false negativerate, but with highly elevated false positive rate of
0.01. . . . . . . 144
Figure B.10 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated
clonal treeswith 6 nodes and 50 mutations. Three different rates of
doublet noisewere added to single-cell data which consists of 25
genotypes drawnunder various values of sampling distortion
parameter. . . . . . . . 145
Figure B.11 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated
clonal treeswith 10 nodes and 50 mutations. 25, 50 and 100
genotypes weredrawn under various values of sampling distortion
parameter. . . . 146
xv
-
Figure B.12 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated
clonal treeswith 20 nodes and 100 mutations. 25, 50 and 100
genotypes weredrawn under various values of sampling distortion
parameter. . . . 147
Figure B.13 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated
clonal treeswith 40 nodes and 100 mutations. 25, 50 and 100
genotypes weredrawn under various values of sampling distortion
parameter. . . . 148
Figure B.14 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE as a function
of the falsenegative rate. False positive rate was set to 0.00001.
. . . . . . . . 148
Figure B.15 Comparison of co-clustering accuracy measure of
phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE as a function
of the falsenegative rate, but with highly elevated false positive
rate of 0.01. . 149
Figure B.16 The effect of CNAs on the co-clustering accuracy
measure of phylo-genetic inference of B-SCITE with increased bulk
data coverage of1, 000, 000× . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 149
Figure B.17 The effect of CNAs on the ancestor-descendant
accuracy measure ofphylogenetic inference of B-SCITE with bulk data
coverage of 10, 000×150
Figure B.18 The effect of CNAs on the ancestor-descendant
accuracy measure ofphylogenetic inference of B-SCITE with increased
bulk data coverageof 1, 000, 000× . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 150
Figure B.19 The effect of CNAs on the different-lineage accuracy
measure of phy-logenetic inference of B-SCITE with bulk data
coverage of 10, 000× 151
Figure B.20 The effect of CNAs on the different-lineage accuracy
measure of phy-logenetic inference of B-SCITE with increased bulk
data coverage of1, 000, 000× . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 151
Figure B.21 Multiple clonal trees compatible with clustering of
mutations inferredby CTPsingle for ALL patient . . . . . . . . . .
. . . . . . . . . . . 152
Figure B.22 Clonal trees for ALL and TNBC patients derived from
B-SCITEmutation trees by node compression . . . . . . . . . . . . .
. . . . . 153
xvi
-
Chapter 1
Introduction
1.1 Genetic basis of cancer and evidence for the existence
ofgenetic intra-tumor heterogeneity
Cancer is a common name to the group of over 200 diseases
characterized by uncontrolledcell division. In the case of leukemia
(cancer of blood or bone marrow), cancer manifestsitself by
overproduction of abnormal white blood cells, whereas in the other
cancer typesuncontrolled cell divisions result in the formation of
abnormal masses of cells, also known asmalignant tumors. In
addition, cancerous cells typically have potential to leave the
primarysite of cancer origin and invade distant tissues forming
metastases. [91, 127].
Cancer is nowadays widely recognized as a disease of genome [91,
105, 13]. It is themost common genetic disease and is estimated to
be responsible for nearly 10 million deathsworldwide in 2018 alone
[10]. Genetic mutations are one of the key causes of cancer
onset,growth, spread and treatment resistance. At the time of
clinical diagnosis, genomes ofcancerous cells typically harbor a
large number of mutations detectable from data generatedby
currently available DNA sequencing technologies. According to some
estimates, for mostof the tumors, these numbers are varying between
1,000 and 20,000 of single nucleotidevariants (SNVs), and a few to
hundreds of copy number aberrations (CNAs) and otherstructural
rearrangements [100].
While the role and importance of genetic mutations in cancer
onset and progression havebeen studied (at a limited resolution)
for a long time, completion of the first draft of thehuman
reference genome in 2001 [22] and technological advancements in DNA
sequencing, inparticular the introduction of next-generation
sequencing (NGS) technologies in 2004 [109],enabled researchers to
study genomic profiles of individual tumors at unprecedented
scaleand resolution. These developments enabled sequencing large
parts or even whole genomesof individual tumors, as well as
sequencing large cohorts of tumor samples [173]. Theywere followed
by development of computational methods for detection of various
types ofsomatic mutations such as single-nucleotide variants
(SNVs), small insertion and deletions
1
-
(indels), large-scale insertions, inversions, translocations,
copy number aberrations (CNAs)and others [178, 78].
Sequencing data that has been generated in the past years has
revealed a striking de-gree of genetic intra-tumor diversity in
cancer. In [50], mutation profiling of four patientswith metastatic
renal-cell carcinoma was performed. It revealed the existence of
muta-tions present in some, but not in all, tumor sites, implying
the existence of spatial geneticintra-tumor genetic heterogeneity.
This diversity was not only observed between physi-cally separated
primary and metastatic tumor sites, but also among distinct regions
of theprimary tumor that were sequenced independently. In another
study the authors trackedtumor progression in three chronic
lymphocytic leukemia patients [147]. For each patient,five blood
samples were obtained at different timepoints of disease
progression and subsetof mutations were selected for targeted deep
amplicon sequencing. The average depth ofsequencing coverage
achieved by amplicon sequencing was 100, 000× yielding highly
reliablevariant allele frequencies for the selected sets of
mutations. Large differences in values ofthe obtained variant
allele frequencies for many pairs of mutations are clear indicator
ofthe existence of genetically distinct cells and temporal genetic
intra-tumor heterogeneity.Similar findings were reported in [99,
83, 13, 116, 64, 179, 49, 172, 88] and many otherstudies.
In addition to the genetic intra-tumor heterogeneity, there are
several other types ofheterogeneity in tumors of a single patient
(e.g., epigenetic intra-tumor heterogeneity).However, since our
main focus is on the genetic intra-tumor heterogeneity, we adopt
con-vention that in the rest of the thesis term intra-tumor
heterogeneity, abbreviated as ITH,refers to this type of
heterogeneity, unless stated otherwise.
1.2 Cancer onset and evolution of cancerous cells
In this section we will discuss several existing theories of
cancer onset and evolution. Ourmain goal is to attempt answering
one of the fundamental questions about ITH: what arethe mechanisms
by which ITH emerges during a tumor growth and does it play any
role ina tumor progression?
Most of the available tumor sequencing data support hypothesis
of single-cell origin ofcancer. According to this hypothesis,
cancer originates from a single cell, also known ascancer founding
cell, which acquires a set of mutations giving it some
proliferative advantageover the neighboring healthy cells. The
evidence supporting this can be found in studieswhere multiple
regions of the same tumor were sequenced and it was observed that
all regionsshare a common set of mutations [50, 166]. Additional
evidence is provided by studies wheremutational profiling of
individual tumor cells was performed and sets of mutations present
inall cancerous cells identified [179, 49, 172]. Some studies also
suggest that a small fraction oftumors might have multiple cells of
origin [183, 152]. Such tumors are known in literature as
2
-
multicentric [70, 24] (terms multifocal and polycentric are also
used as synonyms, althoughin some publications term multifocal has
different meaning [136]).
Mutagenesis (the production of genetic mutations) in cancer is a
dynamic process. Ex-isting mutations typically cause defects in
mechanisms which ensure the accuracy of DNAreplication during the
process of cell division. As a consequence, cancerous cells are
usuallycharacterized by elevated mutation rate in comparison to the
other cells. During the processof cell division, as well as due to
exogeneous factors (e.g., tobacco smoke), cancerous cellsacquire
new mutations distinguishing them from the adjacent cells. There
exist several the-ories about tumor evolution, which have different
implications on the impact of the newlyacquired mutations and the
role of intra-tumor heterogeneity in tumor progression. Belowwe
summarize the most important of these theories.
1.2.1 Clonal theory and branching model of tumor evolution
In 1976, Peter Nowell proposed the clonal theory of cancer
evolution, which posits that can-cer is an evolutionary process
driven by the acquisition of somatic mutations [121]. Duringtumor
growth, descendants of cancer founding cell acquire new mutations
that are laterpassed on their descendants and the process is
continued over time. Consequently, at sometimepoint, in one of the
descendants of the cancer founding cell, a critical set of
mutationsgiving it some selective advantage in comparison to the
other cells can be accumulated. Theemergence of such a cell is then
followed by the expansion of the population of descendantsof this
cell, which leads to the formation of a genetically highly similar
set of cells, betterknown as subclone 1. This process is then
continued over time and, at the time of clin-ical diagnosis, tumors
usually consist of multiple subclones characterized by distinct
setsof somatic mutations. The model of tumor evolution that follows
clonal theory and allowsco-existence of multiple subclones over
time is known as branching clonal evolution and isshown in Figure
1.1.
Tree of tumor evolution
The process of tumor evolution can be depicted by a clonal tree
(of tumor evolution), herealso referred to as tumor phylogeny,
shown in Figure 1.1. In clonal tree, individual nodesrepresent
subclones, with root note representing either population of healthy
cells or thefirst population of cancerous cells 2. Mutations are
placed at the subclone (node) of theirfirst occurrence. The first
population of cancerous cells is also known as the cancer
foundingclone and mutations that it harbors (that are shared among
all cancerous cells) as clonal
1In this thesis we will use definition of subclone as a set of
genetically highly similar cells (similar definitioncan be found in
[148]). Consequently, and for the sake of simplicity, the
population of healthy cells will betreated as one of the
subclones.
2Note that in the case of multicentric tumors it is necessary
that root node represents the population ofhealthy cells
3
-
or trunk mutations. In addition to a mutational label, a
frequency label is also commonlyassigned to the node. Frequency
labels usually represent prevalence of the correspondingsubclone in
the tumor sample or average variant allele frequency of mutations
present innode’s mutational label. In case where multiple tumor
samples are sequenced, frequencylabels can be represented as
vectors of real numbers.
0%
time
tumor size Clonal tree of tumor evolution
20%
20%
35% 25%
Healthy cell
First cancer cell
Clonal (trunk)mutations
Set of mutations
Subclone
Figure 1.1: Tumor growth according to branching clonal model of
tumor evolution (left)and clonal tree of tumor evolution (right).
In the left, healthy cells are shown at the topas a purple circles.
The first cancerous cell and the set of mutations that it harbors
arerespectively depicted as a blue circle and a red star. In the
branching clonal model of tumorevolution, multiple subclones,
depicted in the left as triangles of the same color, emerge
overtime and co-exist in the tumor (with possibility of being
outcompeted and eliminated). Theemergence of subclonal populations
is driven by the acquisition of somatic mutations. Setsof mutations
distinguishing a subclone from its most recent ancestor (parent)
are depictedas stars of different colors. A clonal tree of tumor
evolution, shown in the right figure,is a convenient way of
depicting tumor clonal evolution. Individual nodes of a clonal
treerepresent subclones and mutations are placed at the node
(subclone) of their first occurrenceor, equivalently, to the edge
connecting the node with its parent. Frequency labels are
alsocommonly assigned to the nodes of the tree. Here, each
frequency represents the prevalenceof the corresponding subclone in
this hypothetical tumor at the time of obtaining tumorbiopsy tissue
(the latest timepoint in the left part of the figure). Note that
some of thesubclones that existed in the course of tumor evolution
might have been outcompeted beforethe time of obtaining the biopsy.
Such subclones are assigned zero prevalence and, althoughthey are
absent from the sequenced sample, their existence in the tumor
evolutionary historycan, in some cases, be inferred from the
sequencing data.
4
-
A mutation tree can be defined as a clonal tree of the highest
granularity where ateach node only a single mutation is placed.
There are other tree representations of clonalevolution discussed
in [67] (e.g., it can be depicted by binary genealogical tree [67])
buthere we will restrict ourselves to the representation by
clonal/mutation trees. Later, inSection 4.2.1, we also provide
formal mathematical definition of clonal and mutation trees.
time
20%
20%
35% 25%
0%
Figure 1.2: Alternative illustration of the branching clonal
model (left) and a clonal tree oftumor evolution (right) under the
assumption that mutations present in a subclone are notlost in any
of its descendants. Similarly as in Figure 1.1, circles in the left
part representcells and different sets of mutations are depicted as
stars of different colors. In a clonaltree, each edge is labeled
with sets of mutations distinguishing each child subclone fromits
parent, whereas the mutational label of each node consists of the
set of all mutationsharbored by the corresponding subclone. Note
that, under the assumption of no losses ofmutations, the tree from
the right is equivalent to clonal tree from Figure 1.1.
Under the assumption that none of the mutations present in some
subclone are lost inany of its descendants, we can also depict the
process of tumor growth and tree of tumorevolution as shown in
Figure 1.2.
1.2.2 Other theories of tumor evolution
Although the clonal theory of tumor evolution is well
established with a lot of real datasetssupporting this model3,
alternative models of tumor evolution can also be found in
theliterature.
According to multistep tumorigenesis model proposed by Fearon
and Vogelstein in 1990,tumor progression follows a linear evolution
[44]. In this model, analogous to the clonaltheory of cancer
evolution, acquisition of a set of somatic mutations can provide a
selective
3In this section we interchangeably use theory and model,
assuming the same meaning of the two terms.
5
-
advantage to the host cell and lead to the emergence of a new
subclone. However, themodel proposes that the acquired mutations
provide such a strong selective advantage tothe newly formed
subclone that it soon outcompetes the existing one(s).
Consequently, mostof the time, tumors are expected to be largely
homogeneous with the bulk of tumor massconsisting of a single,
dominant, subclone (see Figure 1.3). Next generation sequencing
dataincreasingly disputes this simple model of tumor evolution.
There is now an overwhelmingevidence that tumor evolution is a more
complex and in many cases branching processwhere multiple subclones
co-exist in the same tumor [50, 49, 172].
Linear evolution Neutral evolution
Figure 1.3: Linear and neutral models of tumor evolution.
Coloring is analogous to that inFigure 1.1. In a linear model of
tumor evolution, set of mutations that drive an emergenceof a new
subclone provide it with selective advantage that it soon
outcompetes the othersubclone(s). The neutral model of tumor
evolution posits that intra-tumor heterogeneity isa byproduct of
tumor growth but mutations acquired by a cell do not confer it a
significantselective advantage. Consequently, according to this
model, a large number of geneticallydistinct cells and wide
spectrum of mutational variant allele frequencies is expected to
beobserved in the tumor biopsy sample.
The neutral model of tumor evolution is another model that
gained attention in the pastyears [175]. It posits that most of the
driver mutations are acquired at the early stages oftumor growth
and mutations occurring later typically do not provide selective
advantageto the host cells. In contrast to the linear theory
characterized by selective sweeps and alargely homogeneous tumor, a
tumor following the neutral model of evolution is expectedto have a
large number of genetically distinct populations of cells and wide
spectrum ofmutational variant allele frequencies (see Figure 1.3).
The methodology used in [175] todemonstrate widespread neutral
evolution among tumors was recently disputed in [103]
6
-
and [161]. However, it is likely that some of the tumors evolve
according to the neutralmodel and further research and evidence
supporting this model will be required in thefuture. A similar
argument applies to the punctuated tumor evolution model, also
knownas the ’big bang’ model [156], which posits that most of the
ITH occurs at the early stagesof tumor growth and is followed by
stable expansion of one or several subclonal populations[160, 30,
156].
Cancer stem cell theory, which is based on hypothesis that tumor
growth is driven bya rare subpopulation of cells, dubbed cancer
stem cells, is beyond the scope of this thesisand we refer readers
to [20] for more details about this theory.
Here, we will use clonal branching theory as a gold standard for
simulating tumorevolution, although two of the three methods that
are going to be presented in the followingchapters do not require a
tumor to strictly follow this model of evolution nor do they
requireit to be of a single-cell origin.
1.3 Clinical relevance of intra-tumor heterogeneity
Before getting into discussion of clinical relevance of ITH, we
quote the very first sentencefrom one of the latest reviews on the
topic: "Intratumor heterogeneity, which fosters tumorevolution, is
a key challenge in cancer medicine." [104].
Numerous studies in the past years suggest that ITH has several
potential clinical im-plications. For instance, in [93] and [108] a
correlation between subclonal diversity andprogression to
esophageal adenocarcinoma in Barrett’s esophagus was reported. In
chroniclymphocytic leukemia, the presence of a subclonal driver was
found to be an independentrisk factor for rapid disease progression
[85]. The extent of ITH has also been linked tothe tumor metastatic
potential and disease-free survival. Findings from [180] suggest
thatpatients developing metastasis in triple-negative breast cancer
had a significantly highermeasure of ITH in the primary tumor,
whereas study of colorectal cancer [71] reported thathigh degree of
ITH in the primary tumor was correlated with an increased rate of
livermetastasis and shorter disease-free survival.
Presence of extensive ITH and the ability of a tumor to acquire
new mutations is con-sidered to be one of the key causes of
treatment failure. In most cases, even if a drug worksat first, it
will not work over the long term [72]. Radiation and chemotherapy
can promotethe emergence of new subclones resistant to treatment,
but treatment resistance can alsobe driven by a minor or dormant
subclone already existing in tumor prior to treatmentinitiation
[23, 42, 147, 111, 41, 117].
While there is an increasing evidence that ITH can be exploited
in clinics as a prognosticsindicator [126], research in the design
of effective treatments that will cure cancer or, atleast prevent
its uncontrolled growth and turn it into chronic disease with low
impact on
7
-
the quality of life [4], is still in its inception. A tumor’s
ability to adopt to treatmentand pervasive intra and inter-tumor
heterogeneity, even among cancers of the same type,largely
complicate design of clinical trials. We expect that technological
advancementsin sequencing, imaging, information sharing and many
other fields will facilitate designof larger clinical trials and
inspire discovery of novel therapeutic targets and
treatmentregimens. Adaptive therapy was proposed as a potential
treatment strategy to preventuncontrolled tumor growth [48]. Its
main idea lies in continuously modulating treatment inorder to
achieve fixed tumor population while avoiding complete elimination
of subclonessensitive to treatment. Namely, elimination of these
subclones is typically followed byuncontrolled growth of
chemoresistant populations. On the other hand, allowing a
fractionof chemosensitive subclones to survive can provide a means
to suppress proliferation of theless fit but chemoresistant
subclones through the competition for limited resources
betweensubclones. Although adaptive therapy is an interesting
approach for controlling tumorgrowth, its successful implementation
in clinical practice will require a good understandingof the tumor
subclonal composition, fitness of individual subclones and
selective advantageof the chemosensitive over the chemoresistant
subclones.
In addition to genetic ITH, which is our main focus, there are
also other types ofITH. For example, in [131] methylation profiling
of localized lung adenocarcinomas revealedcorrelation between the
extent of DNA methylation ITH and tumor size. In the same study,it
was also found that, on average, most of the somatic DNA mutations
were shared amongall of the sequenced tumor regions, suggesting
that they occurred at the early stages oftumor progression, whereas
only a quarter of the differentially methylated probes wereshared
among all regions. These findings indicate that tumor-specific DNA
methylationmight be associated with later branched evolution
observed in the set of patients analyzedin this study [131]. It is
also known that gene expression in individual tumor cells
belongingto the same subclonal population can be influenced by
their position in the tumor (e.g.,center of a tumor vs. its
boundary) [62, 157]. Incorporating genetic ITH with other typesof
ITH and other important factors (e.g., interaction of tumor cells
with microenvironment)will be of great importance in future studies
of tumor growth and progression.
1.4 Motivation, Contributions and Thesis Organization
We expect that, in the foreseeable future, analysis of tumor
subclonal composition andevolution will become more common in
clinical practice and aid clinicians in diagnostics,prognostics, as
well as in making treatment decisions and designing the best
therapies.Due to the extensive intra and inter-tumor heterogeneity,
such therapies will most likely betailored according to the genetic
makeup of individual tumors and consist of combinationof several
drugs targeting different subclonal populations. In this context,
the knowledgeof the clonal tree of tumor evolution can also be
highly valuable as it reveals divergent
8
-
subclonal populations (i.e., subclones evolving on different
branches of the tree) that mightneed to be targeted separately,
particularly in cases where drugs targeting clonal mutationsfail to
provide desired results in halting tumor growth.
Future studies of shared patterns of tumor evolution among large
cohorts of cancer pa-tients will also benefit from methods for
accurate inference of clonal trees for individualpatients. These
studies will further improve our understanding of cancer onset and
progres-sion. Shared patterns can provide novel insights about the
most significant mutations thatpromote tumor growth (i.e., driver
mutations) and treatment resistance, but also aboutthe advantages
and disadvantages that simultaneous presence of sets of mutations
conferto the host cells. Furthermore, we expect that patterns of
tumor evolution will be usedfor predicting next steps in tumor
evolution. Successful prediction of tumor evolutionarybehavior
requires deterministic patterns of tumor evolution and is expected
to be one im-portant subject for future cancer research. One recent
study, involving over 100 patientsdiagnosed with clear-cell renal
cell carcinoma, found evidence for deterministic nature ofclonal
evolution in this cancer type [166]. However, these findings need
to be validated onlarger cohorts and similar studies conducted for
other cancer types.
Metastasis is estimated to be responsible for ∼90% of cancer
related deaths [107, 158].Tracking the metastatic seeding patterns
is one of the very important problems in betterunderstanding of
biological background of metastasis. This task is very challenging
asmetastatic seeding is a complex process. In addition to the
metastatic seeding from primarysite, the existing metastases can
give rise to the new ones. Re-seeding between two
existingmetastatic sites adds an additional level of complexity to
the whole problem [16]. However,all cancerous cells of a given
patient are related through a common (shared) tree of
tumorevolution, which can provide answers to many questions related
to the metastatic process.In the past years, we have been
witnessing increasing interest in the research of variousaspects of
metastasis [55, 106, 63, 139] and specialized methods for studying
metastaticseeding patterns were developed recently [40, 135]. We
recommend [165] for a thoroughreview of the subject.
All of the above illustrate the importance of better
understanding of ITH and tumorevolution at the level of the
individual patient. The inference of tumor subclonal com-position
and evolution can be performed by the use of various signals
originating fromdetected mutations and will be discussed in more
detail in Chapter 2. The pioneering workstudying ITH and evolution
typically focused on a small number of selected genomic
al-terations from several tumor samples and involved manually
reconstructing the subclonalcomposition and/or phylogeny of these
tumors [147, 50]. Developments in DNA sequencingtechnologies
enabled large-scale cancer sequencing efforts where whole
exomes/genomes ofthousands of tumor samples were sequenced [173].
Manual analysis of each of the individual
9
-
tumors from such large data cohorts clearly became non-practical
and required developmentof automated computational methods.
The vast majority of all currently available tumor DNA
sequencing data was obtainedvia bulk sequencing where DNA of
millions of cells are pooled and sequenced together givingonly an
average signal over a large number of cells. Studying tumor
evolution and subclonalcomposition from such data requires
development of computational methods specialized fordeconvolution
of mixed signals while handling intrinsic properties of sequencing
data. Wedevote a separate section in Chapter 2 to provide more
details about the advantages andlimitations of the use of bulk
sequencing data in this context. In Chapter 3 we
introduceCTPsingle, a method for the inference of tumor subclonal
composition and evolution frombulk sequencing data. CTPsingle
assumes that cancer originates from a single cell andfollows the
clonal theory of evolution. It is currently used as one of the
methods to inferclonality in the Evolution and Heterogeneity
Working Group of Pan-Cancer Analysis ofWhole Genomes project [173].
Like other similar tools, CTPsingle is also faced with
sometheoretical limitations due to limited resolution of bulk
sequencing data.
Some of the limitations of bulk data can be resolved by the use
data obtained by recentlyintroduced single-cell sequencing. Several
methods for single-cell sequencing developed inthe past years
generate data of the ultimate resolution for studying ITH and
evolution.However, obtaining a large number of single cells that
are a good representative of thesubclonal populations present in a
tumor is still very challenging due to the cost of single-cell
sequencing and non-uniform sampling of individual tumor cells.
Furthermore, single-celldata is contaminated with various types of
noise, which is a major obstacle for the analysisof tumor evolution
by direct application of standard phylogenetic techniques.
Thereforesuccessful use of this type of data in tumor phylogenetics
requires development of specializedcomputational methods. We devote
a separate section in Chapter 2 to provide more detailedbackground
on the main characteristics of this data type and to summarize
developmentsin the design of related computational methods.
Importantly, bulk data is not exploited in any of the previously
developed methods forstudying tumor evolution from single-cell
data. As strengths and weaknesses of the two datatypes are to a
large extent complementary with respect to phylogeny inference,
performingboth, bulk and single-cell sequencing simultaneously, may
be a competitive strategy fortumor phylogeny reconstruction. In
Chapters 4 and 5 we introduce B-SCITE and PhISCS,the first two
methods for tumor phylogeny inference that leverage complementary
strengthsof single-cell and bulk sequencing data in a joint
inference framework. In addition to superiorperformance over the
existing alternatives on the comprehensive set of simulated data,
wealso show that these tools generate more realistic mutation
histories on several real datasets.
For PhISCS, we provide implementations by the use of both
integer-linear programming(ILP) and constraint-satisfaction
programming (CSP) and show that, at least in the contextof tumor
phylogeny inference, CSP might be a time-efficient alternative to
the ubiquitously
10
-
used ILP. In contrast to the existing methods for inferring
trees of tumor evolution fromsingle-cell data, that are based on
the probabilistic search schemes, PhISCS provides aguarantee of
optimality for the reported solutions or bound on the best
achievable objective.PhISCS is also the first method that
integrates single-cell and bulk sequencing data, whileaccounting
for the possible existence of violations of commonly used and
recently debated[81] infinite sites assumption.
Since all three methods are based on the use of single
nucleotide variants, the majorityof the attention in Chapter 2 is
devoted to description of the related methods based on theuse of
this type of mutations.
Although they are not our main contributions, it is worth
mentioning that in this thesiswe introduce potentially interesting
strategies for generating simulated data, in particularmutations
violating the infinite sites assumption and mutations affected by
copy numberaberrations. We also propose several distinct and novel
measures for comparing trees oftumor evolution and use them in
comparisons of performance of different methods for
treereconstruction. Methods for comparing trees of tumor evolution
are currently lacking andwe believe that the proposed measures will
inspire future research on the topic. Some otherinteresting and
important directions for future research are presented in Chapter
6.
11
-
Chapter 2
Background
In this chapter we provide a background on the existing
computational approaches fordeciphering ITH and inferring trees of
tumor evolution. Based on the input data used, weclassify methods
into three main groups: (i) methods designed only for bulk
sequencingdata (ii) methods designed only for single-cell
sequencing data (iii) methods combiningboth, single-cell and bulk,
sequencing data. We devote a separate section to each group ofthe
methods and also provide a description of the main advantages and
limitations of thebulk and single-cell sequencing data in studying
ITH and evolution.
The rapid developments in the design of algorithms and methods
discussed in this thesiswould be largely impossible without the
completion of the first draft of human referencegenome announced in
2001 [22] and the invention of novel approaches for DNA
sequencingthat followed soon afterwards. In 2004 several
technologies for massively parallel DNAsequencing, better known as
Next Generation Sequencing (NGS), were introduced [109].Since all
methods discussed in this work rely on data generated by some of
the available NGSplatforms, we devote it the first section of this
chapter. In the following sections we discussthe main concepts and
the existing methods for inferring tumor subclonal
compositionand/or trees of tumor evolution.
2.1 Next Generation Sequencing
Next generation sequencing, also called second-generation
sequencing, is a common nameused for several sequencing
technologies that first appeared in 2004 and gradually
replacedtraditional Sanger sequencing. Due to the high cost of
sequencing at the early stages oftechnology developments, its use
was largely limited to academic research laboratories.The first NGS
sequencers were used for sequencing selected genomic regions of
interest and,rarely, for sequencing whole genomes [174, 170].
However, since its introduction, the broad potential of NGS
technologies was recognizedand we have been witnessing a large
investments and rapid technological advancements in
12
-
the field over the past 15 years. Some of the main drivers of
innovation include marketcompetition among companies providing
sequencing infrastructure, but also large financialsupport through
public research funding agencies. For example, National Human
GenomeResearch Institute awarded more than 100 million USD for
developments in NGS between2004 and 2008 [109, 146]. As a result,
cost of sequencing was constantly plummeting andwhole genome
sequencing (WGS) can nowadays be routinely performed, while the use
ofgenetic tests is becoming common clinical practice [80, 73].
Developments in the sequencing technologies also lead to the
development of differentNGS platforms. Description of most of the
particular details of the equipment and lab-oratory steps required
in order to perform NGS experiment falls largely out of the scopeof
this thesis and, for more thorough reading, we refer to some of the
numerous reviewspublished on these topics [109, 153, 53]. Here, we
will focus only on summarizing details ofNGS sequencing process and
generated output data that are most relevant in developmentof
computational methods discussed later in the thesis.
2.1.1 Preparing the input of NGS experiment
One of the first steps that needs to be performed in NGS
experiment is input DNA prepa-ration. Depending on the intended use
of data and financial, technical and other resourcesavailable,
there are different strategies of extraction of cellular DNA and
preparation ofthe final DNA that is later provided as input to the
sequencing machine. We distinguishbetween different DNA preparation
strategies in terms of the number of tumor cells repre-sented in
the final DNA (bulk vs. single-cell sequencing) and between
different approachesto sequencing in terms of the size of sequenced
region (targeted sequencing, whole exomeand whole genome
sequencing). We now briefly describe these two classifications,
withoutgetting into application details that are discussed
later.
Bulk vs. single-cell sequencing
Each NGS experiment requires a minimum amount of DNA to be
provided as input to thesequencing machine. According to some
estimates, this amount is equal to the amount ofDNA found in
approximately 80,000 single cells [169]. For this reason, the
extraction ofDNA from hundreds of thousands or millions of
single-cells is usually one of the first stepsof input DNA
preparation. DNA extraction steps can be followed by some
additional DNAamplification steps (e.g., in targeted sequencing
approach discussed below). Nevertheless, inthis case the sequenced
DNA is a mixture of DNA originating from hundreds of thousandsor
millions of single cells.
Sequencing where initial DNA material is obtained using this
approach is also knownas bulk sequencing.
13
-
On the other hand, since the amount of DNA present in a single
cell is insufficient toperform sequencing, sequencing DNA from
single cell first requires precise extraction ofthe cell’s DNA
followed by several rounds of DNA amplification in order to reach
desiredamount of DNA used for sequencing. Sequencing where DNA is
prepared in this way isbetter known as single-cell sequencing
(SCS).
Targeted, whole exome and whole genome sequencing
The process of input DNA preparation and sequencing also depends
on the intended use ofthe data. The aim of performing sequencing
experiment might be to obtain data about aparticular region
(target) of interest or to interrogate genomic variation at the
whole exomeor the whole genome level. Based on the size of
sequenced region, we divide sequencingexperiments in the following
three groups:
1. Targeted sequencing, which is used in cases where we are
interested in examininggenetic variants in a pre-defined set of
genomic regions. Some examples of the use oftargeted sequencing
include validating mutation of interest and sequencing a
selectedset of genes known to harbor mutations causing a particular
disease. In the clinicaluses of sequencing in cancer treatment,
targeted sequencing of genes for which knowntreatment options exist
is becoming common practice.
2. Whole Exome Sequencing (WES), which is used to search for
mutations in protein-coding regions (exons) of genes. These regions
are expected to harbor the majority ofdeleterious mutations.
3. Whole Genome Sequencing (WGS), where the goal is to obtain
sequencing data of thewhole genome.
2.1.2 Output of NGS experiment
The typical output of NGS experiment consists of millions or,
more recently, billions ofsequencing reads (a short fragments of
the sequenced DNA). Nowadays, most of the NGSdata is generated by
the use of Illumina’s short read sequencing technology, which
providesdata of high throughput and accuracy. Reads generated by
this technology are usually oflengths 100 to 150 base-pairs and
have between 0.1 and 1% erroneously called bases.
Sequencing depth, also called sequencing coverage, is another
important parameter ofthe NGS dataset. Coverage at a given position
can be defined as the number of reads thatoverlap with this
position after the process of read mapping (during the process of
readmapping, locations in the Human Reference Genome that are most
similar to individualreads or their parts are determined). Average
coverage of a given set of genomic regions isdefined as the mean
value of coverage of positions falling into these regions.
14
-
Sequencing breadth is usually defined as percentage of positions
which have sequencingdepth greater than or equal than some given
constant c. When computing sequencingbreadth, only positions that
were intended to be sequenced (e.g., the set of all exons inWES)
are considered.
2.1.3 The uses of NGS data in studies of intra-tumor
heterogeneity andtumor evolution
NGS nowadays enables cost-effective interrogation of genomic
regions of interest or evenwhole genomes. Due to the importance of
genetic aberrations in cancer onset and progres-sion, sequencing of
tumor samples has been one of the main applications of NGS since
itsvery beginnings. Detection of various types of mutations by the
use of whole exome orwhole genome sequencing facilitates
identification of the key cancer driver mutations andmutational
burden among distinct cancer types. Targeted sequencing in search
for the ge-nomic aberrations for which known treatment options are
available is offered at affordableprices (on the order of hundreds
of US dollars) and is starting to become clinical routine(e.g., in
a study involving 1281 oncologists in United States, 75.6% reported
using NGStests to guide treatment decisions for their patients
[46]).
In addition to enabling large scale studies of inter-tumor
heterogeneity [162, 173], whichis characterized by distinct sets of
mutations present among distinct patients, developmentsin
sequencing technologies also facilitated the exploration and better
understanding of theextent of genetic ITH and tumor evolution (see
Section 1.1 for summary of the first studiesusing NGS for analyzing
ITH and evolution).
Due to cost and technological constraints, bulk sequencing is
still the dominant approachin tumor sequencing. Most of the
analyses of ITH and evolution from bulk sequencing datastart with
whole exome or whole genome sequencing [50, 55, 119, 147, 173, 172,
88]. Whiledata produced by WES or WGS is of high sequencing
breadth, its typical depth rangesbetween 30× and 100× [172, 49].
Due to low sequencing depth, variant allele frequencies,defined as
the fraction of reads supporting the variant allele of a reported
mutation, areusually characterized by high variance. An additional
limitation of such data is in detectingrare mutations as mutation
signals are usually not discernible from the sequencing
noise.Therefore it is also very common that WES or WGS data is used
for identifying putativevariants and then followed by targeted
sequencing of the selected subset of the putativevariants [147, 55,
172]. Targeted sequencing is performed in order to identify true
variantsand obtain highly reliable variant allele frequencies that
are later used in the analysis or forvalidating findings obtained
from WES or WGS data. Custom sequencing panels targetinghundreds of
genes were recently used in [88] and [166]. These panels can
provide data ofhigher coverage at lower cost than standard WES, but
many of the important mutationsfrom genes not covered by the panels
can be missed.
15
-
The first method for single-cell cell sequencing was introduced
in 2011 [116]. Althoughthere have been many developments in
single-cell sequencing since 2011 [81, 186], isolating,amplifying
and sequencing DNA of larger number of individual cells (which is
necessary inorder to get an appropriate input for inferring tumor
subclonal frequencies and tree of tumorevolution) is still
challenging and expensive in comparison to the bulk sequencing.
Single-cell data is also characterized by elevated noise rates with
many false negative and some falsepositive mutation calls.
Occasionally DNA from two or more single cells may be
extractedtogether resulting in doublets noise and output data that
reflects DNA of multiple cells.Non-uniform extraction of single
cells can also result in sampling biases where numbersof cells
sampled from subclones are not proportional to their cellular
prevalences. Theeffect of this type of noise is significantly lower
in bulk sequencing. Nevertheless, despitevarious types of noise,
single-cell sequencing yields data of the highest possible
resolutionand has great potential to revolutionize the studies of
tumor evolution. This type of datacan also help in detecting some
of the rare mutations that can be missed by standard
bulkapproaches. Some evidence for this was provided in [172] and
[88]. However, detection ofrare mutations by SCS depends on the
quality of generated data and sampling of single-cells(due to noise
in SCS data, usually at least two cells harboring mutation need to
be sampledin order to get reliable mutation calls).
2.2 Inference of tumor subclonal composition and evolutionfrom
bulk sequencing data
Due to the input DNA preparation strategy used, a bulk
sequencing dataset consists of a setof reads originating from a
mixture of a large number of different cells and therefore
yieldsonly an aggregate signal about their DNA. Consequently, in
the case of bulk sequencing ofheterogeneous tumor tissue, none of
the subclonal populations is observed directly and, inorder to
infer tumor subclonal composition, we need computational methods
for deconvolu-tion of the observed aggregate signals. Each such
method is faced with the very challengingproblem of inferring
unknown numbers of tumor subclones of unknown prevalences
andunknown sets of somatic mutations harbored by individual
subclones. In addition, many ofthe methods also infer a clonal tree
of tumor evolution (defined in Section 1.2.1) and exploitdata
obtained by sequencing multiple samples of the same patient
[81].
Due to their high prevalence in many tumors [19], well developed
methods for identifi-cation from bulk data [178] and simplicity of
use in modeling tumor subclonal compositionand evolution,
single-nucleotide variants (SNVs) are the most widely used type of
mutationsamong the existing methods for studying ITH and evolution
[97, 60, 38, 39, 128, 142, 66,130, 33, 150, 125, 189, 159, 145,
110, 69, 120, 164, 114, 133]. In addition, several methodsbased on
the use of copy number aberrations (CNAs) [58, 124, 123, 25, 185]
and a fewbased on the use of other types of variants (e.g., large
insertions) [37, 21] were also devel-
16
-
oped in the past years. As will be illustrated below, CNAs
overlapping with the genomicposition of a given SNV can have a
substantial impact on the fraction of reads supportingvariant
allele, as well as on the total number of reads spanning the
position. Since thesetwo numbers (per each SNV) are the main input
for all of the methods based on the useof SNVs, prior knowledge of
copy number status of regions harboring SNVs is a requiredpart of
the input for most of such methods. Some of the methods expect
exact estimates ofcopy number (e.g., [33, 39]) whereas the other
use this information only to filter mutationsfrom non-diploid
regions (e.g., [38, 97]). Note that, in principle, the read count
data shouldbe sufficient input to the methods based on the use of
SNVs (since copy number status ofthe mutated loci can be inferred
from the read counts). However, as copy number eventstypically
affect larger genomic regions, it would then be highly desired to
consider not onlyreads spanning genomic positions of the mutated
loci, but many other, if not all, reads.This would come at the cost
of considerably increased model complexity and very chal-lenging
optimization problems and is one of the main reasons that copy
number analysis isperformed as a separate step prior to running
most of the SNV-based methods.
We will now present some of the main concepts used in the
methods based on the useof SNVs. We start with the simple case
where bulk sequencing data of a single tumorsample is available. We
also discuss challenges arising when CNA event(s) overlap with
theposition of an SNV and summarize theoretical limitations. These
limitations are usually aconsequence of the limited resolution of
bulk NGS data and they are particularly emphasizedwhen only a
single sequenced sample is available. In such cases, some of the
limitationsare very likely to influence the correctness of the
solution, regardless of a method used inthe analysis. However, in
many cases, most of the limitations can be resolved if data
frommultiple different samples from the same patient is available.
Due to the importance of theuse of multiple samples when they are
available, we devote a separate subsection discussingpotential
benefits of the joint use of data obtained by sequencing multiple
tumor samples(of the same patient). We assume that all SNVs
considered in the input are from positionsthat are homozygous in
the matching normal tissue (which is usually sequenced in
additionto tumor tissue in order to distinguish somatic and
germline variants). At the end of thissection, we discuss methods
based on the use of CNAs and other types of variants.
2.2.1 Variant and reference read counts as a proxy for the
fraction ofcells harboring mutation
Assume that we are given a heterozygous SNV, denote it as a,
that belongs to the region ofthe genome not affected by any CNA
event. Let va and ta respectively denote the number ofvariant and
total reads supporting this SNV. Assume for the moment that we have
a perfectsampling of reads from the sequenced cells where from each
cell harboring a mutation tworeads are present in the bulk data:
one from the chromosome with the mutation (i.e., a
17
-
read supporting the variant allele) and one from the chromosome
without the mutation(i.e., a read supporting the reference allele).
We also assume that each cell not harboringmutation a contributes
two reads, with each of them supporting the reference allele.
Underthese assumptions, the total number of cells in the sample
equals ta2 . On the other hand,the total number of cells harboring
mutation a equals va. These imply that the fraction ofcells
harboring mutation a, also referred to as cellular prevalence of a
and here denoted asCP(a), equals:
CP(a) = vata2
= 2vata. (2.1)
In practice, the sampling of variant and total reads in the bulk
data is not as ideal asin the above. However, assuming that the
reads are emitted uniformly at random, which isusually a sound
assumption for bulk sequencing, the formula from Equation 2.1
provides anestimate of cellular prevalence of a given mutation.
Clearly, the precision of this estimatedepends on the depth of
sequencing coverage. For example, in the case of low coveragedata,
e.g., if ta = 40, then a single read difference in the number of
observed variant readsva changes the estimate of cellular
prevalence of a for 0.05 (i.e., 5%). Since the effect ofvariance in
the values of va on the estimated cellular prevalence of mutation
decreases withthe increase of depth of sequencing coverage,
performing deep sequencing is necessary inapplications where highly
accurate estimates of mutational cellular prevalences are
required.
The effect of copy number aberrations on the observed variant
allele frequencies
In case where SNV belongs to the genomic region affected by some
CNA event, formulagiven in Equation 2.1 can not be directly used
for computing the fraction of cells harboringa given SNV.
22%
18%
60%
𝑴𝟏
𝑴𝟐
𝑴𝟏
𝑴𝟏
Figure 2.1: Copy number gain of referenceallele (green
rectangle) occurring after M1.
This is illustrated by example shownin Figure 2.1. In this
example, we con-sider two mutations, M1 and M2, and as-sume that M2
belongs to the genomic re-gion not affected by any CNA event. Onthe
other hand, as shown in the figure, M1belongs to the genomic region
affected byCNA event which involves a gain of refer-ence allele
(green).
In perfectly uniform sampling withoutnoise in the sequencing
data, the numberof reads supporting the reference allele
andspanning genomic position of M1 is propor-tional to 0.22 · 2 +
0.18 · 1 + 0.60 · 2 = 1.82.For variant allele, this number is
propor-
18
-
tional to 0.18 · 1 + 0.60 · 1 = 0.78. Now, if we apply formula
from Equation 2.1 we get thatcellular prevalence of M1 equals 2 ·
0.78/(0.78 + 1.82) = 0.6, a value which is very differentfrom the
true value of 0.78. Furthermore, this value is equal to to the
estimated cellularprevalence ofM2 and, in the absence of multiple
bulk samples, this would result in incorrectmerging of the two
mutations to the same cluster/subclone (as will be discussed
below).
This examples clearly illustrates the importance of appropriate
adjustment of variantread counts in cases where one or multiple CNA
events overlap with the genomic position ofa given mutation. In
such cases, estimating CP value from read counts requires
additionalinformation about each of the CNA events affecting
genomic position of the mutation.However, detecting all CNA events
and the number of copies of regions lost or gainedduring each of
the events is very challenging problem for itself. This problem is
furthercomplicated by the presence of ITH [123]. The relative
timing of CNA events and the firstoccurrence of the SNV is yet
another complexity that needs to be modelled properly. Forall these
reasons, many of the existing methods are constrained to the use of
SNVs fromdiploid regions [97, 38, 128, 60, 189]. Some of the
methods considering mutations fromnon-diploid regions (e.g.,
PhyloWGS [33]) expect copy number estimates, whereas in a few(e.g.,
Canopy [69]) CNA events are inferred together with SNVs.
2.2.2 Clustering of mutations based on the read counts
Here, we assume that tumor growth follows the branching clonal
theory of tumor evolution(see Section 1.2.1) and restrict ourselves
to tumors harboring SNVs. At this point, we donot get into details
of minimum number of SNVs that we expect in order to analyze ITHand
evolution. For each SNV, we assume that it occurs exactly once and
is never lost (thisis known as the Infinite Sites Assumption and
will be later discussed in more detail). Inour notation, we
introduce p(S) which denotes the parent of cancerous subclone S in
theclonal tree.
Recalling clonal theory and the Infinite Sites Assumption, a set
of cells harboring SNVsthat appear for the first time at S (i.e.,
mutations that distinguish S from its parentp(S)) consists of cells
from subclone S and all cells from its descendants subclones.
Thisimplies that all mutations appearing for the first time in S
are expected to have similarvalues of the true cellular prevalence
(not necessarily identical, since mutation accumulationbetween p(S)
and S is gradual process where some of the mutations happen
earlier). Theemergence of a subclone S also requires that its first
(founding) cell harbors one or multipledriver mutations that are
absent from p(S). Such mutations, acquired between the twosubclonal
expansions, p(S) and S, provide some selective advantage or
disadvantage toS distinguishing it (in terms of the selective
potential) from p(S). The acquisition ofthe driver mutations is
timely process requiring several cellular divisions until some
ofthe descendant cells of p(S) acquires a critical set of mutations
producing the foundercell of S. Due to an imperfect DNA replication
process, during these cellular divisions
19
-
newly born cells acquire "hitchhiker" (selectively neutral)
mutations that are later passedto their descendants. In addition,
some external factors (e.g., ultraviolet light) might
alsocontribute to the acquisition of new mutations. In summary, in
the observed data we expectthe existence of clusters of mutations,
occurring between the two subclonal expansionsand distinguishing
child subclone from its parent, and also having highly similar
cellularprevalence values.
All of the above are illustrated in Figure 2.2 where in (a) we
show the tree of tu-mor evolution for some hypothetical tumor and
in (b) distribution of the expected cellularprevalences computed
from read counts available in bulk data and using the formula
fromEquation 2.1. When simulating read counts we assume that each
locus has coverage equalto 200×. From Figure 2.2(b), it is obvious
that the distribution of cellular prevalences canbe well explained
by three distinct Gaussian distributions with means around 10%, 30%
andbetween 50% and 55%, as expected based on the frequencies in
Figure 2.2(a). An exampleof desirable clustering of mutations in
this case is shown in Figure 2.4(a).
10% 30%
15%
45%
is present in (15+10+30)% = 55% of cells
Ground truth tree
(a)
0 20 40 60 80 100Cellular prevalence (based on the read
counts)
0
10
20
30
40
50
Num
ber o
f mut
atio
ns
COVERAGE = 200
(b)
Figure 2.2: (a) Clonal tree of evolution of hypothetic tumor (b)
Number of mutations(y-axis) having the observed cellular prevalence
(x-axis). Cellular prevalence values werecomputed by the use of
formula from Equation 2.1 and shown as percentages (we assumethat
all mutations are from diploid regions). Coverage at each position
was set to 200× (i.e.,ta = 200 for each mutation a) and number of
variant reads (denoted previously as va) drawnfrom a Binomial
distribution with success probability depending on cellular
prevalence ofmutation. Red, blue and green clusters of mutations
consist of 265, 300 and 65 mutations,respectively.
In order to illustrate the effect of depth of sequencing
coverage on the observed values, inFigure 2.3 we show an example
with the same setting of parameters as in Figure 2.2, exceptthe
sequencing coverage that was set to 50×. As the example from Figure
2.3(b) illustrates,detection of the green cluster of mutations
becomes very challenging when working with the
20
-
sequencing data of low coverage, despite the fact that its
frequency of 30% is well separatedfrom frequencies of the other two
clusters (10% and 55%) and that, in the ground truth,over 10% (65
out of 630) of all mutations belong to this cluster. An example of
desirableclustering in this case is shown in Figure 2.4(b).
10% 30%
15%
45%
is present in (15+10+30)% = 55% of cells
Ground truth tree
(a)
0 20 40 60 80 100Cellular prevalence (based on the read
counts)
0
5
10
15
20
25
30
35
Num
ber o
f mut
atio
ns
COVERAGE = 50
(b)
Figure 2.3: An example of distribution of mutational cellular
prevalences observed frombulk data. All settings are same as in
Figure 2.2, except that sequencing depth equals 50×.
Both of the above examples assume that only a single tumor
sample is sequenced. Inpractice, sometimes bulk sequencing data of
two or more samples are available [147]. Inthese cases, all of the
above remains same, except that for each mutation, instead of
itscellular prevalence, we consider a vector of its cellular
prevalences across all samples.
Weak parsimony assumption
The goal of most of the available methods for deciphering ITH is
to obtain clustering ofmutations, with each cluster consisting of
mutations from the same subclonal expansion (asillustrated in the
previous figures). One of the most frequently used assumptions
among theexisting methods is the weak parsimony assumption.
According to this assumption, SNVswith similar cellular prevalence
values are present in similar set of cells (i.e., occur for
thefirst time at the same subclone) and therefore get clustered
together. Here, similarity ofcellular prevalence values is largely
determined by the depth of sequencing coverage. Forexample, while
CP values of 25% and 30% are significantly different in the case of
bulk dataof coverage 10, 000×, these values are very similar in the
dataset where coverage equals to40. In general, there is some
degree of a trade-off between the number of mutations andsequencing
coverage. Targeted deep sequencing data enables sequencing smaller
numbers ofmutations but, assuming appropriate adjustments for the
possible CNA events are