Top Banner
Inference of tumor subclonal composition and evolution by the use of single-cell and bulk DNA sequencing data by Salem Malikić M.Sc., Simon Fraser University, 2014 B.Sc., University of Sarajevo, 2011 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Salem Malikić 2019 SIMON FRASER UNIVERSITY Summer 2019 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.
175

Inferenceoftumorsubclonalcomposition …summit.sfu.ca/system/files/iritems1/19469/etd20508.pdf · 2020-03-21 · rations, I would first like to thank Dr. Ibrahim Numanagic and Michael

Mar 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Inference of tumor subclonal compositionand evolution by the use of single-cell and

    bulk DNA sequencing databy

    Salem Malikić

    M.Sc., Simon Fraser University, 2014B.Sc., University of Sarajevo, 2011

    Thesis Submitted in Partial Fulfillment of theRequirements for the Degree of

    Doctor of Philosophy

    in theSchool of Computing ScienceFaculty of Applied Sciences

    c© Salem Malikić 2019SIMON FRASER UNIVERSITY

    Summer 2019

    All rights reserved.However, in accordance with the Copyright Act of Canada, this work may bereproduced without authorization under the conditions for “Fair Dealing.”

    Therefore, limited reproduction of this work for the purposes of private study,research, education, satire, parody, criticism, review and news reporting is likely

    to be in accordance with the law, particularly if cited appropriately.

  • Approval

    Name: Salem Malikić

    Degree: Doctor of Philosophy (Computing Science)

    Title: Inference of tumor subclonal composition andevolution by the use of single-cell and bulk DNAsequencing data

    Examining Committee: Chair: Jian PeiProfessor

    Leonid ChindelevitchSenior SupervisorAssistant ProfessorS. Cenk SahinalpCo-SupervisorProfessorSchool of Informatics, Computingand Engineering,Indiana UniversityCedric ChauveSupervisorProfessorColin CollinsSupervisorProfessorDepartment of Urologic SciencesUniversity of British ColumbiaMaxwell W LibbrechtInternal ExaminerAssistant ProfessorRussel SchwartzExternal ExaminerProfessorDepartment of Biological SciencesCarnegie Mellon University

    Date Defended: August 21, 2019

    ii

  • Abstract

    Cancer is a genetic disease characterized by the emergence of genetically distinct populationsof cells (subclones) through the random acquisition of mutations at the level of single-cellsand shifting prevalences at the subclone level through selective advantages purveyed bydriver mutations. This interplay creates complex mixtures of tumor cell populations whichexhibit different susceptibility to targeted cancer therapies and are suspected to be the causeof treatment failure. Therefore it is of great interest to obtain a better understanding of theevolutionary histories of individual tumors and their subclonal composition. In this thesiswe present three methods for the inference of tumor subclonal composition and evolutionby the use of bulk and/or single-cell DNA sequencing data.

    First, we present CTPsingle, a method which aims to infer tumor subclonal compositionfrom single-sample bulk sequencing data. CTPsingle consists of two steps: (i) robust cluster-ing of mutations using beta-binomial mixture modelling and (ii) inference of tumor phyloge-nies by the use of integer linear programming. On simulated data, we show that CTPsingleis able to infer the purity and the clonality of single-sample tumors with high accuracyeven when restricted to a coverage depth as low as ∼ 30×. CTPsingle is currently used toinfer clonality as a part of the Evolution and Heterogeneity Working Group of Pan Can-cer Analysis of Whole Genomes project where sequencing data of over 2700 tumors areanalyzed.

    Next, we present B-SCITE, the first available computational approach that infers tumorphylogenies from combined single-cell and bulk sequencing data. B-SCITE is a probabilisticmethod which searches for tumor phylogenetic tree maximizing the joint likelihood of thetwo data types. Tree search in B-SCITE is performed by the use of customized MCMCsearch over the space of labeled rooted trees. Using a comprehensive set of simulated data,we show that B-SCITE systematically outperforms existing methods with respect to treereconstruction accuracy and subclone identification. On real tumor data, mutation historiesgenerated by B-SCITE show high concordance with expert generated trees.

    In the third part, we introduce PhISCS, the first method which integrates single-cell andbulk sequencing data while accounting for the possible existence of mutations affected byundetected copy number aberrations, as well as mutations for which the commonly used and

    iii

  • recently debated Infinite Sites Assumption is violated. PhISCS is a combinatorial methodand, in contrast to the available alternatives which are mostly based on the probabilisticsearch schemes, it can provide guarantee of optimality of the reported solutions. We providetwo different implementations of PhISCS: (i) the implementation based on the use of integerlinear programming and (ii) the implementation based on the use of constraint satisfactionprogramming. We show that the latter has lower running time on most of the instancesthat we used to asses the performance of the two implementations. These results suggestthat in some applications constraint satisfaction programming might be a viable alternativeto commonly used integer linear programming. We also demonstrate the utility of PhISCSin analyzing real sequencing data where it reports more plausible and parsimonious tumorphylogenies than the available alternatives.

    Keywords: Intra-tumor heterogeneity; Tumor evolution; Single-cell DNA sequencing; BulkDNA sequencing; Infinite sites assumption; Markov chain Monte Carlo; Joint probabilisticmodel; Integer linear programming; Constraint satisfaction programming

    iv

  • Acknowledgements

    First and foremost, I would like to thank my supervisor Dr. S. Cenk Sahinalp for hisextensive guidance, support and patience during my studies. I especially thank him for theendless effort he put into training me in the scientific field. I am also very thankful to theother supervisors: Dr. Leonid Chindelevitch, Dr. Cedric Chauve and Dr. Colin Collins forfollowing my work in the past years and providing many suggestions that helped improvingit. In addition, I thank Dr. Maxwell Libbrecht for his insightful questions and helpfulsuggestions during the depth exam and thesis defence, where he served as an Examiner.This thesis was considerably improved by the input from Dr. Russell Schwartz, whom I amvery grateful for serving as an External Examiner, for his very detailed proofreading of thethesis and for providing numerous suggestions. Also, I thank dr. Jian Pei for chairing thedefence.

    This work would be impossible without many of the collaborators whom I worked withduring my master’s and doctoral studies. I thank Dr. Nilgun Donmez and Dr. AndrewMcPherson for introducing me to the studies of tumor heterogeneity, providing extensiveguidance in the first years of my research and their vast contribution to the development ofCITUP, which was the basis of my master thesis, and CTPsingle, which is the first methodpresented in this thesis. B-SCITE, the second method presented in the thesis, is a resultof a collaboration with the Computational Biology Group at ETH Zurich lead by Dr. NikoBeerenwinkel. I spent five months in Switzerland working together with Dr. Beerenwinkeland two of the members of his group, Dr. Katharina Jahn and Dr. Jack Kuipers. I thankthem all for their hospitality and collaboration on B-SCITE. Work on PhISCS, which is thethird presented method, is a joint effort of the labs lead by Dr. Sahinalp and Dr. ImanHajirasouliha from Cornell University. In addition to Dr. Sahinalp and Dr. Hajirasouliha,here I would like to thank Simone Cicolella, Ehsan Haghshenas, Md. Khaledur Rahman,Camir Ricketts, Daniel Seidman and Dr. Faraz Hach for their contributions to this project.My special thanks go to Farid Rashidi Mehrabadi, who put an endless effort in PhISCS,contributing to the data analysis, code preparation and methods design.

    Some of the research that I was involved in is not included in the thesis. However, ithelped me in gaining a valuable experience in method development, collaborative researchand gave me an opportunity to attend several scientific conferences. For these collabo-

    v

  • rations, I would first like to thank Dr. Ibrahim Numanagic and Michael Ford, whom Iworked together with on the development of methods for genotyping highly polymorphicgenes. I also thank Nikolai Karpov and Md. Khaledur Rahman for a joint work on thedevelopment of a similarity measure for comparing trees of tumor evolution. With Dr.Sahand Khakabimamaghani I worked on the development of a method for collaborativeintra-tumor heterogeneity detection. I thank Dr. Khakabimamaghani for leading this workand for many insightful discussions about the tumor heterogeneity and potential new waysof solving several important problems in the field. I also thank to all members of Evolutionand Heterogeneity Working Group of Pan Cancer Analysis of Whole Genomes (PCAWG)project that I have been a member of since 2014.

    I would also like to acknowledge insightful feedback received from other colleagues fromDr. Sahinalp’s lab and Laboratory for Advanced Genome Analysis at Vancouver ProstateCenter: Dr. Yen Yi Lin, Ermin Hodzic, Can Kockan, Dr. Alex Gawronski, Iman Sarrafi,Hossein Asghari and Dr. Raunak Shrestha.

    I am indebted to many teachers and professors who helped me in developing passion andenthusiasm towards Mathematics and the other scientific fields. Here, I especially thankto Nermin Suljic, Ali Lafcioglu, Dr. Hasan Jamak and Dr. Dino Oglic. Furthermore, Ithank all of the people from Bosna Sema Educational Institutions for providing an excellentenvironment and support during my high school and undergraduate studies, as well as toCanadian granting agencies, in particular NSERC, for supporting my research.

    I devote a special thanks to two of my colleagues and friends, Dr. Ibrahim Numanagicand Ermin Hodzic. Our friendship dates back to the time of our undergraduate studies atthe University of Sarajevo in Bosnia and Herzegovina. Without Ibrahim coming to SFU in2011, it is very unlikely that I would have ever ended up studying here. He introduced me todr. Sahinalp and provided extensive help with everything. With Ermin, I have been livingduring the whole course of my PhD studies and he has been a great roommate, colleagueand friend.

    I am very grateful to my dear aunt Faiza and uncle Dzevad together with their familyfor providing moral support and for all the help that they provided during my internshipin Switzerland (where they are currently living). I also thank people from ContextualGenomics, a company where I have been working over the past eight months, for providingan excellent work environment and for the great understanding that they showed while Iwas preparing the thesis.

    Last, but not least, I would like to express deep gratitude to my parents, Sadeta andFaiz, and sister Faiza, for their unconditional love and support. I devote this thesis to themand to my sweet little niece Amina.

    vi

  • Table of Contents

    Approval ii

    Abstract iii

    Acknowledgements v

    Table of Contents vii

    List of Tables xi

    List of Figures xii

    1 Introduction 11.1 Genetic basis of cancer and evidence for the existence of genetic intra-tumor

    heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Cancer onset and evolution of cancerous cells . . . . . . . . . . . . . . . . . 2

    1.2.1 Clonal theory and branching model of tumor evolution . . . . . . . . 31.2.2 Other theories of tumor evolution . . . . . . . . . . . . . . . . . . . . 5

    1.3 Clinical relevance of intra-tumor heterogeneity . . . . . . . . . . . . . . . . 71.4 Motivation, Contributions and Thesis Organization . . . . . . . . . . . . . . 8

    2 Background 122.1 Next Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.1.1 Preparing the input of NGS experiment . . . . . . . . . . . . . . . . 132.1.2 Output of NGS experiment . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 The uses of NGS data in studies of intra-tumor heterogeneity and

    tumor evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Inference of tumor subclonal composition and evolution from bulk sequencing

    data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Variant and reference read counts as a proxy for the fraction of cells

    harboring mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Clustering of mutations based on the read counts . . . . . . . . . . . 192.2.3 Basic principles in inferring trees of tumor evolution . . . . . . . . . 23

    vii

  • 2.2.4 Theoretical limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.5 Potential benefits of the use of multiple samples . . . . . . . . . . . . 262.2.6 Methods for the inference of clonal trees based on the use of SNVs . 282.2.7 Methods based on the use of CNAs and other types of mutations . . 29

    2.3 Inference of tumor subclonal composition and evolution from single-cell se-quencing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.1 The main characteristics of single-cell sequencing data . . . . . . . . 312.3.2 Strengths and weaknesses of single-cell sequencing data in recon-

    structing trees of tumor evolution . . . . . . . . . . . . . . . . . . . . 332.3.3 The existing methods for studying ITH and evolution by the use of

    SNVs from single-cell sequencing data . . . . . . . . . . . . . . . . . 332.3.4 Analysis of CNAs from single-cell sequencing data . . . . . . . . . . 36

    2.4 Inference of tumor evolution and subclonal composition by integrative use ofsingle-cell and bulk sequencing data . . . . . . . . . . . . . . . . . . . . . . 36

    3 Clonality inference from single tumor samples using low coverage se-quencing data 393.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.2.1 Input processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.2 Robust clustering using beta-binomial mixture modelling . . . . . . 423.2.3 Estimation of tumor purity . . . . . . . . . . . . . . . . . . . . . . . 433.2.4 Inference of tree of tumor evolution . . . . . . . . . . . . . . . . . . . 43

    3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.2 Applications in real data analysis . . . . . . . . . . . . . . . . . . . . 49

    3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4 Integrative inference of subclonal tumor evolution from single-cell andbulk sequencing data 524.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.1 Tree models of tumor evolution . . . . . . . . . . . . . . . . . . . . . 564.2.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.2.3 Tree scoring based on bulk sequencing data . . . . . . . . . . . . . . 584.2.4 Tree scoring based on single-cell data . . . . . . . . . . . . . . . . . . 594.2.5 Combined B-SCITE approach . . . . . . . . . . . . . . . . . . . . . . 61

    viii

  • 4.2.6 Compression of mutation trees into clonal trees . . . . . . . . . . . . 614.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.3.1 Performance assessment on simulated data . . . . . . . . . . . . . . 624.3.2 Application to real data . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    5 A combinatorial approach for sub-perfect tumor phylogeny reconstruc-tion via integrative use of single-cell and bulk sequencing data 775.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    5.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.2 PhISCS-I for tumor phylogeny inference via single-cell sequencing

    (SCS) data with no mutation elimination allowed . . . . . . . . . . . 835.2.3 Allowing mutations elimination in PhISCS-I . . . . . . . . . . . . . . 855.2.4 Additional ILP constraints to integrate VAFs derived from bulk se-

    quencing data into PhISCS-I . . . . . . . . . . . . . . . . . . . . . . 865.2.5 PhISCS-B for tumor phylogeny inference via SCS data . . . . . . . . 885.2.6 Additional Boolean constraints to integrate VAFs derived from bulk

    sequencing data into PhISCS-B . . . . . . . . . . . . . . . . . . . . . 905.3 Results on simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    5.3.1 Comparative running time analysis of PhISCS-I and PhISCS-B . . . 935.3.2 Measuring accuracy in tree inference . . . . . . . . . . . . . . . . . . 935.3.3 Comparing the accuracy of PhISCS and alternative methods . . . . 95

    5.4 Results on real sequencing data . . . . . . . . . . . . . . . . . . . . . . . . . 1015.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6 Conclusion 1086.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    Bibliography 112

    Appendix A Supplementary Material for CTPsingle: Clonality inferencefrom single tumor sample using low coverage sequencing data 128A.1 Simulation set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.2 Calculation of evaluation measures and run-time settings for AncesTree,

    LICHeE and PyClone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    Appendix B Supplementary Material for B-SCITE: Integrative inferenceof subclonal tumor evolution from single-cell and bulk sequencing data 131B.1 Details of generating simulated data . . . . . . . . . . . . . . . . . . . . . . 131B.2 Phylogenetic accuracy measures . . . . . . . . . . . . . . . . . . . . . . . . . 134

    ix

  • B.3 Derivation of the Binomial distribution approximation formula . . . . . . . 134B.4 Details of running ddClone, OncoNEM, SCITE, PhyloWGS and B-SCITE . 135B.5 Details of input data pre-processing for ALL, TNBC and CRC patients . . 137B.6 Supplementary figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    Appendix C Supplementary Material for PhISCS: A combinatorial ap-proach for sub-perfect tumor phylogeny reconstruction via integrativeuse of single-cell and bulk sequencing data 154C.1 Generalizing the triple-VAF constraints to arbitrary number of mutations . 154C.2 Simulation models used for benchmarking tumor phylogeny inference methods155

    C.2.1 Generating simulated data without ISA violations . . . . . . . . . . 155C.2.2 Simulation of mutations violating ISA . . . . . . . . . . . . . . . . . 155C.2.3 Simulations involving mutations from regions affected by Copy Num-

    ber Gains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.3 TPTED measure for comparing tumor phylogenies . . . . . . . . . . . . . . 157C.4 Benchmarking SCITE, SiFit, B-SCITE and PhISCS . . . . . . . . . . . . . 158C.5 Details of obtaining and pre-processing real data . . . . . . . . . . . . . . . 159C.6 Source codes of Max-SAT solvers used for the implementation of CSP for-

    mulation of PhISCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

    x

  • List of Tables

    Table 5.1 Comparison of running times of ILP and CSP implementations of PhISCS 94

    xi

  • List of Figures

    Figure 1.1 Branching clonal model and clonal tree of tumor evolution . . . . . 4Figure 1.2 Alternative illustration of the branching clonal model and a clonal

    tree of tumor evolution in case where no losses of mutations are allowed 5Figure 1.3 Linear and netrual models of tumor evolution . . . . . . . . . . . . 6

    Figure 2.1 An example of copy number event affecting region harboring SNV . 18Figure 2.2 Clonal tree of tumor evolution and plot of distribution of cellular

    prevalences (estimated based on the read counts with sequencingdepth of 200×) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    Figure 2.3 Clonal tree of tumor evolution and plot of distribution of cellularprevalences (estimated based on the read counts with sequencingdepth of 50×) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Figure 2.4 Desirable clustering of mutations from hypothetical examples with200× and 50× coverage datasets . . . . . . . . . . . . . . . . . . . . 22

    Figure 2.5 Limitation of bulk sequencing data in separating mutations of thesame prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Figure 2.6 Multiple clonal trees consistent with the mutation frequencies ob-served in bulk data . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Figure 2.7 An overview of single-cell sequencing experiment and output data . 32Figure 2.8 Strengths and weaknesses of single-cell sequencing data in inferring

    pairwise order of mutations in tree of tumor evolution . . . . . . . . 34Figure 2.9 Bulk data can improve phylogenetic inference by reducing the effects

    of noise in single-cell sequencing data . . . . . . . . . . . . . . . . . 37

    Figure 3.1 Comparison of purity inference accuracy of CTPsingle, PyClone,LICHEeE and AncesTree . . . . . . . . . . . . . . . . . . . . . . . . 46

    Figure 3.2 Comparison of CTPsingle, PyClone, LICHeE and AncesTree basedon the absolute difference between the true and predicted number ofsubclones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    Figure 3.3 Comparison of CTPsingle, PyClone, LICHeE and AncesTree basedon the quadratic mean of difference of true and predicted lineagefrequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    xii

  • Figure 3.4 Effect of false positive SNVs and copy number aberrations on theperformance of CTPsingle . . . . . . . . . . . . . . . . . . . . . . . 50

    Figure 3.5 Performance of CTPsingle on simulated datasets containing increasednumber of subclones . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Figure 4.1 Comparison of the inference of tumor evolution based on single-celland bulk sequencing data . . . . . . . . . . . . . . . . . . . . . . . . 54

    Figure 4.2 Schematic overview of B-SCITE . . . . . . . . . . . . . . . . . . . . 56Figure 4.3 Comparison of v-measure accuracy of mutation clustering by dd-

    Clone, OncoNEM and B-SCITE for simulated clonal trees with 10nodes and 50 mutations. Three different rates of doublet noise wereadded to single-cell data which consists of 25 genotypes drawn undervarious values of sampling distortion parameter. . . . . . . . . . . . 63

    Figure 4.4 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal treeswith 10 nodes and 50 mutations. Three different rates of doubletnoise were added to single-cell data which consists of 25 genotypesdrawn under various values of sampling distortion parameter. . . . 64

    Figure 4.5 Comparison of ancestor-descendant accuracy measure of phyloge-netic inference of OncoNEM, SCITE and B-SCITE for simulatedclonal trees with 10 nodes and 50 mutations. Three different ratesof doublet noise were added to single-cell data which consists of 25genotypes drawn under various values of sampling distortion param-eter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    Figure 4.6 Comparison of different-lineages accuracy measure of phylogeneticinference of OncoNEM, SCITE and B-SCITE for simulated clonaltrees with 10 nodes and 50 mutations. Three different rates of dou-blet noise were added to single-cell data which consists of 25 geno-types drawn under various values of sampling distortion parameter. 65

    Figure 4.7 The effect of CNAs on the co-clustering accuracy measure of phylo-genetic inference of B-SCITE with bulk data coverage of 10, 000× . 67

    Figure 4.8 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4bulk samples, varying bulk coverage and 25 sampled single cells . . 68

    Figure 4.9 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4bulk samples, varying bulk coverage and 50 sampled single cells . . 68

    xiii

  • Figure 4.10 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of B-SCITE and PhyloWGS on simulated data with 1, 2 and 4bulk samples, varying bulk coverage and 100 sampled single cells . 69

    Figure 4.11 Mutation histories inferred by CTPsingle, SCITE and B-SCITE forPatient 1 from childhood leukemia study (Gawad et al. 2014) . . . 70

    Figure 4.12 Mutation histories inferred by CTPsingle, SCITE and B-SCITE forPatient 2 from childhood leukemia study (Gawad et al. 2014) . . . 71

    Figure 4.13 Mutation histories inferred by the original study, SCITE and B-SCITE for triple-negative breast cancer patient (Wang et al. 2014) 73

    Figure 4.14 Mutation histories inferred by B-SCITE for two colorectal patientswith liver metastasis (Leung et al. 2017) . . . . . . . . . . . . . . . 74

    Figure 5.1 Comparisons of PhISCS and SiFit based on the normalized Robinson-Foulds distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    Figure 5.2 Comparison of PhISCS with SCITE based on the normalized MLTSMsimilarity measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    Figure 5.3 Comparison of PhISCS with SCITE based on TPTED dissimilaritymeasure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    Figure 5.4 Comparison of PhISCS with SCITE on larger number of subclonesand larger number of mutations. . . . . . . . . . . . . . . . . . . . . 100

    Figure 5.5 Comparison of PhISCS and B-SCITE according to both MLTSMand its dual MLTD measures. . . . . . . . . . . . . . . . . . . . . . 102

    Figure 5.6 Mutation histories inferred by PhISCS for patient with primary col-orectal cancer and liver metastasis (patient CRC2 from Leung et al.2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    Figure 5.7 Mutation histories inferred by SCITE, B-SCITE and PhISCS forPatient 2 from childhood leukemia study (Gawad et al. 2014) . . . 106

    Figure A.1 The distribution of lineage frequencies and fraction of mutations persubclone across all simulation datasets generated in CTPsingle . . . 128

    Figure A.2 Comparison of CTPsingle and CITUP on the simulated data . . . . 130

    Figure B.1 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 6nodes and 50 mutations. Three different rates of doublet noise wereadded to single-cell data which consists of 25 genotypes drawn undervarious values of sampling distortion parameter. . . . . . . . . . . . 139

    xiv

  • Figure B.2 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 10nodes and 50 mutations. 25, 50 and 100 genotypes were drawn undervarious values of sampling distortion parameter. . . . . . . . . . . . 140

    Figure B.3 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 20nodes and 100 mutations. 25, 50 and 100 genotypes were drawnunder various values of sampling distortion parameter. . . . . . . . 141

    Figure B.4 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE for simulated clonal trees with 40nodes and 100 mutations. 25, 50 and 100 genotypes were drawnunder various values of sampling distortion parameter. . . . . . . . 142

    Figure B.5 Comparison of adjusted Rand index accuracy of mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated clonal trees with10 nodes and 50 mutations. 25, 50 and 100 genotypes were drawnunder various values of sampling distortion parameter. . . . . . . . 142

    Figure B.6 Comparison of adjusted Rand index accuracy of mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated clonal trees with20 nodes and 100 mutations. 25, 50 and 100 genotypes were drawnunder various values of sampling distortion parameter. . . . . . . . 143

    Figure B.7 Comparison of adjusted Rand index accuracy of mutation clusteringby ddClone, OncoNEM and B-SCITE for simulated clonal trees with40 nodes and 100 mutations. 25, 50 and 100 genotypes were drawnunder various values of sampling distortion parameter. . . . . . . . 143

    Figure B.8 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE as a function of the false negativerate. False positive rate was set to 0.00001. . . . . . . . . . . . . . . 144

    Figure B.9 Comparison of v-measure accuracy of mutation clustering by dd-Clone, OncoNEM and B-SCITE as a function of the false negativerate, but with highly elevated false positive rate of 0.01. . . . . . . 144

    Figure B.10 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal treeswith 6 nodes and 50 mutations. Three different rates of doublet noisewere added to single-cell data which consists of 25 genotypes drawnunder various values of sampling distortion parameter. . . . . . . . 145

    Figure B.11 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal treeswith 10 nodes and 50 mutations. 25, 50 and 100 genotypes weredrawn under various values of sampling distortion parameter. . . . 146

    xv

  • Figure B.12 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal treeswith 20 nodes and 100 mutations. 25, 50 and 100 genotypes weredrawn under various values of sampling distortion parameter. . . . 147

    Figure B.13 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE for simulated clonal treeswith 40 nodes and 100 mutations. 25, 50 and 100 genotypes weredrawn under various values of sampling distortion parameter. . . . 148

    Figure B.14 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE as a function of the falsenegative rate. False positive rate was set to 0.00001. . . . . . . . . 148

    Figure B.15 Comparison of co-clustering accuracy measure of phylogenetic infer-ence of OncoNEM, SCITE and B-SCITE as a function of the falsenegative rate, but with highly elevated false positive rate of 0.01. . 149

    Figure B.16 The effect of CNAs on the co-clustering accuracy measure of phylo-genetic inference of B-SCITE with increased bulk data coverage of1, 000, 000× . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    Figure B.17 The effect of CNAs on the ancestor-descendant accuracy measure ofphylogenetic inference of B-SCITE with bulk data coverage of 10, 000×150

    Figure B.18 The effect of CNAs on the ancestor-descendant accuracy measure ofphylogenetic inference of B-SCITE with increased bulk data coverageof 1, 000, 000× . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

    Figure B.19 The effect of CNAs on the different-lineage accuracy measure of phy-logenetic inference of B-SCITE with bulk data coverage of 10, 000× 151

    Figure B.20 The effect of CNAs on the different-lineage accuracy measure of phy-logenetic inference of B-SCITE with increased bulk data coverage of1, 000, 000× . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    Figure B.21 Multiple clonal trees compatible with clustering of mutations inferredby CTPsingle for ALL patient . . . . . . . . . . . . . . . . . . . . . 152

    Figure B.22 Clonal trees for ALL and TNBC patients derived from B-SCITEmutation trees by node compression . . . . . . . . . . . . . . . . . . 153

    xvi

  • Chapter 1

    Introduction

    1.1 Genetic basis of cancer and evidence for the existence ofgenetic intra-tumor heterogeneity

    Cancer is a common name to the group of over 200 diseases characterized by uncontrolledcell division. In the case of leukemia (cancer of blood or bone marrow), cancer manifestsitself by overproduction of abnormal white blood cells, whereas in the other cancer typesuncontrolled cell divisions result in the formation of abnormal masses of cells, also known asmalignant tumors. In addition, cancerous cells typically have potential to leave the primarysite of cancer origin and invade distant tissues forming metastases. [91, 127].

    Cancer is nowadays widely recognized as a disease of genome [91, 105, 13]. It is themost common genetic disease and is estimated to be responsible for nearly 10 million deathsworldwide in 2018 alone [10]. Genetic mutations are one of the key causes of cancer onset,growth, spread and treatment resistance. At the time of clinical diagnosis, genomes ofcancerous cells typically harbor a large number of mutations detectable from data generatedby currently available DNA sequencing technologies. According to some estimates, for mostof the tumors, these numbers are varying between 1,000 and 20,000 of single nucleotidevariants (SNVs), and a few to hundreds of copy number aberrations (CNAs) and otherstructural rearrangements [100].

    While the role and importance of genetic mutations in cancer onset and progression havebeen studied (at a limited resolution) for a long time, completion of the first draft of thehuman reference genome in 2001 [22] and technological advancements in DNA sequencing, inparticular the introduction of next-generation sequencing (NGS) technologies in 2004 [109],enabled researchers to study genomic profiles of individual tumors at unprecedented scaleand resolution. These developments enabled sequencing large parts or even whole genomesof individual tumors, as well as sequencing large cohorts of tumor samples [173]. Theywere followed by development of computational methods for detection of various types ofsomatic mutations such as single-nucleotide variants (SNVs), small insertion and deletions

    1

  • (indels), large-scale insertions, inversions, translocations, copy number aberrations (CNAs)and others [178, 78].

    Sequencing data that has been generated in the past years has revealed a striking de-gree of genetic intra-tumor diversity in cancer. In [50], mutation profiling of four patientswith metastatic renal-cell carcinoma was performed. It revealed the existence of muta-tions present in some, but not in all, tumor sites, implying the existence of spatial geneticintra-tumor genetic heterogeneity. This diversity was not only observed between physi-cally separated primary and metastatic tumor sites, but also among distinct regions of theprimary tumor that were sequenced independently. In another study the authors trackedtumor progression in three chronic lymphocytic leukemia patients [147]. For each patient,five blood samples were obtained at different timepoints of disease progression and subsetof mutations were selected for targeted deep amplicon sequencing. The average depth ofsequencing coverage achieved by amplicon sequencing was 100, 000× yielding highly reliablevariant allele frequencies for the selected sets of mutations. Large differences in values ofthe obtained variant allele frequencies for many pairs of mutations are clear indicator ofthe existence of genetically distinct cells and temporal genetic intra-tumor heterogeneity.Similar findings were reported in [99, 83, 13, 116, 64, 179, 49, 172, 88] and many otherstudies.

    In addition to the genetic intra-tumor heterogeneity, there are several other types ofheterogeneity in tumors of a single patient (e.g., epigenetic intra-tumor heterogeneity).However, since our main focus is on the genetic intra-tumor heterogeneity, we adopt con-vention that in the rest of the thesis term intra-tumor heterogeneity, abbreviated as ITH,refers to this type of heterogeneity, unless stated otherwise.

    1.2 Cancer onset and evolution of cancerous cells

    In this section we will discuss several existing theories of cancer onset and evolution. Ourmain goal is to attempt answering one of the fundamental questions about ITH: what arethe mechanisms by which ITH emerges during a tumor growth and does it play any role ina tumor progression?

    Most of the available tumor sequencing data support hypothesis of single-cell origin ofcancer. According to this hypothesis, cancer originates from a single cell, also known ascancer founding cell, which acquires a set of mutations giving it some proliferative advantageover the neighboring healthy cells. The evidence supporting this can be found in studieswhere multiple regions of the same tumor were sequenced and it was observed that all regionsshare a common set of mutations [50, 166]. Additional evidence is provided by studies wheremutational profiling of individual tumor cells was performed and sets of mutations present inall cancerous cells identified [179, 49, 172]. Some studies also suggest that a small fraction oftumors might have multiple cells of origin [183, 152]. Such tumors are known in literature as

    2

  • multicentric [70, 24] (terms multifocal and polycentric are also used as synonyms, althoughin some publications term multifocal has different meaning [136]).

    Mutagenesis (the production of genetic mutations) in cancer is a dynamic process. Ex-isting mutations typically cause defects in mechanisms which ensure the accuracy of DNAreplication during the process of cell division. As a consequence, cancerous cells are usuallycharacterized by elevated mutation rate in comparison to the other cells. During the processof cell division, as well as due to exogeneous factors (e.g., tobacco smoke), cancerous cellsacquire new mutations distinguishing them from the adjacent cells. There exist several the-ories about tumor evolution, which have different implications on the impact of the newlyacquired mutations and the role of intra-tumor heterogeneity in tumor progression. Belowwe summarize the most important of these theories.

    1.2.1 Clonal theory and branching model of tumor evolution

    In 1976, Peter Nowell proposed the clonal theory of cancer evolution, which posits that can-cer is an evolutionary process driven by the acquisition of somatic mutations [121]. Duringtumor growth, descendants of cancer founding cell acquire new mutations that are laterpassed on their descendants and the process is continued over time. Consequently, at sometimepoint, in one of the descendants of the cancer founding cell, a critical set of mutationsgiving it some selective advantage in comparison to the other cells can be accumulated. Theemergence of such a cell is then followed by the expansion of the population of descendantsof this cell, which leads to the formation of a genetically highly similar set of cells, betterknown as subclone 1. This process is then continued over time and, at the time of clin-ical diagnosis, tumors usually consist of multiple subclones characterized by distinct setsof somatic mutations. The model of tumor evolution that follows clonal theory and allowsco-existence of multiple subclones over time is known as branching clonal evolution and isshown in Figure 1.1.

    Tree of tumor evolution

    The process of tumor evolution can be depicted by a clonal tree (of tumor evolution), herealso referred to as tumor phylogeny, shown in Figure 1.1. In clonal tree, individual nodesrepresent subclones, with root note representing either population of healthy cells or thefirst population of cancerous cells 2. Mutations are placed at the subclone (node) of theirfirst occurrence. The first population of cancerous cells is also known as the cancer foundingclone and mutations that it harbors (that are shared among all cancerous cells) as clonal

    1In this thesis we will use definition of subclone as a set of genetically highly similar cells (similar definitioncan be found in [148]). Consequently, and for the sake of simplicity, the population of healthy cells will betreated as one of the subclones.

    2Note that in the case of multicentric tumors it is necessary that root node represents the population ofhealthy cells

    3

  • or trunk mutations. In addition to a mutational label, a frequency label is also commonlyassigned to the node. Frequency labels usually represent prevalence of the correspondingsubclone in the tumor sample or average variant allele frequency of mutations present innode’s mutational label. In case where multiple tumor samples are sequenced, frequencylabels can be represented as vectors of real numbers.

    0%

    time

    tumor size Clonal tree of tumor evolution

    20%

    20%

    35% 25%

    Healthy cell

    First cancer cell

    Clonal (trunk)mutations

    Set of mutations

    Subclone

    Figure 1.1: Tumor growth according to branching clonal model of tumor evolution (left)and clonal tree of tumor evolution (right). In the left, healthy cells are shown at the topas a purple circles. The first cancerous cell and the set of mutations that it harbors arerespectively depicted as a blue circle and a red star. In the branching clonal model of tumorevolution, multiple subclones, depicted in the left as triangles of the same color, emerge overtime and co-exist in the tumor (with possibility of being outcompeted and eliminated). Theemergence of subclonal populations is driven by the acquisition of somatic mutations. Setsof mutations distinguishing a subclone from its most recent ancestor (parent) are depictedas stars of different colors. A clonal tree of tumor evolution, shown in the right figure,is a convenient way of depicting tumor clonal evolution. Individual nodes of a clonal treerepresent subclones and mutations are placed at the node (subclone) of their first occurrenceor, equivalently, to the edge connecting the node with its parent. Frequency labels are alsocommonly assigned to the nodes of the tree. Here, each frequency represents the prevalenceof the corresponding subclone in this hypothetical tumor at the time of obtaining tumorbiopsy tissue (the latest timepoint in the left part of the figure). Note that some of thesubclones that existed in the course of tumor evolution might have been outcompeted beforethe time of obtaining the biopsy. Such subclones are assigned zero prevalence and, althoughthey are absent from the sequenced sample, their existence in the tumor evolutionary historycan, in some cases, be inferred from the sequencing data.

    4

  • A mutation tree can be defined as a clonal tree of the highest granularity where ateach node only a single mutation is placed. There are other tree representations of clonalevolution discussed in [67] (e.g., it can be depicted by binary genealogical tree [67]) buthere we will restrict ourselves to the representation by clonal/mutation trees. Later, inSection 4.2.1, we also provide formal mathematical definition of clonal and mutation trees.

    time

    20%

    20%

    35% 25%

    0%

    Figure 1.2: Alternative illustration of the branching clonal model (left) and a clonal tree oftumor evolution (right) under the assumption that mutations present in a subclone are notlost in any of its descendants. Similarly as in Figure 1.1, circles in the left part representcells and different sets of mutations are depicted as stars of different colors. In a clonaltree, each edge is labeled with sets of mutations distinguishing each child subclone fromits parent, whereas the mutational label of each node consists of the set of all mutationsharbored by the corresponding subclone. Note that, under the assumption of no losses ofmutations, the tree from the right is equivalent to clonal tree from Figure 1.1.

    Under the assumption that none of the mutations present in some subclone are lost inany of its descendants, we can also depict the process of tumor growth and tree of tumorevolution as shown in Figure 1.2.

    1.2.2 Other theories of tumor evolution

    Although the clonal theory of tumor evolution is well established with a lot of real datasetssupporting this model3, alternative models of tumor evolution can also be found in theliterature.

    According to multistep tumorigenesis model proposed by Fearon and Vogelstein in 1990,tumor progression follows a linear evolution [44]. In this model, analogous to the clonaltheory of cancer evolution, acquisition of a set of somatic mutations can provide a selective

    3In this section we interchangeably use theory and model, assuming the same meaning of the two terms.

    5

  • advantage to the host cell and lead to the emergence of a new subclone. However, themodel proposes that the acquired mutations provide such a strong selective advantage tothe newly formed subclone that it soon outcompetes the existing one(s). Consequently, mostof the time, tumors are expected to be largely homogeneous with the bulk of tumor massconsisting of a single, dominant, subclone (see Figure 1.3). Next generation sequencing dataincreasingly disputes this simple model of tumor evolution. There is now an overwhelmingevidence that tumor evolution is a more complex and in many cases branching processwhere multiple subclones co-exist in the same tumor [50, 49, 172].

    Linear evolution Neutral evolution

    Figure 1.3: Linear and neutral models of tumor evolution. Coloring is analogous to that inFigure 1.1. In a linear model of tumor evolution, set of mutations that drive an emergenceof a new subclone provide it with selective advantage that it soon outcompetes the othersubclone(s). The neutral model of tumor evolution posits that intra-tumor heterogeneity isa byproduct of tumor growth but mutations acquired by a cell do not confer it a significantselective advantage. Consequently, according to this model, a large number of geneticallydistinct cells and wide spectrum of mutational variant allele frequencies is expected to beobserved in the tumor biopsy sample.

    The neutral model of tumor evolution is another model that gained attention in the pastyears [175]. It posits that most of the driver mutations are acquired at the early stages oftumor growth and mutations occurring later typically do not provide selective advantageto the host cells. In contrast to the linear theory characterized by selective sweeps and alargely homogeneous tumor, a tumor following the neutral model of evolution is expectedto have a large number of genetically distinct populations of cells and wide spectrum ofmutational variant allele frequencies (see Figure 1.3). The methodology used in [175] todemonstrate widespread neutral evolution among tumors was recently disputed in [103]

    6

  • and [161]. However, it is likely that some of the tumors evolve according to the neutralmodel and further research and evidence supporting this model will be required in thefuture. A similar argument applies to the punctuated tumor evolution model, also knownas the ’big bang’ model [156], which posits that most of the ITH occurs at the early stagesof tumor growth and is followed by stable expansion of one or several subclonal populations[160, 30, 156].

    Cancer stem cell theory, which is based on hypothesis that tumor growth is driven bya rare subpopulation of cells, dubbed cancer stem cells, is beyond the scope of this thesisand we refer readers to [20] for more details about this theory.

    Here, we will use clonal branching theory as a gold standard for simulating tumorevolution, although two of the three methods that are going to be presented in the followingchapters do not require a tumor to strictly follow this model of evolution nor do they requireit to be of a single-cell origin.

    1.3 Clinical relevance of intra-tumor heterogeneity

    Before getting into discussion of clinical relevance of ITH, we quote the very first sentencefrom one of the latest reviews on the topic: "Intratumor heterogeneity, which fosters tumorevolution, is a key challenge in cancer medicine." [104].

    Numerous studies in the past years suggest that ITH has several potential clinical im-plications. For instance, in [93] and [108] a correlation between subclonal diversity andprogression to esophageal adenocarcinoma in Barrett’s esophagus was reported. In chroniclymphocytic leukemia, the presence of a subclonal driver was found to be an independentrisk factor for rapid disease progression [85]. The extent of ITH has also been linked tothe tumor metastatic potential and disease-free survival. Findings from [180] suggest thatpatients developing metastasis in triple-negative breast cancer had a significantly highermeasure of ITH in the primary tumor, whereas study of colorectal cancer [71] reported thathigh degree of ITH in the primary tumor was correlated with an increased rate of livermetastasis and shorter disease-free survival.

    Presence of extensive ITH and the ability of a tumor to acquire new mutations is con-sidered to be one of the key causes of treatment failure. In most cases, even if a drug worksat first, it will not work over the long term [72]. Radiation and chemotherapy can promotethe emergence of new subclones resistant to treatment, but treatment resistance can alsobe driven by a minor or dormant subclone already existing in tumor prior to treatmentinitiation [23, 42, 147, 111, 41, 117].

    While there is an increasing evidence that ITH can be exploited in clinics as a prognosticsindicator [126], research in the design of effective treatments that will cure cancer or, atleast prevent its uncontrolled growth and turn it into chronic disease with low impact on

    7

  • the quality of life [4], is still in its inception. A tumor’s ability to adopt to treatmentand pervasive intra and inter-tumor heterogeneity, even among cancers of the same type,largely complicate design of clinical trials. We expect that technological advancementsin sequencing, imaging, information sharing and many other fields will facilitate designof larger clinical trials and inspire discovery of novel therapeutic targets and treatmentregimens. Adaptive therapy was proposed as a potential treatment strategy to preventuncontrolled tumor growth [48]. Its main idea lies in continuously modulating treatment inorder to achieve fixed tumor population while avoiding complete elimination of subclonessensitive to treatment. Namely, elimination of these subclones is typically followed byuncontrolled growth of chemoresistant populations. On the other hand, allowing a fractionof chemosensitive subclones to survive can provide a means to suppress proliferation of theless fit but chemoresistant subclones through the competition for limited resources betweensubclones. Although adaptive therapy is an interesting approach for controlling tumorgrowth, its successful implementation in clinical practice will require a good understandingof the tumor subclonal composition, fitness of individual subclones and selective advantageof the chemosensitive over the chemoresistant subclones.

    In addition to genetic ITH, which is our main focus, there are also other types ofITH. For example, in [131] methylation profiling of localized lung adenocarcinomas revealedcorrelation between the extent of DNA methylation ITH and tumor size. In the same study,it was also found that, on average, most of the somatic DNA mutations were shared amongall of the sequenced tumor regions, suggesting that they occurred at the early stages oftumor progression, whereas only a quarter of the differentially methylated probes wereshared among all regions. These findings indicate that tumor-specific DNA methylationmight be associated with later branched evolution observed in the set of patients analyzedin this study [131]. It is also known that gene expression in individual tumor cells belongingto the same subclonal population can be influenced by their position in the tumor (e.g.,center of a tumor vs. its boundary) [62, 157]. Incorporating genetic ITH with other typesof ITH and other important factors (e.g., interaction of tumor cells with microenvironment)will be of great importance in future studies of tumor growth and progression.

    1.4 Motivation, Contributions and Thesis Organization

    We expect that, in the foreseeable future, analysis of tumor subclonal composition andevolution will become more common in clinical practice and aid clinicians in diagnostics,prognostics, as well as in making treatment decisions and designing the best therapies.Due to the extensive intra and inter-tumor heterogeneity, such therapies will most likely betailored according to the genetic makeup of individual tumors and consist of combinationof several drugs targeting different subclonal populations. In this context, the knowledgeof the clonal tree of tumor evolution can also be highly valuable as it reveals divergent

    8

  • subclonal populations (i.e., subclones evolving on different branches of the tree) that mightneed to be targeted separately, particularly in cases where drugs targeting clonal mutationsfail to provide desired results in halting tumor growth.

    Future studies of shared patterns of tumor evolution among large cohorts of cancer pa-tients will also benefit from methods for accurate inference of clonal trees for individualpatients. These studies will further improve our understanding of cancer onset and progres-sion. Shared patterns can provide novel insights about the most significant mutations thatpromote tumor growth (i.e., driver mutations) and treatment resistance, but also aboutthe advantages and disadvantages that simultaneous presence of sets of mutations conferto the host cells. Furthermore, we expect that patterns of tumor evolution will be usedfor predicting next steps in tumor evolution. Successful prediction of tumor evolutionarybehavior requires deterministic patterns of tumor evolution and is expected to be one im-portant subject for future cancer research. One recent study, involving over 100 patientsdiagnosed with clear-cell renal cell carcinoma, found evidence for deterministic nature ofclonal evolution in this cancer type [166]. However, these findings need to be validated onlarger cohorts and similar studies conducted for other cancer types.

    Metastasis is estimated to be responsible for ∼90% of cancer related deaths [107, 158].Tracking the metastatic seeding patterns is one of the very important problems in betterunderstanding of biological background of metastasis. This task is very challenging asmetastatic seeding is a complex process. In addition to the metastatic seeding from primarysite, the existing metastases can give rise to the new ones. Re-seeding between two existingmetastatic sites adds an additional level of complexity to the whole problem [16]. However,all cancerous cells of a given patient are related through a common (shared) tree of tumorevolution, which can provide answers to many questions related to the metastatic process.In the past years, we have been witnessing increasing interest in the research of variousaspects of metastasis [55, 106, 63, 139] and specialized methods for studying metastaticseeding patterns were developed recently [40, 135]. We recommend [165] for a thoroughreview of the subject.

    All of the above illustrate the importance of better understanding of ITH and tumorevolution at the level of the individual patient. The inference of tumor subclonal com-position and evolution can be performed by the use of various signals originating fromdetected mutations and will be discussed in more detail in Chapter 2. The pioneering workstudying ITH and evolution typically focused on a small number of selected genomic al-terations from several tumor samples and involved manually reconstructing the subclonalcomposition and/or phylogeny of these tumors [147, 50]. Developments in DNA sequencingtechnologies enabled large-scale cancer sequencing efforts where whole exomes/genomes ofthousands of tumor samples were sequenced [173]. Manual analysis of each of the individual

    9

  • tumors from such large data cohorts clearly became non-practical and required developmentof automated computational methods.

    The vast majority of all currently available tumor DNA sequencing data was obtainedvia bulk sequencing where DNA of millions of cells are pooled and sequenced together givingonly an average signal over a large number of cells. Studying tumor evolution and subclonalcomposition from such data requires development of computational methods specialized fordeconvolution of mixed signals while handling intrinsic properties of sequencing data. Wedevote a separate section in Chapter 2 to provide more details about the advantages andlimitations of the use of bulk sequencing data in this context. In Chapter 3 we introduceCTPsingle, a method for the inference of tumor subclonal composition and evolution frombulk sequencing data. CTPsingle assumes that cancer originates from a single cell andfollows the clonal theory of evolution. It is currently used as one of the methods to inferclonality in the Evolution and Heterogeneity Working Group of Pan-Cancer Analysis ofWhole Genomes project [173]. Like other similar tools, CTPsingle is also faced with sometheoretical limitations due to limited resolution of bulk sequencing data.

    Some of the limitations of bulk data can be resolved by the use data obtained by recentlyintroduced single-cell sequencing. Several methods for single-cell sequencing developed inthe past years generate data of the ultimate resolution for studying ITH and evolution.However, obtaining a large number of single cells that are a good representative of thesubclonal populations present in a tumor is still very challenging due to the cost of single-cell sequencing and non-uniform sampling of individual tumor cells. Furthermore, single-celldata is contaminated with various types of noise, which is a major obstacle for the analysisof tumor evolution by direct application of standard phylogenetic techniques. Thereforesuccessful use of this type of data in tumor phylogenetics requires development of specializedcomputational methods. We devote a separate section in Chapter 2 to provide more detailedbackground on the main characteristics of this data type and to summarize developmentsin the design of related computational methods.

    Importantly, bulk data is not exploited in any of the previously developed methods forstudying tumor evolution from single-cell data. As strengths and weaknesses of the two datatypes are to a large extent complementary with respect to phylogeny inference, performingboth, bulk and single-cell sequencing simultaneously, may be a competitive strategy fortumor phylogeny reconstruction. In Chapters 4 and 5 we introduce B-SCITE and PhISCS,the first two methods for tumor phylogeny inference that leverage complementary strengthsof single-cell and bulk sequencing data in a joint inference framework. In addition to superiorperformance over the existing alternatives on the comprehensive set of simulated data, wealso show that these tools generate more realistic mutation histories on several real datasets.

    For PhISCS, we provide implementations by the use of both integer-linear programming(ILP) and constraint-satisfaction programming (CSP) and show that, at least in the contextof tumor phylogeny inference, CSP might be a time-efficient alternative to the ubiquitously

    10

  • used ILP. In contrast to the existing methods for inferring trees of tumor evolution fromsingle-cell data, that are based on the probabilistic search schemes, PhISCS provides aguarantee of optimality for the reported solutions or bound on the best achievable objective.PhISCS is also the first method that integrates single-cell and bulk sequencing data, whileaccounting for the possible existence of violations of commonly used and recently debated[81] infinite sites assumption.

    Since all three methods are based on the use of single nucleotide variants, the majorityof the attention in Chapter 2 is devoted to description of the related methods based on theuse of this type of mutations.

    Although they are not our main contributions, it is worth mentioning that in this thesiswe introduce potentially interesting strategies for generating simulated data, in particularmutations violating the infinite sites assumption and mutations affected by copy numberaberrations. We also propose several distinct and novel measures for comparing trees oftumor evolution and use them in comparisons of performance of different methods for treereconstruction. Methods for comparing trees of tumor evolution are currently lacking andwe believe that the proposed measures will inspire future research on the topic. Some otherinteresting and important directions for future research are presented in Chapter 6.

    11

  • Chapter 2

    Background

    In this chapter we provide a background on the existing computational approaches fordeciphering ITH and inferring trees of tumor evolution. Based on the input data used, weclassify methods into three main groups: (i) methods designed only for bulk sequencingdata (ii) methods designed only for single-cell sequencing data (iii) methods combiningboth, single-cell and bulk, sequencing data. We devote a separate section to each group ofthe methods and also provide a description of the main advantages and limitations of thebulk and single-cell sequencing data in studying ITH and evolution.

    The rapid developments in the design of algorithms and methods discussed in this thesiswould be largely impossible without the completion of the first draft of human referencegenome announced in 2001 [22] and the invention of novel approaches for DNA sequencingthat followed soon afterwards. In 2004 several technologies for massively parallel DNAsequencing, better known as Next Generation Sequencing (NGS), were introduced [109].Since all methods discussed in this work rely on data generated by some of the available NGSplatforms, we devote it the first section of this chapter. In the following sections we discussthe main concepts and the existing methods for inferring tumor subclonal compositionand/or trees of tumor evolution.

    2.1 Next Generation Sequencing

    Next generation sequencing, also called second-generation sequencing, is a common nameused for several sequencing technologies that first appeared in 2004 and gradually replacedtraditional Sanger sequencing. Due to the high cost of sequencing at the early stages oftechnology developments, its use was largely limited to academic research laboratories.The first NGS sequencers were used for sequencing selected genomic regions of interest and,rarely, for sequencing whole genomes [174, 170].

    However, since its introduction, the broad potential of NGS technologies was recognizedand we have been witnessing a large investments and rapid technological advancements in

    12

  • the field over the past 15 years. Some of the main drivers of innovation include marketcompetition among companies providing sequencing infrastructure, but also large financialsupport through public research funding agencies. For example, National Human GenomeResearch Institute awarded more than 100 million USD for developments in NGS between2004 and 2008 [109, 146]. As a result, cost of sequencing was constantly plummeting andwhole genome sequencing (WGS) can nowadays be routinely performed, while the use ofgenetic tests is becoming common clinical practice [80, 73].

    Developments in the sequencing technologies also lead to the development of differentNGS platforms. Description of most of the particular details of the equipment and lab-oratory steps required in order to perform NGS experiment falls largely out of the scopeof this thesis and, for more thorough reading, we refer to some of the numerous reviewspublished on these topics [109, 153, 53]. Here, we will focus only on summarizing details ofNGS sequencing process and generated output data that are most relevant in developmentof computational methods discussed later in the thesis.

    2.1.1 Preparing the input of NGS experiment

    One of the first steps that needs to be performed in NGS experiment is input DNA prepa-ration. Depending on the intended use of data and financial, technical and other resourcesavailable, there are different strategies of extraction of cellular DNA and preparation ofthe final DNA that is later provided as input to the sequencing machine. We distinguishbetween different DNA preparation strategies in terms of the number of tumor cells repre-sented in the final DNA (bulk vs. single-cell sequencing) and between different approachesto sequencing in terms of the size of sequenced region (targeted sequencing, whole exomeand whole genome sequencing). We now briefly describe these two classifications, withoutgetting into application details that are discussed later.

    Bulk vs. single-cell sequencing

    Each NGS experiment requires a minimum amount of DNA to be provided as input to thesequencing machine. According to some estimates, this amount is equal to the amount ofDNA found in approximately 80,000 single cells [169]. For this reason, the extraction ofDNA from hundreds of thousands or millions of single-cells is usually one of the first stepsof input DNA preparation. DNA extraction steps can be followed by some additional DNAamplification steps (e.g., in targeted sequencing approach discussed below). Nevertheless, inthis case the sequenced DNA is a mixture of DNA originating from hundreds of thousandsor millions of single cells.

    Sequencing where initial DNA material is obtained using this approach is also knownas bulk sequencing.

    13

  • On the other hand, since the amount of DNA present in a single cell is insufficient toperform sequencing, sequencing DNA from single cell first requires precise extraction ofthe cell’s DNA followed by several rounds of DNA amplification in order to reach desiredamount of DNA used for sequencing. Sequencing where DNA is prepared in this way isbetter known as single-cell sequencing (SCS).

    Targeted, whole exome and whole genome sequencing

    The process of input DNA preparation and sequencing also depends on the intended use ofthe data. The aim of performing sequencing experiment might be to obtain data about aparticular region (target) of interest or to interrogate genomic variation at the whole exomeor the whole genome level. Based on the size of sequenced region, we divide sequencingexperiments in the following three groups:

    1. Targeted sequencing, which is used in cases where we are interested in examininggenetic variants in a pre-defined set of genomic regions. Some examples of the use oftargeted sequencing include validating mutation of interest and sequencing a selectedset of genes known to harbor mutations causing a particular disease. In the clinicaluses of sequencing in cancer treatment, targeted sequencing of genes for which knowntreatment options exist is becoming common practice.

    2. Whole Exome Sequencing (WES), which is used to search for mutations in protein-coding regions (exons) of genes. These regions are expected to harbor the majority ofdeleterious mutations.

    3. Whole Genome Sequencing (WGS), where the goal is to obtain sequencing data of thewhole genome.

    2.1.2 Output of NGS experiment

    The typical output of NGS experiment consists of millions or, more recently, billions ofsequencing reads (a short fragments of the sequenced DNA). Nowadays, most of the NGSdata is generated by the use of Illumina’s short read sequencing technology, which providesdata of high throughput and accuracy. Reads generated by this technology are usually oflengths 100 to 150 base-pairs and have between 0.1 and 1% erroneously called bases.

    Sequencing depth, also called sequencing coverage, is another important parameter ofthe NGS dataset. Coverage at a given position can be defined as the number of reads thatoverlap with this position after the process of read mapping (during the process of readmapping, locations in the Human Reference Genome that are most similar to individualreads or their parts are determined). Average coverage of a given set of genomic regions isdefined as the mean value of coverage of positions falling into these regions.

    14

  • Sequencing breadth is usually defined as percentage of positions which have sequencingdepth greater than or equal than some given constant c. When computing sequencingbreadth, only positions that were intended to be sequenced (e.g., the set of all exons inWES) are considered.

    2.1.3 The uses of NGS data in studies of intra-tumor heterogeneity andtumor evolution

    NGS nowadays enables cost-effective interrogation of genomic regions of interest or evenwhole genomes. Due to the importance of genetic aberrations in cancer onset and progres-sion, sequencing of tumor samples has been one of the main applications of NGS since itsvery beginnings. Detection of various types of mutations by the use of whole exome orwhole genome sequencing facilitates identification of the key cancer driver mutations andmutational burden among distinct cancer types. Targeted sequencing in search for the ge-nomic aberrations for which known treatment options are available is offered at affordableprices (on the order of hundreds of US dollars) and is starting to become clinical routine(e.g., in a study involving 1281 oncologists in United States, 75.6% reported using NGStests to guide treatment decisions for their patients [46]).

    In addition to enabling large scale studies of inter-tumor heterogeneity [162, 173], whichis characterized by distinct sets of mutations present among distinct patients, developmentsin sequencing technologies also facilitated the exploration and better understanding of theextent of genetic ITH and tumor evolution (see Section 1.1 for summary of the first studiesusing NGS for analyzing ITH and evolution).

    Due to cost and technological constraints, bulk sequencing is still the dominant approachin tumor sequencing. Most of the analyses of ITH and evolution from bulk sequencing datastart with whole exome or whole genome sequencing [50, 55, 119, 147, 173, 172, 88]. Whiledata produced by WES or WGS is of high sequencing breadth, its typical depth rangesbetween 30× and 100× [172, 49]. Due to low sequencing depth, variant allele frequencies,defined as the fraction of reads supporting the variant allele of a reported mutation, areusually characterized by high variance. An additional limitation of such data is in detectingrare mutations as mutation signals are usually not discernible from the sequencing noise.Therefore it is also very common that WES or WGS data is used for identifying putativevariants and then followed by targeted sequencing of the selected subset of the putativevariants [147, 55, 172]. Targeted sequencing is performed in order to identify true variantsand obtain highly reliable variant allele frequencies that are later used in the analysis or forvalidating findings obtained from WES or WGS data. Custom sequencing panels targetinghundreds of genes were recently used in [88] and [166]. These panels can provide data ofhigher coverage at lower cost than standard WES, but many of the important mutationsfrom genes not covered by the panels can be missed.

    15

  • The first method for single-cell cell sequencing was introduced in 2011 [116]. Althoughthere have been many developments in single-cell sequencing since 2011 [81, 186], isolating,amplifying and sequencing DNA of larger number of individual cells (which is necessary inorder to get an appropriate input for inferring tumor subclonal frequencies and tree of tumorevolution) is still challenging and expensive in comparison to the bulk sequencing. Single-cell data is also characterized by elevated noise rates with many false negative and some falsepositive mutation calls. Occasionally DNA from two or more single cells may be extractedtogether resulting in doublets noise and output data that reflects DNA of multiple cells.Non-uniform extraction of single cells can also result in sampling biases where numbersof cells sampled from subclones are not proportional to their cellular prevalences. Theeffect of this type of noise is significantly lower in bulk sequencing. Nevertheless, despitevarious types of noise, single-cell sequencing yields data of the highest possible resolutionand has great potential to revolutionize the studies of tumor evolution. This type of datacan also help in detecting some of the rare mutations that can be missed by standard bulkapproaches. Some evidence for this was provided in [172] and [88]. However, detection ofrare mutations by SCS depends on the quality of generated data and sampling of single-cells(due to noise in SCS data, usually at least two cells harboring mutation need to be sampledin order to get reliable mutation calls).

    2.2 Inference of tumor subclonal composition and evolutionfrom bulk sequencing data

    Due to the input DNA preparation strategy used, a bulk sequencing dataset consists of a setof reads originating from a mixture of a large number of different cells and therefore yieldsonly an aggregate signal about their DNA. Consequently, in the case of bulk sequencing ofheterogeneous tumor tissue, none of the subclonal populations is observed directly and, inorder to infer tumor subclonal composition, we need computational methods for deconvolu-tion of the observed aggregate signals. Each such method is faced with the very challengingproblem of inferring unknown numbers of tumor subclones of unknown prevalences andunknown sets of somatic mutations harbored by individual subclones. In addition, many ofthe methods also infer a clonal tree of tumor evolution (defined in Section 1.2.1) and exploitdata obtained by sequencing multiple samples of the same patient [81].

    Due to their high prevalence in many tumors [19], well developed methods for identifi-cation from bulk data [178] and simplicity of use in modeling tumor subclonal compositionand evolution, single-nucleotide variants (SNVs) are the most widely used type of mutationsamong the existing methods for studying ITH and evolution [97, 60, 38, 39, 128, 142, 66,130, 33, 150, 125, 189, 159, 145, 110, 69, 120, 164, 114, 133]. In addition, several methodsbased on the use of copy number aberrations (CNAs) [58, 124, 123, 25, 185] and a fewbased on the use of other types of variants (e.g., large insertions) [37, 21] were also devel-

    16

  • oped in the past years. As will be illustrated below, CNAs overlapping with the genomicposition of a given SNV can have a substantial impact on the fraction of reads supportingvariant allele, as well as on the total number of reads spanning the position. Since thesetwo numbers (per each SNV) are the main input for all of the methods based on the useof SNVs, prior knowledge of copy number status of regions harboring SNVs is a requiredpart of the input for most of such methods. Some of the methods expect exact estimates ofcopy number (e.g., [33, 39]) whereas the other use this information only to filter mutationsfrom non-diploid regions (e.g., [38, 97]). Note that, in principle, the read count data shouldbe sufficient input to the methods based on the use of SNVs (since copy number status ofthe mutated loci can be inferred from the read counts). However, as copy number eventstypically affect larger genomic regions, it would then be highly desired to consider not onlyreads spanning genomic positions of the mutated loci, but many other, if not all, reads.This would come at the cost of considerably increased model complexity and very chal-lenging optimization problems and is one of the main reasons that copy number analysis isperformed as a separate step prior to running most of the SNV-based methods.

    We will now present some of the main concepts used in the methods based on the useof SNVs. We start with the simple case where bulk sequencing data of a single tumorsample is available. We also discuss challenges arising when CNA event(s) overlap with theposition of an SNV and summarize theoretical limitations. These limitations are usually aconsequence of the limited resolution of bulk NGS data and they are particularly emphasizedwhen only a single sequenced sample is available. In such cases, some of the limitationsare very likely to influence the correctness of the solution, regardless of a method used inthe analysis. However, in many cases, most of the limitations can be resolved if data frommultiple different samples from the same patient is available. Due to the importance of theuse of multiple samples when they are available, we devote a separate subsection discussingpotential benefits of the joint use of data obtained by sequencing multiple tumor samples(of the same patient). We assume that all SNVs considered in the input are from positionsthat are homozygous in the matching normal tissue (which is usually sequenced in additionto tumor tissue in order to distinguish somatic and germline variants). At the end of thissection, we discuss methods based on the use of CNAs and other types of variants.

    2.2.1 Variant and reference read counts as a proxy for the fraction ofcells harboring mutation

    Assume that we are given a heterozygous SNV, denote it as a, that belongs to the region ofthe genome not affected by any CNA event. Let va and ta respectively denote the number ofvariant and total reads supporting this SNV. Assume for the moment that we have a perfectsampling of reads from the sequenced cells where from each cell harboring a mutation tworeads are present in the bulk data: one from the chromosome with the mutation (i.e., a

    17

  • read supporting the variant allele) and one from the chromosome without the mutation(i.e., a read supporting the reference allele). We also assume that each cell not harboringmutation a contributes two reads, with each of them supporting the reference allele. Underthese assumptions, the total number of cells in the sample equals ta2 . On the other hand,the total number of cells harboring mutation a equals va. These imply that the fraction ofcells harboring mutation a, also referred to as cellular prevalence of a and here denoted asCP(a), equals:

    CP(a) = vata2

    = 2vata. (2.1)

    In practice, the sampling of variant and total reads in the bulk data is not as ideal asin the above. However, assuming that the reads are emitted uniformly at random, which isusually a sound assumption for bulk sequencing, the formula from Equation 2.1 provides anestimate of cellular prevalence of a given mutation. Clearly, the precision of this estimatedepends on the depth of sequencing coverage. For example, in the case of low coveragedata, e.g., if ta = 40, then a single read difference in the number of observed variant readsva changes the estimate of cellular prevalence of a for 0.05 (i.e., 5%). Since the effect ofvariance in the values of va on the estimated cellular prevalence of mutation decreases withthe increase of depth of sequencing coverage, performing deep sequencing is necessary inapplications where highly accurate estimates of mutational cellular prevalences are required.

    The effect of copy number aberrations on the observed variant allele frequencies

    In case where SNV belongs to the genomic region affected by some CNA event, formulagiven in Equation 2.1 can not be directly used for computing the fraction of cells harboringa given SNV.

    22%

    18%

    60%

    𝑴𝟏

    𝑴𝟐

    𝑴𝟏

    𝑴𝟏

    Figure 2.1: Copy number gain of referenceallele (green rectangle) occurring after M1.

    This is illustrated by example shownin Figure 2.1. In this example, we con-sider two mutations, M1 and M2, and as-sume that M2 belongs to the genomic re-gion not affected by any CNA event. Onthe other hand, as shown in the figure, M1belongs to the genomic region affected byCNA event which involves a gain of refer-ence allele (green).

    In perfectly uniform sampling withoutnoise in the sequencing data, the numberof reads supporting the reference allele andspanning genomic position of M1 is propor-tional to 0.22 · 2 + 0.18 · 1 + 0.60 · 2 = 1.82.For variant allele, this number is propor-

    18

  • tional to 0.18 · 1 + 0.60 · 1 = 0.78. Now, if we apply formula from Equation 2.1 we get thatcellular prevalence of M1 equals 2 · 0.78/(0.78 + 1.82) = 0.6, a value which is very differentfrom the true value of 0.78. Furthermore, this value is equal to to the estimated cellularprevalence ofM2 and, in the absence of multiple bulk samples, this would result in incorrectmerging of the two mutations to the same cluster/subclone (as will be discussed below).

    This examples clearly illustrates the importance of appropriate adjustment of variantread counts in cases where one or multiple CNA events overlap with the genomic position ofa given mutation. In such cases, estimating CP value from read counts requires additionalinformation about each of the CNA events affecting genomic position of the mutation.However, detecting all CNA events and the number of copies of regions lost or gainedduring each of the events is very challenging problem for itself. This problem is furthercomplicated by the presence of ITH [123]. The relative timing of CNA events and the firstoccurrence of the SNV is yet another complexity that needs to be modelled properly. Forall these reasons, many of the existing methods are constrained to the use of SNVs fromdiploid regions [97, 38, 128, 60, 189]. Some of the methods considering mutations fromnon-diploid regions (e.g., PhyloWGS [33]) expect copy number estimates, whereas in a few(e.g., Canopy [69]) CNA events are inferred together with SNVs.

    2.2.2 Clustering of mutations based on the read counts

    Here, we assume that tumor growth follows the branching clonal theory of tumor evolution(see Section 1.2.1) and restrict ourselves to tumors harboring SNVs. At this point, we donot get into details of minimum number of SNVs that we expect in order to analyze ITHand evolution. For each SNV, we assume that it occurs exactly once and is never lost (thisis known as the Infinite Sites Assumption and will be later discussed in more detail). Inour notation, we introduce p(S) which denotes the parent of cancerous subclone S in theclonal tree.

    Recalling clonal theory and the Infinite Sites Assumption, a set of cells harboring SNVsthat appear for the first time at S (i.e., mutations that distinguish S from its parentp(S)) consists of cells from subclone S and all cells from its descendants subclones. Thisimplies that all mutations appearing for the first time in S are expected to have similarvalues of the true cellular prevalence (not necessarily identical, since mutation accumulationbetween p(S) and S is gradual process where some of the mutations happen earlier). Theemergence of a subclone S also requires that its first (founding) cell harbors one or multipledriver mutations that are absent from p(S). Such mutations, acquired between the twosubclonal expansions, p(S) and S, provide some selective advantage or disadvantage toS distinguishing it (in terms of the selective potential) from p(S). The acquisition ofthe driver mutations is timely process requiring several cellular divisions until some ofthe descendant cells of p(S) acquires a critical set of mutations producing the foundercell of S. Due to an imperfect DNA replication process, during these cellular divisions

    19

  • newly born cells acquire "hitchhiker" (selectively neutral) mutations that are later passedto their descendants. In addition, some external factors (e.g., ultraviolet light) might alsocontribute to the acquisition of new mutations. In summary, in the observed data we expectthe existence of clusters of mutations, occurring between the two subclonal expansionsand distinguishing child subclone from its parent, and also having highly similar cellularprevalence values.

    All of the above are illustrated in Figure 2.2 where in (a) we show the tree of tu-mor evolution for some hypothetical tumor and in (b) distribution of the expected cellularprevalences computed from read counts available in bulk data and using the formula fromEquation 2.1. When simulating read counts we assume that each locus has coverage equalto 200×. From Figure 2.2(b), it is obvious that the distribution of cellular prevalences canbe well explained by three distinct Gaussian distributions with means around 10%, 30% andbetween 50% and 55%, as expected based on the frequencies in Figure 2.2(a). An exampleof desirable clustering of mutations in this case is shown in Figure 2.4(a).

    10% 30%

    15%

    45%

    is present in (15+10+30)% = 55% of cells

    Ground truth tree

    (a)

    0 20 40 60 80 100Cellular prevalence (based on the read counts)

    0

    10

    20

    30

    40

    50

    Num

    ber o

    f mut

    atio

    ns

    COVERAGE = 200

    (b)

    Figure 2.2: (a) Clonal tree of evolution of hypothetic tumor (b) Number of mutations(y-axis) having the observed cellular prevalence (x-axis). Cellular prevalence values werecomputed by the use of formula from Equation 2.1 and shown as percentages (we assumethat all mutations are from diploid regions). Coverage at each position was set to 200× (i.e.,ta = 200 for each mutation a) and number of variant reads (denoted previously as va) drawnfrom a Binomial distribution with success probability depending on cellular prevalence ofmutation. Red, blue and green clusters of mutations consist of 265, 300 and 65 mutations,respectively.

    In order to illustrate the effect of depth of sequencing coverage on the observed values, inFigure 2.3 we show an example with the same setting of parameters as in Figure 2.2, exceptthe sequencing coverage that was set to 50×. As the example from Figure 2.3(b) illustrates,detection of the green cluster of mutations becomes very challenging when working with the

    20

  • sequencing data of low coverage, despite the fact that its frequency of 30% is well separatedfrom frequencies of the other two clusters (10% and 55%) and that, in the ground truth,over 10% (65 out of 630) of all mutations belong to this cluster. An example of desirableclustering in this case is shown in Figure 2.4(b).

    10% 30%

    15%

    45%

    is present in (15+10+30)% = 55% of cells

    Ground truth tree

    (a)

    0 20 40 60 80 100Cellular prevalence (based on the read counts)

    0

    5

    10

    15

    20

    25

    30

    35

    Num

    ber o

    f mut

    atio

    ns

    COVERAGE = 50

    (b)

    Figure 2.3: An example of distribution of mutational cellular prevalences observed frombulk data. All settings are same as in Figure 2.2, except that sequencing depth equals 50×.

    Both of the above examples assume that only a single tumor sample is sequenced. Inpractice, sometimes bulk sequencing data of two or more samples are available [147]. Inthese cases, all of the above remains same, except that for each mutation, instead of itscellular prevalence, we consider a vector of its cellular prevalences across all samples.

    Weak parsimony assumption

    The goal of most of the available methods for deciphering ITH is to obtain clustering ofmutations, with each cluster consisting of mutations from the same subclonal expansion (asillustrated in the previous figures). One of the most frequently used assumptions among theexisting methods is the weak parsimony assumption. According to this assumption, SNVswith similar cellular prevalence values are present in similar set of cells (i.e., occur for thefirst time at the same subclone) and therefore get clustered together. Here, similarity ofcellular prevalence values is largely determined by the depth of sequencing coverage. Forexample, while CP values of 25% and 30% are significantly different in the case of bulk dataof coverage 10, 000×, these values are very similar in the dataset where coverage equals to40. In general, there is some degree of a trade-off between the number of mutations andsequencing coverage. Targeted deep sequencing data enables sequencing smaller numbers ofmutations but, assuming appropriate adjustments for the possible CNA events are