Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics Citation Fernandez, Daniel. 2015. Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Permanent link http://nrs.harvard.edu/urn-3:HUL.InstRepos:14226043 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA Share Your Story The Harvard community has made this article openly available. Please share how this access benefits you. Submit a story . Accessibility
214
Embed
Cell States and Cell Fate: Statistical and Computational ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics
CitationFernandez, Daniel. 2015. Cell States and Cell Fate: Statistical and Computational Models in (Epi)Genomics. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Terms of UseThis article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA
Share Your StoryThe Harvard community has made this article openly available.Please share how this access benefits you. Submit a story .
Chapter 2, chapter 3 and chapter 5 are based on the following (in preparation) papers
(they can also be found at the website http://www.dnaiel.com).
Julio D. Perez, Nimrod D. Rubinstein, Daniel Fernandez, Stephen W. Santoro,Leigh A. Needleman, John J Choi, Mariela Zirlinger, Jun S. Liu and Cather-ine Dulac. Dynamic Regulation and Functional significance of Parent-of-OriginAllelic Expression in the Adult and Developing Brain. Submitted. (2014)
Fernandez, D., Rubinstein, N., Jun Li, M., Perez J., Dulac C. and Liu, JS. Allele-specific Expression across Multiple Tissues in human studies. In preparation.(2014)
Goren A., Xing J., Dixit A., Fernandez D., Velenich A., Durham T., Liu, JS.,Regev, A. and Bernstein BE. Faithfulness of stem cell models: comparative anal-ysis of transcriptome and genomic chromatin structure in early mammalian de-velopment. In preparation. (2014)
Chapter 2 and 3 is based on collaborative work with Professor Catherine Dulac, and her
lab members Nimrod Rubinstein and Julio Perez. Chapter 4 is based on collaborative work
with Professor Catherine Dulac, Nimrod Rubinstein, Jiexing Wu, and Jun Li. Chapter 5
is based on collaborative work with Professor Aviv Regev and Bradley Bernstein labs, and
their lab members Alon Goren, Atray Dixit and Yang Li (member of Jun Liu lab).
viii
Acknowledgments
I want to start by thanking my advisor, professor Jun Liu. I am extremely honored,
grateful and lucky for the opportunity to work and learn from him. His Bayesian eye and
his Monte Carlo ’moves’ very much resemble the one of an eagle: sharp, fast, accurate and
precise. He showed me how to do research by setting up an example of integrity, and high
standards of research himself, and by questioning my ideas, assumptions and models with
an inquisitive mind.
I also wanted to thank my statistical committee members professors Joseph Blitzstein
and Tirthnakar Dasgupta. I thank Professor Joseph Blitzstein for teaching me 210, 211 and
many other statistical ideas, for writing amazing posts in Quora (I am his follower there), and
for showing me how complex ideas can be translated into beautiful, not necesarily simple,
but rather simpler, levels of abstraction. I thank Tirthankar for his patience and great
support throughout my PhD., for teaching me about experimental design, and for being a
great contributor to my research.
I also want to thank our scientific collaborators in biology and my biological comittee
member, Professor Catherine Dulac and her lab members Julio Perez and Nimrod Rubinstein
for their constant support, helpful discussions and hard work in validating the results, and
understanding the relevance of the findings.
Moreover, I am forever indebt with my first biological and medical collaborator Dr.
Bradley Bernstein and his lab members Atray Dixit, Alon Goren, Birgit Knoechel and Ryan
Russell. In their lab I first saw a sequencing machine and ’bench’ work; and their ambition,
and ability to look at a scientific problem from many different angles will always resonate
with me.
But the life of a PhD. is a rich one, and teaching what we have learnt is a core part of
our statistics department. I would also like to thank the CompBio course team, professors
ix
Jun Liu and Shirley Liu, and TFs Alejandro and Lin, from whom I learned how to combine
ideas from statistics, computer science and biology, I was extremely lucky to TF for them,
and learn from them.
Also, I wanted to thank all members of Jun’s lab, we had great moments together, smart
discussions, and always helped one another. I want to especially thank Ke Deng, Ming
Hu, Jun Li, Yang Li and Jiexing Wu. But I was also part of a bigger community, the
statistics department at Harvard and I thank all members of the department. I especially
thank my classmates Valeria Espinoza, Simeng Han, Jonathan Hennessy, Bo Jiang, Joseph
Kelly, Nathan Stein, Xiao Tong, Samuel Wong and Xiaojin Xu; and our own department
administrator Betsey Cogswell and the staff members Steven Finch, Jimmy Matejek, Alice
Moses, Dale Rinkel, Maureen Stanton.
Lastly, but not least, thanks to my family: my noble wife Jane, my parents Arturo and
Magdalena, my brothers Arturo and Juan Cristobal and my sister Magdalena, my in laws
Pedro, Cata and Maca, my godsons R2D2 and Pedrito, and all my nephews and nieces.
Also thanks to my in-laws, Dr. and Ms. Hong, my little sister-in-law Anne Hong and my
brother-in-law el Jefe. I am glad to have a big and loving family, that very much cares for
me, and is there for me unconditionally.
x
To my joyful and caring Jane, who inspires me everyday, and brings me the joy
and peace needed to focus on my research.
xi
Chapter 1
Introduction
A technological advance of a major sort almost always is overestimated in the
short run for its consequences - and underestimated in the long run.
- Francis Collins
The ability to read the whole DNA (whole genome) and to accurately measure the molec-
ular state of a cell(s)1 is now possible due to major technological achievements over the past
few decades, beginning with early Sanger sequencing in the 70s, Sanger and Coulson (1975)
and Sanger et al. (1977) to the Next Generation Sequencing Technologies (NGS) of today,
Metzker (2009) and Koboldt et al. (2013). These technologies, in turn, have allowed us to
read the entire human genome (HGP, the International Human Genome Sequencing Con-
sortium (2001) and Venter et al. (2001)). The HGP opened a pandora’s box of possibilities
to the study of our biology, and to the study of disease. Now, sequencing technologies can
be used to study the proteome, the transcriptome (RNA-Seq), transcriptional regulation
1Note that in this thesis we interchangeably refer to a cell as a single cell, or as a population of cells (be-lieved to be in a similar condition/state or coming from the same tissue). Note that most of our experimentswere done in tissues containing on the order of thousands to million of cells, but the methods could be easilyapplied to single-cell experiments.
1
via protein binding (ChIP-Seq), the epigenetic landscape (ChIP-Seq, DNAse-Seq, WGBS,
Chia-Pet, DNAase-Seq, DamID-Seq), and the chromatin structure of a cell (Hi-C), among
several other molecular measures of a cell’s activity (or current state).
This gives us a much more comprehensive understanding of cell and molecular biology,
allowing us to tackle a variety of questions starting from fundamental biology such as the
basis of epigenetic inheritance, the study of cell identity, cell differentiation, embryonic and
stem cell development, to the importance of genetics and epigenetics in maintaining the
proper state of cells.
In this thesis we present several statistical and computational models that help us under-
stand and properly analyze genomics and epigenomics data, and thus the title of the thesis:
Cell States and Cell Fate: Statistical and Computational models in (Epi)Genomics. How-
ever, before describing the models we present a brief introduction of molecular biology and
the sequencing technologies used throughout this thesis. This serves a two fold objective;
on one hand, it helps the reader to better understand the concepts and the motivation for
our research, making this thesis more or less self-contained, and on the other hand, it helps
motivate the statistical models that are presented. Much like the words of George Box, ”all
models are wrong, but some are useful,” I would further fine tune them to say that ”models
that do not take the data generating process into consideration are most likely not useful.”
2
1.1 The Cell: its components and the central dogma
of MB
A eukaryotic cell is an extremely complex and dynamic system, with millions of molecules
interacting with each other in order to maintain its function and state2.
Nonetheless, part of this complexity can be explained through the central dogma of
molecular biology, and its key actors: the DNA, the chromosome, the chromatin, the RNA,
the RNA-polymerase, the proteins, and the Ribosome. The central dogma states that the
information on how to make a protein is codified inside the DNA, and it flows from DNA to
RNA to protein.
Figure 1.1: The central dogma of molecular biology.
The DNA can be viewed as the information required for a cell to live, divide, differen-
tiate, and maintain its state and function. The fundamental dogma of MB explains how
such information is read, copied, and translated into molecular machinery called proteins.
In broad terms the central dogma describes, in great detail, all the steps to go from reading
2By function we mean the physiological function the cell has in the organism, i.e., hemoglobin is a cellwhich major purpose is to carry oxygen. By cell state we mean the physiological condition of the cell at agiven time.
3
the DNA to creating a protein. It states that first the DNA-Polymerase complex replicates
(make a copy of) a gene (a given region/locus in the DNA) and then the RNA-polymerase
complex transcribes (make a second copy of) the DNA into messenger RNA. Finally, the
Ribosome complex reads the messenger RNA transcript and translates it into protein. It is
important to note that a middle and key step in complex organisms, such as higher-order
eukaryotes, is alternative splicing Berget et al. (1977), where the spliceosome complex re-
moves the introns, and keeps only a combinatorial number of exomes of an mRNA transcript,
figure 1.2. It is then, the spliced mRNA transcript the one that is translated into protein
and not the original mRNA. The importance and pervasiveness of alternative splicing has
been extensively studied in the last years and its of key importance in eukaryotes since it
greatly increases the biodiversity of proteins Black (2003). In humans, for example, about
95% of multiexonic reads are alternatively spliced Pan et al. (2008). This argues in favor
that the minimal fundamental unit of information is not only a gene (a region/sequence in
the DNA), but rather each of its forms, since each of its form may produce very different
proteins, and therefore, the same gene may have several distinctive functions, participate in
different processes, and be part of different pathways.
1.1.1 The genome
The genome is the sequence of all nucleotides present in the DNA molecule. With the use
of sequencing technologies we have been able to read the whole genome of several species, such
as the 3,082,436,951 letters (nucleotides) of the human genome, and the 2.5 Gb (gigabillion
bases/nucleotides) of the mouse genome. However, we need to have in mind that each mouse
and each human is unique, no two humans, except identical twins, contains the same genome.
We are a random combination of our paternal and maternal genomes - in a process called
4
Figure 1.2: Alternative Splicing step. This step is unique to complex organisms and it anadditional transcriptional step.
homologous recombination, our paternal and maternal genomes are crossed-over (mixed)
during meiosis to create a new and unique offspring. This, at first appearing to be a trivial
fact, when added to the fact that DNA gets randomly mutated across generations (and
positively selected), form the basis of evolution and the history of species. In other words,
we are the product of two, simple, but profound in its consequences, stochastic processes.
1.1.2 The Transcriptome and the Proteome
The part of the DNA that is being translated into protein is what we call genes. However,
as mentioned at the beginning of this section, there is one more layer of complexity. In
higher species, due to alternative splicing, after the transcriptional step (copying the whole
gene sequence to mRNA), a given gene gives rise to several different forms of a gene, or
so called isoforms (multiple forms of a gene). And it is those isoforms are translated into
protein, giving rise to several forms of a protein. Alternative splicing has been shown to be
extremely important: allowing a single gene to give rise to multiple products increases the
5
diversity of proteins allowed by our more or less 20,000 genes, and in turn allowing us to
differentiate from other species for which we share great part of our genome.
As proteins form the basis of cell regulation and cell function, we are mostly interested
in estimating protein abundance but current technologies are more suitable for estimating
isoform/transcript abundances. This is how two genomic concepts came to existence: the
transcriptome and the proteome. The transcriptome is the set of all transcripts inside a cell,
and the proteome is the set of all proteins inside a cell. The degree to which the transcriptome
is a proxy for the proteome is still a matter of debate but proteome technologies still have a
long road before we can accurately measure all the proteins inside a eukaryotic cell.
1.1.3 Cell regulation via Protein-Binding
We have explained the process by which proteins are synthesized, but all cells carry the
same genomic information, and all cells from our body come from a single cell, the zygote.
Thus, Why cells with the exact same DNA can be of very different type, exhibiting completely
different morphology, state and function? We have more than 200 different cells-types in
our body, and certainly no one would argue that a neuron looks and acts extremely different
from a blood cell, but what biological mechanism allows the existence and maintenance of
such differences between cell types?
This can be partially understood through gene regulation, i.e., the rate at which different
genes are being expressed in a given cell. In other words, through gene regulation a cell can
control its state and function by ’expressing’ (being translated into protein) different genes,
with in turn perform different function, at different time and space.
A well-known mechanism for gene regulation is the interaction between protein and DNA,
by which a protein/enzyme binds to the DNA in order to control the expression level of a
6
gene or a set of genes. Base on the role of such proteins they have been named: specificity
factors, repressors, general transcription factors, activators and silencers, figure 1.3.
Figure 1.3: Diagram of classes of transcription factors and their activity.
Specificity factors alter the specificity of RNA polymerase for a given promoter or set
of promoters, making it more or less likely to bind to them (i.e. sigma factors used in
prokaryotic transcription). Repressors bind to non-coding sequences on the DNA strand that
are close to or overlapping the promoter region, impeding RNA polymerase’s progress along
the strand, thus impeding the expression of the gene. General transcription factors position
RNA polymerase at the start of a protein-coding sequence and then release the polymerase
to transcribe the mRNA. Activators enhance the interaction between RNA polymerase and a
particular promoter, encouraging the expression of the gene. Activators do this by increasing
the attraction of RNA polymerase for the promoter, through interactions with subunits of
the RNA polymerase or indirectly by changing the structure of the DNA. Enhancers are
sites on the DNA helix that are bound to by activators in order to loop the DNA bringing
7
a specific promoter to the initiation complex. Silencers are regions of DNA that are bound
by transcription factors in order to silence gene expression. The mechanism is very similar
to that of enhancers.
Although transcription factors play a major role in gene regulation, one could imagine
that there must be an inheritable mechanism that activates/repress different transcription
factors in different cells; and that stably maintains different regulatory networks across
different cell types. Such mechanism is called epigenetics. In order to better understand
epigenetics we first introduce the chromatin and chromosome structure, and then explain
how the epigenome is believed to play a major role in gene regulation and ultimately, cell
differentiation, cell identity and cell function.
1.2 Chromatin Structure and epigenetics
Each living eukaryotic cell needs to solve an extremely hard problem: how to fit an
approximately 1.5 meters long molecule (DNA) into a 1 nano meter cell nucleus. The DNA
is only part of the story. The whole DNA molecule is contained within a larger superstructure.
This superstructure was firstly discovered by Walther Flemming in 1879 by using staining
techniques and the microscope to observe the contents inside the nucleus of a cell. He
would stain the ”fibrous network” inside the nucleus, which he termed chromatin, ”stainable
material” (from the greek word chroma, meaning color).
Later on it was discovered that the hard task of packaging DNA is accomplished by
specialized proteins that bind to and fold the DNA, generating a series of coils and loops
that provide increasingly higher levels of organization, preventing the DNA from becoming
an unmanageable tangle. Thus, chromatin can be described as the complex of DNA and
8
protein that make up the contents of the nucleus of a cell. Amazingly, although the DNA is
very tightly folded, it is compacted in a way that allows it to easily become available to the
many proteins in the cell that replicate it, repair it, and use its genes to produce proteins.
Nowadays it is well accepted the high importance of chromatin in packaging the DNA
inside the nucleus in a way that is dynamic and accessible to other proteins. Furthermore
chromatin plays a major role in strengthening the DNA to allow mitosis, and to prevent DNA
damage, and ultimately, it is believed to be central in gene regulation and in maintaining
the state and function of a given cell.
Figure 1.4: Chromatin Structure and its several layers of organization.
1.2.1 Chromatin Structure
Kornberg (1974) studied in depth the first layer of chromatin organization, proposing
9
that chromatin was composed of DNA and a repeating unit of histones, where about 147
base pairs of DNA would wrap around eight histone proteins forming what is called the
nucleosome.
Due to its high compaction, chromatin can be seen as highly structured while at the same
time highly dynamic and flexible, with several layers of organization. In its first organiza-
tional layer lies the DNA and its helix-like structure. At the second layer of organization
the DNA wraps around histone octamers (consisting of 2 copies of each of the core histones,
H2A, H2B, H3 and H4) forming a chain of nucleosomes, where each nucleosome is sepa-
rated by a histone H1 and short linker DNA, as in figure 1.4. The nucleosome plus the
H1 histone is called the chromatome. At this layer of organization the chromatin looks like
beads on a string, and its called euchromatin. Euchromatin is more accessible and ready for
transcription.
Next, in the third layer of organization the nucleosomes fold-up to produce a 30 nm
fibre. Then, the fibre coils twice, forming a highly compacted structure, called chromatin
in interphase or, if subsequent coiling, chromosome in metaphase. The chromatin from the
third-layer and up is called heterochromatin, and its highly compressed and not accessible
for transcription.
1.2.2 Cell regulation via Epigenetics
In biology it is common that structure determines function and the chromatin is no
exception to the rule. The mechanisms by which the chromatin structure and biochemical
state of several chromatin components regulate gene expression, and moreover by which
closely related cells in the lineage tree ”inherit” such information are still a matter of active
research, in a field called epigenetics. Epigenetics can be defined as the stable and heritable
10
information that is distinct from DNA sequences and fostered by specialized mechanisms.
These mechanisms include DNA methylation, small interfering RNAs, histone variants,
histone post-translational modifications (PTMs). To date, however, only DNA methylation
has been shown to be stably inherited between cell divisions. Although some histone PTMs
are expected to contribute to the transmission of epigenetic information, others participate
in the process of transcription - the so-called active marks - and others are likely to be
restricted to structural functions.
1.2.3 DNA methylation
DNA methylation is the biochemical process by which a methyl group is added to the
cytosine or adenine DNA nucleotides. In multicellular eukaryotes, DNA methylation seems
to be confined to cytosine bases and is associated with a repressed chromatin state and
inhibition of gene expression. In adult somatic cells, DNA methylation typically occurs in a
CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells, and
has also been indicated in neural development.
DNA methylation is essential for viability in mice, because targeted disruption of the
DNA methyltransferase enzymes results in lethality. There are two general mechanisms by
which DNA methylation inhibits gene expression: first, modification of cytosine bases can
inhibit the association of some DNA binding factors with their cognate DNA recognition
sequences; and second, proteins that recognize methyl-CpG can elicit the repressive potential
of methylated DNA. Methyl-CpG-binding proteins (MBPs) use transcriptional co repressor
molecules to silence transcription and to modify surrounding chromatin, providing a link
between DNA methylation and chromatin remodelling and modification.
11
1.2.4 Histone Modification
Another type of epigenetic mechanism for gene regulation are the biochemical modifica-
tions of the histone tails. Histones undergo posttranslational modifications that alter their
interaction with DNA and nuclear proteins. The H3 and H4 histones have long tails protrud-
ing from the nucleosome, which can be covalently modified at several places. Modifications
of the tail include methylation, acetylation, phosphorylation, ubiquitination, SUMOylation,
citrullination, and ADP-ribosylation. The core of the histones H2A, H2B, and H3 can also be
modified. Combinations of modifications are thought to constitute a code, the so-called ”his-
tone code”. Histone modifications act in diverse biological processes such as gene regulation,
DNA repair, chromosome condensation (mitosis) and spermatogenesis (meiosis).
In summary, the epigenome plays a major role in gene regulation and how cells differ-
entiate and maintain their identity across the lineage tree. It is analogous to the hardware
and the software in computer systems, where the genome is the same for all the cells, but
the epigenome changes from cell to cell in order to control what the hardware is doing.
1.3 Sequencing Technologies and Experimental Proto-
cols
Sequencing technologies could be loosely define as the set of technologies that allow us
to read, at a single nucleotide level, the information contained in DNA molecules. However,
in order to achieve such goal several human and automated machine steps must be done.
Thus, sequencing technologies include a number of methods that are grouped broadly as
template preparation, sequencing and imaging. The unique combination of specific protocols
12
distinguishes one technology from another and determines the type of data produced from
each platform. Currently the major platforms available include Life Sciences Technology
(Roche), Applied Biosystems SOLiD, Illumina, Pacific Biosciences, IonTorrent among several
others. In the recent years Illumina has become one of the most widely used sequencing
technologies, and in this thesis we mainly use sequencing data generated from the Illumina
sequencing platform, in combination with specific experimental protocols depending on the
question of interest.
It is worth mentioning that we do not have any preference in terms of sequencing tech-
nologies, and each technology has its limitations and advantages. Illumina for example has
to go under several PCR amplifications steps but it has a lower per-base error rate than
PacBio. On the other hand, PacBio can sequence at the single molecule level with no need
for amplification steps but it has a higher per-base error rate than illumina. We believe that
the appropriate choice of technology should be an integral part of the experimental design;
and therefore understanding the limitations of each technology is very important.
1.3.1 PyroSequencing
Pyrosequencing is based on iteratively complementing single strands and simultaneously
reading out the signal emitted from the nucleotide being incorporated (also called sequencing
by synthesis, sequencing during extension). Electrophoresis is therefore no longer required to
generate an ordered read out of the nucleotides, as the read out is now done simultaneously
with the sequence extension.
In the pyrosequencing process, one nucleotide at a time is washed over several copies of
the sequence to be determined, causing polymerases to incorporate the nucleotide if it is com-
plementary to the template strand. The incorporation stops if the longest possible stretch
13
of complementary nucleotides has been synthesized by the polymerase. In the process of in-
corporation, one pyrophosphate per nucleotide is released and converted to ATP by an ATP
sulfurylase. The ATP drives the light reaction of luciferases present and the emitted light
signal is measured. To prevent the dATP provided for sequencing reaction from being used
directly in the light reaction, deoxy-adenosine-50-(a-thio)- triphosphate (dATPaS), which
is not a substrate of the luciferase, is used for the base incorporation reaction. Standard
deoxyribose nucleotides are used for all other nucleotides. After capturing the light inten-
sity, the remaining unincorporated nucleotides are washed away and the next nucleotide is
provided.
In our thesis we used pyrosequencing methods for validation, but Roche 454 life sciences
uses pyrosequencing methods for high-throughput sequencing.
1.3.2 Illumina: Sequencing-by-Synthesis with Reversible Fluores-
cent Terminators
Here we present an overview of the Illumina sequencing technology and the steps neces-
sary to sequence millions of DNA molecules per run, ECO (2007). The first step is to pre-
pare the sample to be sequenced, this varies depending on the experimental protocol (whole
genome sequencing, ChIP-Seq, etc.) but it almost always end with a so-called library, i.e.,
a large number of 200-500 bp double-stranded DNA molecule fragments. Then, illumina-
specific adapters are ligated to the end of each of the fragments and the fragments are isolated
and amplified using limited cycles of PCR. The next step is called Cluster generation and it
involves the set of all DNA fragments from the sample preparation and an Illumina 8-channel
flow cell, figure 1.5, where each channel/lane contains single stranded oligonucleotides that
14
correspond to the sequence of the adapters ligated during the sample preparation step. In
the cluster generation steps, single-stranded adapter-ligated fragments are bound to the flow
cell surface and then they are amplified to form individual DNA-fragment clusters (more
details of this step in figure 1.6). The final step is the Sequencing-by-synthesis step: the
flow cell is now loaded into the sequencing machine for automated cycles of extension and
imaging. In this step, one fluorescent base at a time is added to each cluster (in parallel),
and the base is called to be A,T,C or G based on an algorithm that identifies the emission
color. At the end of the cycle all bases are called for all the fragments present in the lane.
The machine outputs a fastq file, i.e., a file with all the fragments present in the lane, its
respective base information and its call quality.
Figure 1.5: Illumina flow cell.
Next, we describe the experimental protocols for the data we used throughout this the-
sis: Whole-Genome sequencing, RNA-Seq, ChIP-Seq and DAMID-Seq. The understanding
of the experimental protocol is very important since it serves a two-fold change, a better
communication and understanding of the data, and the possibility to translate some of the
data generating mechanisms into appropriate probabilistic models.
15
Figure 1.6: Illumina Workflow.
1.3.3 Whole-Genome Sequencing
The human genome project was a Whole Genome Sequencing project. The whole DNA
information contained in a mix of individuals DNA was sequenced, and transformed into a
large string of A, T, C and Gs, separated by chromosome. This information is now publicly
available for everyone online in UCSD Genome Browser and many other sites. However,
the human genome project was not the only genome to be sequenced, many more species
were sequenced. As of today more than 200 species have been sequenced and there are still
16
ongoing major sequencing projects, such as the Metagenome project, Cancer Genome Atlas,
Personal Genomes, and 1KGP.
It is worth mentioning that although most whole genome sequencing projects use similar
technologies the objective may vary, for example in the early days of sequence the main
objective was to assemble several species (among them humans), genomes. This is a hard
computational problem and requires the use of several sequencing technologies in addition
with efficient dynamic programming alignment algorithms - analogous to completing a jigsaw
puzzle. Many computational and statistical developments came from this area in the early
90s and early 2000s. Now that we have most reference sequences, projects like the TCGA
project relies on the human genome sequence information to characterize non-heritable ge-
nomic variations in cancer samples. On the other hand, the metagenome project aims is
to create a reference set of genomes of all the microorganisms living in certain environment
(Gut, liver, etc.).
The technology to sequence whole genomes used today is also based on Illumina sequenc-
ing technologies, a whole DNA is sheared into fragments and a library is prepared for it to be
the input to sequencing machines, such as the Illumina Machine described in section 1.3.2.
1.3.4 RNA-Seq
RNA-Seq is an experimental protocol based on sequencing technologies that allows us
to estimate the abundance of all the RNA transcripts in a cell. It has quickly become
the standard protocol for quantifying gene expression, largely replacing microarray-based
experiments. This is due to a variety of reasons but, to us, the most compelling is the
analysis flexibility that RNA-Seq based experiments provide. With RNA-Seq experiment
the application of NGS data expanded considerably to studies of transcriptome assembly,
17
alternative splicing, allele-specific expression, etc. In addition, its dynamic range and noise
levels are considered better than its counterpart microarray-based technologies.
In brief, an RNA-Seq experiment consists of taking a representative set of RNAs out
of the population of RNAs, shearing them, and then sequencing its fragments in order to
estimate the relative transcript abundances inside the cell. In more detail Wang et al. (2009),
the protocol is as follows. A set of representative RNA molecules is pulled-down - in the
case we are interested in only coding sequences (sequences that are later translated into
protein) only poly(A)+ RNA’s are selected whereas on the other hand, all RNAs are pulled
down. Due to the limitation of current sequencing technologies these long RNA molecules
cannot be sequenced directly; therefore, these RNA molecules are fragmented into shorter
fragments. Next, given that illumina does not sequence RNA, these short RNA molecules
are used to create cDNA through the use of reverse transcriptase3. The rest of the protocol
is as described in section 1.3.2, these fragmented cDNA molecules are prepared into a library
with adaptors and given to the sequencing machine for sequencing.
1.3.5 ChIP-Seq
In order to study cell regulation, in its 1.0 or 2.0 form, such as DNA-protein binding, tran-
scription factor binding, histone modification profiles, DNAase hypersensitive sites, among
other, the ChIP-Seq technology is the current leading technology - closely related technolo-
gies such as protein binding microarrays and DamID arrays are also used but we do not
describe them in this thesis.
ChIP-Seq stands for Chromatin Imnunoprecipitation followed by sequencing. In short, a
3to do this random hexamer (6 bp) primers will bind to their target (random) complementary sequencein different transcript/regions making short cDNA fragments (range 5-50000 mean/mode 200-500)
18
Figure 1.7: RNA-Seq experimental protocol.
ChIP-Seq experiment starts with purifying the chromatin, and cross-linking the DNA and the
protein of interest. Next, the chromatin is sonicated (or fragmented using MNase treatment)
and the cross-linked DNA-protein complex is pulled-down using an antibody that recognizes
the protein of interest. Next, the cross-linked is reverse, separating the protein of interest,
and the chromatin. Finally the left DNA is purified, amplified, the libraries prepared and
sequenced as in 1.3.2. A more detailed explanation of the experimental protocol can be
found on Furey (2012) and Park (2009).
19
Figure 1.8: ChIP-Seq experimental protocol.
20
Chapter 2
Allelic Imbalance and Allele-specific
expression
Every complete set of chromosomes contains the full code; so there are, as a
rule, two copies of the latter in the fertilized egg cell, which forms the earliest
stage of the future individual.
- Erwin Schrodinguer
In this chapter we extend several RNA-Seq models, such as Trapnell et al. (2010) and Glaus
et al. (2011), to the case of allele-specific gene/isoform expression. In the case of experiments
for which we have the individual’s genotype, and its respective haplotype-blocks, our method
can estimate the expression for each gene/isoform at the allelic level, but we loose the parent-
of-origin information. Moreover, in the case of experiments for which we have the paternal
and maternal genomes our method can estimate the expression for each of the parental
alleles.
We present bayesian and frequentist models of allelic expression in an RNA-Seq experi-
ment, combined with phasing information, such as paternal/maternal phasing in the study of
21
imprinting; or haplotype phasing in the case of population-level studies. Next, we provide a
computationally efficient EM algorithm to estimate allelic expression, and its bayesian coun-
terpart, a gibbs sampling approach to sample the allelic expression from its posterior. Next,
we define a differential-expression test to test for differences between the haplotypes, and
we provide a simulation study to show the performance of our approach in comparison with
other popular published methods, Robinson et al. (2010). The methods presented in this
chapter serve as basis for the study of imprinting in chapter 3 and the study of ase-cis-eQTL
and imprinting in chapter 4.
2.1 Introduction
Several models have been proposed for the estimation of gene/transcript expression,
and/or alternative splicing using RNA-Seq experiments. The state-of-the-art methods can
be divided into two major groups. On the one side, a set of models, that we call count-based
methods estimate gene expression by creating an artificial ”gene model”, where a gene is
defined as the intersection (or union) of all its possible transcripts, and then the expression
is estimated as the sum of all reads that map to such ”gene model”. It has been pointed by
several authors that such models are undesirable since they do not take into consideration
the data generating process of RNA-Seq.
On the other side, several models such as Jiang and Wong (2009), Trapnell et al. (2010),
Glaus et al. (2011) and Turro et al. (2011), that we call transcript-based models, model the
RNA-seq data at the transcript-level, assigning reads to each of the transcripts (alternative
isoforms) of a gene according to their probabilities in the model. In such models, in order
to estimate the expression of a gene, one has to first estimate the expression of each of its
22
transcripts, and then the expression of a gene can be estimated as the weighted average
expression of each of the transcripts coming from such gene.
We believe that the transcript-based models are more in line with the biological reality
since they model the data at the transcript-level. As Sidney Brenner pointed out with respect
to the discovery of genes: ”I think the most important thing there was that immediately you
could say, boy if we could find out how the sequence of bases corresponds to the sequence of
amino acids, because now we could define the gene not just as a blob, not just as a bead on a
string, but we could define the gene now as a length of DNA”. In other words the transcript-
based models adhere to the physical reality of a gene, by which a specific sequence is copied
into an RNA transcript, and not a ”hypothetical construct”, or so-called ”gene model”.
Furthermore, since in transcript-based models the data is modeled at the molecular level
of transcripts, they incorporate an explicit parameter for the relative abundance of tran-
scripts in their likelihood, making the results easier to interpret. In other words, cells have
transcripts inside the nucleus, not reads, nor counts, and having a parameter that estimate
the relative abundance of transcripts is definitely an advantage.
However, most transcript-based models, except Turro et al. (2011), can only quantify
transcript expression at the single chromosome case, where only a single measure of expres-
sion per transcript is calculated. This is a simplification of the biology since we are a diploid
organism, and we have two allelic copies of each transcript: the one inherited from the father
and the one inherited from the mother. More specifically, we have 23 pairs of chromosomes,
where each of the 23 pairs were inherited from our parents.
23
2.1.1 Biological mechanisms of Allelic Imbalance
Allelic imbalance (AI) is a phenomenon where the two alleles of a given gene are expressed
at different levels in a given cell, either because of epigenetic inactivation of one of the two
alleles, or because of genetic variation in regulatory regions. The major known phenomena
of allelic-imbalance are X-chromosome inactivation, mitotically stable autosomal monoallelic
expression, imprinting and cis-regulation at the allele-specific level.
X-inactivation is a process by which cells in the epiblast (cells that give rise to the
embryo) randomly choose one of the X-chromosomes to become inactive, and such decision
is mantained throughout the lineage of a given cell - giving rise to mosaich patterns of
expression across tissues in the developed organism. The icon of X-inactivation is the female
calico cat, which exhibits patches of black and orange fur due to the random inactivation
of one of the alleles of the X-located gene responsible for the coloration of its fur, in which
one allele gives rise to black fur and the other allele gives rise to orange fur. Autosomal
monoallelic expression is similar to X-inactivation in the sense that some cells randomly
choose to inactivate one of the alleles, while other cells of the same type choose the opposite
allele. Also, as in X-inactivation such decision is mantained throughout cell division of the
same cell, but different clonal populations choose the inactive alleles randomly. Imprinting
on the other side is a very different mechanism, it occurs for the cells in the germline, and
its parent-of-origin specfifc, in the sense that either the paternal or the maternal allele is
inactivated. Finally, the last source of allelic imbalance is cis-regulation at the allele-specific
level.
In summary, in a single cell, in females, one would always observe all of the genes of
a randomly chosen X-chromosomes to be inactive, several autosomal genes to be randomly
silenced, hundreds of genes to be silenced in the germline in a parent-of-origin manner, and on
24
the order of thousands to hundreds of genes to exhibit some allelic bias due to cis-regulatory
any di-rection,dependingon LDbetweenthe causalvariant andthe genelocation
Table 2.1: Known Biological mechanisms leading to allelic imbalance.
In this chapter we present a model to estimate ASE (allele-specific expression) using
RNA-Seq data from a single experiment. Furthermore, these estimates can be used to study
25
some of the described mechanisms of allelic imbalance in a more detailed, genome-wide
manner. In population studies, for which we have genotypic information, it can help us
understand allele-specific effects across several individuals, and how allelic1 differences affect
expression. It has been shown that there are several allelic differences that affect disease risk,
and potentially, response to drug treatment. This is of key importance in GWAS, eQTL,
drug response, and the still pending promise of personalized medicine. We dedicate such
effort in chapter 4 of this thesis.
Furthermore, in chapter 3 we show how when combining RNA-Seq data and whole genome
sequencing of the offspring parents we can quantify parent-of-origin expression and, therefore
delve into the biology (and effects) of impriting.
Nevertheless, it has been a long road to be able to study allele-specific expression and
parent-of-origin expression at a genome-wide scale. In the earlier days of microarray exper-
iments for gene expression this was not feasible since one would have needed allele-specific
probes to quantify differential expression of alleles. However with NGS technologies, by com-
bining RNA-Seq experiments with SNP arrays or whole genome sequencing of the parents,
studies of allele-specific expression, or parent-of-origin expression, respectively, have become
more common. Albeit conceptually possible such studies present several computational and
statistical challenges that have been addressed in this thesis: mapping bias to a reference
genome, Degner et al. (2009), phasing of the reads high-uncertainty due to read ambiguity
with respect to the allele-of-origin.
1by different alleles we mean different sequences of a transcript due to natural variation in the population
26
2.1.2 Our Estimand of Interest, and the Data
Our estimand of interest is the difference, in expression, between the two alleles of a gene2.
Let us call such estimand the allelic difference. In order to estimate the allelic difference
we need two sources of information. We first need rna-seq data from each individual, and
next we need the paternal and maternal genomes, or the phased alleles from each individual.
In this chapter we use the paternal and maternal genomes to build the parental-specific
isoforms, and then we use the RNA-Seq data to quantify their respective expression.
More specifically, we assume that we have the following data: the phased diploid genomes,
the set of all (possible) transcripts (this is called the annotation table) that could have been
transcribed from each of the alleles, and the set of reads that were generated by the RNA-Seq
experiment.
As described in section 1.3.4, we do not directly observe the number of mRNA copies
inside the cell, but rather a set of reads that were generated from the experiment. Thus, the
DGP can be viewed as a two-stage sampling process: (1) transcripts are sampled from the
cell, and subsequently (2) fragments/reads are sampled from the transcripts.
Let us denote the set of all observed reads as R, and the set of all possible transcripts as
T . Our estimand of interest in an RNA-seq experiment is the relative abundance, ρt, for a
given transcript t ∈ T . In an RNA-Seq experiment we only observe if a given read maps to a
given transcript, but we do not know which transcript originated such read. This information
can be summarized in the (observed) indicator matrix I, where Iijk = 1 if read i was mapped
to transcript j, allele k, and 0 otherwise. Similarly we can define the (unobserved) indicator
matrix Z, where Zijk = 1 if read i was originated from transcript j, allele k, and 0 otherwise.
2Throughout this thesis we define the alleles of a gene as the two copies of a gene in a given individual,i.e., the maternal copy of the gene and the paternal copy of the gene
27
The matrix I can be observed when the reads are mapped back to the paternal and maternal
genomes using short-read aligners, whereas the matrix Z is unobserved because it represents
the true data generating process by which a given transcript was fragmented into a given set
of reads. We summarize these ideas and the notation in a diagram of the DGP, figure 2.1.
Figure 2.1: A simplified view of the RNA-Seq data generating process. The reality is slightlymore complex because there are multiple cells, and there is an uneven coverage of reads dueto sequencing bias, random primers being not so random, and easier-to-access regions dueto RNA secondary structure, among other complexities.
28
2.1.3 Normalized Measures of RNA expression
It follows from the description of the DGP that without knowledge (spike-ins) of the
total abundance of transcripts molecules inside the cell is not possible to estimate absolute
abundance of transcripts, and therefore, RNA-Seq experiments at best can only provide
measures of the relative abundance of transcripts. We called this measure as ρt in the
description of the DGP, and we defined it as the relative abundance of transcript t in the
cell.
In the current literature there are two widely accepted normalized (across samples) mea-
sures of expression for ρt, Reads per Kilobase per Million mapped reads (RPKM), and
Transcripts Per Million (TPM). However, studies have shown on solid theoretical grounds
that RPKM is less interpretable, its biased when comparing abundances between samples,
and it may inflate the number of false positives when comparing two samples, Wagner et al.
(2012).
We can go one step further and define this mesure in physical terms. ρt is the relative
molar RNA concentration (rmc) of each transcript t, defined as, rmct = mRNAtmRNAtotal
, with,
RNAtotal =∑
t∈T mRNAt. It is straightforward to show that, ˆrmc = 1|T | , i.e., the average
rmc is a constant equals to 1 over the total number of transcripts in the experiment. In
theory, we would like ρt to be directly proportional to rmc. Thus, our interest is to know,
which of the two measures, TPM or RPKM exhibits such property? Let us forget about the
multiple alignment issue for the sake of comparing the two measures. Then,
RPKMt = 109 × ciL′tN
(2.1)
where ct is the number of reads mapping to transcript t, L′t is the effective length of transcript
29
t, and N is the number of mappable reads in the sequencing experiment. It is easy to show
that the sum of RPKMs depends on N , the number of mapped reads, therefore it is platform
dependent and thus not directly proportional to rmct.
On the other side, TPM is defined as:
TPMt = 106 × Z × ciL′tN
(2.2)
where, Z is added in order to make this measure technology-independent, and it is a nor-
malization factor. It can easily be shown that such measure is now directly proportional to
rmct and it is thus our prefered measure of expression.
In the case of allele-specific effects we are interested in measuring, yt = ρtp − ρtm, i.e.,
the difference, in expression, between the paternal and maternal alleles. For such cases, has
been argued that it is not important which measure to use. Nevertheless, such assumption
is not true. Let us assume we did E experiments and we know that the allelic-specific effect
is the same in each of the experiments. Then, if we use RPKM as the measure for ρt, we
have, y(i)t = α(i) × (rmctp − rmctm), i = 1, . . . , E, and therefore our final estimate for the
allele-specific effect will have a bias factor given by α. The bias factor α is a function of the
sequencing depth in each of the experiments so it follows that the difference between the
paternal and maternal RPKMs is not directly proportional to the difference of the paternal
and maternal relative molar concentrations.
2.2 Computational Workflow
A key-step in any analysis of next generation sequencing data is the computational work-
flow to properly process the data. One has to develop computationally efficient, robust and
30
carefully design computational workflows because these experiments tend to produce large
amounts of data. In this thesis, we analyzed more than 100 terabytes of data and without
the aid of high performance computing clusters, and other well-established algorithms many
of the proposed models would not be feasible to implement. Here we describe what we con-
sider the optimal approach to process large amounts of RNA-Seq data into a format that is
suitable for our model, and can serve as an input for our algorithm.
As we mentioned earlier, we assume that we have the set of all reads generated by the
RNA-Seq experiment - this is standard and is contained within a fastq file, where each 4 lines
contains the read id, the read sequence and the per-base read quality. Then, such file needs
to be inspected for quality control, such that reads contaminated with library preparation
steps, and/or reads of low quality bases are filtered out. Once such QC is done we have
a new, higher confidence set of reads, but we still do not know where they were generated
from the genome. Thus, a key step is the alignment of the reads to the reference genome.
In this step one has to have in mind that such reads were originated from transcripts, and
therefore, the reads that cover exon junctions will map to far apart regions in the genome.
Because of this, it is recommended to map the reads to the genome using ’transcriptome-
aware’ aligners, such as Tophat and STAR. Another common strategy is to create a file
with all the annotated transcripts and their respective sequence and map the reads to the
sequence of all transcripts. In our workflow we decided to use transcriptome-aware aligners.
However, in our case we are interested in allele-specific expression and therefore our
transcriptome is double the size of the haploid transcriptome. Therefore, we need to use
’in-silico’ parental genomes and map our reads generated from the RNA-Seq experiment to
each genome independently. This approach has two advantages: it allows us to phase the
reads, meaning to identify from which genome, paternal or maternal, the reads came from;
31
and at the same time it helps us overcome the mapping bias associated when mapping reads
to a haploid genomic reference.
A short summary of the computational workflow is presented below.
Figure 2.2: Computational Workflow to process RNA-Seq data for the analysis of allele-specific expression.
2.3 Frequentist Model of Allele-Specific Expression
After describing the data and the DGP we are ready to define the likelihood function for
our RNA-Seq model:
L(ρ) = p(R|ρ) =N∏i=1
M∑j=1
2∑a=1
ρja × Lja∑Mj=1
∑2a=1 ρja × Lja
× Iija × p(ri|Iija) (2.3)
where,
ρ = {ρ1, . . . , ρT} is the relative abundance for each of the T transcripts in the annotation,
and Lj is the length of transcript j, allele a.
32
R = {r1, . . . rN} is the set of all reads generated by the RNA-Seq experiments.
I is the matrix of 0s or 1s indicating if a read was mapped to transcript j.
The first term in the likelihood represents the probability of sampling reads from a given
transcript in the RNA-Seq experiment. This probability is a combination of a transcript’s
length and abundance. I.e., more fragments will be sampled from abundant transcripts
than less abundant transcripts. But also, because of the fragmentation step in the RNA-
Seq experiment, more fragments will be generated from long transcripts than from short
transcripts.
The second term, p(ri|Iija), represents the probability that read i was originated by
transcript ja. In intuitive words this probability models the read and transcript specific
biases that have been observed in RNA-Seq experiments. Here we use the fragment bias
proposed by Roberts et al. (2011).
We also want to emphasize the importance of understanding the sequencing experiment
since this gives us good insight into the appropriate probabilistic model for our data.
2.3.1 Maximum Likelihood Estimation
We use the EM algorithm Dempster et al. (1977) to obtain estimates for the parameter
of interest, ρ.
In order to simplify the notation of the EM algorithm let us define the following simplified
notation:
p(ri|Iija) = cijk (2.4)
and,
θja =ρja × Lja∑M
j=1
∑2a=1 ρja × Lja
(2.5)
33
Note that,M∑j=1
2∑a=1
θjk = 1 (2.6)
In adittion, there is a 1-1 transformation between θjk and ρjk, lemma A.0.1.
Thus, in the simplified notation, the likelihood can be rewriten as,
L(ρ) = p(R|ρ) =N∏i=1
M∑j=1
2∑a=1
θja × Iija × cija (2.7)
Then, integrating over the missing data I, we obtain the complete data likelihood,
p(R,Z|θ) =N∏i=1
M∏j=1
2∏a=1
(θja × cija)zija (2.8)
Thus, the complete-data likelihood is the product of each read probability as if we knew
which transcript generated them.
Now we can write the (k)-iteration of the EM algorithm as follows,
(E-Step)
Q(θ|θ(k)) = EZ|θ(k)logP (ri, zja|θja)
=N∑i=1
M∑j=1
2∑a=1
Ezija|θ(k) × logP (ri, zja|θja)(2.9)
And,
Ezija|θ(k) =cija × θ(k)
ja∑Mj=1
∑2a=1 cijk × θ
(k)ja
(2.10)
(M-Step)
θ(t+1) = argmaxθQ(θ|θ(t)) (2.11)
34
Thus,
θ(t+1)jk =
∑ni=1E(Zijk|θ(t))
N(2.12)
where E(Zijk|θ(t)) is the same as in 2.12.
Note. This model may be weakly identifiable in cases where the two alleles have few
differences, in terms of number of SNPs, or INDELS. In such cases, the EM still guaran-
tees convergence to, at least, a local maxima. Moreover, the identifiable pair, θj1 + θj2 is
guaranteed to be the maximum likelihood estimate.
2.3.2 Bootstrap approach to obtain Confidence Intervals
The EM algorithm gives us only point-estimates of the parameter of interest, ρt. This
parameter estimates the relative abundance of transcript t. Since in our probabilistic model
reads have been generated from a given transcript according to its length lt, and relative
abundance ρt, we can use the point estimates of ρt to reassign each read to a given transcript.
In more detail, our resampling approach is as follows. First, for each read we compute
the probability of being generated by a given transcript (out of all the transcripts that the
read maps to). Let us call this probability, pij = P (Zij = 1), being the probability that read
i was generated by transcript j,
pij =cijρj∑
j:{Iij=1} cijρj(2.13)
Next, we take each read and assign it to a given transcript according to its probability pij.
The resampling method proposed here relies on randomly assigning reads to transcripts
(in adherence with the RNA-Seq DGP) in order to obtain the variance and covariance of
our abundance estimates for each transcript. Intuitively, if transcript A share many reads
35
with transcript B they will be highly correlated, and if transcript A has very few reads
and/or many shared sequences with other genomic regions it will exhibit high variance in its
abundance estimate.
Note: Without loss of generality we specified one transcript per locus but in the allele-
specific case we always deal with two alleles per transcript.
2.4 Bayesian Model of Allele-Specific Expression
One advantage of defining a bayesian model for allele-specific expression is that we can
directly sample from the posterior, thus avoiding the use of a bootstrap approach to obtain
the standard error of our expression estimates.
The model presented here is an expansion of the BitSeq model, Glaus et al. (2011), for
the allele-specific case, where the main difference is that now the indicator matrix I has
twice the number of transcripts than in the non-allele case.
In order to build the allele-specific expression model we use the same notation as in
figure 2.1. We define our model in terms of θ, instead of in terms of ρ in order to sim-
plify the sampling steps. However, there is a 1-1 transformation between both parameters,
lemma A.0.1. Also, to simplify the notation, we refer to the number of transcripts as T ,
which in the allele-specific setting is equivalent to twice the number of transcripts than in
the haploid setting of the BitSeq model.
Thus, we can write the joint distribution of the observed data, the missing data and the
36
parameters as,
p(R, I,Z,θ) =N∏i=1
p(ri|Zi, I i,θ)× p(Zi, I i,θ)
=N∏i=1
p(ri|Zi, I i,θ)× p(Zi|I i,θ)× π(θ)
=N∏i=1
p(ri|Zi, I i)× p(Zi|I i,θ)× π(θ)
(2.14)
where,
• Zi, corresponds to the categorical variable indicating which transcript generated read
i. Zi = t : such that Iit 6= 0.
• θ = {θ1, . . . , θT} is the relative abundance of reads generated by each of the T tran-
scripts.
• p(ri|Zi, I i) can be interpreted as the probability of an observed alignment. We model
the probability of an observed alignment as the product of sequencing a read from a
given position, and the probability of sequencing a read sequence with its matches and
mismatches, with respect to the genomic reference. Thus, this probability corrects for
positional, alignment mismatches, and sequencing biases. In the case of a single-end
experiment this probability can be formulated as, p(ri|Zi, I i) = p(position|Zi, I i) ×
p(read sequence|Zi, I i). In the case of a paired-end experiment we also incorpo-
rate the fragment length, L, and its distribution. Thus, in this case, p(ri|Zi, I i) =
p(position|L,Zi, I i)×p(L|Zi, I i)×p(read1 sequence|Zi, I i)×p(read2 sequence|Zi, I i).
• π(θ) is the prior probability for the relative abudance of reads generated by a given
transcript. We model π(θ) as,
π(θ) = Dir(α) (2.15)
37
The general version of the model also incorporate the probability that a given read was
generated by noise and not by any of the transcripts it aligned to. In order to do so,
we define a latent variable Znoise, and the associated probability for the read to not have
been generated by any transcript, θnoise - when adding this extra latent variable and noise
parameter, the joint distribution of the data becomes,
p(R, I,Z,θ, Znoise, θnoise) =N∏i=1
p(ri|Zi, I i)×p(Zi|I i, Znoise,θ)×p(Znoise|θ)×π(θ)×π(θnoise)
(2.16)
Without loss of generalization we will work with the model without the extra noise term in
the next sections, but our model runs with the noise term.
2.4.1 Gibbs Sampler
Let us write the joint distribution of the observed data, the latent variables, and the
model parameters,
p(R, I,Z,θ) =N∏i=1
p(ri|Zi, I i)× p(Zi|I i,θ)× π(θ) (2.17)
In the Gibbs sampler we first assign a read to a given transcript, i.e., we sample Zi|I i,θ,
and next we sample the transcript abudance given the read assignment, i.e., θ|Zi - doing
this iteratively we guarantee convergence to the joint distribution.
Nevertheless, such strategy can be quite slow since we need to assign have to sample
reads and trasncripts abundances in a read-by-read manner. A faster approach would be to
use the collapsed gibbs sampler version of this algorithm.
38
Algorithm 1 Gibbs Sampler for estimating Allele-specific expression parameters θ
1. Initialize the parameters.
2. Sample each parameter iterative (until convergence) from their full condi-tionals, as follows,For alignment of read i = 1, . . . , n
(a) Sample Zi|I i,θ ∼ Cat(Zi|φi), where φit ∝ p(ri|I i, Zi) × θt, t = 1, . . . , T . Notethat φit needs to be normalized such that
∑t:{Iit 6=0} φit = 1. I.e., the read can
have only been generated by one of the transcripts it aligned to.
(b) Sample θ|Zi ∼ Dir (θ|α + C1, . . . α + CT ), where Ct =∑N
i=1 δ(Zi = t). I.e., Ctcounts the total number of reads that were assigned as generated by transcript tin a given iteration.
3. Take only the samples after the burn-in and compute posterior statistics for the pa-rameters of interest.
2.5 Hierarchical Model of Allele-Specific Expression
across Multiple Experiments
We have presented methods to obtain expression estimates per-experiment, but in several
situations we may want to estimate the grand-mean across all experiments. In the hierarchi-
cal model we assume exchangeability across experiments with the same factors, since there
is no reason to believe experimental-level differences when the factors are shared. The Hi-
erarchical model borrows information from exchangeable experiments, and it tends to give
better predictions than complete-pooling or no-polling, Gelman (2006).
This model is similar to the previous model but now we impose a common hyperprior
distribution for each θ(e), i.e., the abundance of reads coming from a given transcript in a
given experiment. We show a graphical representation of the model in figure 2.3. Thus, we
can write the joint distribution of the parameters and the observed data as,
39
p (R, I,Z,θ,µ, σ) = p(σ)× p(µ)×E∏e=1
{π(θ(e)|(µ, σ)
)×
N∏i=1
p(r(e)i |Z
(e)i , I
(e)i
)× p
(Z
(e)i |I
(e)i ,θ(e)
)}(2.18)
where,
• Z(e)i , corresponds to the categorical variable indicating which transcript generated read
i in the eth experiment. Z(e)i = t : such that I
(e)it 6= 0.
• θ(e) = {θ(e)1 , . . . , θ
(e)T } is the relative abundance of reads generated by each of the T
transcripts in experiment e.
• p(r(e)i |Z
(e)i , I
(e)i ) can be interpreted as the probability of an observed alignment for a
given read in a given experiment.
• π(θ(e)|(µ, σ)
)is the prior probability for the relative abudance of reads generated by
a given transcript in a given experiment. We model this probability as, Dir (µ� σ).
• p(µ) is the prior probabily for the hyperparameter µ. The hyperparameters µ can be
interpreted as the grand-mean for the abundance of reads for each transcripts across
all experiments. This parameter is shared among all the same experiments.
• p(σ) is the prior probabily for the hyperparameter σ. The hyperparameter σ can be
interpreted as a scaling factor, allowing different scaling between different experiments.
Now, using the joint distribution we derive a Metropolis-within-Gibbs sampler to obtain
the posterior estimates for each of the parameters in the model,
40
Algorithm 2 Metropolis-within-Gibbs Sampler for estimating grand-mean across all exper-iments
1. Initialize the parameters.
2. Sample each parameter iterative (until convergence) from their full condi-tionals, as follows,For experiment e = 1, . . . , EFor alignment of read i = 1, . . . , ne
(a) Sample Z(e)i |I
(e)i ,θ(e) ∼ Cat
(Z
(e)i |φ
(e)i
), where φit ∝ p(r
(e)i |I
(e)i , Z
(e)i ) × θ(e)
t , t =
1, . . . , T
(b) Sample θ(e)|Z(e)i ∼ Dir
(θ(e)|µ1σ + C
(e)1 , . . . , µTσ + C
(e)T
), where C
(e)t =∑N
i=1 δ(Z(e)i = t). I.e., C
(e)t counts the total number of reads that were assigned
as generated by transcript t in a given experiment e, in a given iteration.
End of iteration over read and experiments, now we sample the hyperpa-rameters:
(c) Sample µ|θ(1), . . . ,θ(E), σ from its posterior using Metropolis-Hasting’s algo-rithm and a proposal distribution given by µ∗|µ(s) ∼ Dir
(µ∗|µ(s)
)- the posterior
was obtained by discarding all terms not involving µ in the joint distribution,
p(µ|θ(1), . . . ,θ(E), σ
)∝
E∏e=1
Dir(θ(e)|µ, σ
)×Dir(1)
(d) Sample σ|θ(1), . . . ,θ(E),µ from its posterior using Metropolis-Hasting’s algo-rithm and a proposal distribution given by σ∗|σ(s) ∼ LN
(σ(s), 1
)- the posterior
was obtained by discarding all terms not involving σ in the joint distribution,
p(σ|θ(1), . . . ,θ(E),µ
)∝
E∏e=1
Dir(θ(e)|µ, σ
)×Gamma(λ1, λ2)
3. Take only the samples after the burn-in and compute posterior statistics for the pa-rameters of interest.
41
Figure 2.3: Graphical Representation of the Hierarchical Model across experiments.
2.6 Identifiability Issues
The models presented can be weakly or strongly identifiable depending on the data. There
are three main driving experimental factors with respect to identifiability of the model: type
of reads (single-end or paired-end), reads length (25 bp or 150 bp), and SNP density in a
given individual. RNA-Seq experiments that are single-end, short reads, and individuals
or mouse models with low SNP density provide very weakly identifiable models. On the
contrary, experiments that are paired-end, long reads, and high SNP density in the individual
provide much more identifiable models. In order to solve the problem of identifiability we
combine our estimates using a collapsing procedure for allele-specific isofrorms that are
weakly identifiable. We compute the correlation of the posterior samples of θi and θj, and
if the correlation is above a given threshold we combine the estimates from isoform i and
isoform j. Although this is an adhoc procedure it helps us tremendously in reducing our
data to identifiable isoforms. An alternative approach would be to use variable selection
models, such as isoDot to select isoforms that are supported by the data and then fit our
model only on supported isoforms.
42
2.7 Simulation Results: To Count or not to Count?
As mentioned earlier, transcript-based models provide interpretable estimates of transcript-
specific expression and a straightforward way to combine such estimates into gene expression
estimates. In this section we compare our methods with the, up to now, commonly used
method of counting the number of reads per SNP in a given transcript, Robinson et al.
(2010).
We used tuxSimAllele to simulate an RNA-Seq experiment, a diploid version of the
published simulator TuxSim, Trapnell et al. (2013). Our simulator, TuxSimAllele, takes as an
input an expression file, and a parameters file containing RNA-Seq experimental parameters,
as well as dna-specfic parameters. The expression file contains the target expression for each
of the annotated transcripts for each allele. The RNA-Seq specific parameters are related
to the technical parameters of the experiment: the length of the reads, the type of reads
(single-end or paired-end), the mean cDNA fragment length and its variance, and the total
number of cDNA fragments that were sequenced (sequencing depth). The genome-specific
parameters are the genome of the individual and the set of variations (SNPs and Indels)
in the individual’s genome. Here we assumed a SNP density (in exons) in line with the
biological reality and the current technology - human alleles vary about 1 SNP per 1 Kb
(0.001 SNP density), and mouse isogenic lines alleles vary about 1 SNP per 100 bp (0.01).
Since our objective is to identify allele-specific bias across multiple experiments, we de-
cided to simulate allele-specific bias across multiple experiments. In order to do so we create
a base expression file for all allelic-transcripts in the annotation. In the base expression file
we specify the RPKM, ρt, for each of the allelic transcripts in the annotation. Also in the
base expression file we select a subset of genes that will exhibit allele-specific bias in an iso-
43
form specific-manner. In more concrete terms, let us say our targe gene A has two isoforms,
B and C. We can imagine several allele-specific effects for such gene A. For example, gene A
may express only one isoform, let us say isoform B, and then isoform B may be biased; or
gene A may expressed isoform B in a parental manner and isoform C in a maternal manner,
and all other 4 possible combinations, figure 2.4. It is trivial to note that, in real data,
for the general case of a gene with k isoforms3, we would observe at most 3 × k combi-
nation of expresed isoforms and their respective allelic effect (paternal, maternal or bias).
Furthermore, it is also trivial to note that if we use a count-based method we would see
many ’complex’ patterns, such SNPs belonging to the same gene will have opposite biases.
Thus, the count-based methods present a major conundrum, which bias-direction should one
choose? Well, it turns out that expression happens at the isoform level, and therefore, the
bias at the SNP-level does not have any biological interpretation.
Figure 2.4: All posible allele-specific effects for a gene with two isoforms.
In order to show this conceptual problem with the count based method we conduct
simulations of allelic bias under several scenarios. Furthermore, we also want to show that
our methods estimate the correct direction of the bias, and that the false positives and
false negatives are well-calibrated under simple models of differential expression between the
alleles.
First let us describe how we simulated the biological variability across experiments. We
3the average number of isoforms per gene according to the gencode annotation is about 10 isoforms pergene, so as we can see there are many possible complex combinations of isoform expression and allelic bias
44
assume, in line with empirical data, that the number of reads in a given isoform, Xt, follows
a NB[µt, σ2t ], with µt = ρt × Lt
1000× D
1000000, and a variance, fitted with a lowess (using real
data), to be a function of the mean. Thus for each experiment, given our base expression
file, that specifies ρt, and the parameters files that specifies D, we can simulate the RNA-Seq
experiment, and generate reads accordingly.
We simulated expression data for all the genes/isoforms in the gencode annotation and we
perturbed (simulated as having parental/maternal bias) only about 0.5% of the genes (in line
with the numbers previously reported). We perform the simulations under several scenarios,
and we compare the EM and bayesian approaches described in this chapter, in addition with
the most commonly used count-methods for estimating allelic-biases. Table 2.2 explains the
different simulated scenarios, and table 2.3 and figure 2.5 show the performance for each
of the methods under all the simulated scenarios. We can see that under the assumption
that the paternal and maternal effects are consistent across all isoforms of a gene the count
method performs in line with the EM and Bayesian allele-specific models described earlier,
nevertheless under any other scenario our proposed models perform preferably.
2.8 Discussion
We presented a model and an efficient computational workflow for finding allele-specific
expression estimates at the transcript-level, and that also allows us to compute estimates
at the gene-level by taking the weighted sum of all the transcripts present in a given gene.
Nevertheless, the model is as good as the data is and in order to reduce the uncertainty of
our estimates several experimental decisions can be made. First, one would ideally estimate
allele-specific expression for individuals with a well-sequenced genome, and with large DNA
45
Consistant All isoforms of the perturbed gene exhibit an allelic ef-fect in the same direction. I.e., following our gene Aexample, the paternal copies of isoform B and isformC, would be higher by a factor between [1.5,3] from thematernal copies of isoforms B and C.
Inconsistant Isoforms of the perturbed gene exhibit an allelic effect inopossite directions. I.e., continuing our gene A example,the paternal copy of isoform B would be higher by afactor between [1.5, 3] from the maternal copy of isoformB, and the maternal copy of isoform C would be higherby a factor between [1.5,3] from the paternal copy ofisoform C. Thus, in this case, isoform B has a paternalbias, whereas isoform C has a maternal bias.
Rotation-Scale Expression levels for all isoforms in the gene are ran-domly rotated and scaled. I.e., in our example, the pa-ternal copy of isoform B would be scaled up by a factorbetween [1.5, 3], and the paternal copy of isoform Cwould also be scaled up by a factor between [1.5, 3].Then we would rotate the expression levels of the pater-nal copies of isoform B and C.
Rotation Expression levels between the parental are rotated.Same as rotation-scale but now expression levels are onlyrotated.
Single Transcript We randomly select X isoforms to scale up/down withrespect of paternal/maternal expression.
Table 2.2: Table with all the allele-specific perturbations that were simulated.
sequence difference between the alleles (i.e., large nucleotide composition difference between
the maternal and paternal copies of a gene). In humans we expect around 15 million SNPs
46
Figure 2.5: These are the ROC curves for the 5 different methods under all perturbation scenarios.
47
between two individuals, while the difference in two mouse strains can be higher ( 1% of the
genome, or equivalently 30 million SNPs).
Second, we would ideally be able to sequence large molecules and thus, reducing the
mapping uncertainty. Illumina current technologies can easily sequence 100 bp paired-end
reads, and PacBio technologies can sequence much longer reads but with a much higher
error rate. We expect that in the longer term this will be less of an issue and we will be able
to sequence much longer molecules, thus reducing the posterior variance of our estimates
considerably.
Third, sequencing depth can helps us considerably in being able to estimate allele-specific
expression for lowly expressed genes, and also helps us in having more reproducible and
robust estimates of the allelic epxression difference.
48
Chapter 3
Design of Experiments in the study of
Parental-Specific Expression
To consult the statistician after an experiment is finished is often merely to ask
him to conduct a post-mortem examination. He can perhaps say what the
experiment died of.
- Ronald Fisher
In this chapter we study parentally-biased gene/isoform expression across several brain
regions1 during early and adult developmental stages, and in other tissues such as liver and
muscle. Traditionally, parentally-biased gene expression, when observed across most cell
types and in an all-to-none fashion is called imprinting. Imprinting is an epigenetic phenom-
ena by which we parentally inherit only one ’working’ copy of a the so-called imprinted gene.
I.e., either the maternal or the paternal copy is actively transcribed in the offspring, and the
other copy (paternal or maternal, respectively) is silenced, and therefore, not transcribed,
1we analyzed rna-seq data for the cerebellum, dorsal raphe and arcuate; and pyrosequencing data forvalidation in at least 10 different brain regions
49
figure 3.1.
Figure 3.1: IGF2 and H19 are two well-known imprinted genes - when the two copies are notimprinted the offspring shows Beckwith-Wiedemann overgrowth syndrome.
At a first glance imprinting seems to be counterintuitive, and hard to be explained by
evolution - both copies should be equally expressed as a way to compensate in case one of
the alleles have a deleterious mutation. However, the paradox is only apparent, and it can
be explained by the kinship theory of genomic imprinting, Wilkins and Haig (2003). The
kinship theory, or also known as the conflict theory of imprinting is better understood if we
think about litter-bearing mammals (dogs, mouse, etc.), i.e., mammals that tend to produce
more than one offspring, usually in large numbers and sometimes from multiple fathers. The
kinship theory proposes that for certain genes, there is a conflict between the mother and
the father evolutionary goals. It is in the best interest of the father to make his offspring the
strongest and the fittest of the litter, while in the case of the mother she needs to balance her
resources such that all her offsprings in the litter obtain a fair-share of the limited resources
coming from the placenta. Thus, the conflict can be resolved through the use of imprinting -
for example, the father would tend to enhance the expression of growth related factors such
50
that his offspring is the strongest, and the mother to compensate such effect would tend to
repress growth-related genes.
To better understand the kinship theory let us take as an example two well-known im-
printed genes, the growth-related genes HG19 and IGF2 shown in figure 3.1. It is known
that in mammals only the paternal copy of IGF2 is expressed in the offspring, and in the
case of H19 only the maternal copy is expressed. Since both are growth related factors,
IGF2 being a growth-enhancing gene and H19 being a growth-supressor gene this shows the
kinship theory in action.
It has also been shown that imprinting plays a major role in development and its key to
the proper function of an organism. In some cases the lack of imprinting can be lethal to the
mother and the offspring and in other cases is associated with major disorders. It has been
shown to be strongly associated with the Prader-Willis, Angelman and Beckwith-Wiedemann
syndromes, and it has also been shown to play an important role in several types of cancer,
such as leukemia, prostate cancer, and colorectal cancer. Moreover, several studies suggest
that imprinting may play an active role in the brain and in social behavior, Garfield et al.
(2011). Thus, imprinting is a strong candidate for a better understanding of socially-related
diseases such as autism and schizophrenia, Badcock and Crespi (2008).
However, few studies have looked at imprinting in a systematic, and genome-wide manner.
Thus, our interest to understand its effects using a more general approach, by measuring its
effect across several developmental time-points, several tissues and in a genome-wide manner.
In this chapter we assume that we have estimates of the parental-specific gene expression
for each experiment - one possible model for achieving such goal is to use the models proposed
in the previous chapter, but as technology evolves there may be more accurate ways to
estimate parental-specific expression. Next, we propose a model on such estimates to help
51
us identify reproducible and consistent differences in expression between the two allele’s of
a gene, the maternal and the paternal copies.
In order to estimate genes that exhibit parent-of-origin effects we present a careful exper-
imental design that accounts for many of the important factors. Next, we fit the Bayesian
Regression and variable selection model to our data and obtain estimates of the main and
interaction effects. Finally, we compare our results with known imprinted genes, and we
validate some of our predictions using pyro-sequencing follow-up experiments.
3.1 Experimental Design
To properly study parent-of-origin bias one has to consider a careful experimental design
or it could lead to misleading/wrong conclusions. First, it is well-known that the DNA
sequence has a large effect over expression, given that sequence differences in promoter,
enhancer and/or insulator regions give rise to different transcriptional regulation programs.
In our design we control for sequence-specific differences in expression by using two
reciprocal mouse crosses. In the initial cross the paternal genome comes from one strain,
and the maternal genome from a different strain. In the reciprocal cross the paternal and
maternal strains are inverted. The mouse cross is one of the 2-level factors in the experiment.
The cross is not the only factor that could likely affect parental bias. This bias may vary
across tissues, ages, and/or gender. In order to exemplify this, let us take the gender factor.
Males have one copy of chromosome X, and females have two copies of chromosome X. Thus,
at the very least, there cannot be imprinting in X-chromosome genes for males.
Moreover, several studies have suggested that imprinting does not exhibit an homoge-
neous pattern across cell types, Garfield et al. (2011). With this in mind, we did experiments
52
in several brain regions, such as Cerebellum, Arcuate and Dorsal Raphe, and an even larger
survey of brain regions during the validation step. It is also well known that DNA methyla-
tion, a key mechanism associated with genomic imprinting and gene-silencing, its dynamic
and likely to change between different developmental stages.
Although we repeated the experimental design across several tissues, without loss of
generality, we shall focus on the design for one brain region, the cerebellum.
With all these factors in mind, we conducted a paired 23 factorial design with several
replicates per pair. The experimental units are mice reciprocal crosses, F1i and F1r, of two
mouse strains, C57 and CAST. Then, for each cross we take cells from their cerebellum at
two-stages of development, 8 days and 8 weeks; and each experiment consists of a single-end
59 bp RNA-Seq experiment (section 1.3.4). The three 2-level factors are Sex (S): M (1), F
(−1); Cross (C): F1i (1), F1r (−1) and Age (A): P8 (1), P64 (−1).
Our estimand of interest is the difference, in expression, between the paternal copy of
the gene and the maternal copy of the gene. Thus, ideally, we would measure the paternal
expression, zp, and maternal expression, zm directly and our response would be the difference
between the two copies2, z = zp−zm. Intuitively, if the difference (z) is 0 across all replicates
we can conclude that there is no parent-of-origin effect, but if such difference is greater or
less than 0 there is a parent-of-origin effect.
However, the problem is slightly more complex and in an RNA-Seq experiment one cannot
directly observe the number of copies coming from each of the alleles, but rather we observe
reads coming from each of the alleles. Therefore, to circumvent this issue, we use an RNA-
Seq model to estimate the allele-specific expression for each gene (chapter 2). Let us denote
2by paternal expression we mean the number of RNA molecules that have been transcribed from thepaternal DNA, and by maternal expression we mean the number of molecules that have been transcribedfrom the maternal DNA
53
our paramater (estimand) of interest as z (equivalent to θ in our RNA-Seq model, equation
2.17). Let us also denote its posterior mean as y = E(z|RNA-seq data) and its posterior
variance as ε = V ar(z|RNA-Seq data), then using bayesian asymptotic theory we know that,
z ∼ N(y, ε). Moreover, under a vague prior for z we can further assume that y and ε closely
approximate the maximum likelihood estimator and fisher information of z. Consequently,
approximately, y ∼ N(z, ε).
We can think of our response as a response y with known measurement errors ε. It is
analogous to measuring the response in an experimental design where we do not directly
observe the response, but rather an estimate of the response plus some error3.
The table describing the full experimental design is presented below, table 3.1. As one
can see in the table, the design is a balance design with 6 replicates per factor combination.
Thus, for example, the first row in the table corresponds to the first factor combination where
the tissue-sample for the RNA-Seq experiment was extracted from a male offspring (S = 1),
coming from an F1i cross (C = 1), and a cerebellum with 8 days of development (A = 1).
represents the estimated errors associated with each estimated response, and nrep = 6 means
that the experiment was repeated (independently) for 6 different biological replicates of such
factor combination.
We are mainly interested in computing the main effect of the difference between the
paternal and the maternal gene expression, after controlling for all the factors mentioned
3Note that the expression estimate and its standard error were estimated from the probabilistic modelin chapter 2. Albeit not perfect, in the sense that a full model across replicates where we estimate theexpression and the factor effects simultaneously may have been more appropriate, this is a tractable andreasonable way to simplify the problem. Furthermore, for a single replicate estimating the expression andits standard errors takes about 6 to 7 hours on 6 cores and it uses about 10 to 20 Gb of Ram, thus a fullmodel would probably be highly impractical.
Table 3.1: Paired 23 factorial design with repeated measurements - six replicates per factorcombination.
above. In order to do so we propose a Bayesian Regression Allelic Imbalance Model, with
the acronym of BRAIM. The model name serves two purposes, one the one hand it stresses
the fact that the model is a bayesian regression on the difference between the alleles, and on
the other hand it serves the purpose to stress that most of our studies of imprinted genes
were done across several Brain regions.
In this model we model all the main effects and interaction effects of each factor over the
response (difference in expression between the alleles). Continuing with the example in the
table, we model the main effects of S,C and A; and its interaction effects SC, SA,CA; plus
an intercept term that estimates the average gene/isoform expression difference (across all
factors) between the paternal and maternal alleles.
3.2 BRAIM: Bayesian Regression Allelic Imbalance Model
Our proposed model, BRAIM, has two key features, on the one side it is not just a simple
linear regression model, but it also incorporates a variable selection feature to estimate and
select the relevant main effects and interaction effects. In addition, it incorporates into the
model the fact that our observations are estimates of the true allelic expression, and thus it is
55
necessary to propagate the uncertainty of our estimates into the model - genes/isoforms with
fewer reads and few SNPs will have higher uncertainty than gene/isoforms with more reads
and many SNPs thus avoiding the uncertainty in our estimates will lead to high number of
false positives and biased conclusions.
A relatively similar model in the context of variable selection was first proposed by George
and McCulloch (1993), and later on extended to experimental designs that involves complex
aliasing by Chipman et al. (1997). Our model takes this ideas and extends it to a 23 factorial
paired design, with measurements errors.
Let us denote the estimate of our response as y. Let us also denote the standard error
of our response estimate as, ε4.
Then, we can model y as normally distributed,
y|z ∼ MVN (z, E) (3.1)
where, y is an estimate of the response, of dimension n × 1, z can be interpreted as the
unobserved/latent true value of the response and is of dimension n× 1, and E = diag(ε2i ) is
the covariance matrix of the response estimate, modeled as independent and of dimension
n×n. Note that n represents the total number of observations. Continuing with the example
from the previous section, n would correspond to 2k × nrep = 23 × 6 = 48, since that is the
total number of samples for our given paired 3-factor (2-level each factor) experimental
design.
4Here it is important to note that in RNA-Seq data we do not directly observe the expression of agene/isoform since our observed data are reads, and we do not directly observe which isoform generatedthem. This uncertainty must be propagated into our model for allele-specific expression and thus, as describedearlier, our observed response y is an estimate of the true difference in expression between the paternal andmaternal allele.
56
Next, we model the unobserved true value of the response, z, as,
z|β, σ2 ∼ MVN (Xβ,Σ) (3.2)
where, X = (X1, . . . , Xp) is the experimental design matrix, with X1 being a column vector
of 1s, and β = (β1, . . . , βp) the regression parameters for the intercept, plus the main and
interaction effects; and Σ = diag (σ2). Note that in our model β1 is the intercept and it
can be interpreted as the average difference between the paternal and maternal expression
values. Continuing with our example, p = k +(k2
)+ 1 = 3 + 3!
2!1!+ 1 = 7. Let us also add an
additional notation, p = m + 1 + m∗, where m is the number of main effects and m∗ is the
number of interaction effects. Note that in a balanced design as in our example, m = k, but
in other experimental designs such coincidence may not always hold true.
Next, we take a similar approach to the aforementioned models and model the βs as a
mixture of normals, such that the first normal is centered around zero with a small variance,
representing a non-significant effect, and the second normal is centered around the estimated
effect of β, and considers the effect to be significant.
βk|τk, ck, δk ∼
N (0, τ 2
k ) , δk = 0
N (0, (ckτk)2) , δk = 1
(3.3)
The δs can be interpreted as an indicator random variable that indicates if a given factor
has an effect on the response. In the case we are interested only in the main effects, we can
57
model the δs as Bernoulli i.i.d., with probability pi.
π(δ) ∝k∏i=1
pi (3.4)
In case we are interested in main effects and interaction effects, we should have in mind the
principles of effect sparsity, effect hierarchy and effect heredity. Thus, our model for δ in
such case is,
π(δ) ∝m+1∏i=1
pi
m∗∏i=1
p(δi|pa(δi)) (3.5)
where, m is the number of main effects, m∗ is the number of interactions in the model
(m + m∗ = k, i.e., the total number of effects incorporated into the model). Also, pa(δi)
represents all the parents of effect i.
We can also write the distribution of β as a multivariate normal,
β|σ2, τ , c, δ ∼ MVN (0,Σδ) (3.6)
where,
Σδ = diag
((σcδkk τk
)2)
(3.7)
Finally, we put a conjugate prior on σ2,
σ2 ∼ IG (ν/2, νλ/2) (3.8)
58
3.2.1 Gibbs Sampling
Let us write the joint distribution of the observed data, y = (y1, . . . , yn), the covariance
diagonal matrix with observation errors, E = diag (ε2i ), and the vector of parameters and
and, in the case we are only interested in the main effects, the prior for δ is a multinomial
with probabilities pi.
π (δ) ∼ multinom(p) (3.14)
Note that each probability pk is the prior probability that factor k will have a main effect
on the response.
Now we derive the full conditionals so we can define the Gibbs Sampling algorithm,
59
1. Full Conditional for z - Obtained by dropping the terms not involving z in the joint
distribution.
f(z|σ2,β, y
)∝ f (y|z, E) π
(z|β, σ2
)∝ |E|−1/2 exp
{−1
2(y − z)T E−1 (y − z)
}× |Σ|−1/2 exp
{−1
2(z −Xβ)T Σ−1 (z −Xβ)
}∝ exp
{−1
2
[(y − z)T E−1 (y − z) + (z −Xβ)T Σ−1 (z −Xβ)
]}
Now, using the expansion form of the multivariate normal, lemma B.0.2, we obtain,
f(z|σ2,β, y
)∝ exp
{−1
2
[(z − µz)
T Λ−1z (z − µz)
]}
where,
µz = Λz
(Σ−1Xβ + E−1y
)
Λz =(E−1 + Σ−1
)−1
Thus,
z|σ2,β, y ∼MVN (µz,Λz)
2. Full conditional for β - Obtained by dropping the terms not involving β in the joint
60
distribution.
f(β|z, σ2, δ
)∝
π(z|β, σ2
)π(β|δ, σ2
)∝
|Σ|−1/2 exp
{−1
2(z −Xβ)T Σ−1 (z −Xβ)
}× σ−(k+1) exp
{−1
2βTΣ−1
δ β
}∝
exp
{−1
2
[(z −Xβ)T Σ−1 (z −Xβ) + βTΣ−1
δ β]}
Now, using the expansion form of the multivariate normal with regression coefficients,
lemma B.0.2, we obtain,
f(β|z, σ2, δ
)∝ exp
{−1
2
[(β − µβ
)TΛ−1β
(β − µβ
)]}
where,
µβ = Λβ
(XTΣ−1z
)
Λβ =(XTΣ−1X + Σ−1
δ
)−1
Thus,
β|z, σ2, δ ∼MVN(µβ,Λβ
)
3. Full conditional for σ2 - Obtained by dropping the terms not involving σ2 in the joint
61
distribution.
f(σ2|z,β, δ
)∝
π(z|β, σ2
)π(β|δ, σ2
)π(σ2)∝[
σ2]−N/2
exp
{− 1
2σ2(z −Xβ)T (z −Xβ)
}× [σ]−(k+1) exp
{− 1
2σ2βTΣ−1
δ β
}×[σ2]−ν/2−1
exp
{− νλ
2σ2
}∝
[σ2]−(N+k+1+ν)/2−1
exp
{− 1
2σ2
[νλ+ (z −Xβ)T (z −Xβ) + βTΣ−1
δ β]}
Thus,
σ2|z,β, δ ∼ IG
(1
2(N + k + 1 + ν) ,
1
2
[νλ+ (z −Xβ)T (z −Xβ) + βTΣ−1
δ β])
.
4. Full conditional for δ - Obtained by dropping all the terms not involving δ in the joint
distribution.
f(δ|z,β, σ2
)∝ π
(β|δ, σ2
)π (δ)
However, the joint conditional for δ is unknown, and therefore, it is more suitable
to sample each δi independently, given the set δ[−i] = {δ1, . . . , δi−1, δi+1, . . . , δk+1}.
Using the equation above, we obtain,
f(δi|δ[−i], z,β, σ2) ∝ π
(δi|δ[−i],β, σ
2)π(δi, δ[−i]
)
62
Algorithm 3 BRAIM Gibbs Sampler.
1. Initialize the parameters.
2. Sample each parameter iterative (until convergence) from their full condi-tionals, as follows,
(a) Sample z|σ2,β, y ∼ Nn (µz,Λz),with µz = Λz (Σ−1Xβ + E−1y), and Λz = (E−1 + Σ−1)
−1
(b) Sample β|z, σ2, δ ∼ Nk(µβ,Λβ
),
with µβ = Λβ
(XTΣ−1z
)and Λβ =
(XTΣ−1X + Σ−1
δ
)−1
(c) Sample
σ2|z,β, δ ∼ IG(
12
(N + k + 1 + ν) , 12
[νλ+ (z −Xβ)T (z −Xβ) + βTΣ−1
δ β])
(d) For i = 1, . . . , k + 1, Sample δi with probability as in equation 3.15.
3. Take only the samples after the burn-in and compute posterior statistics for the pa-rameters of interest.
Thus,
p(δi = 1|δ[−i],β, σ
2, z)
=π(β|δi = 1, δ[−i], σ
2)π(δi, δ[−i]
)π(β|δi = 1, δ[−i], σ2
)π(δi = 1, δ[−i]
)+ π
(β|δi = 0, δ[−i], σ2
)π(δi = 0, δ[−i]
)=
π(δi, δ[−i]
)π(δi = 1, δ[−i]
)+
π(β|δi=0,δ[−i],σ2)π(β|δi=1,δ[−i],σ2)
π(δi = 0, δ[−i]
) (3.15)
where,π(β|δi=0,δ[−i],σ
2)π(β|δi=1,δ[−i],σ2)
is the ratio of the normal mixture for β, equation 3.12.
3.3 Choice of Prior Parameters
Several prior parameters need to be specified. The normal mixture prior on β has pa-
rameters τ and c, the inverse gamma prior for σ has parameters ν and λ, and each δ has
63
prior probability pk to be selected. Since we are mainly interested in a good ranking of
the probability of a factor being significant we need to tune this parameters to make the
posteriors comparable across isoforms/genes. Thus, we treat these parameters as tunning
constants as well as representations of our prior information.
Let us focus on τ and c now. The parameter c acts as a multipying constant that
determines how much higher a null effect from an important effect has to be in order to be
considered siginificant. I.e., if we fix c = 10 it means that an important effect is around
10 times larger than a neligible one. With this in mind we tune the value of τ such that
significant effects has a high posterior probability of being selected. If τ is too large then
almost no effect would be considered significant, and if τ is too small also almost no effect
would be considered significant. Thus, a good tunning point for τ is key.
An improper prior for σ, i.e., ν = 0 deems inappropriate, since this allows σ to be close
to 0 and results in overinflated posteriors for selecting a variable as significant. Thus a
relatively unimformative prior would be a good choice - it has been suggested to use a value
for ν close to 2 and then choose λ such that the prior mean for σ equals√V ar(y)/5.
Finally, the prior probability for δ = 1 has a very straightforward interpretation. pk is
the probability that a given factor can be considered as significant.
3.4 Analysis and Results
We applied the main effects BRAIM model to our response. As described earlier, the
response was the difference between paternal and maternal expression for a given transcript,
and for each transcript, we estimated the posterior probabilities of each factor being selected
in the model and their main effects. The factors considered in the model were Sex (S), Cross
64
(C) and Age (A).
Before fitting the model we filtered lowly-expressed transcripts. We considered a tran-
script to be expressed if there was at least one replicate with more than 10 reads mapping
to it. Thus, transcripts that did not have more than 10 reads mapping to them in at least
one replicate were discarded from further analysis. We used the ensembl annotation of
transcripts5, and we added retrotransposons and functional RNAs, totaling approximately
140,000 transcripts. After filtering for lowly expressed transcripts, we obtained 47, 676 tran-
scripts (from 26, 651 genes) with higher expression levels that we used for our analysis.
We fitted the model, independently, for each of the 47,676 transcripts, using the gibbs
sampling procedure described in Algorithm 1. Thus, for each of the transcripts we obtained,
Ns number of MCMC samples, for the following parameters: βI , βS, βC , βA, δI , δS, δC , δA,
where,
• βI correspond to the intercept samples, and represents the average difference between
the paternal and maternal transcripts across all possible factor combinations. Thus,
βI can be interpreted as the parent-of-origin effect for a given transcript.
• βS, βC and βA correspond to the main effects of Sex, Cross and Age. The main effect
of Cross can be interpreted as a strain effect, or a haplotype-specific effect as opposed
to the parent-of-origin effect we are mostly interested in.
•∑δI=1Ns
,∑δS=1Ns
,∑δC=1Ns
,∑δA=1Ns
correspond to the posterior probability of the intercept
being selected into the model; and the posterior probability of the Cross, Sex and Age
factors to be selected into the model. If the intercept has a high probability of being
selected we are more confident that there is a strong difference between the paternal
5http://www.ensembl.org/index.html
65
and maternal transcripts across any factor combination.
Let us take three illustrative examples, where the estimates of βI and their posterior
probability of being selected varies from negligible to highly significant. Our first example
is gene A530017D24Rik, transcript ENSMUST00000101077. We chose this gene because it
is a very common example in our data, it shows a strong strain effect but it exhibits no
parent-of-origin effect. In figure C.1 we show the paternal and maternal expression for each
of the replicates, in addition with the posterior probabilities for each of the factors to be
selected into the model. The MCMC samples for each of the variables in the model are
shown in the pages following the main figure (figures C.2 to C.6).
A second example is two different transcripts for the same gene, H13. Let us focus on
transcript ENSMUST00000109825, figure C.7. This example shows the canonical imprinting
pattern, where the paternal copy of the transcript is highly expressed and the maternal copy
is silenced. The MCMC samples for each of the variables in the model are shown in the
pages following the main figure (figures C.8 to C.12).
Now let us focus on another transcript, ENSMUST00000148156, figure C.13 of gene H13.
This examples is also a canonical imprinting pattern, where in this case the maternal copy
of the transcript is highly expressed and the paternal copy is silenced. The MCMC samples
for each of the variables in the model are shown in the pages following the main figure
(figures C.14 to C.18).
We could show such figures for each of the 47, 676 MCMC traces for each transcripts
analyzed but this would surely use too many pages. Thus, we decided to compute the
median posterior value for each of the samples in each transcript, i.e., we obtain 47, 676
median posterior values for each of the transcripts/parameter combination. Since there
are βI , βS, βC , βA, δI , δS, δC , δA different parameters we plot the histogram of the posterior
66
medians across all transcripts in 3.2. In the figure we can appreciate that,
• βI is centered around 0 and the majority of the genes shows very little parent-of-origin
effects, with the exception of some interesting outliers.
• βS is centered around 0 and the majority of the genes shows very little Sex effects.
• βC is centered around 0 but a large number of genes exhibit a strong cross-effect.
• βA is centered around 0 and the majority of the genes shows almost no Age effects,
except for a small minority.
•∑δI=1Ns
, i.e., the posterior probability of the intercept being included in the model.
We can see that for most genes the posterior probability is below 0.95. However
an interesting pattern arise, there are several genes with posterior probability 1, and
for those we are extremely confident on the mean difference being significant across
all factors. There are about 150 transcripts with PP > 0.95. We call these genes
significant, and we select them for subsequent sensitivity analysis.
•∑δS=1Ns
, i.e., the posterior probability of the sex factor being included in the model. We
can see that for almost all genes the posterior probability is below 0.95, and we can
conclude that Sex is almost never a significant factor.
•∑δC=1Ns
i.e., the posterior probability of the cross factor being included in the model.
We can see that for many transcripts the PP > 0.95, and more than 200 genes show
a significant cross effect. This is expected and in line with the biology, since differ-
ent strains will contain differences in sequence and therefore very different regulation
programs.
67
•∑δC=1Ns
i.e., the posterior probability of the age factor being included in the model.
We can see that for a few transcripts the PP > 0.95. These transcripts exhibit an
age-specific effect over imprinting.
The distribution of the PPs (Posterior Probabilties) of the parental effect shows that
most transcripts are not inferred to be under imprinting regulation, figure 3.2. It also clearly
shows a group of transcripts with PP > 0.95 of the parental effect, which we set as our
cutoff for calling an effect significant, i.e., calling the parental bias imprinting. Notably, the
distribution of the PPs of the cross effect, figure 3.2, clearly shows how widely prevalent it is
in these hybrids indicating that both crosses are necessary for inferring genomic imprinting or
monoallelic expression (as opposed to Deng et al., 2014). In addition, none of the transcripts
imprinted are reported to have a sex-effect with PP above our 0.95 cutoff, indicating that
genomic imprinting is sex invariant in the mouse cerebellum. Conversely, a small group
of imprinted transcripts are found to have age-effect PPs above the 0.95 cutoff indicating
age-regulated imprinting (detailed below).
Independent Validation using PyroSequencing
Among the 124 genes inferred to be imprinted (represented by 169 transcripts), 74 were
previously reported as such (either identified before 2010 or independently validated by De-
Veale et al., 2012; or Gregg et al., 2010). The remaining 50 genes had not been described
as imprinted before (figure 3.3A). To independently evaluate imprinting in all these candi-
dates we used pyrosequencing, a real-time sequence-by-synthesis approach relying on light
emissions after nucleotide incorporation (Wang and Elbein, 2007). As positive and negative
controls, we respectively tested 11 known imprinted genes and 11 randomly selected genes
with no significant parental effects according to our RNA-seq analysis. We tested an average
68
Figure 3.2: Posterior Probability for each main effect to be included in the model, i.e.,Parent-of-origin, Cross, Age and Sex main effects.
of two SNPs per gene across 12 P60 and/or 12 P8 individual CBs, different from those used
in our RNA-seq experiment. We estimated parental effects in these data using the model we
developed for the same task for RNA-seq data, where there is no measurement uncertainty.
This confirmed significant parental effects for 41 of the candidate novel imprinted genes and
the expected significant and non-significant parental effects for all positive and negative con-
69
trols, respectively (figure 3.5). Among the novel and known imprinted genes we observed a
slightly higher number of genes with a paternal effect (figure 3.3B). Interestingly, five genes
preferentially express different isoforms from the maternal and paternal alleles (figure 3.3B
and see below).
Figure 3.3: List of known and novel (as predicted by our method) imprinted genes, and thedistribution of the bias in terms of allelic preference.
The distribution of the parental biases of the novel and known imprinted genes span the
entire range of slightly above 50:50
Notwithstanding, several true imprinted genes may not meet this parental bias PP cutoff
(i.e., false negatives). We therefore subjectively selected for pyrosequencing evaluation 18
genes with PPs of their parental effect below 0.95. This successfully confirmed a significant
parental effect in 10 of these 18 genes. As a positive control for this step, we chose the
Casd1 gene, which has been reported to be imprinted in other tissues (Ono et al., 2003) and
obtained a parental effect PP of 0.89.
Isoform-Specific Imprinting
Not only are transcript-level estimates of expression necessary for obtaining accurate
gene-level estimates of expression, they are also important for inferring parental effects spe-
70
Figure 3.4: Degrees of parental bias between known and novel imprinted genes.
Figure 3.5: Selected novel imprinted genes and their pyrosequencing validation.
71
Figure 3.6: Scatter plot of the bias-ratio estimates using pyrosequencing and the estimatesof the bias-ratio using RNA-Seq. There is a large agreement between both technologies.
cific to certain gene isoforms if such exist. Indeed, isoform-specific imprinting is known
to occur in genes harboring a paternally expressed gene within an intron resulting in the
generation of different isoforms from the maternal and paternal alleles (McCole and Oakey,
2008). Although it is unclear exactly how such regulation arises, transcriptional interference
by the inner paternally expressed gene likely plays an important role (McCole and Oakey,
2008). Our analysis detected almost all of the previously reported cases of genes with iso-
forms subjected to such regulation (figure 3.7) and further detected additional imprinted
transcripts in some of these loci. This includes either or both of two short transcripts of
the Herc3 gene (indistinguishable by our sequence data) from a promoter upstream to the
72
large 25 exons-long transcript (Gencode transcript IDs: ENSMUST00000141600.1 and ENS-
MUST00000122981.1), which are preferentially expressed by the maternal allele (figure 3.7).
Other known cases of isoform-specific imprinting are due to differential methylation of al-
ternative promoters, as in the case of the Gnas (Peters and Williamson, 2007) and Grb10
(Arnaud et al., 2003) genes. Surprisingly, at the paternally expressed Mest gene we detect a
novel maternally expressed short isoform (Gencode transcript ID: ENSMUST00000149496.1)
whose transcription starts at exon 9 (figure 3.7). This transcript is presumably a non-coding
RNA since no ORF could be found in it. These results therefore emphasize the importance
of analyzing allelic-biased expression at the transcript level.
Developmental Regulation of Genomic Imprinting in the Cerebellum
The importance of imprinting during development, particularly in the brain, has been
clearly established (Wilkinson et al., 2007). Furthermore, age-dependent regulation of the
expression and/or imprinted state of some genes has also been described (Gregg et al., 2010).
Here we took advantage of cerebellar postnatal maturation to detect imprinted genes regu-
lated during important milestones of neuronal development. We detected 57 imprinted genes
(50% of all imprinted genes expressed in the CB) for which either the parental bias and/or
the total expression (paternal + maternal) level are regulated according to developmental
stage (age effect PP > 0.95; figure 3.8A). This includes 11 genes in which both the parental
bias and the total expression level are affected by age, 17 genes in which only the parental
bias is affected by age, and 29 genes in which only the total expression level changes with
age, 21 in which the parental bias remains age invariant and eight which are not expressed
in the adult. Two striking patterns are apparent among the age regulated genes. The first
is that in a disproportionally high number of age regulated genes both the parental bias and
73
Figure 3.7: Isoform-specific imprinting. Rian its maternally biased, H13, Herc3 and Mestexhibits some isoforms to be maternally biased, and some isoforms to be paternally biased.
74
the total expression level are stronger in P8 CB (21 genes compared to 9 in which both the
parental bias and the total expression level are higher in P60, and 11 and 9 genes in which
either the parental bias is higher in P8 and total expression level higher in P60 or the oppo-
site, respectively, P −value = 0.048; χ2 test). Considering that the CB undergoes important
developmental milestones at P8 may suggest that these genes are actively involved in this
process. The second pattern is that the age effect on the magnitude of the parental bias
and on the level of total expression are positively correlated (Pearson correlation coefficient
= 0.34; P-value = 3.1 × 10−4). Such a pattern may artificially arise if the power to detect
parental biases is strongly correlated with expression levels. This option, however, is not
supported in our data. Alternatively, it is possible that either the preferentially expressed
allele and/or the non-preferentially expressed allele experience a significant change in ex-
pression levels along development, thereby altering both the magnitude of the parental bias
as well as the level of the total expression. To test this hypothesis we fitted our model to the
data where we defined the response as either the paternal expression levels or the maternal
expression levels. This analysis indeed revealed that age regulated imprinting (age effect
PP > 0.95) is achieved either by a significant change in the expression level of the preferred
allele (31 out of 57 genes, shown along the X − axis in figure 3.8C), a significant change in
the expression level of the non-preferred allele (13 out of 57 genes, shown along the Y −axis
in figure 3.8C), or a significant change in the expression levels of both alleles (13 out of
57 genes, shown along the diagonal in figure 3.8C), indicating that altering the expression
level of the preferred allele is the common mode through which age regulated imprinting is
achieved (P-value = 0.003; χ2 test).
Some of the genes which parental bias in the CB is affected by age are associated with
developmental processes such as cell proliferation, differentiation, and survival. For instance
75
Figure 3.8: Developmental regulation of Imprinting
the Asb4 gene, which regulates embryonic stem-cell differentiation (Townley-Tilson et al.,
2014), exhibits a strong maternally-biased expression during cerebellar development but
is biallelicaly expressed during adulthood. This change in parental bias is achieved by a
significant decrease in maternal expression and an increase in paternal expression, which
interestingly, maintains the total level of expression approximately constant. The growth
suppressor Grb10 gene, on the other hand, exhibits biallelic expression at P8 but exclusive
paternal expression in the adult, which is achieved by suppression of the maternal allele (fig-
ure 3.8E). Strikingly, we observe a switch in the parental effect for the transcription factor
Zim1, from maternal bias during development to paternal bias in the adult CB, which is
achieved by reduction in the maternal expression level from above the paternal expression
level to below it. We additionally detect a novel imprinted locus at the distal end of chro-
76
mosome 1, which exhibits age dependent regulation. The genes Ier5, Mr1, Stx6, and the
putative BC034090 gene, which we name here Impar (for Imprinted and Age Regulated),
which comprise this locus, show a maternal bias during CB development but biallelic ex-
pression in the adult (figure 3.8C). This shift in parental bias is achieved, for all genes in the
locus, by a reduction in the expression level of the maternal allele and to a lesser extent in the
paternal allele. It is interesting to note that Stx6 is believed to regulate neuronal migration
and formation of processes (Kabayama et al., 2008; Tiwari et al., 2011), two events necessary
for the integration of granule cells to the cerebellar circuit occurring at the P8 stage. It will
therefore be important to determine whether the imprinting regulation of this gene affects
these processes. Finally, we additionally observe age effects on the parental biases of specific
isoforms of Herc3, Mest, and H13, which all show isoform-specific imprinting.
Genomic Locations of Imprinted Genes
As described above our method detects imprinting where the parental expression bias
ranges from weak expression biases to absolute silencing of an allele (figure 3.9A). Inter-
estingly, a substantial proportion of these novel imprinted genes localize to the vicinity of
are uniform across the tissues we sampled or result from averaging variable magnitudes of
parental biases specific to individual cell types, as in the case of a nephron-specific cell
type imprinting of the Gnas gene (Weinstein et al., 2000) and neuron-specific imprinting
of the Snx14 gene (Huang et al., 2014). Resolving between the two options for all the
weakly biased imprinted genes clearly requires more sensitive high-throughput approaches.
However, the imprinting regulation of these genes, in at least some of the CB cell types,
is corroborated by several lines evidence. First, many of these genes are located by or at
known imprinted clusters and this localization is observed to be evolutionarily conserved
suggesting it is under purifying selection. In addition, some of the isolated weakly biased
genes have putative imprinting control regions. Second, the parental biases of many of these
genes show both tissue and developmental-stage specificities. Third, many of these genes
88
are implicated to participate in the same biological pathways, including cell survival and
apoptosis, as strongly biased genes.
89
Figure 3.11: Temporal regulation of imprinting
90
Figure 3.12: ROC for sensitivity analyis when changing the model parameter values.
91
Chapter 4
Allele-Specific Regulation in Human
Population across Multiple Tissues
About 99 percent of genes in humans have counterparts in the mouse. Eighty
percent have identical, one-to-one counterparts.
- Eric Lander
4.1 The GTEx Project Consortium
During my PhD., thanks to my advisor, I was lucky to have access to genotype data for
many individuals (185),and their respective gene-expression across multiple ex-vivo tissues
(10). The project I am referring to is the Genotype-Tissue Expression project (GTEx) and
the overarching aim of the project, as stated in their website, is to provide to the scientific
community a resource with which to study human gene expression and regulation and its
relationship to genetic variation.
This is an ambitious goal but the data certainly provides enough information, for at
92
least, in a given population of individuals, be able to identify wich variants may regulate the
expression of a gene - such type of variants are called eQTLs, and it stands for expression
quantitative trait loci. There is a clear limitation in the sense that it is not an ATLAS of
ALL the regulatory elements in the genome, since the data will certainly miss regulatory
regions that do not span a genetic variant, or will miss regulatory regions for SNPs with very
low MAF (minor allele frequency) due to low number of replicates.
Nevertheless, several other projects have tried to map the regulatory activity, though the
use of ChIP sequencing methods in order to better understand gene regulation across multiple
cell types. For example the ENCODE project used several cell lines to identify TF binding
sites, Histone modification sites, DNA methylation regions and DNA hypersensitivity sites.
Another project closely related, but with the aim to identify regulatory regions in ex-vivo
tissues as opossed to cultured-in-vitro cell lines is the Epigenomics Roadmap project. Finally,
it is worth mentioning the FANTOM project, wich just published an ATLAS of enhancers
across multiple tissues. These projects are complimentary to the GTEx consortium since
they can help us in the interpretation of the eQTL results.
In this chapter of the thesis we extend our allelic imbalance models from previous chapters
to its applications in the context of population studies, such as the GTEx project.
4.2 Experimental Design and Computational Workflow
The GTEx project collected gene expression across multiple tissues, and genotype data
from blood samples across multiple post-mortem donors (individuals). For a graphical ver-
sion of the GTEx experimental design please refer to figure 4.1.
More specifically, the GTEx project consortium has collected RNA-Seq sample from as
93
Figure 4.1: Graphical representation of GTEx experimental design.
many as 26 human tissues with varying sample numbers (10 tissues have at least 80 sample
sizes in v3 data). RNA-Seq was performed using the Illumina TruSeq library construction
protocol, which uses a non-strand specific polyA+ library to produce 76-bp paired-end reads.
DNA samples were sent to the Broad Institute Genetic Analysis Platform for genotyping,
were placed on 96-well plates using the Illumina HumanOmni5-4v1 SNP array. Individual
phenotypes were extensively recorded as covariates to account for sex, population structure,
and other surrogate variables.
We utilize the GTEx phs000424.v3.p1 data to perform our analysis and model construc-
tion, which contains 185 individual genotype data (includes 5M SNP array and exome array)
and 898 RNA-Seq samples from 14 tissues. The full experimental design in table 4.1.
We applied in-house computational workflow (figures 4.2 and 4.3) to quantify allele-
specific transcript expression in each individual genome.
94
Tissue Organ Tissue Name Tissue Code RNA-Seq Samples (n)
Adipose Adipose Subcutaneous Sc 88Heart Left Ventricle LV 80Lung Lung Lu 107
Whole Blood Whole Blood WBl 210Artery Tibial Tb 184Brain Amygdala Am 23Brain Caudate Ca 33Brain Cerebellar Hemisphere CH 22Brain Cerebellum Cb 28Brain Cortex Ctx 25Brain Frontal Cortex FC 23Brain Hippocampus Hp 27Brain Hypothalamus Hy 22Brain Nucleus Acumbens NAc 26
Table 4.1: Number of RNA-Seq samples per tissue.
As shown in figure 4.2, for each GTEx individual, we first merged the genome-wide
SNP array data and exome array data to generate a combined genotype data. We filtered
out the variants with minor allele frequency less than 5%. Then, we performed genotype
phasing using SHAPEIT2 to get the paternal and maternal allele for each individual. We
also imputed a larger number of SNPs and variants using IMPUTE2, which in turns uses
data from the 1000 Genomes Project. We controlled the quality of the imputed variants by
setting up a strict cutoff of imputation certainty.
We further construct a personal diploid genome for each GTEx individual using both the
original chipped allele and the imputed allele. We processed GTEx pair-end RNA-Seq data
by a common pipeline, which followed by removing adaptor (TrimGalore) and controlling
read quality (FastQC).
We applied both sophisticated algorithms and in-house strategy to quantitatively measure
the allele-specific expression on transcript level (Figure 4.3). We use GENCODE annotation
95
to extract personal transcriptome from individual diploid genome that generated in the last
step. We performed splice junction aware genome alignment for tissue level RNA-Seq using
STAR2, which will map sequencing read to each personal transcriptome. Finally, we used
the model described in chapter 2 to quantify the transcript expression level for both paternal
and maternal transcriptome.
Figure 4.2: The computational workflow for processing GTEx DNA and RNA samples.
96
Figure 4.3: The computational workflow used to process the RNA-Seq data for quantifyingallele-specific transcript expression.
4.3 Hi-Braim
The major difference between human observational studies and the mouse study is that in
humans there is a large diversity in the genomic sequences between individuals, and therefore
between the two alleles. Thus, in order to properly study allele-specific expression in humans
we need to have in mind two major sources of allelic imbalance: cis-aseQTL, and Imprinting.
4.3.1 Definining cis and trans eQTL and ASE
As we mentioned in previous chapters, we refer to Allele-specific expression (ASE) as
the quantification of the expression of a given allele (paternal or maternal allele) for a given
gene/isoform.
In this section our interest is to use such information to now identify regulatory variants
that may control the expression of one of the alleles. It has been proposed in the literature
that genes may be regulated in two major forms, cis and trans - cis stands for the latin
word meaning on the same side of, and trans means on the other side of. Specifically, we
97
refer to cis-regulation as a type of regulation that happens in an allele-specific manner, i.e.,
a mutation from the maternal allele can only affect the expression of the maternal copy of a
given gene/isoform, but cannot influence the expression of the paternal allele, figure 4.4-(a).
Several mechanisms could explain regulation in this manner, such as DNA methylation in a
specific allele, or a cis-eQTL located in the transcriptional binding site of an specific gene.
On the other hand, we refer to trans regulation as the regulation that happens regardless
of the allele of origin - i.e., both alleles are affected by the variant to the same extent,
figure 4.4-(b). A plausible mechanism for such type of regulation could be that a variant
changes the expression of a protein that regulates the expression of a gene regardless of the
allele-of-origin - Thus, making similar changes to both, the maternal and paternal copies of
the gene.
It is important to note that if we take the traditional eQTL mapping approaches, such
as regressing the genotype on the total expression of a gene (the sum of expression coming
from the paternal and maternal alleles), we cannot distinguish between cis and trans acting
regulatory programs, figure 4.4-(c,d).
This poses the question, How can we then distinguish between cis and trans eQTLs? In
order to do so we can use the allele-specific expression estimates. Let us take the difference
in expression between the two alleles, y = yp − ym, then in the case of a trans-eQTL,
regardgless of the genotype of the candidate eQTL, we expect the mean difference to be 0
across all individuals. Nevertheless, in the case of a cis-eQTL, we would expect the mean
difference across all individuals to be larger than 0 when the genotype is heterozygous, but
we would expect the difference to be 0 when the genotype is homozygous, figure 4.5.
98
Figure 4.4: (a) An example of cis-eQTL, one can see that when the genotype of one allele is G theexpression of the gene located in the same allele is higher, than when the genotype of the allele isC. (b) An example of a trans-eQTL, the allele expression is balanced, but the total expression ofthe gene (sum of the expression of the two alleles) is higher when snp A is present. (c,d) If we donot consider the allele-specific expression and we only look at the total expression of the gene onecannot distinguish between a cis and a trans eQTL.
4.3.2 Hi-Braim with No Imprinting
In this section we propose an allelic imbalance model across multiple tissues that account
for two major sources of allelic imbalance: cis-regulation in an allele-specific manner (cis-
aseQTL) and imprinting. In order to do so we propose a Hierarchical Bayesian Regression
Allelic Imbalance Model across multiple tissues and individuals, Hi-BRAIM.
Let us focus on one candidate SNP at a time. Without loss of generality, let us say we
know that in our population of interest the major allele is T (with population frequency
99
Figure 4.5: An example of a cis-eQTL. Individual (i) and (ii) have heterozygous SNPs in their exonsand therefore we can quantify their allele-specific expression. Nevertheless, individual (iii) has noheterozygous SNPs in the exons and therefore we cannot estimate its allele-specfic expression. Thecandidate SNP is CT in individual (i) and CC in individual (ii). We observe from the figure thatthis is a cis-eQTL since the number of reads coming from the C allele is higher than the number ofreads coming from the T allele in individual (i). Contrarily, the numnber of reads coming from eachof the ’C’ alleles is the same in individual (ii). The fact that we observe a positive difference betweenthe two alleles, as defined by the cis-eQTL makes CT a candiate for allele-specific regulation.
pM), and the minor allele is C (with population frequency pm = 1 − pm). We expect the
values for this SNP, in a sample of size n from our population of interest, to be homozygous
TT (in n × pM × pM individuals), heterozygous CT or TC (in n × pM × pm individuals)
and homozygous CC (in pm × pm). Our estimand of interest, ASE(T,C), is the average
difference in expression in our population of interest, between the T and the C alleles of a
given gene/isoform. If the estimate of the difference between the alleles (and its confidence
interval) is always strictly positive or strictly negative we can assume that there is an effect
of the SNP in the expression of its alleles - in more specific terms, a single nucleotide base
difference, upstream of the gene/isoform, has an effect on the number copies generated by
each allele.
Let us assume we have estimates for the gene/isoform expression of each of the alleles
100
(in TPM), yT and yC , and their respective errors in the estimation ε(yT ) and ε(yC). Let us
denote our response, i.e., the difference in expression between the two alleles, as y = yT − yC ,
and its associated error in the estimate is ε = ε(y). In practice, we estimate such response
independently across all individuals in all tissues using models for allele-specific expression
and RNA-Seq data as in Chapter 2.
Let us first focus on all the samples taken from individuals such that the target SNP is
heterozygous, CT or TC in our example. We model y(het)t,j , i.e., the difference in expression
between the T and the C allele in tissue t of individual j, as normally distributed with mean
z(het)t,j , and variance
[ε
(het)t,j
]2
,
y(het)t,j = z
(het)t,j + ε
(het)t,j
ε(het)t,j ∼ N
(0,[ε
(het)t,j
]2)
where, j = 1, . . . , n(het)t , and n
(het)t is the number of heterozygous individuals in a given tissue
t = 1, . . . , T . Note that z(het)t,j is a latent variable that represents the true (unobserved) value
of the difference in expression for the allele’s corresponding to the heterozygous SNPs, and
we model it as,
z(het)t,j ∼ N
(µ
(het)t , σ2
t
)(4.1)
µ(het)t represents the average difference in expression across all individuals with heterozygous
SNP in tissue t. We assume a linear model for µ(het)t as follows,
µ(het)t = α
(het)t +X(het)β (4.2)
where, α(het)t represents the average difference in expression in a given tissue across all in-
101
dividuals with a heterozygous target SNP, and we put a prior on α(het)t ∼ N(µα, σ
2α) with
hyperprior parameters µα and σ2α.
The matrix X(het) is of dimension nhett ×p and it represents the matrix of covariates for all
individuals with heterozygous target SNP in a given tissue, and, β is a vector of dimension
p and we assume a non-informative prior for each of its entries.
Nevertheless, in our sample population not all individuals are heterozygous for the target
SNP. Thus, for the individuals with homozygous target SNP, i.e., with target SNP values of
CC and TT in our example, we model them as having mean 0 but sharing the same variance
as in the heterozygous SNPs, thus,
y(hom)t,j = z
(hom)t,j + ε
(hom)t,j
ε(hom)t,j ∼ N
(0,[ε
(hom)t,j
]2)
Moreover, we assume that the true difference in expression between alleles also has mean
0 in the homozygous case, z(hom)tj ∼ N(0, σ2
t ). This is the key difference between the two
groups of individuals. In the case of the heterozygous group we assumed the difference to be
positive, with parameter α(het)t , and in the case of the homozygous group we assumed their
mean difference to be 0.
Note that σ2t represents the variance of the (true) difference between the alleles for tissue t
and we assume it to be a common variance for individuals with homozygous and heterozygous
target SNPs. The advantage of such assumption is that we gain in sample size, and the
individuals with homozygous target SNPs aid us in estimating the variance for the group of
interest, the heterozygous group.
Finally, we put a prior on σ2t ∼ IG(a, b), with hyperparameters a and b; and we put a
102
prior on σα ∼ IG(a∗, b∗) with hyperparameters a∗ and b∗.
Detecting Imprinting
The previous model would work for genes that do not exhibit any parent-of-origin effect,
which is roughly 97% or more of the genes in the human genome. However, for the small
percentage of genes that may exhibit some parental bias in their expression the previous
model would be oversimplified and would not detect such effects. Let us define our response
as y = δ ∗ (yp − ym), where δ is an indicator that takes the value 1 if yp actually correponds
to the paternal genome, and −1 otherwise. Then, the intermediate sign random variable δ
helps us estimate the parental bias, but since δ is missing, we would ideally integrate over δ
in our model so to obtain an estimate of the parent-of-origin effect.
4.3.3 Hi-Braim with Imprinting
In this section we propose an allelic imbalance model across multiple tissues that ac-
count for two major sources of allelic imbalance: cis-regulation in an allele-specific manner
(cis-aseQTL) and imprinting. In order to do so we extend the previous model to a Hier-
archical Mixture model, where we model imprinting using a latent variable (missing data)
that indicates if the difference corresponds to the paternal allele minus the maternal allele,
or viceversa.
Let us define the difference, in expression, between the two alleles for tissue t, individual
j as, ytj = y(p)t,j − y
(m)t,j . Then, we can model ytj ∼ N(zt,j, εt,j). Next, we model the true
difference in expression, zt,j as a mixture of normals1 with a linear imprinting effect νt,
1the mixture comes from the fact that the true paternal allele is unknown. Thus, for example a 0/100imprinted gene will look like a mixture since some of the response will correspond to the difference betweenthe paternal minus the maternal allele, and some of the different will correspond exactly to the opossite case
103
SNP-specific effect αt (figure 4.6), and covariates effect β.
zt,j|δt,j, νt, αt, β, σ2t ∼
N(νt + αt +Xβ, σ2
t ) , δt,j = 1
N(−νt + αt +Xβ, σ2t ) , δt,j = −1
(4.3)
Figure 4.6: The allelic imbalance in this figure can be explained with two sources in mind. νrepresents the imprinting effect and α represents the cis-effect, only present in the heterozy-gous individuals.
Note that analogous to our model with no imprinting the cis-effects can be decomposed
into homozygous SNPs for which we assume the difference in expression between the alleles
to be 0, and therefore no αt and β effects are in the model, and heterozygous SNPs for which
we model the difference as having an effect modeled by αt and β.
Next, we specify the prior distributions of our model parameters. We put a log-normal
104
prior on νt ∼ LN (log µν , σ2ν) because νt can only take positive values. We put a normal prior
on αt ∼ N(µα, σ2α), and a noninformative prior on β, p(β) = 1. We put an inverse-gamma
prior on σ2t ∼ IG(a, b), and we assume p(δtj = 1) = 1/2.
Finally, we specify the hyperprior distributions. We assume a non-informative hyperprior
on µα, an inverse-gamma hyperprior on σ2α ∼ IG(a∗, b∗) and on σ2
ν ∼ IG(c∗, d∗). And lastly,
a log-normal hyperprior on log µν ∼ LN(0, 102).
Adaptive MCMC within Gibbs Sampling
Let us write the joint density of the observed data, y = (y(het), y(hom)), the covariance
diagonal matrix with observation errors, E = diag
([ε(het)
]2
,[ε(hom)
]2)
, and the vector of
parameters and latent variables, θ =(z(hom), z(het),ν,α,β,σ2, µα, σ
2α, a, b, a
∗, b∗, c∗, d∗),
f(y, E;θ) =T∏t=1
n(hom)t∏j=1
[f(y
(hom)t,j |z(hom)
t,j ,[ε(hom)
]2)f(z
(hom)t,j |δ(hom)
t,j , µ(hom)t,j , σ2
t
)f(δ
(hom)t,j
)]n(het)t∏j=1
[f(y
(het)t,j |z
(het)t,j ,
[ε(het)
]2)f(z
(het)t,j |δ
(het)t,j , µ
(het)t,j , σ2
t
)f(δ
(het)t,j
)]f(νt|σ2
ν
)f(α
(het)t |µα, σ2
α
)f(σ2t |a, b
)f(σ2ν |c∗, d∗
)f(σ2α|a∗, b∗
)
where,
µ(hom)t,j = δ
(hom)t,j νt, and
µ(het)t,j = α
(het)t +Xβ + δ
(het)t,j νt
We derived a Gibbs Sampler approach first but it had a very slow convergence to the true
values. Thus we decided to reduce the number of parameters in our model by integrating
105
the joint density over the δ and z latent variables. Let us focus on the posterior part that
depends on δ and z, for such part we can write the posterior as,
f1(y, E;θ)
∝T∏t=1
nt∏j=1
[f(yt,j |zt,j , [εt,j ]2
)f(zt,j |δt,j , µt,j , σ2
t
)f (δt,j)
]
∝T∏t=1
nt∏j=1
exp
{− (yt,j − zt,j)2
2 [εt,j ]2
}(exp
{− (zt,j − µt,j)2
2σ2t,j
})δt,j (exp
{− (zt,j − µt,j)2
2σ2t,j
})1−δt,j
Let us first integrate (sum) over all the possible values of δ. Since δ can only take two
values, the integration corresponds merely to the sum of the two mixtures,
f2(y, E;θ) ∝T∏t=1
nt∏j=1
exp
{− (yt,j − zt,j)2
2 [εt,j ]2
}(exp
{−(zt,j − µ+
t,j
)22σ2
t,j
}+ exp
{−(zt,j − µ−
t,j
)22σ2
t,j
})
∝T∏t=1
nt∏j=1
exp
{− (yt,j − zt,j)2
2 [εt,j ]2
}exp
{−(zt,j − µ+
t,j
)22σ2
t,j
}+
exp
{− (yt,j − zt,j)2
2 [εt,j ]2
}exp
{−(zt,j − µ−
t,j
)22σ2
t,j
}
Now we have the sum of the multiplication of two-normal densities. We can integrate
p (Y, Z, I, S | θ, ω, µ, σ, λ, πS, πI) = p (Y | Z, µ, σ, λ) p (Z | θ, ω, S) p (I | πI) p (S | πS) .
(5.14)
Note that the unknown variables in the model include all missing data and parameters. We
set πS to be a small number (default 0.01) to reflect our belief that only small proportion
of genes are biomarkers. The set of unknown variables is {Z, I, S, µ, σ, λ, θ, ω, πI , K}. We
136
propose an efficient Gibbs sampler to sample from the posterior distribution of missing data
and parameters. We integrated out µ, σ, λ and θ for the Gibbs sampler and developed an
efficient collapsed Gibbs sampler.
Gibbs sampling algorithm
To derive the collapsed Gibbs sampling algorithm, let us first assume Z is known and
look at the conditional posterior distribution of S and I. The likelihood function for missing
data Z is
p (Z | I, S, θ, ω) =∏
g:Sg=0
ωng,1g (1− ωg)ng,0∏
g:Sg=1
K∏k=1
θng,k,1g,k (1− θg,k)ng,k,0 , (5.15)
where {ng,0, ng,1} are the number of cells being active or inactive across all cells, and
{ng,k,0, ng,k,1} are the number active or inactive cells in sub-population k for each gene i.
We adopt a Beta (αω, αω) prior for ωg and a Beta (αθ, αθ) prior for θg,k, where the default
values of αω and αθ are set to be 0.5. We integrate out ω’s and θ’s and obtain the marginal
likelihood,
p (Z | I, S) =∏
i:Sg=0
B (αω + ng,1, αω + ng,0)
B (αω, αω)
∏i:Si=1
K∏k=1
B (αθ + ng,k,1, αθ + ng,k,0)
B (αθ, αθ)
=∏
i:Sg=0
[Γ (αω + ng,1) Γ (αω + ng,0)
Γ (2αω + C)
Γ (2αω)
Γ (αω) Γ (αω)
]∏
i:Sg=1
K∏k=1
[Γ (αθ + ng,k,1) Γ (αθ + ng,k,0)
Γ (2αθ + Ck)
Γ (2αθ)
Γ (αθ) Γ (αθ)
](5.16)
where Ck denotes the number of cells in sub-population k according to current I, B (·, ·) and
Γ (·) denote beta and gamma function. The posterior distribution of I and S is proportional
137
to the product of likelihood function P (Z|I, S) and the prior distributions of I and S,
P (I, S | Z) ∝ p (Z | I, S) p (I) p (S) . (5.17)
We let p (Sg = 1) = πS, where πS reflects our prior knowledge on the proportion of genes
differentially expressed in sub-populations. By default (in the current implementation), we
assume that a small proportion of genes are differentially expressed (biomarkers) across
sub-populations and we set the prior probability πS = 0.05.
Our goal is to draw S (the biomarker selection indicators) and I (cell sub-population
indicators) from its marginal posterior distribution, p (S, I | Z). We initialize Z, S and I
according to their prior distributions and use the Gibbs sampling algorithm to update each Sg
and Ic. Here we derive the conditional distributions for updating Sg and Ic. The probability
of selecting gene g as a biomarker conditional on Z and I, can be calculated as follows,
p (Sg = 1 | Z, I) ∝ p (Sg = 1) p (Z | Sg = 1, I)
= πS
K∏k=1
[Γ (αθ + ng,k,1) Γ (αθ + ng,k,0)
Γ (2αθ + Ck)
Γ (2αθ)
Γ (αθ) Γ (αθ)
], (5.18)
p (Sg = 0 | Z, I) ∝ p (Sg = 0) p (Z | Sg = 0, I)
= (1− πS)
(Γ (αω + ng,1) Γ (αω + ng,0)
Γ (2αω + C)
Γ (2αω)
Γ (αω) Γ (αω)
). (5.19)
In the sampler, once we have sampled the the set of biomarkers according to their posterior
probability we want to assign each of the cells into a given subpopulation. In order to do
so, we need to update the posterior for the cell population indicator Ic for each cell c, given
the new set of biomarkers.
In a mixture of finite mixture model, the prior probability of cell c joining an existing
138
sub-population k, conditional on the sub-population assignments of other cells, is
p(Ic = k | I[−c]
)∝ (Ck + 1)κ (C,K) , (5.20)
and the prior probability of forming a new sub-population is
p(Ic = K + 1 | I[−c]
)∝ κ (C,K + 1) , (5.21)
where K+1 represents the new sub-population, and function κ (c, k) = E[K(c)/K
(k)]. Here,
K(t) = K (K − 1) · · · (K − t+ 1) , (5.22)
K(n) = K (K + 1) · · · (K + n− 1) . (5.23)
The numbers κ (c, k) can be precomputed efficiently using the following recursion,
κ (c, k) = κ (c− 1, k − 1)− (c+ t− 2)κ (c, k − 1) , (5.24)
κ (c, 0) = E[1/K(n)
]= P (S > n) /γn, (5.25)
where γ is the Poisson distribution parameter for K.
The conditional distribution of Ic given Z, S and I[−c] is
p(Ic = k | Z, S, I[−c]
)∝ p
(Ic = k | I[−c]
)p(Z | Ic = k, I[−c], S
), (5.26)
∝ (Ck + 1)κ (C,K) p(Z | Ic = k, I[−c], S
), (5.27)
139
and
p(Ic = K + 1 | Z, S, I[−c]
)∝ p
(Ic = K + 1 | I[−c]
)p(Z | Ic = K + 1, I[−c], S
), (5.28)
∝ κ (C,K + 1) p(Z | Ic = K + 1, I[−c], S
), (5.29)
where
p(Z | Ic = k, I[−c], S
)∝
∏g:Sg=1
K∏l=1
[Γ(αθ + n∗g,l,1
)Γ(αθ + n∗g,l,0
)Γ (2αθ + C∗l )
Γ (2αθ)
Γ (αθ) Γ (αθ)
]. (5.30)
where{n∗g,l,0, n
∗g,l,1
}are the number of cells being active or inactive in sub-population l of
each gene g, and C∗l denotes the number of cells in sub-population l - all conditional on
current sub-population assignments I except that Ic = k.
Conditional on a given Z, we can use the conditional distribution above to update S and
I. To complete one Gibbs sampling iteration, we also need to update Z conditional on S,
I, θ, ω, Y , µ, σ and λ. In previous calculations of conditional distribution of S and I, we
integrated out parameters θ and ω. Fortunately, we apply the same scheme to integrate out
θ, ω, µ, σ and λ from the conditional distribution of Z. Therefore, in the Gibbs sampling
process we only need to sample from the joint posterior distribution of Z, S and I. This
collapsed Gibbs sampling scheme dramatically helps the mixing of the Markov chain and
improves computational efficiency.
The observed likelihood function of Yg = {Y1,g, . . . , YC,g} conditional on
Zg = {Z1,g, · · · , ZC,g} and other parameters is
140
p (Yg | Zg , µg , σg , λg) =
C∏c=1
[E (Yc,g | λ)]I(Zc,g=0) [N (Yc,g | µg , σ2g
)]I(Zc,g=1)(5.31)
=
C∏c=1
[λ exp (−λYc,g)]I(Zc,g=0)
[(2πσ2
g
)−1/2exp
(−
1
2σ2g
(Yc,g − µg)2
)]I(Zc,g=1)
, (5.32)
where E (· | ·) and N (· | ·, ·) are Exponential and Normal distribution density functions.
First, by Exponential-Gamma conjugacy, we can integrate out λg from the likelihood function
over the prior distribution Gamma (1, 1),
p (Yg | Zg , µg , σg) =
∫p (Yg | Zg , µg , σg , λg) p (λg) dλg (5.33)
=βαλλ
Γ (αλ)
Γ (αλ + ng,0)(βλ +
∑c:Zc,g=0 yc,g
)αλ+ng,0 · C∏c=1
[(2πσ2
g
)−1/2exp
(−
1
2σ2g
(Yc,g − µg)2
)]I(Zc,g=1)
.(5.34)
Then, we integrate µg and σg out from the likelihood function by their prior distribution,
p (Yg | Zg) (5.35)
=
∫p (Yg | Zg , µg , σg , λ) p (µg) p
(σ2g
)dµgdσ
2g (5.36)
=βαλλ
Γ (αλ)
Γ (αλ + ng,0)(βλ +
∑c:Zc,g=0 yc,g
)αλ+ng,0 · (5.37)
βασσ (2π)−ng,1
2
Γ (ασ)√ng,1 + 1
Γ(ng,1
2+ ασ
)βσ + 1
2
∑c:Zc,g=1
Y 2c,g + µ20 −
∑c:Zc,g=1
Yc,g+µ0
2
ng,1+1
ng,1
2+ασ
. (5.38)
141
Now we have the marginal likelihood of Yg given Zg, and we can update each Zc,g by the
following conditional distribution. The conditional probability of Zc,g being 1 is
p(Zc,g = 1 | Yg , Z[−c],g , Sg , Ic, θ, ω
)(5.39)
=p(Zc,g = 1, Yg , Z[−c],g | Sg , Ic, θ, ω
)p(Zc,g = 0, Yg , Z[−c],g | Sg , Ic, θ, ω
)+ p
(Zc,g = 1, Yg , Z[−c],g | Sg , Ic, θ, ω
) (5.40)
=p(Zc,g = 1, Yg | Z[−c],g , Sg , Ic, θ, ω
)p(Zc,g = 0, Yg | Z[−c],g , Sg , Ic, θ, ω
)+ p
(Zc,g = 1, Yg | Z[−c],g , Sg , Ic, θ, ω
) (5.41)
=p(Yg | Zc,g = 1, Z[−c],g
)p (Zc,g = 1 | Sg , Ic, θ, ω)
p(Yg | Zc,g = 0, Z[−c],g
)p (Zc,g = 0 | Sg , Ic, θ, ω) + p
(Yg | Zc,g = 1, Z[−c],g
)p (Zc,g = 1 | Sg , Ic, θ, ω)
. (5.42)
Let z be either 1 or 0; Note that if we can find p(Zc,g = z | Z[−c],g, Sg, Ic
)by integrating
out θ andω from p (Zc,g = z | Sg, Ic, θ, ω), then the conditional distribution of Zc,g being z
can be simplified to
p(Zc,g = z | Yg , Z[−c],g , Sg , Ic
)(5.43)
=p(Zc,g = z, Yg | Z[−c],g , Sg , Ic
)p(Zc,g = 0, Yg | Z[−c],g
)+ p
(Zc,g = 1, Yg | Z[−c],g
) (5.44)
=p(Yg | Zc,g = z, Z[−c],g
)p(Zc,g = z | Z[−c],g , Sg , Ic
)p(Yg | Zc,g = 0, Z[−c],g
)p(Zc,g = 0 | Z[−c],g , Sg , Ic
)+ p
(Yg | Zc,g = 1, Z[−c],g
)p(Zc,g = 1 | Z[−c],g , Sg , Ic
) .(5.45)
The analytical form of p(Zc,g = z | Z[−c],g, Sg, Ic
)is simply the conditional mean of θg,Ic
given Z[Ic/c],g of due to Bernoulli-Beta conjugacy, where Z[Ic/c],g = {Zc′,g : Ic′ = Ic and c′ 6= c}.
When gene g is selected as biomarker, i.e. Sg = 1,
142
p(Zc,g = 1 | Z[−c],g , Sg = 1, Ic
)=
∫p(Zc,g = z | Sg = 1, θg,Ic
)p(θg,Ic | Z[Ic/c],g
)dθg,Ic (5.46)
=
∫θg,Icp
(θg,Ic | Z[Ic/c],g
)dθg,Ic (5.47)
=αθ + ng,[Ic/c],1
2αθ + ng,[Ic/c], (5.48)
where ng,[Ic/c] is the total number of elements in Z[Ic/c],g and ng,[Ic/c],1 is the total number
of 1s in Z[Ic/c],g. When Sg = 0,
p(Zc,g = 1 | Z[−c],g , Sg = 0, Ic
)=
∫p (Zc,g = z | Sg = 0, ωg) p
(ωg | Z[−c],g
)dωg (5.49)
=
∫ωgp
(ωg | Z[−c],g
)dωg (5.50)
=αω + ng,[−c],1
2αω + C − 1, (5.51)
where ng,[−c],1 is the number of 1s in Zg except Zc,g. Note that integrating (5.48), (5.51)
and (5.38) we can already evaluate (5.45).
The Gibbs sampling algorithm is straightforward with the calculated conditional distri-
butions. In iteration t of the Gibbs sampler, we conduct the following conditional updates:
1. Draw Sg from p (Sg | Z, I), for all g = 1, . . . , G,
2. Draw Ic from p(Ic | Z, S, I[−c]
), for all c = 1, . . . , C.
3. Draw Zc,g from p(Zc,g | Yg, Z[−c],g, Sg, Ic
), for all g = 1, . . . , G and c = 1, . . . , C.
143
5.2.3 Selecting the number of groups K
Approximating marginal likelihood functions
In this part, we show how to select the number of subpopulations, K. The confidence
of observing heterogeneity can be evaluated by checking if K > 1 is supported by the data.
There are multiple ways of selecting the number of groups K. The first one is using Dirichlet
process mixture model to simultaneously sample from the space of K and I. However, it has
been proved that the estimation of number of groups K is inconsistent by Dirichlet process
mixture model. Another natural way for estimating K is by Bayesian model selection, i.e.
calculating the marginal likelihoods p (Z | Mk), where Mk denotes the model in which the
mixture model is constrained to have K groups. The calculation of marginal likelihood
p (Z | Mk) involves integrating out all parameters from their prior distribution,
p (Z | Mk) =
∫ ∫ ∑S
∑I
p (Z | S, I, θ, ω,Mk) p (S) p (I | Mk) p (θ | Mk) p (ω) dθdω, (5.52)
where∑
S
∑I represents the summation over all possible values of discrete variables S
and I. Note that integration in Eq. 5.52 does not have analytical solution and the numerical
integration is also infeasible to calculate due to the high dimension. However, we could
use the samples drawn from Gibbs sampling to approximate the marginal likelihood by the
method in Chib (1995). In particular, conditional on a specific S, we have the following
equation holds for an arbitrary (θ∗, ω∗),
ln p (Z | S,Mk) = ln p (Z | S, θ∗, ω∗,Mk) + ln p (θ∗) + ln p (ω∗)− ln p (θ∗, ω∗ | S,Z,Mk) (5.53)
144
In the above equation p (Z | S, θ∗, ω∗,Mk) can be calculated analytically as the product
of observed likelihood functions,
p (Z | S, θ∗, ω∗,Mk) =
∏Sg=0
C∏c=1
(ω∗g)Zc,g (1− ω∗g)1−Zc,g
C∏c=1
1
K
K∑k=1
∏Sg=1
(θ∗g,k
)Zc,g (1− θ∗g,k
)1−Zc,g . (5.54)
We approximate ln p (θ∗, ω∗ | S,Z,Mk) by fixing S and running Gibbs sampler for M
iterations to draw samples{I(1), . . . , I(M)
}from p (I | S,Z,Mk), then
p (θ∗, ω∗ | S,Z,Mk) =∑I
p (θ∗, ω∗ | S, I, Z,Mk) p (I | S,Z,Mk)
≈1
M
M∑m=1
p(θ∗, ω∗ | S, I(m), Z,Mk
), (5.55)
where
p (θ∗, ω∗ | S, I, Z,Mk) =∏Sg=0
p(ω∗g | S, I, Z,Mk
) ∏Sg=1
p(θ∗g | S, I, Z,Mk
)(5.56)
=∏Sg=0
Beta(ω∗g |αω + ng,1, αω + ng,0
)· (5.57)
∏Sg=1
K∏k=1
Beta(θ∗g,k
∣∣αθ + ng,k,1, αθ + ng,k,0
). (5.58)
Plug 5.55 in 5.53, we get the approximation for marginal likelihood ln p (Z | S,Mk).
Even though 5.53 is valid for any valid θ∗ and ω∗, arbitrary choice can lead to very bad
approximation. Here we choose each θ∗g,k or ω∗g to be their posterior mean given S, which ω∗g
145
can be calculated exactly
ω∗g ≡ E [ωg | S,Z] =ng,1
2αω + C, (5.59)
and θ∗g,k can be calculated from the samples I(1), . . . , I(M) from collapsed Gibbs sampler,
θ∗g,k ≡ E[θg,k | S,Z,Mk
](5.60)
= E[E[θg,k | S, I, Z,Mk
]| Z, S,Mk
](5.61)
=1
M
M∑m=1
E[θg,k | S,Z, I(m)
](5.62)
=1
M
M∑m=1
αθ + n(m)g,k,1
2αθ + Ck. (5.63)
It is extremely difficult to further integrate out S from p (Z | S,Mk) since different
S gives different dimensionality of (θ, ω). Therefore instead of p (Z | Mk) we calculate
p(Z | Sk,Mk
)where Sk is an estimator of S under model Mk. We choose estimator for
gene g, Sk,g, to be the marginal mode of posterior distribution, i.e.,
Sk,g = 1, p (Sg | Z,Mk) > 0.5, (5.64)
Sk,g = 0, p (Sg | Z,Mk) < 0.5. (5.65)
146
Selecting K with marginal likelihoods
In previous section we proposed to approximate p(Z | Sk,Mk
)by MCMC samples,
where notation Sk indicates that each k should have different estimator Sk. p(Z | Sk,Mk
)quantifies the overall fitting of model Sk and Mk to the data. Therefore we can use
p(Z | Sk,Mk
)to compare models under different group number k. In particular, we can
use p(Z | Sk,M1
), the marginal likelihood for one single population, as the baseline and
calculate ∆k, which is defined as the following Bayes factor,
∆k = ln p(Z | Sk,Mk
)− ln p
(Z | Sk,M1
). (5.66)
Note that if θ and ω has the same prior distribution, i.e. αω = αθ in our setting, θs and ωs
will be equivalent underM1, therefore Z will be independent of S, i.e. ln p(Z | Sk,M1
)=
ln p (Z | M1). Let’s call M1 as the “homogeneous model” because it allows only one single
cluster. ∆k quantifies how much better a model with k sub-populations fits the data than
the homogeneous model. Note that the calculation of ∆k can be simplified due to the
cancellations of likelihood functions for Zg with Sg = 0,
147
∆k = lnp(Z | Sk,Mk
)p(Z | Sk,M1
) (5.67)
= ln
∫ ∫ ∑I p(Z | Sk, I, θ, ω,Mk
)p (I | Mk) p (θ | Mk) p (ω) dθdω∫ ∫ ∑
I p(Z | Sk, I, θ, ω,M1
)p (I | M1) p (θ | M1) p (ω) dθdω
(5.68)
= ln
∫ ∫ ∑I
∏Sk,g=0 p (Zg | ωg)
∏Sk,g=1 p (Zg | I, θ,Mk) p (I | Mk) p (θ | Mk) p (ω) dθdω∫ ∫ ∑
I
∏Sk,g=0 p (Zg | ωg)
∏Sk,g=1 p (Zg | I, θ,M1) p (I | M1) p (θ | M1) p (ω) dθdω
(5.69)
= ln
[∫ ∑I
∏Sk,g=0 p (Zg | ωg) p (ω) dω
] ∫ ∫ ∑I
∏Sk,g=1 p (Zg | I, θ,Mk) p (I | Mk) p (θ | Mk) dθ[∫ ∑
I
∏Sk,g=0 p (Zg | ωg) p (ω) dω
] ∫ ∫ ∑I
∏Sk,g=1 p (Zg | I, θ,M1) p (I | M1) p (θ | M1) dθ
(5.70)
= ln
∫ ∫ ∑I
∏Sk,g=1 p (Zg | I, θ,Mk) p (I | Mk) p (θ | Mk) dθ∫ ∫ ∑
I
∏Sk,g=1 p (Zg | I, θ,M1) p (I | M1) p (θ | M1) dθ
(5.71)
= lnp({Zg : Sk,g = 1
}| Sk,g ,Mk
)p({Zg : Sk,g = 1
}| Sk,g ,M1
) . (5.72)
Note that since M1 is essentially a single component model, we could calculate
p({Zg : Sk,g = 1
}| Sk,g,M1
)analytically,
p({Zg : Sk,g = 1
}| Sk,g ,M1
)=
∫ ∏Sg=1
p (Zg | θ,M1) p (θ | M1) dθ (5.73)
=∏Sg=1
∫p (Zg | θg,1) p (θg,1) dθg,1 (5.74)
=∏Sg=1
[Γ (αθ + ng,1) Γ (αθ + ng,0)
Γ (2αθ + C)
Γ (2αθ)
Γ (αθ) Γ (αθ)
]. (5.75)
Therefore, we only need to approximate marginal likelihood
p({Zg : Sk,g = 1
}| Sk,g,Mk
). Additionally, Equation 5.72 also indicates that if Sk,g = 0
for all g, then ∆k = 0.
148
5.2.4 Simulation studies
We conduct simulation studies to test our algorithm performance in identifying biomarker
genes and partition the cells into sub-populations. In the following five scenarios we sim-
ulated data with different parameter configurations to benchmark the performance of the
algorithm under different signal-to-noise ratios. For each scenario, we simulated the binary
active/inactive states for G genes across C cells. Without loss of generality, we let K = 2 for
number of simulated sub-populations and each cell is randomly assigned to each of the two
sub-populations. Among the G genes, GS genes are differentially expressed in the two sub-
populations, and other genes are homogeneous across all cells. For heterogeneously expressed
genes, we let their probability of being active to be pE and (1− pE) in two sub-populations.
These two probabilities do not necessarily sum to be one, but here we simulate data in
this way to guarantee the contrast between sub-populations. For homogeneously expressed
genes, we randomly draw their probability of being active from a Beta (a, a) distribution,
where in current study we let a = 1 and the resulting Beta (1, 1) distribution is simply the
Uniform distribution on [0, 1]. For each of the simulated scenarios, we run 1000 iterations
of our Gibbs sampling for each simulated data set, and the estimated posterior distributions
of parameters S = (S1, . . . , SG) and I = (I1, . . . , IC). In the display of results, we rank the
genes by their posterior probability of being selected, p (Si = 1 | Z).
Scenario 1: C = 100, G = 200, NS = 0. We still simulated the expression 200
genes across 100 homogeneous cells. Specifically, there is no sub-population across the 200
cells. The heatmap of simulated data is shown in Figure 5.8. One can see clearly that
the partitioning by hierarchical clustering was random because there is no sub-population
in the data. We run our Gibbs sampling on this simulated data. The estimated posterior
probabilities for the 200 genes being selected are plotted in top of Figure 5.9. We can see
149
that when there is no sub-population heterogeneity and no differentially expressed genes,
the posterior probabilities of genes being selected as biomarker are all very low, which is
consistent with our expectation. The posterior probabilities for cells being partition to sub-
population 1 are all around 50% which directly tells us the uncertainty of the partitioning.
The algorithm tells us the correct answer: there is no sub-population in the simulated dataset.
Figure 5.8: Scenario 1: (Left) Heatmap of simulated data. (Right) Hierarchical clustering ofcells by Euclidean distance metrics.
150
200 Genes
Pos
terio
r pr
obab
ility
of b
eing
sel
ecte
d
0.0
0.2
0.4
0.6
0.8
1.0
Cell
Pos
terio
r pr
obab
ility
in s
ub−
popu
latio
n 1
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5.9: Scenario 1: (Top) Posterior probability of genes being selected. (Bottom) Pos-terior probability of cells being in sub-population 1. Dashed line: 50% probability cutoff.
Scenario 2: C = 100, G = 200, NS = 10, pE = 0.85. We simulate 200 genes for 100
cells. The 100 cells are randomly assigned to the two sub-populations with equal probability.
Among the 200 genes, 10 of them are differentially expressed and 190 of them are homo-
geneously expressed. In two sub-populations, the differentially expressed genes have either
151
85% or 15% probability of being active. The heatmap of simulated data is shown in Figure
5.10. The left figure show the heatmap of raw simulated data. Each row represents one
cell and each column represents one gene. The row colorbar indicates the sub-population
of each row, and the columns highlighted by column colorbar are the true differentially ex-
pressed genes. Obviously from the left figure people cannot partition the cells into right
sub-populations and identify the systematically differentially expressed genes by eye. We
run our Gibbs sampling on this simulated data and the traceplots of log-likelihood function
of 3 independent MCMC chains, as well as the autocorrelation plots of chain 1, are shown in
Figure 5.11. Note that the Markov chains mixed well and converged to stationary distribu-
tion quickly in one or two hundreds of iterations. The estimated posterior probabilities for
the 200 genes being selected are plotted in top of Figure 5.12. The first NS = 10 genes are the
true differentially expressed and our algorithm successfully picked them out. The estimated
posterior probabilities for the 100 cells to be partitioned into sub-population 1 are shown
in bottom of Figure 5.12. From the figures we can see that the algorithm perfectly selected
out the true heterogeneous genes and partitioned the cells into correct sub-populations. The
heatmap in Figure 5.13 shows partitioning of the 100 cells with row colorbar indicates the
sub-population membership of each cell. Our algorithm correctly partitioned all of the 100
cells into their true sub-populations. A zoom-in of the heatmap of the 100 cells on top 20
genes with highest posterior probabilities (including the 10 true genes) of being selected is
shown in Figure 5.13. Note that this output shows the clear heterogeneity between the two
sub-populations.
152
Figure 5.10: Scenario 2: (Left) Heatmap of simulated data. (Right) Hierarchical clusteringof cells by Euclidean distance metrics.
153
0 50 100 150
−10
600
−10
550
−10
500
−10
450
−10
400
Iteration
Log−
likel
ihoo
d
MCMC Chain 1MCMC Chain 2MCMC Chain 3
0 5 10 15 20
−0.
20.
00.
20.
40.
60.
81.
0
Lag
Aut
ocor
rela
tion
of lo
g−lik
elih
ood
trac
e
Figure 5.11: Scenario 2: MCMC diagnostic plots: (Left) Log-likelihood trace plots for 3independent Gibbs sampling MCMC chains. (Right) Autocorrelation plots for log-likelihoodtrace of chain 1.
154
Gene
Pos
terio
r pr
obab
ility
of b
eing
sel
ecte
d
0.0
0.2
0.4
0.6
0.8
1.0
True heterogeneous genes
Cell
Pos
terio
r pr
obab
ility
in s
ub−
popu
latio
n 1
0.0
0.2
0.4
0.6
0.8
1.0
True sub−population 1 cells
Figure 5.12: Scenario 2: (Top) Posterior probability of genes being selected. (Bottom)Posterior probability of cells being in sub-population 1. Dashed line: 50% probability cutoff.
155
Figure 5.13: Scenario 2: (Left) Heatmap for results of partitioning of cells and selected genes.(Right) Zoom-in of top 20 selected genes. The genes highlighted have posterior probabilityof being selected > 0.5.
Scenario 3: C = 100, G = 200, NS = 5, pE = 0.85. We still simulated 200 genes for
100 cells, but we reduced the number of selected genes from 10 to 5 to see whether we can
still partition the cells correctly with less amount of information. The 100 cells are randomly
assigned to the two sub-populations with equal probability. The heatmap of simulated data
is shown in Figure 5.14. It is clearly that due to the less amount of signal in data, the
hierarchical clustering algorithm did even worse than Scenario 2. Basically the partitioning
by hierarchical clustering was very random. We run our Gibbs sampling on this simulated
data and the traceplots of log-likelihood function of 3 independent MCMC chains, as well
as the autocorrelation plots of chain 1, are shown in Figure 5.15. Note that the Markov
chains mixed well and converged to stationary distribution quickly in one or two hundreds of
iterations. The estimated posterior probabilities for the 200 genes being selected are plotted
in top of Figure 5.16. The first NS = 5 genes are the true differentially expressed and our
156
algorithm successfully picked them out. The estimated posterior probabilities for the 100
cells to be partitioned into sub-population 1 are shown in bottom of Figure 5.16. From the
figures we can see that the algorithm can still perfectly selected out the true heterogeneous
genes and partitioned the cells into correct sub-populations even though there are only 5
differentially expressed genes. The heatmap in Figure 5.17 shows partitioning of the 100
cells with row colorbar indicates the sub-population membership of each cell. Our algorithm
correctly partitioned all of the 100 cells into their true sub-populations. A zoom-in of the
heatmap of the 100 cells on top 20 genes with highest posterior probabilities (including the 5
true genes) of being selected is shown in Figure 5.17. Note that this output shows the clear
heterogeneity between the two sub-populations.
Figure 5.14: Scenario 3: (Left) Heatmap of simulated data. (Right) Hierarchical clusteringof cells by Euclidean distance metrics.
157
0 500 1000 1500
−10
650
−10
600
−10
550
Iteration
Log−
likel
ihoo
d
MCMC Chain 1MCMC Chain 2MCMC Chain 3
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
Aut
ocor
rela
tion
of lo
g−lik
elih
ood
trac
e
Figure 5.15: Scenario 3: MCMC diagnostic plots: (Left) Log-likelihood trace plots for 3independent Gibbs sampling MCMC chains. (Right) Autocorrelation plots for log-likelihoodtrace of chain 1.
158
Gene
Pos
terio
r pr
obab
ility
of b
eing
sel
ecte
d
0.0
0.2
0.4
0.6
0.8
1.0
True heterogeneous genes
Cell
Pos
terio
r pr
obab
ility
in s
ub−
popu
latio
n 1
0.0
0.2
0.4
0.6
0.8
1.0
True sub−population 1 cells
Figure 5.16: Scenario 3: (Top) Posterior probability of genes being selected. (Bottom)Posterior probability of cells being in sub-population 1. Dashed line: 50% probability cutoff.
159
Figure 5.17: Scenario 3: (Left) Heatmap for results of partitioning of cells and selected genes.(Right) Zoom-in of top 20 selected genes. The genes highlighted have posterior probabilityof being selected > 0.5.
Scenario 4: C = 100, G = 1000, NS = 10, pE = 0.85. In this scenario we test the
algorithm’s ability to handle large set of genes. We simulated 1000 genes for 100 cells. The
100 cells are randomly assigned to the two sub-populations with equal probability. Among
the 1000 genes, 10 of them are differentially expressed and 990 of them are homogeneously
expressed.. The heatmap of simulated data is shown in Figure 5.18. We run our Gibbs
sampling on this simulated data and the traceplots of log-likelihood function of 3 independent
MCMC chains, as well as the autocorrelation plots of chain 1, are shown in Figure 5.19.
Note that the Markov chains mixed well and converged to stationary distribution quickly
in one or two hundreds of iterations. The estimated posterior probabilities for the 1000
genes being selected are plotted in top of Figure 5.20. The first NS = 10 genes are the
true differentially expressed and our algorithm successfully picked them out. The estimated
posterior probabilities for the 100 cells to be partitioned into sub-population 1 are shown in
160
bottom of Figure 5.20. From the figures we can see that the algorithm picked out 10 out of
10 true positive genes and only two false positives, and perfectly partitioned the cells into
correct sub-populations even though there are only 5 differentially expressed genes. The
heatmap in Figure 5.21 shows partitioning of the 100 cells with row colorbar indicates the
sub-population membership of each cell. Our algorithm correctly partitioned all of the 100
cells into their true sub-populations. A zoom-in of the heatmap of the 100 cells on top 20
genes with highest posterior probabilities (including the 10 true genes) of being selected is
shown in Figure 5.21. Note that this output shows the clear heterogeneity between the two
sub-populations.
Figure 5.18: Scenario 4: (Left) Heatmap of simulated data. (Right) Hierarchical clusteringof cells by Euclidean distance metrics.
161
0 500 1000 1500
−51
250
−51
150
−51
050
−50
950
Iteration
Log−
likel
ihoo
d
MCMC Chain 1MCMC Chain 2MCMC Chain 3
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
Aut
ocor
rela
tion
of lo
g−lik
elih
ood
trac
e
Figure 5.19: Scenario 4: MCMC diagnostic plots: (Left) Log-likelihood trace plots for 3independent Gibbs sampling MCMC chains. (Right) Autocorrelation plots for log-likelihoodtrace of chain 1.
162
1000 Genes
Pos
terio
r pr
obab
ility
of b
eing
sel
ecte
d
0.0
0.2
0.4
0.6
0.8
1.0
True heterogeneous genes
Cell
Pos
terio
r pr
obab
ility
in s
ub−
popu
latio
n 1
0.0
0.2
0.4
0.6
0.8
1.0
True sub−population 1 cells
Figure 5.20: Scenario 4: (Top) Posterior probability of genes being selected. (Bottom)Posterior probability of cells being in sub-population 1. Dashed line: 50% probability cutoff.
163
Figure 5.21: Scenario 4: Zoom-in of top 20 selected genes. The genes highlighted haveposterior probability of being selected > 0.5.
Scenario 5: C = 100, G = 500, NS = 10, pE = 0.9. We simulated 500 genes for 100 cells.
Instead of assigning the 100 cells to the two sub-populations with equal probability, we ran-
domly selected 10 genes to be in sub-population 1 and other 90 genes to be in sub-population
2, which makes the number of cells in two sub-populations unbalanced. We test the perfor-
mance of our method on unbalanced cell population data. The heatmap of simulated data is
shown in Figure 5.22. It is shown that the hierarchical clustering algorithm cannot did even
worse because of the unbalanced sub-population sizes. Basically the partitioning by hierar-
chical clustering was very random and the dendrogram branches, which are measurements
of difference between cells, are very long. We run our Gibbs sampling on this simulated
data and the traceplots of log-likelihood function of 3 independent MCMC chains, as well
as the autocorrelation plots of chain 1, are shown in Figure 5.23. Note that the Markov
chains mixed well and converged to stationary distribution quickly in one or two hundreds of
164
iterations. The estimated posterior probabilities for the 200 genes being selected are plotted
in top of Figure 5.24. The first NS = 10 genes are the true differentially expressed and our
algorithm successfully picked them out. The estimated posterior probabilities for the 100
cells to be partitioned into sub-population 1 are shown in bottom of Figure 5.24. From the
figures we can see that the algorithm can still perfectly selected out the true heterogeneous
genes and partitioned the cells into correct sub-populations even though there are only 10
differentially expressed genes. The heatmap in Figure 5.25 shows partitioning of the 100
cells with row colorbar indicates the sub-population membership of each cell. Our algorithm
correctly partitioned all of the 100 cells into their true sub-populations. A zoom-in of the
heatmap of the 100 cells on top 20 genes with highest posterior probabilities (including the
10 true genes) of being selected is shown in Figure 5.25. Note that this output shows the
clear heterogeneity between the two sub-populations.
Figure 5.22: Scenario 5: (Left) Heatmap of simulated data. (Right) Hierarchical clusteringof cells by Euclidean distance metrics.
165
0 500 1000 1500
−51
250
−51
150
−51
050
−50
950
Iteration
Log−
likel
ihoo
d
MCMC Chain 1MCMC Chain 2MCMC Chain 3
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
Aut
ocor
rela
tion
of lo
g−lik
elih
ood
trac
e
Figure 5.23: Scenario 5: MCMC diagnostic plots: (Left) Log-likelihood trace plots for 3independent Gibbs sampling MCMC chains. (Right) Autocorrelation plots for log-likelihoodtrace of chain 1.
166
500 Genes
Pos
terio
r pr
obab
ility
of b
eing
sel
ecte
d
0.0
0.2
0.4
0.6
0.8
1.0
True heterogeneous genes
Cell
Pos
terio
r pr
obab
ility
in s
ub−
popu
latio
n 1
0.0
0.2
0.4
0.6
0.8
1.0
True sub−population 1 cells
Figure 5.24: Scenario 5: (Top) Posterior probability of genes being selected. (Bottom)Posterior probability of cells being in sub-population 1. Dashed line: 50% probability cutoff.
167
Figure 5.25: Scenario 5: (Left) Heatmap for results of partitioning of cells and selected genes.(Right) Zoom-in of top 20 selected genes. The genes highlighted have posterior probabilityof being selected > 0.5.
From the five simulated scenarios we conclude that our algorithm is able to identify syn-
chronized heterogeneously expressed gene group, and partition the cells into sub-populations.
In next section we applied the algorithm to real dataset.
5.2.5 Application on real hematopoietic stem cell data
We applied our algorithm on the hematopoietic stem cell data in Guo et al. (2013).
There are six single cells populations, and for each population there are 20 to 50 cells. The
names of cell populations and number of cells in each population are listed in Table 5.1. The
expression of 179 genes were quantified by quantitative PCR (qPCR). We use our method
to select biomarker genes that are differentially expressed in sub-populations and partition
the cells into heterogeneous sub-populations. The final results are listed in Table 5.1.
Bcl2, have posterior probabilities > 0.4 to be selected as biomarkers. All of the 18 cells have
clear posterior probabilities of being partitioned into sub-population 1 or 2.
MPP: Ten genes, Cdkn3, Cdkn2c, Ccnb2, Cdkn1a, Tob1, Ccna2, Cdk1, Bcl2, have
posterior probabilities > 0.5 to be selected as biomarkers. All of the 48 cells have clear
posterior probabilities of being partitioned into sub-population 1 or 2.
170
Appendix A
ASE Models
Lemma A.0.1 Let θ1,. . . ,θM , L1, . . . , Lj be real numbers such that, Lj 6= 0 and 0 ≤ θj ≤
1∀j,∑M
j=1 θj = 1 and∑M
j=1 θj × Lj 6= 0. Let ρj =θj×Lj∑Mj=1 θj×Lj
, then, θj =
ρjLj∑Mj=1
ρjLj
.
Proof
ρj =θj × Lj∑Mj=1 θj × Lj
⇒M∑j=1
ρjLj
=M∑j=1
(θj∑M
j=1 θj × Lj
)
=1∑M
j=1 θj × Lj
=ρj
θj × Lj
⇒ θj =
ρjLj∑Mj=1
ρjLj
Note. This lemma is to show that there is a one to one transformation from our parameter
of interest, the relative abundance of transcript j, ρj; and the relative abundance of reads
generated by transcript j, θj.
171
Appendix B
Mathematical Derivations
Proposition B.0.2 If
f(z|µ,y, S,Σ) ∝ exp{−1
2
[(z − µ)T S−1 (z − µ) + (y − z)T Σ−1 (y − z)
]},
then, z ∼ N (µz,Λz), with µz = Λz (S−1µ+ Σ−1y) and Λz = (S−1 + Z−1)−1
Proof The proof can be found in Gelman et al. (2013), pp85.
Proposition B.0.3 If y|β,Σ ∼ Nn (Xβ,Σ), and β|D ∼ Nk(0, D),
then, β|y,Σ, D ∼ Nk(µβ,Λz
), with µβ = ΛβX
TΣ−1y and Λβ =(XTΣ−1X +D−1
)−1
Proof The proof can be found at Lindley and Smith (1972).
172
Appendix C
Examples of MCMC chains for
BRAIM model
Figure C.1: Plot of the paternal and maternal expression estimates across replicates,and the Posterior Probabilities of the BVSIE model for gene A530017D24Rik, transcriptENSMUST00000101077.
173
Figure C.2: MCMC samples for βI .
174
Figure C.3: MCMC samples for βC .
175
Figure C.4: MCMC samples for βS.
176
Figure C.5: MCMC samples for σ2.
177
Figure C.6: MCMC samples for the likelihood function.
178
Figure C.7: Plot of the paternal and maternal expression estimates across replicates, andthe Posterior Probabilities of the BVSIE model for H13, transcript ENSMUST00000109825.
179
Figure C.8: MCMC samples for βI .
180
Figure C.9: MCMC samples for βC .
181
Figure C.10: MCMC samples for βS.
182
Figure C.11: MCMC samples for σ2.
183
Figure C.12: MCMC samples for the likelihood function.
184
Figure C.13: Plot of the paternal and maternal expression estimates across replicates, and thePosterior Probabilities of the BVSIE model, gene H13, transcript ENSMUST00000148156.
185
Figure C.14: MCMC samples for βI .
186
Figure C.15: MCMC samples for βC .
187
Figure C.16: MCMC samples for βS.
188
Figure C.17: MCMC samples for σ2.
189
Figure C.18: MCMC samples for the likelihood function.
190
Appendix D
MCMC chains for Hi-BRAIM
D.1 Simulations
191
0 600
0.2
0.4
0.6
0.8
1.0
a
iter
α(het
)
0 600
0.2
0.4
0.6
0.8
1.0
b
iter
α(het
)
0 600
0.2
0.4
0.6
0.8
1.0
c
iter
α(het
)
0 600
0.2
0.4
0.6
0.8
1.0
d
iterα(h
et)
0 600
0.2
0.4
0.6
0.8
1.0
e
iter
α(het
)
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Figure D.1: MCMC samples for αt.
192
0 200 400 600 800
−0.
15−
0.10
−0.
050.
000.
050.
10
a
iter
β α
0 200 400 600 800
−0.
050.
000.
050.
100.
15
b
iter
β α
0 200 400 600 800
−0.
15−
0.05
0.00
0.05
0.10
c
iter
β α
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Figure D.2: MCMC samples for β.
193
0 600
12
34
5
a
iter
ν(het
)
0 600
1.0
1.5
2.0
2.5
3.0
3.5
4.0
b
iter
ν(het
)
0 600
12
34
c
iter
ν(het
)
0 600
12
34
5
d
iterν(h
et)
0 600
1.0
1.5
2.0
2.5
3.0
e
iter
ν(het
)
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
−0.
20.
00.
20.
40.
60.
81.
0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Figure D.3: MCMC samples for νt.
194
0 600
1.0
1.2
1.4
1.6
1.8
2.0
a
iter
σ2
0 600
0.8
1.0
1.2
1.4
1.6
1.8
2.0
b
iter
σ2
0 600
0.8
0.9
1.0
1.1
1.2
1.3
1.4
c
iter
σ2
0 600
0.8
0.9
1.0
1.1
1.2
1.3
1.4
d
iterσ2
0 600
1.0
1.5
2.0
2.5
e
iter
σ2
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
0 15 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
Figure D.4: MCMC samples for σ2t .
195
0 200 400 600 800 1000
0e+
006e
+28
iter
σ α2
0 5 10 15 20 25 30
0.0
0.4
0.8
Lag
AC
F
Figure D.5: MCMC samples for σ2α.
196
0 200 400 600 800 1000
02
46
8
iter
σ ν2
0 5 10 15 20 25 30
0.0
0.4
0.8
Lag
AC
F
Figure D.6: MCMC samples for σ2ν .
197
0 200 400 600 800 1000
−30
−10
1030
iter
µ α
0 5 10 15 20 25 30
0.0
0.4
0.8
Lag
AC
F
Figure D.7: MCMC samples for µα.
198
0 200 400 600 800 1000
−90
00−
6000
iter
log.
post
erio
r
0 5 10 15 20 25 30
0.0
0.4
0.8
Lag
AC
F
Figure D.8: MCMC samples for the log posterior.
199
Bibliography
Christopher Badcock and Bernard Crespi. Battle of the sexes may set the brain. Nature,454(7208):1054–1055, 2008.
Susan M Berget, Claire Moore, and Phillip A Sharp. Spliced segments at the 5’terminus ofadenovirus 2 late mrna. Proceedings of the National Academy of Sciences, 74(8):3171–3175,1977.
Douglas L Black. Mechanisms of alternative pre-messenger rna splicing. Annual review ofbiochemistry, 72(1):291–336, 2003.
Siddhartha Chib. Marginal likelihood from the gibbs output. Journal of the AmericanStatistical Association, 90(432):1313–1321, 1995.
Hugh Chipman, Michael Hamada, and CFJ Wu. A bayesian variable-selection approachfor analyzing designed experiments with complex aliasing. Technometrics, 39(4):372–381,1997.
Jacob F Degner, John C Marioni, Athma A Pai, Joseph K Pickrell, Everlyne Nkadori, YoavGilad, and Jonathan K Pritchard. Effect of read-mapping biases on detecting allele-specificexpression from rna-sequencing data. Bioinformatics, 25(24):3207–3212, 2009.
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from in-complete data via the em algorithm. Journal of the Royal Statistical Society. Series B(Methodological), pages 1–38, 1977.
ECO. Seqanswers, October 2007. http://seqanswers.com/forums/showthread.php?t=21.
Terrence S Furey. Chip–seq and beyond: new and improved methodologies to detect andcharacterize protein–dna interactions. Nature Reviews Genetics, 13(12):840–852, 2012.
Alastair S Garfield, Michael Cowley, Florentia M Smith, Kim Moorwood, Joanne E Stewart-Cox, Kerry Gilroy, Sian Baker, Jing Xia, Jeffrey W Dalley, Laurence D Hurst, et al.Distinct physiological and behavioural functions for parental alleles of imprinted grb10.Nature, 469(7331):534–538, 2011.
Andrew Gelman. Multilevel (hierarchical) modeling: what it can and cannot do. Techno-metrics, 48(3), 2006.
200
Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald BRubin. Bayesian data analysis. CRC press, 2013.
Edward I George and Robert E McCulloch. Variable selection via gibbs sampling. Journalof the American Statistical Association, 88(423):881–889, 1993.
Peter Glaus, Antti Honkela, and Magnus Rattray. Bayesian model of transcript differentialexpression in rna-seq data with biological variation. 2011.
Guoji Guo, Sidinh Luc, Eugenio Marco, Ta-Wei Lin, Cong Peng, Marc A Kerenyi, SemirBeyaz, Woojin Kim, Jian Xu, Partha Pratim Das, et al. Mapping cellular hierarchy bysingle-cell analysis of the cell surface repertoire. Cell stem cell, 13(4):492–505, 2013.
Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in rna-seq.Bioinformatics, 25(8):1026–1032, 2009.
Daniel C Koboldt, Karyn Meltz Steinberg, David E Larson, Richard K Wilson, and Elaine RMardis. The next-generation sequencing revolution and its impact on genomics. Cell, 155(1):27–38, 2013.
Roger D Kornberg. Chromatin structure: a repeating unit of histones and dna. Science, 184(4139):868–871, 1974.
Dennis V Lindley and Adrian FM Smith. Bayes estimates for the linear model. Journal ofthe Royal Statistical Society. Series B (Methodological), pages 1–41, 1972.
Michael L Metzker. Sequencing technologies?the next generation. Nature Reviews Genetics,11(1):31–46, 2009.
Qun Pan, Ofer Shai, Leo J Lee, Brendan J Frey, and Benjamin J Blencowe. Deep sur-veying of alternative splicing complexity in the human transcriptome by high-throughputsequencing. Nature genetics, 40(12):1413–1415, 2008.
Peter J Park. Chip–seq: advantages and challenges of a maturing technology. Nature ReviewsGenetics, 10(10):669–680, 2009.
Adam Roberts, Cole Trapnell, Julie Donaghey, John L Rinn, Lior Pachter, et al. Improvingrna-seq expression estimates by correcting for fragment bias. Genome Biol, 12(3):R22,2011.
Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor packagefor differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
Fred Sanger and Alan R Coulson. A rapid method for determining sequences in dna byprimed synthesis with dna polymerase. Journal of molecular biology, 94(3):441–448, 1975.
201
Frederick Sanger, Steven Nicklen, and Alan R Coulson. Dna sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467, 1977.
Alex K Shalek, Rahul Satija, Xian Adiconis, Rona S Gertner, Jellert T Gaublomme, RaktimaRaychowdhury, Schragi Schwartz, Nir Yosef, Christine Malboeuf, Diana Lu, et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature,2013.
the International Human Genome Sequencing Consortium. Initial sequencing and analysisof the human genome. Nature, 409(6822):860–921, February 2001. ISSN 0028-0836. doi:10.1038/35057062. URL http://dx.doi.org/10.1038/35057062.
Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J vanBaren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assembly andquantification by rna-seq reveals unannotated transcripts and isoform switching duringcell differentiation. Nature biotechnology, 28(5):511–515, 2010.
Cole Trapnell, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and LiorPachter. Differential analysis of gene regulation at transcript resolution with rna-seq.Nature biotechnology, 31(1):46–53, 2013.
Ernest Turro, Shu-Yi Su, Angela Goncalves, LJ Coin, Sylvia Richardson, and Alex Lewin.Haplotype and isoform specific expression estimation using multi-mapping rna-seq reads.Genome Biol, 12(2):R13, 2011.
J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O. Smith,M. Yandell, C. A. Evans, and R. A. Holt. The Sequence of the Human Genome. Science,291(5507):1304–1351, February 2001. ISSN 1095-9203. doi: 10.1126/science.1058040. URLhttp://dx.doi.org/10.1126/science.1058040.
Matti Vihola. Robust adaptive metropolis algorithm with coerced acceptance rate. Statisticsand Computing, 22(5):997–1008, 2012.
Gunter P Wagner, Koryu Kin, and Vincent J Lynch. Measurement of mrna abundance usingrna-seq data: Rpkm measure is inconsistent among samples. Theory in Biosciences, 131(4):281–285, 2012.
Zhong Wang, Mark Gerstein, and Michael Snyder. Rna-seq: a revolutionary tool for tran-scriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.
Jon F Wilkins and David Haig. What good is genomic imprinting: the function of parent-specific gene expression. Nature Reviews Genetics, 4(5):359–368, 2003.