-
Probability, Entropy, and Adaptive
Immune System Repertoires
Zachary Michael Sethna
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Physics
Adviser: Professor Curtis Callan
September 2018
-
c© Copyright by Zachary Michael Sethna, 2018.
All rights reserved.
-
Abstract
The adaptive immune system, composed of white blood cells called
lymphocytes (B
and T cells) that circulate in the lymph and blood, is a
precision tool that tags
and removes foreign peptides. Such peptides, also called
antigens or epitopes, are
identified by a specific binding to elements of a library or
repertoire of unique proteins
called receptors (e.g. antibodies or T cell receptors). A
repertoire must be large and
diverse enough so that at least one receptor will be able to
recognize any pathogen
epitope the organism is likely to encounter. This diversity is
achieved by stochastic
rearrangement of the germline DNA to create novel
complementarity determining
region sequences (CDR3) in a process called called V(D)J
recombination.
In this thesis we utilize previously developed generative models
of V(D)J recombi-
nation events, and infer the model parameters from large
datasets of DNA sequences.
The generation probability (Pgen) of a nucleotide or amino acid
CDR3 is the sum
of all model probabilities of V(D)J recombination events that
generate the sequence.
While previously it was only feasible to compute Pgen of
nucleotide sequences, we
introduce a novel dynamic programming algorithm that efficiently
computes Pgen of
amino acid sequences. We use this Pgen for several applications.
First we examine
how the diversity of a repertoire, characterized by the model
entropy, scales with the
number of insertions in the V(D)J process. This is used to
describe the maturation
of the T cell repertoire of mice from embryos to young adults.
Next, we introduce a
statistical model of hypermutation in B cells and infer the
parameters from a human
repertoire, providing a principled quantification of the biases
in hypermutation rates.
Lastly, we examine the statistics of the receptors shared
amongst a cohort of more
than 600 individual humans and show that the statistics and
identities of so-called
‘public’ sequences are determined directly from Pgen.
We highlight possible clinical applications and attempt to place
this work in the
context of a full theory of the adaptive immune system.
iii
-
Acknowledgements
I don’t have the words to express my thanks to my advisor Curt
Callan. Curt has been
a consummate advisor, providing support, advice, direction, and
countless opportu-
nities. I came into grad school with somewhat scattered
interests, yet Curt showed
me, by example, how to find a path forward through dedication,
collaboration, and
boundless curiosity. Curt has always been willing to entertain
my crazy, inchoate
ideas, and with only a few incisive questions give them shape
(though it often takes
me days to catch up and realize this). Curt, thank you for all
of your time and effort,
thank you for being my mentor. Thank you.
I also thank my collaborators on both sides of the pond. I have
learned so much
from the insights and clarity of Aleksandra Walczak and Thierry
Mora. Their ability
to parse the underlying science, translate it into math, and
then communicate this
effectively is something I hope to one day be able to emulate.
Yuval Elhanati has
made my time here much more productive and enjoyable. Not only
did Yuval provide
crucial assistance with every step of the research, but he
provided a sympathetic ear
and was willing to talk about whatever the topic of the day was.
Quentin Marcou is
not only a wonderful collaborator, but a welcoming friend.
Thanks to Ben Greenbaum and Vinod Balachandran for great
discussions, data,
and continuing collaboration.
I would also like to thank Anand Murugan, whom I have never met,
but whose
code I’ve spent uncounted hours working with.
Biophysics
The professors in biophysics have been hugely influential on my
perspective on science
and life, and I would like to thank them. I must start by
thanking Bill Bialek, and not
only for being on my committee. His vision, instant
understanding of any topic, and
personality have made his conversations something to be sought
after. I would like to
iv
-
thank Bob Austin, not only for being a reader of this thesis,
but for the many crazy
conversations and a shared appreciation of scotch. I also want
to thank Josh Shaevitz
for efficiently cutting to the bone of any issue, Thomas Gregor
for teaching me much
during my time as a TA for ISC, and Ned Wingreen for somehow
always knowing
everything about any biological system. You all have made
Princeton biophysics not
only a superb place to do research, but a friendly and welcoming
environment.
The biophysics community also has had several postdocs and
graduate students
over the years that I would like to thank for teaching me much
and making my
time here so much fun. Andreas Mayer for great discussions on
immunology. I’ve
immensely enjoyed speculating about Information Geometry with
Ben Machta. I’d
also like to thank Leenoy Mushulam, Henry Mattingly, Dima
Krotov, Ashley Linder,
Ugne Klibaite, Ben Bratton, Gordon Berman, Michael Tikhonov,
Xiaowen Chen,
Guannan Liu, Mochi Liu, Alex Song, Sagar Setru, Mark Ioffe, and
Jeff Nyugen.
Physics
The greater physics community has made Jadwin Hall a second home
for these years.
I’d like to thank Herman Verlinde for all of his work in
organizing the grad program.
A special thanks to Suzanne Staggs for being on my committee.
Thanks to Jessica
Heslin, Barbara Mooring, and Kate Brosowksy for the invaluable
administrative as-
sistance – without you we grad students would be helpless. Sumit
Saluja has been a
lifesaver with helping me get my code running on the server.
Also, a shoutout to the
softball team – especially the impressive Ed Groth.
Friends
Naturally, I must thank my fellow grad students who’ve been
through the ringer with
me and yet made my time here enjoyable. There are too many
people to name, so un-
doubtably I have accidentally forgotten some people: I must beg
your forgiveness! I’d
v
-
like to thank Aitor Lewkowycz for the science, fun, keen
insight, and advice. Aaron
Levy for the innumerable discussions about life, politics, and
science. Will Coulton
for always being a good sport and a positive influence in every
scenario. Josh Hard-
enbrook for always calling me out when he thinks I am wrong.
Dave Zajac for helping
me ‘study’ for prelims with uncounted games of pool. Christian
Jepsen for his impec-
cable taste. Joaquin Turiaci and Debayan Mitra for the many fun
nights of beer and
foosball. Shai Chester for the fun and ridiculous stories, but
NOT for any ‘help’ in my
work. Farzan Beroz for the many philosophical and science
discussions. Lauren Mc-
Gough for the many discussions about about stat mech,
information theory, and life.
Kenan Diab also understands the important things in a grad
student’s life: softball,
starcraft, MTG, and beer. Ilya Belopolski for doing many prelim
problems together
while DJ’ing with some select music. DJ Strouse for our annual
run-ins at APS and
the many good conversations about information theory and machine
learning. Bin Xu
for his always cheerful demeanor and great scientific
discussions. Mallika Randeria
for her friendship and advice. Tom Hazard, softball captain
extraordinaire. Many
thanks to Shawn Westerdale, Anne Gambrel, Guangyong Koh, Ed
Young, Matt Her-
nandez, Lee Gunderson, Sarthak Parikh, Grisha Tarnoplskiy, Vlad
Kirilin, Matteo
Ippolti, Luca Iliesiu, and Trithep Devakul. Thanks to
everyone.
Family
Lastly, I must thank the whole of my family for being so
supportive of me since
before I can remember. I come from a unique family, filled with
medical doctors and
physicists, such that when I go home I am frequently grilled on
my research. Coming
from such a background, it is no surprise that I’ve effectively
split the difference
between physics and medicine in this thesis.
It would be hard to overstate the influence my uncle, Jim
Sethna, has had on
me: I’ve quite literally followed in his footsteps in getting a
PhD in physics from
vi
-
Princeton. Thank you Uncle Jim for all of your advice, support,
and even academic
mentorship. I cannot tell you how much it means to me.
My grandparents, Patarasp Sethna, Shirley Sethna, Marjory
Sethna, Joshua Lyn-
field, and Yelva Lynfield, have always been an examples to me,
both in their achieve-
ments and morality. Sadly, not all of my grandparents will see
me graduate, however
I am confident that all of them would both be proud and approve
of my time here.
I also thank my sisters, Julia and Sharon Sethna, for always
providing a ready
distraction when needed.
Finally, I would like to thank my parents Ruth Lynfield and
Michael Sethna, with-
out whom not only would this thesis not have been possible but I
never would have
been in the position in the first place. Your love, support,
direction, and parenting
have got me to this point. Mom, your talents and commitment to
helping people is
inspiring. Your work in infectious diseases and epidemiology
have clearly colored my
interests. And Dad, your elevation of science and logic above
all else has shaped the
way I think. You have frequently ‘joked’, that studying math,
physics, and science is
‘holy work’ – a sentiment I certainly share. Thank you both for
everything.
vii
-
“The idea is like grass. It craves light, likes crowds,
thrives
on crossbreeding, grows better for being stepped on.”
- Ursula K. Le Guin, The Dispossessed
viii
-
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . xiv
1 Introduction 1
1.1 Adaptive immune system . . . . . . . . . . . . . . . . . . .
. . . . . . 1
1.1.1 B cells . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 2
1.1.2 T cells . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 3
1.1.3 The DNA problem . . . . . . . . . . . . . . . . . . . . .
. . . 4
1.2 V(D)J recombination . . . . . . . . . . . . . . . . . . . .
. . . . . . . 4
1.3 Repertoire sequencing and analysis . . . . . . . . . . . . .
. . . . . . 7
1.4 Organization of thesis . . . . . . . . . . . . . . . . . . .
. . . . . . . . 8
2 Generative Model 9
2.1 V(D)J recombination models . . . . . . . . . . . . . . . . .
. . . . . . 9
2.1.1 VDJ generative model . . . . . . . . . . . . . . . . . . .
. . . 11
2.1.2 Model Validation . . . . . . . . . . . . . . . . . . . . .
. . . . 11
2.1.3 VJ generative model . . . . . . . . . . . . . . . . . . .
. . . . 12
2.1.4 Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 12
2.2 Model Entropy . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 13
ix
-
2.2.1 Entropy of Precomb . . . . . . . . . . . . . . . . . . . .
. . . . 14
2.2.2 Entropy of Pgen . . . . . . . . . . . . . . . . . . . . .
. . . . . 18
2.2.3 The Pgen distribution . . . . . . . . . . . . . . . . . .
. . . . 18
2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 19
2.3.1 Errors and Mismatches . . . . . . . . . . . . . . . . . .
. . . 21
2.3.2 Expectation Maximization algorithm . . . . . . . . . . . .
. . 24
2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . .
. . . 27
3 V(D)J recombination to sequences: Precomb → Pgen 28
3.0.1 Probability Spaces (mathematical aside) . . . . . . . . .
. . . 29
3.1 Too many states! The free energy problem . . . . . . . . . .
. . . . . 29
3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . .
. . . . . . 31
3.3 OLGA . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 33
3.3.1 Notation, 3 ′ and 5 ′ vectors . . . . . . . . . . . . . .
. . . . . . 34
3.3.2 VDJ recombination: V, M, D, N, and J . . . . . . . . . . .
. 37
3.3.3 VJ recombination: V, M, and J . . . . . . . . . . . . . .
. . . 43
3.3.4 Validation . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 44
3.3.5 Comparison to existing methods . . . . . . . . . . . . . .
. . . 46
3.4 Some applications of OLGA computed Pgen . . . . . . . . . .
. . . . . 48
3.4.1 Pgen distributions and diversity . . . . . . . . . . . . .
. . . . 48
3.4.2 Generation probability of epitope-specific TCRs . . . . .
. . . 49
3.4.3 Predicting the frequencies . . . . . . . . . . . . . . . .
. . . . 51
3.4.4 Generation probability of sequence motifs . . . . . . . .
. . . 53
4 The repertoires ‘Of Mice and Men’ 55
4.1 Of Mice... (mouse TRB) . . . . . . . . . . . . . . . . . . .
. . . . . . 55
4.1.1 Generative model . . . . . . . . . . . . . . . . . . . . .
. . . . 57
4.1.2 Changing insertion profile → Increasing diversity . . . .
. . . 58
x
-
4.1.3 Mixture mode . . . . . . . . . . . . . . . . . . . . . . .
. . . 60
4.1.4 Toy model of mouse repertoire maturation . . . . . . . . .
. . 64
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 65
4.2 ...and Men (human IGH) . . . . . . . . . . . . . . . . . . .
. . . . . . 67
4.2.1 Analysis approach . . . . . . . . . . . . . . . . . . . .
. . . . 67
4.2.2 Generative Model, Allele identification . . . . . . . . .
. . . . 68
4.2.3 Hypermutation . . . . . . . . . . . . . . . . . . . . . .
. . . . 70
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 73
5 Sharing 74
5.1 The Sharing Distribution . . . . . . . . . . . . . . . . . .
. . . . . . 76
5.1.1 Analytical calculation of the sharing distribution from
the Pgen
distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 77
5.1.2 Sharing modified by selection . . . . . . . . . . . . . .
. . . . 81
5.2 Extrapolation to full repertoires and beyond . . . . . . . .
. . . . . . 83
5.3 Predicting the publicness of sequences . . . . . . . . . . .
. . . . . . 86
5.3.1 Sharing and TCR generation probability . . . . . . . . . .
. . 86
5.3.2 PUBLIC: Classifier of public vs. private TCRs based on
gener-
ation probability . . . . . . . . . . . . . . . . . . . . . . .
. . 89
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 91
6 Conclusion 93
A Information Theory 96
A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 96
A.2 Mutual Information . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 98
A.3 Kullback-Leibler divergence . . . . . . . . . . . . . . . .
. . . . . . . 99
B Probabilistic vs Deterministic inference 100
xi
-
C Proof of Expectation Maximization algorithm 103
D Mouse Appendix 105
D.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 105
D.2 Model parameters and validation . . . . . . . . . . . . . .
. . . . . . 106
E Human B cells Appendix 113
E.1 Repertoire entropy . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 113
E.2 Inference of alleles and their chromosome distribution . . .
. . . . . . 114
E.3 Model parameters and validation . . . . . . . . . . . . . .
. . . . . . 116
F Sharing Appendix 122
F.1 Sampling effects . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 122
F.2 Monte Carlo simulation . . . . . . . . . . . . . . . . . . .
. . . . . . . 124
F.2.1 Sequence data . . . . . . . . . . . . . . . . . . . . . .
. . . . 124
Bibliography 126
xii
-
List of Tables
3.1 Distance metrics for OLGA VDJ validation . . . . . . . . . .
. . . . 45
3.2 Time performance and scaling of possible methods. . . . . .
. . . . . 47
3.3 P funcgen of TCR motifs . . . . . . . . . . . . . . . . . .
. . . . . . . . . 54
3.4 Pgen of invariant T cell (iNKT and MAIT cells) TRA motifs .
. . . . 54
4.1 Breakdown of B cell sequences and models . . . . . . . . . .
. . . . . 67
D.1 Mouse dataset summary . . . . . . . . . . . . . . . . . . .
. . . . . . 106
E.1 Heterozygous V allele information (Individual A) . . . . . .
. . . . . 116
E.2 Heterozygous D and J allele information (Individual A) . . .
. . . . . 116
F.1 Mice dataset sample sizes . . . . . . . . . . . . . . . . .
. . . . . . . 125
xiii
-
List of Figures
1.1 Schematic of VDJ recombination . . . . . . . . . . . . . . .
. . . . . 5
2.1 Distribution functions: P (−E = log Pgen) . . . . . . . . .
. . . . . . . 19
3.1 CDR3 indexing cartoon . . . . . . . . . . . . . . . . . . .
. . . . . . . 34
3.2 Validation of OLGA VDJ algorithm . . . . . . . . . . . . . .
. . . . . 44
3.3 Validation of OLGA VJ algorithm . . . . . . . . . . . . . .
. . . . . . 46
3.4 Precomb and Pgen distributions . . . . . . . . . . . . . . .
. . . . . . . 48
3.5 Pgen of human TRB sequences for hepatitis C and influenza A
epitopes. 50
3.6 Pgen distributions for virus specific TRB sequences . . . .
. . . . . . . 51
3.7 Scatter of mean occurrence frequencies vs Pgen . . . . . . .
. . . . . . 52
4.1 Age-dependent insertion length distributions . . . . . . . .
. . . . . . 56
4.2 Sequence entropy for thymic repertoires . . . . . . . . . .
. . . . . . . 59
4.3 Repertoire maturation schematic . . . . . . . . . . . . . .
. . . . . . 61
4.4 Mean effective TdT level ᾱ and entropy vs age . . . . . . .
. . . . . . 63
4.5 Amount of mixing: variance of α vs age . . . . . . . . . . .
. . . . . . 64
4.6 Allele organization on chromosomes . . . . . . . . . . . . .
. . . . . . 69
4.7 Sequence dependence of somatic hypermutations . . . . . . .
. . . . . 71
5.1 Pipeline for computing the distribution of shared sequences
. . . . . . 76
5.2 Sharing distribution for 14 mice . . . . . . . . . . . . . .
. . . . . . . 78
5.3 Sharing distribution for 658 humans . . . . . . . . . . . .
. . . . . . . 79
xiv
-
5.4 Number of unique CDR3s in pooled repertoires . . . . . . . .
. . . . 84
5.5 Fraction of total repertoire composed of ‘public’ sequences
. . . . . . 85
5.6 Mouse Pgen distributions by sharing number . . . . . . . . .
. . . . . 87
5.7 Human Pgen distributions by sharing number . . . . . . . . .
. . . . . 88
5.8 PUBLIC classifier schematic . . . . . . . . . . . . . . . .
. . . . . . . 89
5.9 Performance of the PUBLIC classifier . . . . . . . . . . . .
. . . . . . 90
B.1 Probabilistic vs Deterministic marginal distributions . . .
. . . . . . . 101
D.1 Gene usages by mouse age . . . . . . . . . . . . . . . . . .
. . . . . . 107
D.2 Deletion profiles by mouse age . . . . . . . . . . . . . . .
. . . . . . . 108
D.3 Frequencies of non-templated insertions . . . . . . . . . .
. . . . . . . 109
D.4 Mouse model MI validation . . . . . . . . . . . . . . . . .
. . . . . . 110
D.5 Variation of V and J gene usage across biological replicates
. . . . . . 111
D.6 Variation of deletion profiles across biological replicates
. . . . . . . . 112
E.1 Entropy of B cell model . . . . . . . . . . . . . . . . . .
. . . . . . . 113
E.2 B cell gene usages . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 117
E.3 B cell deletion profiles . . . . . . . . . . . . . . . . . .
. . . . . . . . 118
E.4 B cell non-templated nucleotide frequencies . . . . . . . .
. . . . . . . 119
E.5 PinsVD and PinsDJ over replicates . . . . . . . . . . . . .
. . . . . . . . 120
E.6 B cell model MI validation . . . . . . . . . . . . . . . . .
. . . . . . . 120
E.7 B cell model insertion Markov model validation . . . . . . .
. . . . . 121
F.1 Downsampling in sharing analyses . . . . . . . . . . . . . .
. . . . . . 123
xv
-
Chapter 1
Introduction
1.1 Adaptive immune system
The adaptive immune system evolved to provide animals with a
precision tool to
identify and remove anything ‘foreign’ to the animal. This is
done by having a
large library, or repertoire, of proteins called receptors that
bind specifically to some
small fragment of a protein called an epitope or antigen. This
binding or affinity
is determined by physical properties such as electrostatics,
hydrophobicity, Van de
Waals forces, steric concerns, etc. By specificity we mean that
this receptor will only
bind to a very limited number of epitopes and have only limited
affinity for other
epitopes1. Crucially, this specificity allows the adaptive
immune system to weed
out any receptors which recognize self peptides which would
trigger an autoimmune
response. However, this repertoire must be large and diverse
enough to be able to
identify any foreign peptide to ensure that microbes and
cancerous cells are quickly
identified and dealt with. In this thesis we will characterize
just how staggeringly
diverse these adaptive immune system repertoires are.
In order to generate and regulate these receptors, the adaptive
immune system
has a special class of cells called lymphocytes, of which there
are two main subtypes:
1Frequently the amount of ‘cross-reactivity’ is assumed to
negligible
1
-
B cells and T cells. Each lymphocyte has a single receptor, of
which it expresses many
copies, in order to recognize epitopes. These lymphocyte
receptors are protein com-
plexes composed of two amino acid chains, a larger one and a
smaller one. Each chain
has largely conserved portions (in order to standardize the way
the adaptive immune
system uses these receptors) along with highly variable regions
that provide the spe-
cific binding to epitopes. The most highly variable region, and
the one that largely
determines the affinity of a receptor to an epitope, is called
the complementarity-
determining region 3 or CDR32. We will often be a little sloppy
and refer to the
‘receptor’ and the CDR3 of a single chain interchangeably. Once
a ‘naive’ lympho-
cyte is activated by specifically binding to an epitope, it will
proliferate and some of
these cells will be archived as ‘memory’ cells to quickly
reactivate and eliminate the
antigen if the organism is ever exposed to it again.
1.1.1 B cells
B cells are lymphocytes that produce, and secrete, receptors
called antibodies. Anti-
bodies are composed of a heavy chain (IGH) and a light chain
(IGL). These receptors
can either be free in the plasma or expressed on the membrane of
B cells3. These
antibodies bind specifically to antigens. An antibody bound to
an antigen serves as a
tag for the rest of the immune system to attack the antigen.
Furthermore, antibodies
can directly neutralize microbes by binding to surface proteins
and ‘gum up’ their
operation. Foreign peptides in solution can also be made to
precipitate by antibodies
coagulating many of the peptides together.
2There are two other variable loops, CDR1 and CDR2, that are
determined by the V germlinetemplates. As a result the variation of
these loops is limited. While the CDR1 and CDR2 loops areimportant
biologically, particularly for major histocompatibility complex
(MHC) recognition of Tcells, we focus exclusively on the CDR3
region in this thesis. Unlike the CDR1 and CDR2 loops, theCDR3
region spans the region of the receptor sequence where the DNA
editing process called V(D)Jrecombination occurs (1.2). We define
the boundaries of the CDR3 region to be the conserved aminoacid
residues cysteine (C) on the 5 ′ end and a phenylalanine (F) or
tryptophan (W) on the 3 ′ end.These conserved residues are
important to ensure the receptor folds and works properly.
3If expressed on a membrane an antibody is frequently referred
to as a B cell receptor (BCR).We are sometimes sloppy and will
refer to antibodies in general as BCRs to parallel TCRs.
2
-
The amazing specificity of antibodies is generated through a
process called hy-
permutation [Teng and Papavasiliou, 2007]. Following the
successful recognition of
an antigen, a B cell proliferates and its receptor sequence
undergoes random point
mutations. These cells are then selected for affinity to the
epitope. The result is an
evolutionary process within a single individual, producing
receptors with dramatically
increased affinity to the epitope. We will present a
quantitative model of hypermu-
tation in chapter 4.
1.1.2 T cells
Although antibodies bind directly to epitopes in solution, T
cells have their epitope
recognition mediated by other cells. In animals with adaptive
immune systems, cells
display a protein complex called major histocompatibility
complex (MHC) on their
membrane. This protein complex can then be ‘loaded up’ with a
peptide fragment by
the cell, and a T cell receptor (TCR) can then recognize the
peptide - MHC complex
(pMHC)4. Cells load up the MHC complex with chopped up peptides
internal to the
cell, giving the T cell a snapshot of the current protein
synthesis of the cell. This
provides an excellent mechanism for the T cell to be able to
identify if a cell was
infected by a virus or has become cancerous. Also, if a cell is
infected by a virus it is
possible that peptides internal to the viral capsid (and thus
not an accessible epitope
to an antibody/BCR) could be loaded up into pMHC, providing
additional epitopes
for the adaptive immune system to tag.
Similar to antibodies, TCRs are composed of two chains, an α
chain (TRA) and
a β chain (TRB). Ideally we would analyze the full receptor
composed of TRA-TRB
pairs, however it is hard experimentally to have high throughput
sequencing that
4This is the interaction between Cytotoxic or CD8+ T cells and
the MHC I complex. There isan additional MHC complex (MHC II) that
is expressed by a class of cells called antigen presentingcells
(APC) that actively uptake and present peptides. There are also
several other classes of Tcells, which perform a variety of roles.
For the purposes of this thesis we focus on CD8+ T cells andthe MHC
I complex.
3
-
accurately pairs TRA and TRB chains. Instead, many sequencing
analyses focus on
only one chain. For much of this thesis we will focus on TRBs in
both humans and
mice as the TRB chain is not only much more diverse than the TRA
chain, it is also
the chain that determines much of the receptor-epitope
specificity.
1.1.3 The DNA problem
The massive diversity of receptors needed for a functioning
repertoire poses a very
interesting problem. These receptors are proteins, coded for by
DNA sequences. Each
unique receptor demands a unique DNA sequence. The number of
unique receptors
in a repertoire utterly dwarfs the number of coding genes in a
genome. For example,
a human TRB repertoire might have 108−1010 unique receptors,
whereas the number
of coding genes in the human genome is approximated to be of the
order of 104−105.
Clearly the human genome cannot directly store the DNA sequences
of every receptor
in a repertoire. This prompts question of how can such a
diversity of receptors be
generated from limited DNA.
1.2 V(D)J recombination
The solution to the apparent conundrum laid out in the previous
section is a process
called V(D)J recombination wherein the actual DNA sequences of
developing B cells
and T cells get recombined, generating novel genes that
translate to unique CDR3
amino acid sequences. While highly regulated, this process
allows the adaptive im-
mune repertoire to generate the necessary diversity to
specifically recognize foreign
antigens/epitopes. This discovery led to Susumu Tonegawa’s 1987
Nobel Prize in
Medicine [Hozumi and Tonegawa, 1976]. The rest of the thesis
will involve proba-
bilistically modeling this V(D)J recombination.
4
-
Figure 1.1: Schematic of VDJ recombination
J2-1D1
J2-2J2-1D2V3V2V1 J1-2J1-1D1… … …RAG
TdT
V3V2V1 … N2
V3V2V1 … D1 J2-1N2RAG
V2 D1 J2-1N2
TdT
N1
D1 J2-1N2V2 N1
Simplification of the stages of VDJ recombination for TRB. Shows
the arrangementof example V, D, and J genes on the chromosome,
along with the RSS regions (or-ange stripes). For TRB gene locus
the D and J genes are arranged as above, whichimplies the
topological constraint that D2 and J1-∗ genes are never jointly
used.Non-templated nucleotides, indicated by N1 and N2, are
inserted at the VD and DJjunctions by the TdT complex.
V(D)J recombination has become an extremely well studied process
over the past
40 years and the critical enzymes have been identified and
studied. Of particular
interest to this thesis will be the enzymes recombination
activating genes (RAG) 1
and 2, and terminal deoxynucleotidyl transferase (TdT), both of
which are uniquely
expressed in lymphocytes. VDJ recombination leads to the
generation of sequences
that produce IGH and TRB chains, while VJ recombination produces
IGL and TRA
chains.
Before recombination, the germline chromosome has two or three
types of genetic
templates: variable (V), diversity (D), and joining (J). For
each type of template,
there are multiple genes (e.g. there are 35 TRBV genes in mice)
which are identi-
5
-
fied by immediately adjacent, highly stereotyped, 7-mer
nucleotide sequences called
recombination signal sequences (RSS). During VDJ recombination5,
RAG enzymes
bind specifically to the RSS of a J gene and of a D gene and
make an incision that cuts
out the intervening DNA. This cutting of the DNA can be messy,
possibly deleting
away parts of the D and J genes, or leaving some single stranded
DNA hanging, which
will get repaired by inserting in reverse complementary
palindromic nucleotides. The
D and J genes are then spliced together, possibly with
non-templated nucleotide in-
sertions from the TdT enzyme. A similar slicing and splicing
process then happens
at the V-D junction.
To remove the biology, and make this clear on an abstract level,
VDJ recombina-
tion acts by choosing a particular gene (strings of nucleotides)
for each of the V, D,
and J segments, deleting away some of the nucleotides of those
genes (or inserting
reverse palindromic nucleotides), and then inserting random
nucleotides at the VD
and DJ junctions as the sequence is spliced together to read
(from 5 ′ to 3 ′ ) VDJ.
This provides a new DNA sequence, where all of the edits
(splicing, deleting, and
inserting) correspond to the CDR3 region of the receptor.
This V(D)J recombination process has no guarantee of success, or
of producing
a DNA sequence that can translate to a functional protein. As
there are random
numbers of deletions and insertions, the DNA sequence may have
frame shifts or stop
codons in them. If this happens, and a V(D)J recombination event
on a chromosome
leads to a nonproductive sequence, the cell may try again on the
second chromosome.
If this second recombination leads to a functional receptor, the
cell will have two
rearranged chromosomes: one functional and expressed, and the
nonfunctional one
silenced by allelic exclusion. This fortunate quirk will prove
crucial later in this thesis.
Once a T cell or B cell has a functional receptor there is some
quality control that
occurs. The cell undergoes both positive selection (e.g.
checking a TCR interacts
5In VJ recombination, there is no D gene, and the V and J genes
are directly spliced together
6
-
well with MHC) and negative selection (i.e. removing cells with
high affinity to self
epitopes). This somatic selection process is crucial to ensure
both useful receptors
and to prevent autoimmune responses and skews the repertoire on
a statistical level.
Models characterizing the statistics of this selection process
have been introduced by
my collaborators, particularly Yuval Elhanati [Elhanati et al.,
2014], and are discussed
in the papers that are referenced in chapter 4 [Elhanati et al.,
2015, Sethna et al.,
2017].
1.3 Repertoire sequencing and analysis
Advances in high throughput sequencing [Robins et al., 2010a]
have allowed for large
scale sequencing of lymphocytes in a blood or tissue sample: the
sample is broken
down, the DNA extracted, and specialized primers amplify the DNA
sequence of the
CDR3 region before sequencing. Such experiments are now becoming
so routine that
there is interest in using them for medical diagnostic and
immunotherapy purposes.
Almost all of the data discussed in this thesis was sequenced
using a protocol pio-
neered by Harlan Robins [Robins et al., 2010a], who has started
a company, Adaptive
Biotechnologies, to provide repertoire sequencing services.
These experiments can successfully sequence millions of cells
(or more), produc-
ing datasets of ∼ 104 − 106 unique DNA sequences. The
availability of datasets of
such size and quality allows for serious statistical analyses to
quantify the underly-
ing biology as well as the possibility to explore more
theoretical questions. Being
physicists, the approach we will take in this thesis is to
construct a statistical model,
i.e. a parameterized probability distribution, of V(D)J
recombination that reflects
the underlying biological processes. These large datasets are
then used to infer the
model parameters. The model parameters will provide quantitative
descriptions of the
V(D)J recombination machinery, and the model itself provides a
distribution of the
7
-
probability of generating any receptor (Pgen) that can be used
to answer theoretical
questions like characterizing the diversity of a repertoire.
1.4 Organization of thesis
This thesis is broken into two main parts. The first, covers
chapters 2 and 3 and
provides the mathematical framework for the rest of the thesis.
The class of generative
models used to analyze the generation probability (Pgen) of
adaptive immune system
repertoires (first introduced in Murugan et al. [2012]) is
described, and the inference
process, expectation maximization (EM), used to fit the model
parameters is laid out.
We also show how one of the main metrics we use, the entropy of
a model, can be
computed and broken down into different components. In addition,
the computational
challenges associated with computing Pgen of sequences is
discussed, in particular the
exponential explosion of the number of recombination events that
generate amino acid
CDR3 sequences. We then demonstrate the novel dynamic
programming algorithm,
OLGA [Sethna et al., 2018], that we developed to efficiently
solve this problem and
make the computation of amino acid CDR3 sequences Pgen not only
tractable, but
fast.
The second part, spanning chapters 4 and 5, dives into the
applications of the
modeling framework defined in the first part. The first part of
chapter 4 describes
the work from Sethna et al. [2017] analyzing the maturation of
mouse repertoires from
embryo to young adult. The second half of chapter 4 lays out a
model quantifying
hypermutation in B cells [Elhanati et al., 2015]. Finally,
chapter 5 demonstrates how
Pgen explains the curious observation of so-called ‘public’
sequences.
8
-
Chapter 2
Generative Model
2.1 V(D)J recombination models
The definition, selection, and inference of a generative model
of V(D)J recombination
is the foundation for all of the work that comes later. Such a
generative model defines
a probability measure over the state space of V(D)J
recombination events, which can
be extended to define probabilities of particular receptors or
collections of receptors.
We begin by introducing a general model framework by requiring
that the model
respects the biology of the V(D)J recombination process. To do
this we define the
state (sample) space of V(D)J recombination events by
combinations of the stochastic
events in the DNA splicing itself (i.e. gene choice,
deletions/palindromic insertions,
and insertions). For example, we can described the state
(sample) space of VDJ
recombination events as:
Ωe = {(V,D, J, dV , dD, d′D, dJ, {mi}, {ni})} (2.1)
Where V, D, and J are the gene choices, dV , dD (5′ /left), d′D
(3
′ /right), and dJ
are deletions (including palindromic insertions), and {mi} and
{ni} are the specific
9
-
nucleotide sequences which are inserted at the VD and DJ
junctions respectively 1.
This also allows us to define a fully general model family for
the recombination event
e ∈ Ωe:
Precomb(e) = P (V, dV , {mi}, dD, D, d′D, {ni}, dJ , J)
(2.2)
We cannot use the fully general model above, which defines a
unique probability for
each combination of recombination events, due to the exponential
explosion of param-
eters. The challenge is to construct sub-models which have few
enough parameters
to be inferred, yet still sufficiently describe the observed
sequences. In general this is
done by positing the independence and dependence of the various
splicing events and
then checking if the factorization captures the necessary
correlations. The specific
models used are factorized to reflect the spatial correlations
along the chromosome.
For VDJ recombination, these models assume that the V choice is
independent
of the D/J choice (the latter two being correlated by virtue of
the order in which
the genes are laid out on the chromosome, see Fig. 1.1), the
deletion profiles depend
only on the gene choice, and lastly that the insertions are
independent of the genomic
contributions and each other. There is still an exponential
blowup of parameters
unless a simpler (fewer parameters) model for the inserted
sequences is introduced.
We use a model that is a product of a length distribution and a
dinucleotide Markov
model. This model factorization and dinucleotide Markov model is
first introduced
and validated in Murugan et al. [2012], however important
exceptions to this factor-
ization will be discussed in Chapter 4 in the contexts of mouse
T cells [Sethna et al.,
2017] and human B cells (Elhanati et al. [2015]). For VJ
recombination, these models
assume the V/J choice is correlated, the deletion profiles
depend only the gene choice,
and lastly the insertion region is independent of the genomic
contribution.
1The subscript i index is read from 5 ′ to 3 ′ .
10
-
2.1.1 VDJ generative model
The VDJ recombination model is defined as:
Precomb(e) = PV(V )PDJ(D, J)PdelV(dV |V )PdelJ(dJ |J)PdelD(dD,
d′D|D)
×PinsVD(`VD)p0(m1)[`VD∏i=2
SVD(mi|mi−1)]
×PinsDJ(`DJ)q0(n`DJ )[`DJ−1∏i=1
SDJ(ni|ni+1)] (2.3)
Where the inserted nucleotide sequences {mi} and {ni} have
lengths `VD and `DJ with
insertion length distributions PinsDJ(`VD) and PinsDJ(`DJ), SVD
and SDJ are the respec-
tive dinucleotide Markov transition matrices, and finally, p0
and q0 are the nucleotide
biases for the first insertion at each junction2. Note that the
inserted sequence of
length 0 (i.e. no insertions at a junction) is also allowed and
has probability PinsVD(0)
or PinsDJ(0) depending on the splicing junction.
2.1.2 Model Validation
As mentioned above, it is important to check that the
factorization of the model
structure is correct. To address this issue, we examine the
correlations between
various marginal variables of the model (i.e. the stochastic
recombination events: V,
delV, J, insVD, etc) by examining the mutual information of each
pair.
To determine if we have captured the correct correlations in the
data, we compare
the precise mutual information computed directly from the model,
to the estimated
mutual information determined by the expectation over the data
(using the Treves-
Panzeri correction [Treves et al., 1998] to account for finite
sample size).
The generative model has zero mutual information, by
construction, between in-
dependent marginal pairs, e.g. the number of VD insertions and
the choice of J gene.
2Note, we often make the further approximation that the
insertion Markov model is at steady-state, i.e. we set p0 and q0 to
be the steady state distributions of SVD and SDJ respectively.
11
-
Variables that correlate with each other either directly or
indirectly, e.g. between D
and J gene choice, or between D choice and number of D deletions
may have non-zero
mutual information. In order to quickly gauge if a model is
consistent (or inconsis-
tent) with the model factorization, we use plots like D.4, where
the MI computed
from the model is below the diagonal and the expectation over
the data is above the
diagonal. If the plot is symmetric about the diagonal, then the
model is self consis-
tent with the data. Indeed, the total missed mutual information
is, to leading order,
precisely the amount of information our factorized model missed
due to its structure.
To validate the dinucleotide Markov model for insertions, we
compare the expected
trinucleotide frequencies to the observed trinucleotide
frequencies.
We will perform these checks in Chapter 4 when we look at mouse
T cells [Sethna
et al., 2017] and human B cells [Elhanati et al., 2015].
2.1.3 VJ generative model
Analogous to the VDJ model, we define the model factorization
for the generative
model of VJ recombination. The primary distinction is that there
is no D gene, nor
is there an N2 insertion region (DJ junction). Also, as there is
evidence of repeated
splicing attempts for the TCRα chain, the V and J gene usages
are allowed to be
correlated Elhanati et al. [2016].
Precomb(e) = PVJ(V, J)PdelV(dV |V )PdelJ(dJ
|J)PinsVJ(`VJ)p0(m1)[`V J∏i=2
SVJ(mi|mi−1)]
(2.4)
2.1.4 Pgen
Our model, Precomb, defines a probability measure over the state
(sample) space Ωe of
recombination events. However, this model can, in theory, be
extended to other state
(sample) spaces of much more interest scientifically and
biologically. In particular,
12
-
we examine the state spaces of DNA nucleotide sequence reads,
CDR3 nucleotide
sequences, and CDR3 amino acid sequences (or collections/motifs
of amino acid CDR3
sequences). This is done by summing over all recombination
events that generate one
of the ‘coarse grained’ states to give the probability of
generating a particular CDR3
sequence or receptor.
Pgen(seq) =∑e|seq
Precomb(e) (2.5)
This generation probability, or ‘Pgen’, of a sequence or
receptor will be used con-
tinuously throughout this thesis. We will return to this idea of
extending or ‘coarse
graining’ the probability space in greater detail in Chapter
3.
2.2 Model Entropy
Before we introduce our method for inferring the model
parameters, we first introduce
a concept that we will return to repeatedly: the entropy of a
model. One of the
advantages of having a probabilistic model of V(D)J
recombination is that we can
use the (Shannon) entropy (Appendix A) of the distribution as a
well defined measure
of the ‘diversity’ of a repertoire. We examine the entropy of
both Precomb and Pgen.
First we show how to compute the entropy S(Precomb) directly
from the model, and
how it decomposes into contributions from the gene choice, the
deletions, and the
insertions. We also show explicitly how changing the insertion
length distribution
has an outsized impact on the entropy. Then we discuss how to
approximate S(Pgen)
by Monte Carlo simulation. Throughout this section we do not
specify what units we
want to express the entropy in, however we will most frequently
talk about entropy
in units of bits (so base of the log is 2, i.e. log2)3.
3Personally, I think everything should be done in nats (log base
e), however for most people it iseasier to parse bits (log base 2)
or dits (log base 10).
13
-
2.2.1 Entropy of Precomb
The entropy4 of a VDJ recombination model is:
H(Precomb) = −〈log(Precomb)〉Ωe =
−〈log(PVPDJPdelVPdelJPdelDP{mi}P{ni})
〉Ωe
(2.6)
Now, we can break the total entropy expression into independent
components, and
compute the entropy of each of the components independently.
Genes/Deletions entropic contribution
The gene/deletion contributions are fairly straightforward to
compute. Examining
the V templates:
H(PV(V )PdelV(dV |V )) =−∑V,dV
PV(V )PdelV(dV |V ) [log(PV(V )) + log(PdelV(dV |V ))]
=−∑V
PV(V ) log(PV(V ))−∑V,dV
PV(V )PdelV(dV |V ) log(PdelV(dV |V ))
=H(PV) +∑V
PV(V )H(PdelV(dV |V ))
=H(PV) + 〈H(PdelV)〉V(2.7)
In an analogous fashion we can determine H(PDJ), 〈H(PdelD)〉D,
and 〈H(PdelJ)〉J . We
say that the entropy contribution from the choice of germline
template is H(PV) +
H(PDJ), while the deletion entropic contribution is 〈H(PdelV)〉V
+ 〈H(PdelD)〉D +
〈H(PdelJ)〉J .4We indicate entropy by H not S in this section so
as not to confuse notation with the dinucleotide
transition matrices SVD and SDJ
14
-
Insertion entropic contribution
The entropy of the insertions is much trickier to compute as we
will have to sum
the Markov model probabilities over all possible insertion
sequences. We drop the
VD/DJ subscripts as the computations are identical.
H(P{mi}) =−∑{mi}
P{mi}({mi}) log(P{mi}({mi}))
=−∑`
∑{mi}|`
Pins(`)P{mi}|`({mi})[log(Pins(`)) + log(P{mi}|`({mi}))]
=−∑`
Pins(`) log(Pins(`))−∑`
Pins(`)∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
=H(Pins)−∑`
Pins(`)∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
(2.8)
where,
P{mi}|`({mi}) = p0(m1)[∏̀i=2
S(mi|mi−1)]. (2.9)
In order to make the dependence of this entropy on the average
insertion length
(〈`〉) more explicit we will make the approximation that the
Markov model is at
steady-state (i.e. p0 = pss, the steady-state distribution of
S).
We will now prove inductively that for ` ≥ 1:
H(P{mi}|`) =−∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
=H(pss)− (`− 1)∑m
pss(m)∑n
S(n|m) log(S(n|m))(2.10)
15
-
Initial Step: ` = 1
This is trivial as P{mi}|`({mi} = m) = p0(m) = pss(m), so by
direct computation:
−∑
{mi}|`=1
P{mi}|`(m) log(P{mi}|`(m)) = −∑m
pss(m) log(pss(m)) = H(pss) (2.11)
Inductive step
Assuming we have have shown that Eq. 2.10 is true for ` ≤ k, we
prove it holds for
` = k + 1.
−∑
{mi}|`=k+1
P{mi}|`({mi}) log(P{mi}|`({mi}))
=−∑mk+1
∑{mi≤k}
S(mk+1|mk)P{mi}|k({mi≤k})[log(S(mk+1|mk)) +
log(P{mi}|k({mi≤k}))
]=H(P{mi}|k({mi}))−
∑mk+1
∑{mi≤k}
S(mk+1|mk)P{mi}|k({mi≤k}) log(S(mk+1|mk))
=H(P{mi}|k({mi}))
−∑mk+1
∑mk
S(mk+1|mk) log(S(mk+1|mk))∑
{mi≤k−1}
S(mk|mk−1)P ({mi≤k−1}|k − 1)
(2.12)
Now, in order to do the summation in the second term we make the
observation
that the conditional terms only depend on the two last
nucleotides σk+1 and σk so we
would like to get the marginal distribution
pk(mk) =∑
{mi≤k−1}
P (mk|mk−1)P ({mi≤k−1}|k − 1) (2.13)
But, we recall our previous assumption that the Markov process
is in its steady state
to know that the marginal distribution is the same as the
steady-state distribution
16
-
(i.e. pk = pss)5 Plugging this back in shows
−∑mk+1
∑mk
S(mk+1|mk) log(S(mk+1|mk))∑
{mi≤k−1}
S(mk|mk−1)P ({mi≤k−1}|k − 1)
=−∑m
pss(m)∑n
S(n|m) log(S(n|m))
(2.14)
which shows the inductive step holds for k + 1 and completes the
proof.
Putting everything together we get the entropy from a single
insertion junction
being
H(Pins) +H(pss)− (〈`〉 − 1)∑m
pss(m)∑n
S(n|m) log(S(n|m)) (2.15)
Note the dependence of this expression on the average number of
insertions 〈`〉.
We will return to this in chapter 4 when we see that the way a
repertoire scales its
diversity is by changing the insertion length distribution.
Total entropy of Precomb
H(Precomb) =H(PV) +H(PDJ) + 〈H(PdelV)〉V + 〈H(PdelD)〉D +
〈H(PdelJ)〉J
+H(PinsVD) +H(pss)− (〈`VD〉 − 1)∑m
pss(m)∑n
SVD(n|m) log(SVD(n|m))
+H(PinsDJ) +H(qss)− (〈`DJ〉 − 1)∑m
qss(m)∑n
SDJ(n|m) log(SDJ(n|m))
(2.16)
5If we didn’t want to make the steady-state assumption, it is
easy to see how using this marginaldistribution would change Eq.
2.10 to:H(P{mi}|`) = H(p0)−
∑`k=2
∑m pk(m)
∑n S(n|m) log(S(n|m))
17
-
2.2.2 Entropy of Pgen
The probability distribution of Pgen no longer factorizes after
the summation. As a
result we cannot break down the entropy into independent pieces.
Instead a different
tack is taken, to estimate the entropy of Pgen.
We recall that the entropy of a distribution is just −〈logP 〉.
This means that we
can estimate the entropy of Pgen by taking the expectation value
over Monte Carlo
simulated sequences:
S(Pgen) ≈ 〈log(Pgen(s))〉s∈MCsample (2.17)
2.2.3 The Pgen distribution
Another extremely effective way of visualizing the diversity of
a repertoire is to exam-
ine the probability density of the log Pgen of sequences. If a
large number of sequences
(or recombination events) are drawn from a model distribution
(i.e. Monte Carlo sam-
pling), they can be histogrammed by the log of their generation
probabilities. If we
define an energy as E ∼ − log Pgen, this distribution is the
probability density P (−E),
and is closely related to the density of states (a connection we
will return to in chapter
5). An example of one of these plots is shown for a human TRB
model in Fig. 2.1,
demonstrating the massive range of generation probabilities,
spanning ∼20 orders of
magnitude. Another very useful aspect of these plots is that the
mean of each dis-
tribution is the entropy of the distribution (up to a minus
sign), and is indicated as
the dotted lines in Fig 2.1. We frequently use such plots as a
way of characterizing
the data visually. It is easy to see shifts to more or less
entropic distributions, and
to see any impact on the tails. Furthermore, these plots can be
made from the data
directly by histogramming their generation probabilities and the
entropy of such a
distribution will again be the mean6.
6Please note, when using data sequences we should technically
say that the ‘entropy’ computedas the mean of the distribution is
technically a cross entropy. For the non-productive sequenceswe
largely focus on in this thesis this is a negligible distinction.
However, for inframe productive
18
-
Figure 2.1: Distribution functions: P (−E = log Pgen)
Shows the distribution of generation probabilities over 3
different state spaces of thesame human TRB model, highlighting the
‘coarse graining’ of the model from recombi-nation events, to
nucleotide sequences, and finally to amino acid
sequences/receptors.The dotted lines indicate the mean of each
distribution and is mathematically equiv-alent to the negative of
the entropy of each distribution. The entropy of the distri-butions
decreases as they get more coarse grained.
2.3 Inference
The data which is used to infer these models comes from
high-throughput Illumina
sequencing [Robins et al., 2010a] and is organized as a
collection of DNA sequences
of around 60-200 base pairs. We will want to infer the
parameters of the generative
model that most accurately reflect the sequences observed in the
experiment. Without
a principled prior that significantly biases the distribution
(note, Jeffrey’s prior is
remarkably flat for these generative models), the parameters are
inferred by way of
sequences this is not an irrelevant concern as the distributions
are noticeably skewed towards highergeneration probabilities due to
somatic selection. See Elhanati et al. [2014] for a discussion of
somaticselection and the statistical effects on the distribution.
We are a little sloppy and always refer tothis quantity as the
entropy of the distribution, even if it is technically a cross
entropy at times.
19
-
maximum likelihood estimation. Given a collection of observed
DNA sequences S and
a generative model determined by parameters θ ∈ Θ, we want to
infer the estimated
parameters θ̂:
θ̂ = arg maxθ
L(θ; S) = arg maxθ
p(S|θ) = arg maxθ
∏seq∈S
Pgen(seq|θ) (2.18)
as the sequences in S are assumed to be independently
generated.
In order to properly infer the parameters of a V(D)J model we
must be careful to
only use sequences that are statistically representative of the
V(D)J recombination
machinery itself and are not skewed by any selective process or
somatic population
dynamics. This is a real worry as not only could clonal
expansion overrepresent
specific sequences, but functional receptors are systematically
biased away from the
underlying V(D)J generative distribution due to their
involvement in the immune
system function (this is explored in Elhanati et al. [2014]).
Fortunately, as discussed
in section 1.2, V(D)J recombination does not always produce
inframe, productive
sequences with each recombination event. As a result, the DNA
sequence datasets
we analyze contain a significant fraction of sequences we know
must be nonproduc-
tive/nonfunctional because they are frame shifted (out of frame)
or contain a stop
codon. These sequences can never be expressed and therefore
should experience no
selective pressures. Thus, to ensure a statistically unbiased
sample, we filter our sam-
ple for only unique, nonproductive sequences. Filtering for
unique sequences removes
the influence of clonal dynamics and expansion, whereas
filtering for nonproductive
sequences removes any selection effects.
The generative models described (Eq. 2.3, Eq. 2.4) are defined
over the space
of recombination events which are ‘hidden’ in the sense that
there are many, many,
recombination events that can lead to a particular DNA sequence
and there is no way
to determine which one actually occurred. In order to infer the
parameters of such a
20
-
model, a classic iterative learning algorithm, expectation
maximization (EM), is used
which ensures that a local maximum in likelihood is achieved
(proof in Appendix C).
2.3.1 Errors and Mismatches
Each recombination event e = (V,D, J, dV , dD, d′D, dJ , {mi},
{ni}) generates a specific
DNA sequence. However, it is possible that when this gene was
sequenced that the
recorded nucleotides do not match up perfectly with the sequence
generated by e.
This mismatch could indicate a sequencing error in the
experiment or, in the case of
B cells, could be the result of hypermutations (this will be
discussed in much greater
detail in Section 4.2). We will need to account for such
mismatches or errors in order
to properly infer the parameters of the generative model. To do
this we introduce
an error/mismatch model. Formally, we define the observed
probabilities, given an
observed/measured sequence seqo as:
Porecomb(e, seqo) = Precomb(e)Pmis(seqo|e)
Pogen(seqo) =∑e∈E
Porecomb(e, seqo)(2.19)
Where Pmis(seqo|e) is the error/mismatch model whose parameters
will be inferred
during the EM inference. There are several Pmis(seqo|e) models
used over the course
of this work.
No error model
It is useful to first consider a model where no errors or
mismatches are allowed. To
do this, define Pmis(seqo|e) = I[e generates seqo]. Then,
Porecomb(e, seqo) =
Precomb(e) if e generates seq0 otherwise (2.20)21
-
and
Pogen(seqo) =∑e∈E
Porecomb(e, seqo) =∑e|seqo
Precomb(e) = Pgen(seqo) (2.21)
we see we recover Pgen from Pogen.
Flat error rate
This model assumes that the probability of a mismatched
nucleotide between the
observed sequence seqo = {soi} and the sequence generated by
recombination event e,
seqe = {sei}, is a flat probability pm.
Pmis(seqo|e) =∏i
(pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) (2.22)
Flat error rate, restricted to genomic templates
In practice, it doesn’t make much sense to examine mismatches
outside of the region
of the sequence that is determined by a germline V, D, or J
sequence. Define the set
of positions, Posgene where the nucleotides {sei} come from a
germline template and
its complement, Posins, where the nucleotides come from
non-templated insertions.
We define a new error model that applies the flat error model to
positions Posgene and
the no error model to positions Posins:
Pmis(seqo|e) =
0, if ∃i ∈ Posins s.t. soi 6= sei∏
i∈Posgene (pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) ,
otherwise
(2.23)
This is the model that is used most frequently and unless
otherwise stated is the
model that is used for inference purposes.
22
-
N-mer context dependent error model
In order to study hypermutations in Section 4.2 we use a
mismatch model where
the mismatch rate is modulated depending on the 7-mer nucleotide
sequence around
the mismatch site. Here we define a general N-mer context model
where there are
independent energies at each site (i.e. a one point model).
ph(i|seq) =1
Zpbg(si−bN
2c, si−bN
2c+1, . . . , si+bN
2c) exp
bN2 c∑k=−bN
2c
−Ek(si+k)
(2.24)where pbg(σ) is the background frequency of the N-mer
nucleotide sequence σ and
the proportionality constant Z is determined by matching the
overall mismatch rate
(i.e. 〈ph〉 = pm). As we have the freedom to define the 0 energy
with each of the Eks,
it is convenient to set∑
σ∈{A,C,G,T}Ek(σ) = 0 to make it transparent if the
nucleotide
identity at position k in the N-mer makes a hypermutation
mismatch more or less
likely.
One may also notice that we did not specify whether seq is seqo
or seqe. Ideally we
would want seq to be the sequence immediately before the
hypermutation occurred
(e.g. if we were constructing an evolutionary tree from
hypermutations we should use
the current node’s sequence as seq). However, for inference
purposes this ambiguity
is functionally irrelevant as choosing either seqo or seqe to be
seq will result in a
negligible difference.
Again, we will want to restrict to mismatches with the germline
sequences (to
ensure we have identified a hypermutation), so we define:
Pmis(seqo|e) =
0, if ∃i ∈ Posins s.t. soi 6= sei
otherwise:∏i∈Posgene (ph(i|seq)I[soi 6= sei ] + (1−
ph(i|seq))I[soi = sei ])
(2.25)
23
-
2.3.2 Expectation Maximization algorithm
Expectation maximization is implemented by taking an initial
guess (generally ran-
domized) for the parameters and then iterating two different
steps. The first step,
expectation, defines a function which is the expected
log-likelihood over the distribu-
tion of data and hidden variables determined by the data and the
current guess of the
parameters. Explicitly, if θ′ is the current estimation of the
parameters, we define:
Q(θ|θ′) = 〈logL(θ; X,Z)〉Z|X,θ′ (2.26)
Note, Q(θ|θ′) is still a function of some undetermined
parameters θ. This leads
to the second step: maximization. To determine the next
iteration’s parameter esti-
mation we maximize the estimation function:
θ(i+1) = arg maxθ
Q(θ|θ(i)) (2.27)
Repeatedly iterating these steps will monotonically increase
both Q and the full
likelihood function (proof below). Let us be explicit in how
this translates into the
specific scenario of a VDJ generative model. Say we have
(nonproductive) sequences
S, the set of possible recombination events Ωe = {(V,D, J, dV ,
dD, d′D, dJ , {mi}, {ni}},
and the model structure from Eq. 2.3. Then θ is the collection
of parameters defining
PV, PDJ, PdelV, etc. The expectation step is defined as so:
Q(θ|θ′) = 〈logL(θ; S,E)〉E|S,θ′ =∑
seq∈S
∑e∈E
Porecomb(e|seq, θ′) log Porecomb(e, seq|θ)
(2.28)
Now,
Porecomb(e|seq, θ′) =Porecomb(e, seq|θ′)∑e∈E P
orecomb(e, seq|θ′)
=Porecomb(e, seq|θ′)
Pogen(seq|θ′)(2.29)
24
-
is the fractional contribution of the particular event to the
total Pgen of that sequence.
Plugging Porecomb(e|seq, θ′) back in and expanding Porecomb(e|θ)
we get:
Q(θ|θ′) =∑
seq∈S
∑e∈E
Porecomb(e, seq|θ′)Pogen(seq|θ′)
×[
logPV(V (e)) + logPDJ(D(e), J(e))
+ logPdelV(dV (e)|V (e)) + logPdelD(dD(e), d′D(e)|D(e)) +
logPdelJ(dJ(e)|J(e))
+ logPinsVD(`VD(e)) + log p0(m1(e)) +
`VD∑i=2
logSVD(mi(e)|mi−1(e))
+ logPinsDJ(`DJ(e)) + log q0(n`DJ (e)) +
`DJ−1∑i=1
logSDJ(ni(e)|ni+1(e))
+ logPmis(e|seq)]
(2.30)
We now need to evaluate arg maxθQ(θ|θ′). As the expansion breaks
up into indepen-
dent pieces, we can deal with them one at a time. First examine
the parameters in PV.
We want to maximize f(PV) = Q(θ|θ′) conditioned on g(PV) =∑
V PV(V ) − 1 = 0.
Naturally, this is done with Lagrange multipliers (5f = λ5 g).
5f is readily com-
puted:
∂f
∂PV(Vi)=∂Q(θ|θ′)∂PV(Vi)
=∂
∂PV(Vi)
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
logPV(e)
=∑
seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)]PV(Vi)
(2.31)
λ5 g is even more straightforward:
λ∂g
∂PV(Vi)= λ
∂
∂PV(Vi)
[∑V
PV(V )− 1]
= λ (2.32)
So,
PV(Vi) =1
λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)] (2.33)
25
-
To solve for λ, plug back in to our normalization condition
(g(PV) =∑
V PV(V )−1 =
0):
g(PV) = 0 =∑V
PV(V )− 1 =− 1 +1
λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
∑Vi
I[Vi = V (e)]
=− 1 + 1λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
=− 1 + 1λ
∑seq∈S
Pgen(seq|θ′)Pgen(seq|θ′)
=− 1 + 1λ
∑seq∈S
1
=− 1 + |S|λ
⇒ λ =|S|
(2.34)
Finally this gives us the expression for the parameters of PV
for the next iteration:
PV(Vi) =1
|S|∑
seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)] (2.35)
which is just the expectation of that marginal, V gene usage in
this case, over the
data sequences and using the previous iteration’s parameters. It
is easy to show
that the remaining parameters are inferred in an analogous
fashion with the only
caveat being that in the derivation for conditional
distributions you need to use a
normalization condition (and thus another Lagrange multiplier)
for each variable
that the distribution is conditioned on (or do the inference as
a joint distribution).
For example:
g(PdelV|Vi) = 0 =∑d′V
PdelV(d′V |Vi)− PV(Vi) (2.36)
26
-
Also note that as the insertion dinucleotide Markov models also
break up into a
similar form, their parameters are also inferred in an identical
manner (only that
each recombination event e can contribute more than one term to
the sum).
2.3.3 Implementation
Implementation of the EM algorithm for these V(D)J generative
models is quite
tricky, and requires a large amount of computational power. As
model parameters
are learned from large datasets of∼ 104−105 sequences, there is
a premium on efficient
parallelized code. Sequence alignment, efficient enumeration of
recombination events,
and intelligent organization of data structures are only some of
the challenges. The
story of developing software to infer these parameters belongs
to others and so won’t
be a focus of this thesis. However, I do want to take a moment
to describe and
highlight the work done to make this difficult inference process
possible.
My predecessor, Anand Murugan, was the first to code up and
implement a VDJ
generative model of the form Eq. 2.3 and this was the basis of
the first paper de-
scribing these V(D)J generative models in Murugan et al. [2012].
His MATLAB code
was then later adapted by me to define and infer the models
discussed in Chapter 4.
Despite the success of this MATLAB code, it does require some
expertise to use and
any changes to the model structure must be hard coded.
Recently a collaborator, Quentin Marcou, developed a software
package called
IGoR (Inference and Generation Of Repertoires) in C++ [Marcou et
al., 2018]. IGoR
is constructed in a way that allows the user to easily define
the model structure (i.e.
the factorization) and runs smoothly and quickly. This software
was used to infer the
models discussed/used in chapters 3 and 5. IGoR is publicly
available on GitHub:
https://github.com/qmarcou/IGoR.
27
https://github.com/qmarcou/IGoR
-
Chapter 3
V(D)J recombination to sequences:
Precomb→ Pgen
The previous chapter laid out how a generative V(D)J model can
be constructed and
inferred. However, the generative model is defined over a state
(sample) space of re-
combination events, Ωe, whereas the scientific interest is over
the state (sample) space
of sequences or receptors (both nucleotide and amino acid), and
biological/physical
effects can only take place on the level of the physical protein
structure of the re-
ceptor, i.e. the amino acid sequence (or possibly some coarse
grained version of the
amino acid sequence). As briefly discussed in 2.1.4, the V(D)J
model does define the
probability of generating a particular nucleotide or amino acid
sequence by summing
over all recombination events that generate the sequence. This
was summarized in
Eq. 2.5, which we repeat here:
Pgen(seq) =∑e|seq
Precomb(e) (3.1)
This summation over recombination events is, in some sense,
‘coarse graining’ the
state (sample) space as we are aggregating many states
(recombination events) into
a new state (a nucleotide or amino acid sequence).
28
-
3.0.1 Probability Spaces (mathematical aside)
Formally, this ‘coarse graining’ is just extending probability
spaces. First we define
the sample space of recombination events (Ωe, with σ-algebra
Be), the sample space of
nucleotide CDR3 sequences (Ωnt, with σ-algebra Bnt), and the
sample space of amino
acid CDR3 sequences (Ωaa, with σ-algebra Baa). Note that as each
recombination
event generates a specific nucleotide sequence through the
physical process of V(D)J
recombination, we have the surjective map πv(d)j : Ωe → Ωnt.
Furthermore, as each
(in-frame) nucleotide sequence translates to an amino acid
sequence, we can define the
translation mapping πnt2aa : Ωnt → Ωaa (if we wished to be
pedantic we could keep the
out of frame sequences in Ωaa to ensure that πnt2aa is a
function over the whole sample
space and to maintain the total measure of 1 over Ωaa). In this
notation it is easy to see
that the mapping πv(d)j extends the probability space of V(D)J
recombination events,
(Ωe,Be,Precomb) to the probability space of nucleotide
sequences, (Ωnt,Bnt,Pgennt),
while the mapping πnt2aa extends the probability space of
nucleotide sequences to the
probability space of amino acid sequences (Ωaa,Baa,Pgenaa). Our
sloppy notation of
e|seq can now be understood as either π−1v(d)j(ntseq) or
π−1v(d)jπ−1nt2aa(aaseq).
3.1 Too many states! The free energy problem
Despite Eq 2.5’s seeming simplicity, it can prove to be
computationally very problem-
atic because of the number of recombination events that could
generate a particular
sequence. This is the exact same problem that plagues much of
statistical physics –
summing over all states to determine the partition function or a
free energy can prove
to be computationally prohibitive if the only method of doing
the summation is by
enumerating the states. Indeed, log(Pgen), a quantity we will
look at repeatedly, can
even be thought of as a free energy. The reader may remember
that this quantity, Pgen
was required to do the EM inference in the previous chapter
2.3.2, so to do any sort
29
-
of inference or to construct any sort of probabilistic model of
V(D)J recombination
the problem of enumerating all possible recombination events
must be addressed.
In previous work, and in the inference procedures of Murugan et
al. [2012], and
Marcou et al. [2018], the number of states to be summed over is
controlled through
through regularization. By regularization we mean that some
procedure is used to
limit the number of recombination events that are considered to
a manageable num-
ber. Fortunately, this is quite possible for nucleotide
sequences. By only considering
gene templates V, (D), and J that have a sufficiently good
alignment (e.g. Smith-
Waterman alignment), capping the number of deletions/insertions,
and having cutoffs
for fractional probabilities and errors, it is feasible to
reduce the number of recom-
bination events that correspond to a nucleotide sequence (i.e.
the notation e|seq) to
the order of thousands or less. This makes it tractable, if
still very computationally
intensive, to compute Pgen for nucleotide sequences. It must be
noted, that for soft-
ware attempting to infer V(D)J models of arbitrary structure,
this enumeration of
recombination events is very useful as there are no restrictions
on the correlations it
can consider.
However, this approach of exhaustive enumeration with some
regularization is
computationally intractable for amino acid CDR3 sequences, let
alone any kind of
coarse grained alphabet of amino acids that might be more
interesting functionally.
This can easily be seen from the fact that the number of
possible nucleotide sequences
that translate to a particular amino acid sequence will explode
exponentially with the
number of amino acids in the CDR3 region:
|{σ s.t. nt2aa(σ) = a}| =∏ai∈a
#codons|ai (3.2)
To put some perspective on these numbers, the average number of
nucleotide
sequences that code for a mouse TRB CDR3 amino acid sequence is
∼ 2 billion
30
-
— and mouse TRB CDR3 sequences are significantly shorter than
human TRB or
IGH. Even the heavily optimized and efficient IGoR software
developed to do V(D)J
generative model inference [Marcou et al., 2018], which can
compute the Pgen of
around 60 nucleotide sequences per CPU second, would take around
8500 CPU hrs to
compute the Pgen of a single mouse TRB amino acid sequence. This
is prohibitively
long if there is interest in analyzing repertoire datasets that
can easily be of the order
of 105 unique sequences or larger. For this reason, much of the
early work in this
field, and in this thesis, was restricted to the analysis of
nucleotide sequences.
While computing Pgen for amino acid sequences by way of
enumerating recombi-
nation events is computationally intractable, this is not to say
that the summation
is impossible. In this chapter we present a dynamic programming
algorithm and
software, OLGA (Optimized Likelihood estimate of immunoGlobulin
Amino-acid se-
quences, available at https://github.com/zsethna/OLGA), that
efficiently computes
Pgen not only for amino acid CDR3 sequences, but inframe
nucleotide sequences as
well as sequences composed of coarse grained/ambiguous amino
acid alphabets and
motifs. Indeed, OLGA can sum over all possible recombination
events of a mouse
TRB model in seconds (and can compute around 50 Pgen mouse TRB
amino acid
sequences per CPU second). This work is detailed in the paper
Sethna et al. [2018].
This algorithm however does require the V(D)J generative models
of the form 2.3 or
2.4, and so loses the flexibility of being able to consider
arbitrary model correlations.
The ability to compute Pgen on an amino acid and functional
receptor level will
likely prove to be extremely useful, and we explore some example
applications.
3.2 Dynamic Programming
OLGA is an algorithm that leverages ‘dynamic programming’ to
avoid enumerating an
exponentially large number of states. Rather than give a formal
definition of dynamic
31
https://github.com/zsethna/OLGA
-
programming, we show an example. Fortunately, physicists are
already familiar with
one of the cleanest examples of dynamic programming, and one
that truly shows the
computational effectiveness of such a technique: the discretized
path integral. If we
have position x with N possible locations, discretized time t,
and a Markov transition
matrix Rt(xi → xj) (which may depend on time), we can ask what
is the probability
of starting at position x0 and ending at position xT at time T .
If we define the
function
Pt(x0, xi) =∑
{x0,x(1),x(2),...,x(t−1),xi}
t−1∏t′=0
Rt′(x(t′)→ x(t′ + 1) (3.3)
we want PT (x0, xT ). Now, one could list out all the paths that
start at x0 and end
at xT , compute their weights, and sum. However, the number of
paths increases
exponentially with t, so the computation time would explode
exponentially as O(T ×
NT−1) (T operations on each of NT−1 paths). Instead, it is
computationally much
more efficient to sum up all the path weights to each position,
at each time step and
then update. In other words, we notice this recursion
relation:
Pt+1(x0, xi) =∑
{x0,x(1),x(2),...,x(t−1),x(t),xi}
t∏t′=0
Rt′(x(t′)→ x(t′ + 1)
=∑x(t)
Rt(x(t)→ xi)∑
{x0,x(1),x(2),...,x(t−1),x(t)}
t−1∏t′=0
Rt′(x(t′)→ x(t′ + 1)
=∑x(t)
Rt(x(t)→ xi)Pt(x0, x(t))
(3.4)
This can be written in a vectorized notation by writing Pt(x0,x)
as a column vector
with elements Pt(x0, xi):
Pt+1(x0,x) = RtPt(x0,x)⇒ PT (x0,x) = RT−1RT−2...R1R0P0(x0,x)
(3.5)
where P0(x0,x) = I(x0). Thus, solving for PT (x0, xT ) by using
dynamic programming
would require O(T ×N2) operations — a massive speedup from the
O(T ×NT−1) op-32
-
erations of the exhaustive enumeration of the paths. We have
turned the summation
over all individual microstates (i.e. the paths) into a matrix
expression with steps in
time. The algorithm, OLGA, that we developed to compute Pgen of
nucleotide and
amino acid sequences from a generative model will analogously
reduce the exponen-
tial blowup of exhaustive enumeration of recombination events
down to polynomial
time by summing over matrix expressions based on positions in
the sequence read.
3.3 OLGA
We now describe how OLGA computes Eq. 2.5 without summing over
exhaustively
enumerated recombination events, using dynamic programming. This
algorithm re-
quires specific tailoring to the model structure as the
correlations have to be built
in explicitly, so the algorithm is slightly different for
generative models of VDJ
(TCRβ/IGH, Eq. 2.3) and VJ (TCRα/IGL, Eq. 2.4) recombination. We
will first
present the VDJ algorithm, and give the simpler algorithm for
generative models of
VJ recombination afterwards.
Each recombination event implies an annotation of the amino acid
CDR3 sequence,
(a1, . . . , aL), assigning a different origin to each
nucleotide position (one of V, N1, D,
N2, or J, where N1 and N2 are the non-templated VD and DJ
insertion segments,
respectively) that parses the sequence into 5 contiguous
segments (see schematic in
Fig.3.1)
The core principle of the method is to sum over possible
nucleotide locations of
the 4 boundaries between the 5 segments, x1, x2, x3, and x4, but
in a recursive way
using matrix operation. This can be summarized into a compact
matrix expression:
Pgen(a1, . . . , aL) =∑
x1,x2,x3,x4
Vx1Mx1x2
∑D
[D(D)x2x3N
x3x4J(D)
x4]. (3.6)
33
-
Figure 3.1: CDR3 indexing cartoon
} } }} }Vx1 D(D)x2x3 J (D)
x4
Mx1x2 N x3x4N2N1
V D Jx1 x2 x3 x4
a4, i1=4
x1=11
u=1, u*=2 u=2, u*=1 u=3, u*=3
10 11 12u1=2, u1*=1
Boxes correspond to nucleotides and are indexed by integers.
Each group of threeboxes (identified by heavier boundary lines)
corresponds to an amino acid. Thenucleotide positions x1, . . . ,
x4 identify the boundaries between different elements ofthe
partition. The V, M, D(D), N and J(D) matrices define cumulated
weightscorresponding to each of the 5 elements.
However, to do this, we will need to define objects that
accumulate the probabil-
ities of events from the left of a position x (i.e. up to x) and
the right of x (i.e. from
x+ 1 on) which will require some notation.
3.3.1 Notation, 3 ′ and 5 ′ vectors
Suppose we have a CDR3 ‘amino acid’ sequence a = (a1, . . . ,
aL). By ‘amino acid’
sequence, we mean that each of the ‘amino acids’, ai, correspond
to some collection of
nucleotide triplets, or codons. We allow this mapping between
‘amino acids’, a, and
codons to be arbitrary at this point, and use the notation σ ∼ a
if the codons in the
nucleotide sequence σ correspond to the codons allowed by the
amino acid sequence
34
-
a. This will allow us not only to recover the standard
nucleotide translation map-
ping, πnt2aa, when using the standard amino acid alphabet (e.g.
TGTGCCAGCAGT
∼ πnt2aa(TGTGCCAGCAGT) = CASS), but also provides a trivial
extension to in-
clude in-frame nucleotide sequences (define an ‘amino acid’
symbol for each individual
codon) as well as coarser grained collections of amino acids.
For example, all codons
that code for amino acids with a common chemical property, e.g.
hydrophobicity or
charge, could be grouped into a single ‘amino acid’. In that
formulation, (a1, . . . , aL)
would correspond to a sequence of symbols denoting that
property. This could prove
to be very useful in constructing and assessing future coarse
grained models of recep-
tor - epitope affinities.
It will simplify the later expressions to be able to refer to a
position x not only
by its nucleotide index, but by the corresponding amino acid
index i as well as what
position x is in the codon reading from 5 ′ to 3 ′ (u) and what
position x + 1 is in a
codon reading from 3 ′ to 5 ′ (u∗). This is shown graphically in
Fig. 3.1. Explicitly, for
position xj:
ij =⌈xj
3
⌉uj = xj − 3(ij − 1)
u∗j = 3−mod(uj, 3)
(3.7)
It is also crucial to introduce what we will call ‘5 ′ vectors’
and ‘3 ′ vectors’. A 5 ′ vector,
denoted with a subscript (e.g. Xx) accumulates weights for the
sequence to the 5′ (left)
side of x (including the nucleotide position x), whereas a 3 ′
vector, denoted with a
superscript (e.g. Yy) reflects the weights for the sequence to
the 3 ′ (right) side of
x (excluding the nucleotide position x). Because we are dealing
with amino-acids,
which are encoded with codons made of 3 nucleotides, we need to
keep track of
weights by the identity of the nucleotides at the beginning or
the end of the codon.
This requires the definition of a 5 ′ vector (3 ′ vector) to
depend on the value of u (u∗).
35
-
For the first nucleotide position in a codon, u = 1 (u∗ = 1), Xx
(Yy) must be
interpreted as a row (column) vector of 4 numbers indexed by σ =
A, T,G, or C,
corresponding to the cumulated probability weight from the 5 ′
/left (3 ′ /right) side
that nucleotide at position x (x + 1) takes value σ. If u = 2
(u∗ = 2), then Xx (Yy)
is also a row (column) vector of 4 numbers indexed by nucleotide
σ = A, T,G, or
C, but with a different interpretation: it corresponds to the
cumulated probability
up to position x from the 5 ′ /left side (x+ 1 from the 3 ′
/right), with the additional
constraint that the nucleotide at the last position in the
codon, x + 1 (x), can take
value σ (the value is 0 otherwise). Lastly, if x (x+ 1) is the
last position in a codon,
i.e. u = 3 (u∗ = 3), the cumulative sequence terminates at the
end of a codon and
we do not keep nucleotide information, so Xx (Yx) is a
scalar.
If we have a 5 ′ vector Xx that contains the accumulated weights
up to position
x, and 3 ′ vector Yx that contains the weights from position x+1
onwards, we will
want to ‘glue’ these sequence contributions together to get the
total probability of
the sequence. This is indicated by the expression1 XxYx, which
has a very convenient
structure. As the combinations of u and u∗ are (1, 2), (2, 1),
or (3, 3), we see that
the matrix multiplication XxYx is one of two situations. First
if u = u∗ = 3, XxY
x is
just scalar multiplication of the aggregate weights for the 5 ′
and 3 ′ sides. If (u, u∗)
= (1, 2) or (2, 1), then XxYx is the dot product between a
vector of weights indexed
by nucleotides needed to complete the codon and the vector of
weights indexed the
completing nucleotide on the other side. In either case, the
result is the total aggregate
weight of the sequence conditioned on the partition x,
accurately reflecting the weight
of ‘gluing’ the possible sequences from the 5 ′ /left side to
the 3 ′ /right side.
This notion of sequence gluing also allows for the definition
and interpretation of
matrices (e.g. Rxy), with both 5′ and 3 ′ indices. A matrix Rxy
can be thought of as
1Please note that the resemblance of the expression XxYx to a
contraction over the position x in
Einstein notation should not be misinterpreted. The
‘contraction’ is over possible nucleotide identityindices not over
the position index x.
36
-
‘gluing ’ a new sequence segment (x to y) to what an existing 5
′ or 3 ′ vector describes.
For example:
XxRxy = Hy
RxyYy = Gx
(3.8)
The matrix Rxy can be mapping from any value of u to any other
(or any value of
u∗ to any other), and so has 9 possible
combinations/interpretations based on the u
mapping and can be a 4x4, 4x1, 1x4, or 1x1 matrix as a
result.
3.3.2 VDJ recombination: V, M, D, N, and J
Eq 3.6 shows the summation over positions of a matrix
expression, with the vec-
tors/matrices corresponding to different VDJ contributions. The
5 ′ vector Vx1 corre-
sponds to a cumulated probability of the V segment finishing at
position x1; matrix
Mx1x2 is the probability of the VD insertion extending from x1 +
1 to x2; Nx3x4 is the
same for DJ insertions; matrix Dx2
x3(D) corresponds to weights of the D segment
extending from x2 + 1 to x3, conditioned on the D germline
choice being D; 3′ vector
Jx4(D) gives the weight of J segments starting at position x4 +
1 conditioned on the
D germline being D. This D dependency is necessary to account
for the dependence
between the D and J germline segment choices [Murugan et al.,
2012]. All the defined
vectors and matrices depend on the amino acid sequence (a1, . .
. , aL), but we leave
this dependency implicit to avoid making the notation too
cumbersome.
The entries of the vectors/matrices corresponding to the
germline segments, V,
D(D), and J(D), can be calculated by simply summing over the
probabilities of
different germline segments compatible with the sequence (a1, .
. . , aL) with conditions
on deletions to achieve the required segment length. The ∼ sign
is generalized to
incomplete codons so that it returns a true value if there
exists a codon completion
that agrees with the sequence a.
37
-
V contribution: Vx1
The 5 ′ vector, Vx1 , aggregates the weights (PV and PdelV) from
sequences originating
from the templated V genes up from the start of the CDR3 region
to position x1. As a
5 ′ vector, Vx1 can be a 1x1 or 1x4 matrix depending on u1. sV
is the sequence of the V
germline gene (read 5 ′ to 3 ′ ) from the conserved residue
(generally the cysteine C) to
the end of the gene plus the maximum number of reverse
complementary palindromic
insertions appended to the 3 ′ end. lV is the length of sV .
Vx1(σ) =∑V
PV(V )PdelV(lV − x1|V )I(sVx1 = σ)I(sV1:x1 ∼ a1:i1) if u1 =
1,
Vx1(σ) =∑V
PV(V )PdelV(lV − x1|V )I((sV1:x1 , σ) ∼ a1:i1) if u1 = 2,
Vx1 =∑V
PV(V )PdelV(lV − x1|V )I(sV1:x1 ∼ a1:i1) if u1 = 3.
(3.9)
N1 contribution: Mx1x2
This matrix includes the weights (PinsVD, p0, and∏SVD(mi|mi−1)
from the glu