-
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053
General comments, TODOs, etc.
-Think up a catchy title.-Decide which references to anonymize
to maintain double-blindness.-Write the abstract-Fill in remaining
[CITE], which are cases where it is unclear to me what
paper should be cited. The NIPS style guide allows for 1-page
only ofcitations; the font size has been reduced as much as
permissible.
- Given the space constraints, I’m inclined to think that we
should notinclude a future work section in the paper.
1
-
054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107
Identifying Tandem Mass Spectra using DynamicBayesian
Networks
Anonymous Author(s)AffiliationAddressemail
Abstract
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod temporincididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis nostrudexercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis auteirure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat
nullapariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officiadeserunt mollit anim id est laborum.
1 Introduction
Tandem mass spectrometry, a.k.a. shotgun proteomics, is an
increasingly accurate and efficienttechnology for identifying and
quantifying proteins in a complex biological sample, such as a drop
ofblood. This technology has been used to identify biomarkers
associated with disease [CITE], andto quantify changes in protein
expression across different cell types [CITE]. Most applications
oftandem mass spectrometry require the ability to accurately map a
fragmentation spectrum generatedby the device to a peptide, a
protein subsequence, which generated the spectrum. The task of
mappingspectra to peptides is known as spectrum identification, a
pattern recognition task akin to speechrecognition. In speech
recognition, the input is an utterance, which must be mapped to a
sentence innatural language, a enormous structured class of labels.
A spectrum is akin to an acoustic utterance; apeptide is akin to a
sentence, a sequence of amino acids instead of words. Unlike speech
recognition,(i) accurate labelled data, ground truth
peptide-spectrum matches, cannot be acquired; (ii) the
scoringfunction for peptide-spectrum matches has traditionally been
a non-probabilistic function, whileprobabilistic approaches have
become dominant in speech; (iii) the optimization used to identify
thebest peptide match require enumerating and scoring all candidate
peptides against a spectrum.
In this work, we introduce a dynamic Bayesian network (DBN) that
generalizes one of the mostpopular scoring functions for peptide
identification (Section 4). Our probabilistic formulation, whichwe
call Didea, provides new insight into a technique that has been
used in computational biology forover 17 years. Didea provides a
new function for scoring peptide-spectrum matches, that
significantlyoutperforms existing scoring functions, including
those used in expensive commercial tools for peptideidentification.
We further show that additional qualitative knowledge about peptide
fragmentationcan be easily incorporated into the model, leading to
further improvements in identification accuracy.
A fundamental computational constraint in current approaches to
spectrum identification is thedependence on peptide database
search. The best peptide match is found by exhaustively scoring
alarge list of candidate peptides against the spectrum. In speech
recognition, database search wouldbe analogous to decoding an
utterance by scoring every common sentence in the English
languageagainst the utterance, picking the highest scoring match.
In Section 5, we extend Didea with lattices, acompressed
representation of sequences, common in speech and language
processing [CITE]. Latticesfind novel use here: allowing us to
replace an exhaustive enumeration with dynamic programmingover
peptide sequences.
2
-
108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
AProteinsProteins
1 2 3
Peptides Mass spectrometerSpectraProteins
B
Figure 1: (A) Schematic of a typical shotgun proteomics
experiment. The three steps—(1) cleavingproteins into peptides, (2)
separation of peptides using liquid chromatography, and (3) tandem
massspectrometry analysis—are described in the text. (B) A sample
fragmentation spectrum, along withthe peptide (PTPVSHNDDLYG)
responsible for generating the spectrum. Peaks corresponding
toprefixes and suffixes of the peptide are colored red and blue,
respectively. By convention, prefixes arereferred to as b-ions and
suffixes as y-ions.
Jeff: [Note, while lattices are common in the speech world,
outside of speech they might be confusiblewith, say, Birkhoff
lattices. We might want to add a bit of text in the above
sayingthat lattices, in thiscontext, is a linear sized
representation of an exponential number of sequences, and can be
seen as asequential analoge of, say, binary decision diagrams. BTW,
also, one option for the extended versionis to,s ay define a macro
that is a comment for the main version but includes the text for
the extendedversion, so then we can have one .tex file for both
submission and supplement.]
Jeff: [One other commen there. I think this reads well, but we
then immediately go on to describeshotgun proteomics. Pehraps in
the intro offer up a few more details of the model and what enables
itto achieve such good performance. The reason is that, otherwise,
people might be left wondering.]
2 Tandem Mass Spectrometry
Experimental framework A typical shotgun proteomics experiment
proceeds in three steps, asillustrated in Figure 1. The input to
the experiment is a collection of proteins, which have beenisolated
from a complex mixture. Each protein can be represented as a string
of amino acids, wherethe alphabet is size 20 and the proteins range
in length from 50–1500 amino acids. A typicalcomplex mixture may
contain a few thousand proteins, ranging in abundance from tens to
hundredsof thousands of copies.
In the first experimental step, the proteins are digested into
shorter sequences (peptides) using amolecular agent called trypsin.
To a first approximation, trypsin cleaves each protein
deterministicallyat all occurrences of “K” or “R” unless they are
followed by a “P”. This digestion is necessary becausewhole
proteins are too massive to be subject to direct mass spectometry
analysis without using veryexpensive equipment. Second, the
peptides are subjected to a process called liquid chromatography,in
which the peptides pass through a thin glass column that separates
the peptides based on a particularchemical property (e.g., the
hydrophobicity). This separation step reduces the complexity of
themixtures of peptides going into the mass spectrometer. The third
step, which occurs inside the massspectrometer, involves two rounds
of mass spectrometry. Approximately every second, the
deviceanalyzes the population of approximately 20,000 intact
peptides that most recently exited from theliquid chromatography
column. Then, based on this initial analysis, the machine selects
five distinctpeptide species for fragmentation. Each of these
fragmented species is subjected to a second round
3
-
162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
of mass spectrometry analysis. The resulting “fragmentation
spectra” are the primary output of theexperiment.
A sample fragmentation spectrum is shown in Figure 1B. During
the fragmentation process, eachamino acid sequence is typically
cleaved once, so cleavage of the population results in a varietyof
observed prefix and suffix sequences. Each of these subpeptides is
characterized by its mass-to-charge ratio (m/z, shown on the
horizontal axis) and a corresponding intensity (unitless, but
roughlyproportional to abundance, shown on the vertical axis). The
input to the spectrum identificationproblem is one such
fragmentation spectrum, along with the observed (approximate) mass
of theintact peptide. The goal is to identify the peptide sequence
that was responsible for generating thespectrum.
Solving the spectrum identification problem In practice, the
spectrum identification problemcan be solved in two different ways,
either de novo, in which the universe of all possible peptidesis
considered as candidates solutions, or by restricting the search
space to a given peptide database.Because high throughput DNA
sequencing can provide a very good estimate of the set of
possiblepeptide sequences for most commonly studied organisms, and
because database search typicallyprovides more accurate results
than de novo approaches, we focus on the database search version
ofthe problem in this paper.
The first computer program to use a database search procedure to
identify fragmentation spectrawas SEQUEST [7], and SEQUEST’s basic
algorithm is still used by essentially all database searchtools
available today. John: [cite: Sadygov2004] Bill: [Do we really need
a cite for the previoussentence? If so, then we have to use
something more recent than 2004. I vote to delete this cite]The
approach is as follows. We are given a spectrum S, a peptide
database P , a precursor massm (i.e., the measured mass of the
intact peptide), and a precursor mass tolerance δ. The
algorithmextracts from the database all peptides whose mass lies
within the range [m − δ,m + δ]. Thesecomprise the set of candidate
peptides C(m,P, δ) = {p : p ∈ P ; |m(p) − m| < δ} where m(p)is
the calculated mass of peptide p. In practice, depending on the
size of the peptide database andthe precursor mass tolerance, the
number of candidate peptides ranges from hundreds to hundredsof
thousands. Each candidate peptide p is used to generate a
theoretical spectrum s(p), and thetheoretical spectrum is compared
to the observed spectrum using a score function K(·, ·). Theprogram
reports the candidate peptide whose theoretical spectrum scores
most highly with respect tothe observed spectrum: arg maxp∈C(m,P,δ)
K(S, s(p)).
In this work, we compare the performance of Didea to two widely
used search programs, SEQUESTand Mascot [11], as well as to a less
commonly used but methodologically related method, PepHMM[15].
These three methods differ primarily in their choice of score
function K(·, ·). Describing thedetails of SEQUEST’s score
function, XCorr, is beyond the scope of this paper, but the basic
ideais to compute a scalar product of the observed and theoretical
spectrum and then subtract out an“average” scalar product term that
is produced by shifting the observed and theoretical
spectrumrelative to one another: XCorr(S, s(p)) = 〈S, s(p)〉 −
1150
∑75τ=−75
∑Ni=1 Sis(p)i−τ . Mascot is a
commercial product that uses a probabilistic scoring function to
rank candidate peptides, the details ofwhich have not been
published. PepHMM first generates a theoretical spectrum, akin to
SEQUEST’s.The probability that the peaks in the theoretical
spectrum occurred in the observed spectrum is thencaclulated using
a a hidden Markov model (HMM), and the candidate peptide is
assigned a scorebased on the confidence of this probability, which
is measured using an estimated normal distributionover the peptide
masses within ±δ of the precursor mass.The spectrum identification
problem is difficult to solve primarily because of noise in the
observedspectrum. In general, the x-axis of the observed spectrum
is known with relatively high precision.However, in any given
spectrum, many expected fragment ions will fail to be observed, and
thespectrum is also likely to contain a variety of additional,
unexplained peaks. These unexplained peaksmay result from unusual
fragmentation events, in which small molecular groups are shed from
thepeptide during fragmentation, or from contaminating molecules
(peptides or other small molecules)that are present in the mass
spectrometer along with the target peptide species.
3 Evaluation Metrics
4
-
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
Jeff: [Do we want this here? Most ML papers put the evaluation
methodology just before theresults section, and the the first thing
that is done is the intro, motivation, background literatureand
alternative approaches, and then the new approach (i.e., the new
probabilistic method). Thenthe results (and methodology) go at the
end. Putting the evaluation section here might surprise
thereviewer.]
Labelled data for spectrum identification would consist of a set
of ground truth peptide-spectrummatches: spectra where the mapping
to a peptide is known. Unfortunately, accurate labelled data
doesnot exist in this domain, which complicates evaluation. To
estimate the probability that a spectrumidentification is false, we
therefore make use of the standard decoy-target approach [6, 9].
For eachspectrum, two searches are performed: one to find the best
peptide in the target database C(m,P, δ).Then, a second search is
performed to find the best peptide in a decoy database C(m, P̃ ,
δ): a set ofplausible peptides where it is extremely improbable
that the correct peptide is contained in it. In ourexperiments, the
target P and decoy P̃ databases are the same size, with decoys
being generated byrandomly permuting peptides in the target
database, under the requirement that P ∩ P̃ = ∅.A single tandem
mass spectrometry experiment generates m = O(105) spectra. We
expect a certainfraction of the identifications to be spurious, and
so only the top-k scoring identifications are retainedas quality
matches, the rest are ignored. False Discovery Rate (FDR) [14]
(essentially one minusprecision) provides a rule for determining
what k should be, given a bound on the expected fraction ofspurious
identifications among the top-k. To make use of FDR, we first pose
the question of whetheror not to accept a single spectrum
identification as a hypothesis test.
Consider a single spectrum s, searched against the target
database C(m(s), P, δ). Denote the peptidescoring function θ : s×p
→ Θ ⊆ R. When only one spectrum is under consideration, the
dependenceof θ on s is not shown. Now, θ(p) is itself a random
variable. To sample from the distribution of θ(p),we score each
peptide in the target database: θ(C) = {θ(p) : p ∈ C(m(s), P, δ)}.
Choosing thehighest scoring peptide as the proposed match
corresponds to the test statistic T (θ(C)) = max(θ(p) :p ∈ C(m(S),
P, δ)). Colloquially, the hypothesis test can expressed in terms of
the test statistic. Thenull hypothesis, H0, is that a peptide
matches the spectrum by chance; the alternate hypothesis, H1,is
that the peptide generated the spectrum. Formally, the hypothesis
test is
H0 : θ(p) ≤ θ0 H1 : θ(p) > θ0,
where θ0 is a user-determined threshold on the score which
determines the stringency of the test. Asa decision rule, the null
hypothesis is rejected if the test statistic T (θ(C)) exceeds
critical value c.Equivalently, the highest scoring peptide match
for a spectrum is deemed correct if its score is greaterthan c.
A single tandem MS experiment leads to m hypotheses. Let V (c)
be the number of hypotheses whereH0 is incorrectly rejected at
critical value c; let R(c) be the number of hypotheses where H0
wasincorrectly rejected. For sufficiently large m, we estimate FDR
using
F̂DR(c) = E[V (c)]/ E[R(c)].
An estimate of E[V (c)] is the number of spectra where the best
decoy match has a score higher thanc; an estimate of E[R(c)] is the
value of R(c) itself, the number of spectra where the best
targetmatch has a score higher than c. The decoy database is only
used to estimate the error rate. The aboveestimate of FDR has an
intuitive interpretation, it is 1-precision. Since FDR(c) is not
necessarilystrictly increasing with c, we instead report the
estimated q-value [12]:
q̂(c) = mint≥c
F̂DR(t).
At a score threshold c, we have q-value q(c) ∈ [0, 1], which is
the expected fraction of spuriousidentifications among those whose
score is at least c.
Jeff: [I think most of the equations above are not long and
shoud be inlined, to save space.]
The tradeoff between the number of identifications that are
accepted and the stringency of theacceptance criterion is
represented as an absolute ranking curve. Each point on the x-axis
is aq-value in [0,1], the corresponding value on the y-axis is the
number of top-scoring spectra whoseidentification is accepted at
that q-value. At q = 1, all m identifications are accepted; at q =
0, noidentifications are accepted. In real-world usage, the concern
is with maximizing performance at
5
-
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
small q-values, so we plot only q ∈ [0, 0.1]. One method
dominates another if its absolute rankingcurve is strictly above
the absolute ranking curve for the other method.
Ajit: [Include a pointer to an absolute ranking plot.]
Ajit: [It’s natural for a machine learning audience to want to
view hypothesis testing as a 0/1 classificationproblem: i.e.,
assign label 1 to the identifications we want to accept. If we
switch from FDR to the positive FalseDiscovery Rate [13], we can
draw a connection to Bayes error rates on such a classification
problem. However,using Bayes error rate would require the user to
control the stringency of the test by setting a parameter
thatcorresponds to the relative importance of a false negative to a
false positive, and that is harder to understand anda q-value.]
4 Scoring identifications as inference in a Dynamic Bayesian
Network
In this section, we show that Equation ?? can be generalized
Ajit: [ this is not strictly what is going on,as we’re not
generalizing it, rather we are inspired by it to create a proper
probabilistic model. ] as inference ina DBN (Figure 2). The DBN is
based on the mobile proton hypothesis of peptide fragmentation
[4],which we describe mathematically below. We provide empirical
evidence that our probabilisticscoring function is significantly
better than the scoring functions used in commercially
developedpackages.
Peptide Fragmentation We start at the second phase in tandem
mass spectrometry: the proteinsequence has been digested, and a
peptide has been isolated in the first mass spectrometry step.
Apeptide is represented as a string p = a1a2 . . . an, since our
only concern is in decoding a peptide’ssequence. Each letter at is
drawn from an alphabet of 20 standard amino acids, whose masses
areknown. The mass function m(·) refers both to the mass of a
residue, m(at), and to the mass of asequence of residues, m(p)
=
∑nt=1 m(at).
Peptides are ionized in the second phase of mass spectrometry,
so each peptide has a positive chargedue to carrying one, two, or
three extra protons: c(p) ∈ {1, 2, 3}. Peptides predominantly
fragmentinto a prefix and suffix: b = a1 . . . at, y = at+1 . . .
an. The extra protons are divided between theprefix and suffix:
c(b) + c(y) = c(p). If either b or y have zero charge, it cannot be
detected, and itscorresponding peak will not show up in the
spectrum. Charge distributions are not equally probable:e.g., when
c(p) = 2, fragment ions of charge 2 are exceedingly rare. When the
peptide fragments atposition t, the prefix fragment ion is referred
to as the bt-ion; the suffix fragment ion, the yt-ion. Theset {bt}t
is referred to as the b-ion series, with the y-ion series defined
analogously.Each peak in an idealized spectrum corresponds to a
fragment ion in the the b-ion or y-ion series: theposition of the
peak for a fragment ion b is a deterministic function of m(b) and
c(b) and likewise fory. A fragment spectra measures how often
particular peaks with a specific mass-to-charge (m/z) ratioare
detected, so there is no sequence information in a peak.
A spectrum s is a collection of peaks, intensities at given m/z
positions: s = {(xj , hj)} where xj isa point on the m/z axis
(x-axis), and hj is the corresponding intensity (see Figure 1B). In
practice,there is substantial discrepancy between an idealized
spectrum and a real one due to measurementnoise, secondary
fragmentation of the b or y ions, non-protein contaminants, or
other imperfectionsin the isolation of the peptide. Even barring
noise in the spectra, there is substantial variation acrossspectra
which must be controlled. There can be order-of-magnitude
differences in both total intensity∑
j hj and maximum intensity, max{hj} across spectra. To control
for intensity variation, werank-normalize each spectrum: peaks are
sorted in order of increasing intensity, and the ith peak
isassigned intensity i/|s|, so max{hj} = 1.0.From the settings used
to collect the spectra, we know that xj ∈ [0, 2000] m/z units. We
quantize them/z scanning range into B = 2000 uniformly sized bins.
The bins correspond to a vector of randomvariables S = (Si : i = 1
. . . B). A spectrum is an instantiation of S, s = (s1, . . . ,
sB), where themost intense rank-normalized peak is retained in each
bin. If no peak is present in a bin, then Si = 0.
A Generative Model of Peptide Fragmentation
DBNs are commonly used to model discrete-time phenomena; but can
be applied to any sequentialdata. In Figure 2, each non-prologue
frame t = 1 . . . n corresponds to the fragmentation of peptide
pinto the bt and yt ions. The peptide is represented as a vector of
random variables A = (Ai : i =
6
-
324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377
1 . . .). Since we are given the peptide-spectrum match to
score, A is observed, with At = at. Thespectrum variables S are
fixed across all frames, and observed, since the spectrum is
given.
The masses of the prefix and suffix are denoted nt = m(a1 . . .
at) and ct = m(at+1 . . . an)1. Themasses can be defined
recursively: n0 = 0, nt = nt−1 + m(at), and cn = 0, ct = ct+1 +
m(at+1).The variables p = {nt, ct}n−1t=1 identify the peptide.The
random variables bt, yt ∈ {1 . . . B} are indices that select which
bins are expected to contain thebt-ion and the yt-ion,
respectively. Recall, the there is a deterministic relationship
between the massand charge of a fragment, and its location on the
m/z axis: i.e, bt = round((nt + 1)/zt).
To generalize Equation ?? to a posterior probability, we need a
background score which measuresthe average fit of the spectrum to a
shifted version of the theoretical spectrum. The shift variableτ0
allows us to shift the theoretical spectrum. τ0 ∈ [−M . . . + M ],
for a choice of M ∈ {1 . . . B}.Instead of predicting the bt-ion at
bin bt, we predict it at bin bt + τ . If the shifted bin location
isoutside the range {1, . . . , B}, we map those positions to a
special bin that contains no peak. To shiftthe entire theoretical
spectrum, τt = τt−1, t = 1 . . . n. The distribution over τ0 is
uniform.
Most of the conditional probability distributions in Figure 2
are deterministic, which leads to a simpleform for the joint
distribution:
p(τ0, s,p) = p(τ0)n−1∏t=1
B∏i=1
[P (Si | bt, yt, τt)]δ(i=bt+τt∨i=ct+τt) . (1)
The inference which connects this model to Equation ?? is the
log-posterior of τn:
θ(s,p) , log p(τn = 0 |p, s) = log p(τ0 = 0,p, s)− log
|τ0|−1∑τ0
p(p, s | τ0). (2)
The log p(τ0 = 0,p, s) term is the probabilistic analogue of 〈S,
s(p)〉 in Equation ??, a term whichmeasures the similarity between
the theoretical and observed spectra. The log |τ0|−1
∑τ0
p(p, s | τ0)term is a generalized version of the
cross-correlation between the real and theoretical spectra:
theaverage similarity between the spectrum and shifted versions of
the theoretical spectra.
Computing the scoring function θ(s,p) is somewhat simpler than
computing the evidence p(p, s).Algorithms for DBN inference are
typically forward-backward schemes (c.f., [2]), with it
beingpossible for θ(·) to be computed using only a forward
pass.
Virtual Evidence
An advantage of our probabilistic approach to scoring is that we
have substantial flexibility inrepresenting the contribution of
peaks towards the score, P (Si | bt, yt, τt). Using virtual
evidence [10],we are free to choose an arbitrary non-negative
function fi(S) to model each bin.
One way to mimic the observation Si = si is to introduce a
virtual binary variable Ci, whose soleparent is Si. The virtual
child is fixed to Ci = 1. If P (Ci = 1 |Si) , δ(Si = si), then
P (Si = si | bt, yt, τt) =∑Ci
P (Ci = 1, Si | bt, yt, τt)P (Ct = 1 |Si).
Virtual evidence changes the definition of the virtual child’s
conditional probability distribution toP (Ci = 1 |Si) = fi(Si), for
a user-defined non-negative function fi. One could define a
separate fifor each bin i, but for simplicity we choose a single
function for all bins, f .
Following Equation ?? we impose additional constraints on the
form of f . The score of a peptide-spectrum match should depend
only on the peaks: f(0) = 1. If a peak is found in an activatedbin,
its contribution to the score must be higher than that of an
activated bin with no peak: ∀S >0, f(S) > f(0). Finally,
matching high intensity peaks should be worth more than matching
lowintensity peaks; the b- and y-series should be more prominent
than noise in the spectrum: i.e., f ismonotone increasing. Based on
our experiments, a class of f that works particularly well is
fλ(S) =eλ − λ + λeλS
2eλ − 1− λ. (3)
7
-
378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431
Prologue Chunk Epilogue
nt
ct
i = 1 . . . B
Si
ytbt
c0
n0
cn
nn
at an
τ0 τt τn
Shared across all frames
Figure 2: Didea as a graphical model: the “prologue” occurs once
at the beginning, the “epilogue”occurs once at the end, and the
“chunk” is unrolled as necessary to any desired length. At each
chunk,the bottom plate is expanded to have B copies.
The parameter λ > 0 dictates the relative value placed upon
peak intensity in the scoring function.
Experiments
We compare the performance of spectrum identification algorithms
against three tandem MS experi-ments on proteins from two different
organisms:
• 60cm: A tryptic digest of S. Cerevisiae lysate containing
18,149 spectra, each with aprecursor ion charge of 2. That is, c(p)
= 2 for all candidate peptides.
• Yeast-01: A tryptic digest of S. Cerevisiae lysate, containing
34,499 spectra, each with aprecursor charge 2 or 3. That is c(p) ∈
{2, 3} for all candidate peptides. We compare allalgorithms under
the assumption that each candidate peptide has c(p) = 2.
• Worm-01: A tryptic digest of C. Elegans proteins, containing
22,436 spectra, each withprecursor charge 2 or 3. Again, we compare
all algorithms under the assumption thatc(p) = 2.
The peptide database P for the yeast data sets is generated by
an in silico trypsin digest of the solubleyeast proteome [16].
We compare the performance of four spectrum identification
algorithms on these three data sets2:Didea, Crux/XCorr, Mascot, and
PepHMM. Jeff: [Say again that these other methods include atleast
one that is quite standard, and that mascot is commercial]. The
search parameters are controlledacross the four methods. Candidate
peptides are selected using a mass window of δ = 3.0 Da,
savePepHMM, which uses a hard-coded window of δ = 2.0 Da. The
entire b- and y-ion series are assumedto be present. A fixed
modification to cysteine is included to account for
carbamidomethylation ofprotein disulfide bonds. In all cases, the
decoys are generated by randomly permuting target peptides.
Figure 3 presents the absolute ranking comparison of the four
methods. In all cases, there is asignificant improvement in the
number of spectra that are confidently identified, with Didea
strictlydominating over q ∈ [0, 1], save for the Worm-01
experiment. Ajit: [I’m betting that the poor performancein Worm-01
is due largely to failures on spectra with heavy precursor masses
where the charge is probably +3. Ifwe assume the charge is +2, a
large chunk of the b/y-series would fall outside the scanning range
of the device]
Jeff: [I think we should include some reasons for this
performance increase, rather than just giving it.First, did you end
up using the MLE for λ? If so, say so. This should also include the
benefit of thef function, that this “distribution” was not pulled
out of a hat, but is a family of distributions that
1The prefix and suffix of a peptide are more commonly referred
to as the N- and C-terminal fragments.2Comparisons on additional
data sets are included in D.
8
-
432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(a) 60cm
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
6
7
8
9
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(b) Yeast-01
0.02 0.04 0.06 0.08 0.10q-value
0
2
4
6
8
10
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(c) Yeast-02
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
6
7
8
9
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(d) Yeast-03
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
6
7
8
9
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(e) Yeast-04
0.02 0.04 0.06 0.08 0.10q-value
0
2
4
6
8
10
12
14
Spe
ctra
iden
tified
(100
0’s)
DideaCruxMascotPepHMM
(f) Worm-01
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
6
7
8
Spe
ctra
iden
tified
(100
0’s)
DideaCruxPepHMM
(g) Worm-02
0.02 0.04 0.06 0.08 0.10q-value
0
1
2
3
4
5
6
Spe
ctra
iden
tified
(100
0’s)
DideaCruxPepHMM
(h) Worm-03
Figure 3: Absolute ranking comparison
are particularly suited to this problem. Shoudl also say that
this function is novel, not before beenused for this (or any
problem), as far as we know. Now also, key benefit of our approach
is that it isprobabilistic, and thus automatically normalized
appropriately, unlike the crux approach mentionedabove where there
can be unwanted miscalculations between forground and background
model (atleast this should be our hypothesis).]
5 Lattice Decoding
Lattice representation for peptide database
The drawback of representing each peptide as an individual
observation sequence is that the samecomputations need to be
carried out multiple times for peptides with identical substrings.
A moreefficient way of representing a peptide database is in the
form of a subpeptide lattice. Latticerepresentations are widely
used for other sequence modeling problems outside computational
biology,such as speech and language processing (e.g. [1, 3, 5]).
They provide a way of representing a finitebut possibly very large
set of strings in a compact, compressed form by the sharing of
commonsubstrings. Given an alphabet A of amino acids, a peptide p
can be defined as a string over A. Asubpeptide s is a substring of
p whose length, |s|, is typically less than the length of p. We
denote thetotal inventory of subpeptides with S. A subpeptide
lattice is a directed acyclic graph G = (V,E)with a set of vertices
V and a set of edges E, each of which is labelled with a subpeptide
s ∈ S
9
-
486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539
Figure 4: Compressed lattice for the three peptides AAAANWLR,
AAADEWDER, AAADLISR.
From Jeff: I don’t think we’ll have space for this figure,
unfortunately, at least in the main version.We could use it in the
extended version (which could be a strict superset of this
paper).
Figure 5: Graphical model structure for a peptide lattice.
and, optionally, additional information such as frequencies or
probabilities. The concatenation ofsubpeptides along a path through
the lattice corresponds to a complete peptide in the database.
Using a lattice representation, common subpeptides can be shared
among peptides and the peptidedatabase can be represented much more
compactly. The computations needed to evaluate theobservation model
for specific amino acids are only performed once per edge; thus,
depending on thedegree of sharing inherent in the lattice (relative
to the uncompressed database), significant speedupscan be achieved.
The question is how to define the S such that the resulting lattice
is as compact aspossible. To address this problem we exploit the
fact that, formally, a lattice is a (weighted)
finite-stateautomaton (FSA) Jeff: [Add cite]. Our initial starting
point is a “naive” lattice representation whereevery peptide is
represented as a separate path consisting of edges labelled with
individual aminoacids only. We then apply a series of well-known
operations on finite-state machines that transformthe lattice into
the corresponding minimal lattice that has the smallest possible
number of states. Thealphabet S results as a by-product of this
procedure.The first step is to convert the peptide database (i.e. a
simple set of strings) to a finite-state automatonF . Next, F is
determinized. Determinization converts F into an equivalent FSA,
Fdet, such that forany given state q and alphabet symbol a ∈ A,
there is only a single outgoing edge from q labeled witha. Third,
Fdet is minimized. Minimization creates an FSA, Fmin, that is
equivalent to Fdet but hasthe minimal number of states. Algorithms
for determinization and minimization have been studied indepth
(e.g. [8]); we use the implementations provided in the OpenFst
toolkit3. Finally, deterministicsubpaths in the lattice (sequences
of states with only one outgoing edge) are collapsed into a
singleedge, further limiting the number of states and edges and
thus reducing memory requirements. Figure4 shows an example of a
compressed lattice for three peptides. At the end of this
procedure, S isdefined by list of unique edge labels in the final
collapsed lattice.
One problem is that the lattice incorporates peptides of
different lengths, which complicates scoringwith the observation
model described in Section 4. In order be able to score all strings
simultaneously,they need to be warped to a common length. We
achieve this by appending a “dummy amino acid”symbol to peptides
shorter than the longest peptide in the database, such that all
strings have the samelength.
Graphical model representation of lattices
In order to use a lattice representation within our graphical
modeling framework the lattice needs tobe represented as a
graphical model structure, visualized in Figure 5.
Valid paths through the lattice are specified by the NODE
variable and associated parameters: theprobability of node j given
node i is nonzero whenever an edge exists between them in the
originallattice. The SP (subpeptide) variable with cardinality |S|
encodes the identity of the edge labeland is dependent on a start
node i and end node j. SPPOS specifies the position in the
subpeptide;whenever the final position is reached, the binary
transition variable TRANS is switched to 1. TheTRANS variable is in
turn a switching parent for NODE and SP; if it is 1, NODE and SP
take on newvalues (i.e. a transition in the lattice occurs),
otherwise the values from the previous frame are copied.Finally,
SPPOS and SP jointly determine the amino acid (AA) variable, which
is connected to anobservation (or a more complicated observation
model as described above). The validity of strings isensured by
dedicated end node and end transition variables which ensure that
the end the observationsequence coincides with the end of a
subpeptide.
When using a peptide lattice to search an entire database,
precursor filtering can be done as part of thesearch. To this end a
pruning variable is included that assigns zero probability to a
path if the currentaccumulated mass exceeds an upper mass limit, or
if the lower mass limit exceeds the current mass
3www.openfst.org
10
-
540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593
database # peptides “naive” lattice compressed lattice |S|worm
523,190 27M/27M 295k/804k 251kyeast 162,916 8.5M 95k/256k 85k
Table 1: Sizes of naive and compressed lattices (given as number
of nodes/number of edges), and thesize of the subpeptide alphabet
for worm and yeast databases.
Experiment “naive” compressedABC
Table 2: CPU time of inference for database search vs. search
through lattice
plus the maximum possible mass value that can still be added
before the end of the peptide is reached(the maximum remaining mass
is equal to the remaining number of peptide positions multiplied
bythe largest mass value of any amino acid). The pruning variable
is checked whenever a new edge inthe lattice is being entered.
Experiments
Table 2 compares the sizes of the original “naive” lattice
representation where each peptide isrepresented as an individual
string and the corresponding compressed lattice representation.
With respect to computational efficiency, speedups can be
achieved by evaluating the observationmodel only once for each edge
in the lattice.
Three different timing experiments were conducted to evaluate
the lattice representation. In Experi-ment A we use the lattice as
a compact representation for sets of peptides that have been
prefilteredaccording to their precursor mass values. In Experiment
B, the entire peptide database is representedas a lattice and the
search is conducted against the entire database. Precursor
filtering is performed aspart of the search, through the pruning
variable in the graphical model lattice structure, as
describedabove. In Experiment C we also conduct a search over the
entire database but (additionally?) usepruning options provided by
the graphical model inference code.
Timing experiments were conducted on a (MACHINE SPECS?). Each
number is the average of 20runs and reports the inference time
only, excluding startup cost.
Acknowledgements
Use unnumbered third level headings for the acknowledgements
title. All acknowledgements go atthe end of the paper.
11
-
594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647
References
[1] X. Aubert, C. Dugast, H. Ney, and V. Steinbiss. Large
vocabulary continuous speech recognition of wallstreet journal
data. In Proceedings of ICASSP, pages 129–132.
[2] Jeff Bilmes. Dynamic graphical models. IEEE Signal
Processing Magazine, 27(6):29–42, Nov 2010.
[3] C. Chelba and A. Acero. Position-specific posterior lattices
for indexing speech. In Proceedings of ACL,2005.
[4] A. R. Dongre, J. L. Jones, A. Somogyi, and V. H. Wysocki.
Influence of peptide composition, gas-phasebasicity, and chemical
modification on fragmentation efficiency: evidence for the mobile
proton model.Journal of the American Chemical Society,
118:8365–8374, 1996.
[5] C. Dyer, S. Muresan, and P. Resnik. Generalizing word
lattice translation. In Proceedings of ACL/HLT,pages 1012–1020,
2008.
[6] J. E. Elias and S. P. Gygi. Target-decoy search strategy for
increased confidence in large-scale proteinidentifications by mass
spectrometry. Nature Methods, 4(3):207–214, 2007.
[7] J. K. Eng, A. L. McCormack, and J. R. Yates, III. An
approach to correlate tandem mass spectral dataof peptides with
amino acid sequences in a protein database. Journal of the American
Society for MassSpectrometry, 5:976–989, 1994.
[8] J.E. Hopcroft and J. Ullman. Introduction fo Automata
Theory, Languages and Computation. Addison-Wesley, Reading, Mass.,
1979.
[9] L. Käll, J. D. Storey, M. J. MacCoss, and W. S. Noble.
Assigning significance to peptides identified bytandem mass
spectrometry using decoy databases. Journal of Proteome Research,
7(1):29–34, 2008.
[10] J. Pearl. Probabilistic Reasoning in Intelligent Systems :
Networks of Plausible Inference. MorganKaufmann, 1998.
[11] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S.
Cottrell. Probability-based protein identification bysearching
sequence databases using mass spectrometry data. ELECTROPHORESIS,
20(18):3551–3567,1999.
[12] J. D. Storey. A direct approach to false discovery rates.
Journal of the Royal Statistical Society, 64:479–498, 2002.
[13] J. D. Storey. The positive false discovery rate: A bayesian
interpretation and the q-value. The Annals ofStatistics,
31(6):2013–2035, 2003.
[14] J. D. Storey and R. Tibshirani. Statistical significance
for genome-wide studies. Proceedings of theNational Academy of
Sciences of the United States of America, 100:9440–9445, 2003.
[15] Yunhu Wan, Austin Yang, and Ting Chen. PepHMM: A hidden
markov model based scoring function formass spectrometry database
search. Analytical Chemistry, 78(2):432–437, 2006. PMID:
16408924.
[16] M. P. Washburn, D. Wolters, and J. R. Yates, III.
Large-scale analysis of the yeast proteome by multidi-mensional
protein identification technology. Nature Biotechnology,
19:242–247, 2001.
12
-
648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701
SUPPLEMENTARY MATERIAL
A Tandem Mass Spectrometry
B Description of the Testing Datasets
Elided. The data sets have been used in previously published
work.
C Evaluation Metrics
Add form of the equation when ntargets neq ndecoys. Include
results on qvality: i.e., that we dobetter even under alternate
estimators of the FDR.
Ajit: [Questions that we may want to put in a supplement.]
Q: Why are ground truth peptide-spectrum matches not available
in any significant quantities ?
A: Theoretically, one could create a purified sample of a
peptide which could be used to generate aspectrum where the peptide
is known. However, the resolution of tandem mass spectrometry is
sohigh that creating sufficiently pure samples is impractical.
One could attempt to label spectra by hand, but such labellings
are known not to be especially accurate[CITE].
D Scoring Identifications as Inference in a Dynamic Bayesian
Network
Explain where the virtual evidence function comes from, the MLE,
and why it does not work well.
Additional Experiments
Scatter plots relating our scoring function against Crux. Break
down the comparison of methodsbased on filtering returned PSMs by
length, by spectrum length, by precursor mass. Sum-product
vs.max-product. Ablative: replace the VECPT function with
intensity.
E Lattice Decoding
Additional Experiments
1
IntroductionTandem Mass SpectrometryEvaluation MetricsScoring
identifications as inference in a Dynamic Bayesian NetworkLattice
DecodingTandem Mass SpectrometryDescription of the Testing
DatasetsEvaluation MetricsScoring Identifications as Inference in a
Dynamic Bayesian NetworkLattice Decoding