Tandem Mass Spectrometry: Peptide identification via database search, de novo sequencing Some slides adapted by Sangtae Kim and from Jones & Pevzner 2004
Tandem Mass Spectrometry:
Peptide identification via database search, de novo sequencing
Some slides adapted by Sangtae Kim and from Jones & Pevzner 2004
Peptide and protein sequencesPeptide ≈ string over a weighted alphabet:
…AFSRLEMILGF…
AFSRLSRLEMILGF
EMILG
Peptides=
substrings
Amino acid Mass
A 71.0
F 147.1
S 87.0
D 115.0
L 113.1
E 129.1
M 131.1
I 113.1
G 57.0
Protein sequence:
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1 AA residuei AA residuei+1
N-terminus C-terminus
Mass spectrometry (MS and MS/MS)
Additional Fragmentation
E
L
G
R
A
L
Prefix masses for peptide LARGE
Mass
L A
RL A
GRL A
Rel
. int
ensi
ty
Adapted from slides by Vineet Bafna, UCSD
Quantification
Peptide Fragmentation
� Peptides tend to fragment along the backbone.� Mass spectrometer is a sophisticated scale to
measure the masses of these fragments
H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
H+
Prefix Fragment Suffix Fragment
Collision Induced Dissociation
N- and C-terminal Fragments
Masses of Terminal Fragments
Peptide
Mass (D) 57 + 97+ 147+114 = 415
N- and C-terminal Fragments
415
486
301
154
57
71
185
332
429
N- and C-terminal Fragments
415
486
301
154
57
71
185
332
429
Theoretical Spectrum
415
486
301
154
57
71
185
332
429
Reconstruct peptide from the set of masses of fragm ent ions
(mass-spectrum )
57 71 154 185 301 332 415 429 486
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragm ent ions
(mass-spectrum )
57 71 154 185 301 332 415 429 486
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragm ent ions
(mass-spectrum ) 57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154
Peptide Fragmentation
y3
b2
y2 y1
b3a2 a3
HO NH3+
| |R1 O R2 O R3 O R4
| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
| | | | | | |H H H H H H H
b2-H2O
y3 -H2O
b3- NH3
y2 - NH3
Mass Spectra
G V D L K
mass0
57 Da = ‘G’ 99 Da = ‘V’LK D V G
� The peaks in the mass spectrum:� Prefix� Fragments with neutral losses (-H2O, -NH3)� Noise and missing peaks.
and Suffix Fragments.
D
H2O
Peptide Identification Problem
G V D L K
mass0
Inte
nsity
mass0
MS/MSPeptide Identification
Example of a real MS/MS spectrum
Symmetric
b10
y12
MS/MS Fundamental Challenge
Given a spectrum, find the best peptide annotation
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.
Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass
Output: � Peptide P with mass m, whose theoretical
spectrum matches the experimental Sspectrum the best
Sequence from spectrum
Enzymatic digestionTandem
Mass SpectrometryProteins
Peptides
…Large set of
MS/MS spectra …
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, SEQUENCE,HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
s
s
s
f
ee
e
e
e
e
e
q
q
qu
u
u
n
n
n
ec
c
c
se
e
e
q
un
c
Peptide SEQUENCE
Database search De novo sequencing
DB search: Genomics � Proteomics
Translation
DNA
Protein
mRNA
Transcription
Ribosomal protein translation
Match between Spectra and the Shared
Peak Count
� The match between a spectrum and a peptide can be
defined as the number of shared masses (peaks)
between the two (Shared Peak Count or SPC)
� In practice, several tools use the weighted SPC that
reflects intensities of the peaks
� We will later see better ways to score peptide-spectrum
matches
Peptide Identification Problem
Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.
Input:� S: experimental spectrum� D: database of peptides� ∆: set of possible ion types� m: parent mass
Output: � A peptide of mass m from the database whose
theoretical spectrum matches the experimental S spectrum the best
s(P)
MS/MS spectrum identification
Set ofMS/MSspectra
… …Peptidesfrom DB P
Database search: (Yates et al.’94 – SEQUEST)⇒ Works very well when the peptide sequence is in the database⇒ Approach of choice for everyday protein and peptide identification
Determining reliability of identifications
EVERY spectrum has some best match to the database – how can we tell whether it’s a significant match?
From Elias’07
Decoy databases
Elias’07
Decoy databases are the most common approach to determine the reliability of identifications – estimate the False Discovery Rate.
Null hypothesis of the target-decoy strategy:� Each spectrum is generated by a random (peptide-like) amino
acid sequence
0 5 10 15 20 250
0.05
0.1
0.15
0.2
Match scores
Rel
ativ
e fr
eque
ncy
of m
atch
sco
res
Matches to Decoy
Matches to Target
FDR =
� Standard definitions� True if a sample peptide generated the spectrum, False otherwise� Positive if we accept the identification, Negative otherwise
� False Discovery Rate = FP/(TP+FP), estimated using percentage of Decoy matches above the selected score threshold.
False Discovery Rate
Approaches to interpretation
De-novo interpretation of an MS/MS spectrum attempts to generate the most likely peptide from direct interpretation of the MS/MS spectrum
� Searches over the space of all possible peptides� As opposed to only those present in some specific database
� Does not require any previous knowledge of the appropriate protein sequence
� Paradoxically, a de-novo interpretation can be computed much faster than a regular database search!
Issues on de-novo interpretation
Computational issues: How to efficiently search the space of possible solutions?� Without eliminating any eligible candidates� Without double-counting peaks in the spectrum
Scoring issues: How to best score the matches between peptides and MS/MS spectra?� Most intensity explained� Most ion types explained� Maximum likelihood models
De-novo sequencing: how?
How can we find the amino acid sequence that best explains the spectrum?
a) Explains the largest number of peaksb) Explains the most intensity
� Exhaustive enumeration?1. Generate every possible sequence with the same peptide
mass2. Match each sequence to the spectrum3. Choose the sequence that explains the most intensity in the
spectrum
F S N A M S D I
V SGQ L I D
Takes too long!
Initial attempts
Computer programs for de-novo interpretation of MS/MS spectra date as far back as 1966 when Biemman et al. proposed a prefix extension algorithm:� Tries every possible prefix extension� Eliminates solutions with missing peaks� Outputs every peptide with a matching parent mass� ALL prefix peaks must be present in the spectrum
In an attempt to include interpretations with missing peaks, Sakurai et al. (1984) proposed:� Exhaustive search over the space of all possible permutations of amino
acid multisets where the total mass equals the parent mass of the MS/MS spectrum
� Score peptide/spectrum matches by counting prefix and suffix mass matches
� Naturally more sensitive but also very slow
Prefix extension revisited
The second half of the eighties saw a few better designed approaches to this problem, based on the same type of algorithm:� Prefix extension, one amino acid at a time.� Tolerate missing peaks.� Include both prefix and suffix peaks in the score.� User-specified maximum number of candidates in memory at any
point of the execution. (sub-optimal)
Ishikawa and Niwa’86, Siegel and Baumann’88, Johnson and Biemann’89, Zidarov et al.’90
In 1990 Bartels introduced a graph representation of an MS/MS spectrum� Every peak in the spectrum defines a vertex� Vertices connected by an edge if peak mass difference is an amino acid
mass
� Best peptide is defined as the best path between the two endpoint vertices: v0 to vM(S)
� No detailed algorithm was given for finding the best peptide; interactive exploration tool was made available.
Spectrum graphs
v0 vM(S)
DP de-novo sequencing
What is the DP recursion?Score(i) = intensity(i) + max( Score(j) ),
for all j with mass(i)-mass(j) ∈ Amino acid masses
Recovered de-novo sequence? ESESE
DP de-novo sequencing: EDTES
DP recursion:Score(i) = intensity(i) + max( Score(j) ),
for all j with mass(i)-mass(j) ∈ Amino acid masses
Recovered de-novo sequence? ESESE
Why was the correct peptide missed?
Forbidden pairs
The exclusion of symmetric peaks in a maximal scoring path through a spectrum graph was first proposed by Dančik et al. in 1999:
� Peaks in the spectrum are called forbidden pairs if their mass adds up to the parent mass – either vertex can be used in the output path but not both.
� A path is anti-symmetric if it uses at most one vertex from every forbidden pair.� Objective function becomes: find maximal scoring anti-symmetric path.� NP-Hard in the general case
Solution:1. Extend the sequence from either the
prefix or from the suffix2. Avoid reusing the pairing mass
(highlighted in red)
Note that this ordering is always possible (Chen’01, Bafna and Edwards’03)
What is the recursion?
Maximal scoring anti-symmetric path
Chen et al’01 provided a dynamic programming recursion to find a maximal scoring anti-symmetric path:� A peak si precedes a peak sj if it is closer to one of the ends of the
spectrum: min(si,m(S)-si)
Optimality and extensions
The anti-symmetric dynamic programming solutions are� Correct: no symmetric peak can be reused and every anti-
symmetric path is considered� Optimal: a maximal scoring anti-symmetric path is constructed� “Efficient”: runtime efficiency is O(n2), n=# peaks
Other algorithms have also been proposed:� Ma’03, Frank’05, same algorithm, different scoring� Bafna and Edwards’03, same principle, extended the concept
of forbidden pairs to avoid peak reusage between any pair of ion-types
How does de-novo perform?
� More involved scoring schemes have been proposed� Sherenga (Likelihood model)� NovoHMM (Hidden Markov Models)� Pepnovo (Bayesian network)� Peaks (commercial, scoring model unknown)
� Best algorithms predict 1 incorrect amino acid out of every 4 predictions (NovoHMM, Pepnovo)
� Main problem: MS/MS spectra are noisy!
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.
Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass
Output: � Peptide P with mass m, whose theoretical
spectrum matches the experimental Sspectrum the best
� Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right
� Additional representative peaks (ion types)
� Noise (or unexplainable peaks)
Scoring MS/MS spectrum masses
Ion type ∆m p(ion)
b +1 0.5
b (iso) +2 0.15
b-NH3 -16 0.3
b-H2O -17 0.3
a (-CO) -27 0.17
b-NH3-H2O -34 0.16
…
F S N A M S D I
V SGQ L I D
Ion Types
� Some masses correspond to fragment
ions, others are just random noise
� Knowing ion types ∆={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
� We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
Example of Ion Type
� ∆={δ1, δ2,…, δk}
� Ion types
{ b, b-NH3, b-H2O}
correspond to
∆={0, 17, 18}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
Vertices of Spectrum Graph
� Masses of potential N-terminal peptides
� Vertices are generated by reverse shifts corresponding to ion types
∆={δ1, δ2,…, δk}
� Every N-terminal fragment can generate up to k ions
m-δ1, m-δ2, …, m-δk
� Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
� Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}
Reverse Shifts
Shift in H2O+NH3
Shift in H2O
Edges of Spectrum Graph
� Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A
� Gap edges for di- and tri-peptides
� Paths in the labeled graph spell out amino acid sequences – how to find the correct one?
� We need scoring to evaluate paths
Path Score
� p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}
� p(P, s) = the probability that peptide Pgenerates a peak s
� Scoring = computing probabilities
� p(P,S) = πsєS p(P, s)
� For a position t that represents ion type δj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
Peak Score
Peak Score (cont’d)
� For a position t that is not associated with an ion type:
qR , if peak is generated at tpR(P,st) =
1-qR , otherwise� qR = the probability of a noisy peak that does
not correspond to any ion type
Finding Optimal Paths in the Spectrum Graph
� For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:
� Peptides = paths in the spectrum graph
� P’ = the optimal path in the spectrum graph
p(P,S)p(P',S) Pmax=
Ions and Probabilities
� Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}
� δi-ions of a partial peptide are produced independently with probabilities qi
Ions and Probabilities
� A fragment has all k peaks with probability
� and no peaks with probability
� A peptide also produces a ``random noise'' with uniform probability qR at any position.
∏=
k
iiq
1
∏=
−k
iiq
1
)1(
Ratio Test Scoring for Partial Peptides
� Incorporates premiums for observed ions and penalties for missing ions.
� Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:RRRR q
q
q
q
q
q
q
q 4321)1(
)1( ⋅−−⋅⋅
Scoring Peptides
� T- set of all positions.
� Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.
� A peak at position tδj is generated with probability qj.
� R=T- U Ti - set of positions that are not associated with any partial peptides (noise).
Probabilistic Model
� For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.
� Similarly, for t∈R, the probability that P produces a random noise peak at t is:
−=
otherwise1
position tat generated ispeak a if),,( j
j
j
q
qSPtP
δ
−=
otherwise1
position tat generated ispeak a if)(
R
RR q
qtP
Probabilistic Score
� For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:
∏∏= =
=n
i
k
j iR
i
R j
j
tp
SPtp
Sp
SPp
1 1 )(
),,(
)(
),(
δ
δ
Resulting sequencing accuracy
Algorithm Average Accuracy
SequenceLength
Tag 3 Tag 4 Tag 5 Tag 6
Sherenga 0.690 8.65 0.821 0.711 0.564 0.364
Peaks 0.673 10.32 0.889 0.814 0.689 0.575
Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339
Benchmarking reported for 280 spectra.
Frank and Pevzner’05
Enhancing Sherenga (Dancik et al.’99)
Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment
ions.� Incorporates additional chemical
knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage
site.� Improves the Random Model.
Pepnovo slides by Ari Frank
pos(m)(region in peptide)
yby2
a
b2
a-NH3
a-H2O
b-NH3
b-H2O
y-NH3
y-H2O
b-H2O-NH3 b-H2O-H2O
y-H2O-NH3
y-H2O-H2O
N-aa(N-terminal amino acid)
C-aa(C-terminal amino acid)
HCID - Fragmentation Network
Amino acid influence
Ion combinations
Positional influence
pos y P(y|pos)
0 0 0.10 1 0.22
2 3 0.52
4 3 0.08
Discrete Intensity Values
� Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).
� Normalized intensities Discretized into 4 intensity levels:� zero : I < 0.05
� low : 0.05 ≤ I < 2 (62% of peaks)� medium : 2 ≤ I < 10 (26% of peaks)� high : I ≥ 10 (12% of peaks)
Combinations of Fragments
� The topology takes into account dependencies between fragments.
� The values of the probability tables are learned from the training data, so they reflect the true “fragmentation rules”.
� “Logical” combinations get higher probabilities:P(b=high | y=high ) = 0.36, vs. P(b=high | y= low ) = 0.03.
yby2
ab2
a-NH3a-H2O
b-NH3b-H2O
y-NH3
y-H2O
b-H2O-NH3 b-H2O-H2Oy-H2O-NH3
y-H2O-H2O
Additional Chemical Knowledge
� The identity of the flanking amino acids influences the peak intensities:� Increased intensities N-terminal to Proline and Glycine� Increased intensities C-terminal to Aspartic Acid.
� 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).
N-aa(N-terminal amino acid)
C-aa(C-terminal amino acid)
yb
Positional Influence
� Creates separate models for different locations of the cleavage site in the peptide.
� Models phenomena such as:� weak b/y ions near terminal ends.� prevalence of a-ions in the first half of the peptides.� prevalence of b2 towards the peptide’s C-terminal and y2
near the N-terminal.
pos(m)(region in peptide)
yby2
a
b2
HRandom – Regional Density
Bin
0
1
2
3
Intensity levels
1
2
2
2
2
3
3
Window
m/zw
2ε
Computing the Random Probability
� α=1-(2ε)/w , is the probability of a single peak missing the bin.
� Let ni , 1≤i≤d, be counts of peaks with intensity i in window w:
∑=
==
∑==
∑−==
=
+=
d
idRandom
n
dRandom
nn
dRandom
nniIP
nnIP
nntIPd
ii
d
tii
t
01
1
1
1),...,|(.3
),...,|0(.2
)1(),...,|(.1
1
1
α
αα
(normalization term)
(prob. of no peak)
(prob. of peak with intensity t)
Random Model cont.
� Employing this random model increases the contribution of peaks in sparse regions of the spectrum.
� Decreases score for spurious matches in dense regions.
� Increases contribution of high intensity peaks compared to low intensity.
Probability under HCID
From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so:
where I are the ion intensities for mass m and π(f) are the intensities of the parents of node f.
))(|(),(,...},{
fIPmIPbyf
fHH CIDCIDπ∏
∈
=r
Probability under HRandom
� Peak occurrences are treated as random independent events:
� The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.
),...,|(),( 1...},,{ 2
dOHybyf
fHH nnIPmIP RandomRandom ∏−∈
=r
The Likelihood Ratio Score
� A putative cleavage site is scored according to the log ratio test:
� Can be used to score a peptide by summing the score for the prefix masses:
),...,|(
))(|(
log),(
),(log),(
1,...},{
,...},{
dbyf
fH
byffH
mH
mHm nnIP
fIP
mIP
mIPmIScore
Random
CID
Random
CID
∏∏
∈
∈==π
r
rr
∑=
==n
iimn mIScorepppPScore i
121 ),()..(
r
PepNovo’s De Novo Sequencing
� A spectrum graph is created from the experimental MS/MS spectrum.
� The nodes are scored using the Bayesian network.
� Highest scoring anti-symmetric path is found using dynamic programming.
Data and Software
� 1252 spectra of doubly charged tryptic peptides (from ISB and OPD), measured on ion trap mass spectrometer:� 972 spectra in the training set.� 280 spectra in the test set (peptides up to 1400 Da.,
assignments independently verified.)
� Compared PepNovo with 3 de novo programs: Sherenga (Spectrum Mill 3.0), Lutefisk XP, and Peaks v2.3.
Results
Algorithm Average Accuracy
SequenceLength
Tag 3 Tag 4 Tag 5 Tag 6
PepNovo 0.727 10.30 0.946 0.871 0.800 0.654
Sherenga 0.690 8.65 0.821 0.711 0.564 0.364
Peaks 0.673 10.32 0.889 0.814 0.689 0.575
Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339
Benchmarking reported for 280 spectra.
Frank and Pevzner’05
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.
Input:� S: experimental spectrum� ∆: set of possible ion types� m: parent mass
Output: � Peptide P with mass m, whose theoretical
spectrum matches the experimental Sspectrum the best
� Missing peaks due to chemical conditions � e.g. Proline (P) ‘grabs’ the amino acid to its right
� Additional representative peaks (ion types)
� Noise (or unexplainable peaks)
Scoring MS/MS spectrum masses
Ion type ∆m p(ion)
b +1 0.5
b (iso) +2 0.15
b-NH3 -16 0.3
b-H2O -17 0.3
a (-CO) -27 0.17
b-NH3-H2O -34 0.16
…
F S N A M S D I
V SGQ L I D
Ion Types
� Some masses correspond to fragment
ions, others are just random noise
� Knowing ion types ∆={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
� We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
Example of Ion Type
� ∆={δ1, δ2,…, δk}
� Ion types
{ b, b-NH3, b-H2O}
correspond to
∆={0, 17, 18}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
Vertices of Spectrum Graph
� Masses of potential N-terminal peptides
� Vertices are generated by reverse shifts corresponding to ion types
∆={δ1, δ2,…, δk}
� Every N-terminal fragment can generate up to k ions
m-δ1, m-δ2, …, m-δk
� Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
� Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}
Reverse Shifts
Shift in H2O+NH3
Shift in H2O
Edges of Spectrum Graph
� Two vertices with mass difference corresponding to an amino acid A:� Connect with an edge labeled by A
� Gap edges for di- and tri-peptides
� Paths in the labeled graph spell out amino acid sequences – how to find the correct one?
� We need scoring to evaluate paths
Path Score
� p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}
� p(P, s) = the probability that peptide Pgenerates a peak s
� Scoring = computing probabilities
� p(P,S) = πsєS p(P, s)
� For a position t that represents ion type δj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
Peak Score
Peak Score (cont’d)
� For a position t that is not associated with an ion type:
qR , if peak is generated at tpR(P,st) =
1-qR , otherwise� qR = the probability of a noisy peak that does
not correspond to any ion type
Finding Optimal Paths in the Spectrum Graph
� For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:
� Peptides = paths in the spectrum graph
� P’ = the optimal path in the spectrum graph
p(P,S)p(P',S) Pmax=
Ions and Probabilities
� Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}
� δi-ions of a partial peptide are produced independently with probabilities qi
Ions and Probabilities
� A fragment has all k peaks with probability
� and no peaks with probability
� A peptide also produces a ``random noise'' with uniform probability qR at any position.
∏=
k
iiq
1
∏=
−k
iiq
1
)1(
Ratio Test Scoring for Partial Peptides
� Incorporates premiums for observed ions and penalties for missing ions.
� Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:RRRR q
q
q
q
q
q
q
q 4321)1(
)1( ⋅−−⋅⋅
Scoring Peptides
� T- set of all positions.
� Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that represent ions of fragments Pi.
� A peak at position tδj is generated with probability qj.
� R=T- U Ti - set of positions that are not associated with any partial peptides (noise).
Probabilistic Model
� For a position t δj ∈ Ti the probability p(t, P,S) that peptide P produces a peak at position t.
� Similarly, for t∈R, the probability that P produces a random noise peak at t is:
−=
otherwise1
position tat generated ispeak a if),,( j
j
j
q
qSPtP
δ
−=
otherwise1
position tat generated ispeak a if)(
R
RR q
qtP
Probabilistic Score
� For a peptide P with n amino acids, the score for the whole peptide is expressed by the following ratio test:
∏∏= =
=n
i
k
j iR
i
R j
j
tp
SPtp
Sp
SPp
1 1 )(
),,(
)(
),(
δ
δ
Resulting sequencing accuracy
Algorithm Average Accuracy
SequenceLength
Tag 3 Tag 4 Tag 5 Tag 6
Sherenga 0.690 8.65 0.821 0.711 0.564 0.364
Peaks 0.673 10.32 0.889 0.814 0.689 0.575
Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339
Benchmarking reported for 280 spectra.
Frank and Pevzner’05
Enhancing Sherenga (Dancik et al.’99)
Pepnovo’s scoring model:� Determines different intensity values.� Considers dependencies between fragment
ions.� Incorporates additional chemical
knowledge (e.g., preferred cleavage sites).� Uses positional influence of the cleavage
site.� Improves the Random Model.
Pepnovo slides by Ari Frank
pos(m)(region in peptide)
yby2
a
b2
a-NH3
a-H2O
b-NH3
b-H2O
y-NH3
y-H2O
b-H2O-NH3 b-H2O-H2O
y-H2O-NH3
y-H2O-H2O
N-aa(N-terminal amino acid)
C-aa(C-terminal amino acid)
HCID - Fragmentation Network
Amino acid influence
Ion combinations
Positional influence
pos y P(y|pos)
0 0 0.10 1 0.22
2 3 0.52
4 3 0.08
Discrete Intensity Values
� Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).
� Normalized intensities Discretized into 4 intensity levels:� zero : I < 0.05
� low : 0.05 ≤ I < 2 (62% of peaks)� medium : 2 ≤ I < 10 (26% of peaks)� high : I ≥ 10 (12% of peaks)
Combinations of Fragments
� The topology takes into account dependencies between fragments.
� The values of the probability tables are learned from the training data, so they reflect the true “fragmentation rules”.
� “Logical” combinations get higher probabilities:P(b=high | y=high ) = 0.36, vs. P(b=high | y= low ) = 0.03.
yby2
ab2
a-NH3a-H2O
b-NH3b-H2O
y-NH3
y-H2O
b-H2O-NH3 b-H2O-H2Oy-H2O-NH3
y-H2O-H2O
Additional Chemical Knowledge
� The identity of the flanking amino acids influences the peak intensities:� Increased intensities N-terminal to Proline and Glycine� Increased intensities C-terminal to Aspartic Acid.
� 400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.).
N-aa(N-terminal amino acid)
C-aa(C-terminal amino acid)
yb
Positional Influence
� Creates separate models for different locations of the cleavage site in the peptide.
� Models phenomena such as:� weak b/y ions near terminal ends.� prevalence of a-ions in the first half of the peptides.� prevalence of b2 towards the peptide’s C-terminal and y2
near the N-terminal.
pos(m)(region in peptide)
yby2
a
b2
HRandom – Regional Density
Bin
0
1
2
3
Intensity levels
1
2
2
2
2
3
3
Window
m/zw
2ε
Computing the Random Probability
� α=1-(2ε)/w , is the probability of a single peak missing the bin.
� Let ni , 1≤i≤d, be counts of peaks with intensity i in window w:
∑=
==
∑==
∑−==
=
+=
d
idRandom
n
dRandom
nn
dRandom
nniIP
nnIP
nntIPd
ii
d
tii
t
01
1
1
1),...,|(.3
),...,|0(.2
)1(),...,|(.1
1
1
α
αα
(normalization term)
(prob. of no peak)
(prob. of peak with intensity t)
Random Model cont.
� Employing this random model increases the contribution of peaks in sparse regions of the spectrum.
� Decreases score for spurious matches in dense regions.
� Increases contribution of high intensity peaks compared to low intensity.
Probability under HCID
From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so:
where I are the ion intensities for mass m and π(f) are the intensities of the parents of node f.
))(|(),(,...},{
fIPmIPbyf
fHH CIDCIDπ∏
∈
=r
Probability under HRandom
� Peak occurrences are treated as random independent events:
� The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.
),...,|(),( 1...},,{ 2
dOHybyf
fHH nnIPmIP RandomRandom ∏−∈
=r
The Likelihood Ratio Score
� A putative cleavage site is scored according to the log ratio test:
� Can be used to score a peptide by summing the score for the prefix masses:
),...,|(
))(|(
log),(
),(log),(
1,...},{
,...},{
dbyf
fH
byffH
mH
mHm nnIP
fIP
mIP
mIPmIScore
Random
CID
Random
CID
∏∏
∈
∈==π
r
rr
∑=
==n
iimn mIScorepppPScore i
121 ),()..(
r
PepNovo’s De Novo Sequencing
� A spectrum graph is created from the experimental MS/MS spectrum.
� The nodes are scored using the Bayesian network.
� Highest scoring anti-symmetric path is found using dynamic programming.
Data and Software
� 1252 spectra of doubly charged tryptic peptides (from ISB and OPD), measured on ion trap mass spectrometer:� 972 spectra in the training set.� 280 spectra in the test set (peptides up to 1400 Da.,
assignments independently verified.)
� Compared PepNovo with 3 de novo programs: Sherenga (Spectrum Mill 3.0), Lutefisk XP, and Peaks v2.3.
Results
Algorithm Average Accuracy
SequenceLength
Tag 3 Tag 4 Tag 5 Tag 6
PepNovo 0.727 10.30 0.946 0.871 0.800 0.654
Sherenga 0.690 8.65 0.821 0.711 0.564 0.364
Peaks 0.673 10.32 0.889 0.814 0.689 0.575
Lutefisk 0.566 8.79 0.661 0.521 0.425 0.339
Benchmarking reported for 280 spectra.
Frank and Pevzner’05