Published: March 18, 2011 r2011 American Chemical Society 2154 dx.doi.org/10.1021/pr200031z | J. Proteome Res. 2011, 10, 2154–2160 ARTICLE pubs.acs.org/jpr MassWiz: A Novel Scoring Algorithm with Target-Decoy Based Analysis Pipeline for Tandem Mass Spectrometry Amit Kumar Yadav, Dhirendra Kumar, and Debasis Dash* Institute of Genomics and Integrative Biology (CSIR), Mall Road, Delhi, India b S Supporting Information ’ INTRODUCTION With the advent of soft ionization techniques like MALDI 1 and ESI, 2 it became possible to ionize highly polar and non- volatile molecules such as peptides without destroying them. They could now be introduced into a mass spectrometer, making analysis of peptides a lot easier. Sequence database searching emerged as a valuable alternative to de novo sequencing. Due to the rapid advances made in MS instrumentation (LTQ , QTOF, FTICR, Orbitrap, etc.), availability of complete genome sequences, increased computational power for data analyses, and development of algorithms mass spectrometry has become the method of choice for proteomics studies. 3,4 Washburn et al. 5 showed the applicability of high throughput capability of LCMS approach in the yeast proteome establishing shotgun proteomics as a valuable methodology. There have been improvements in bioinformatics tools and algorithms for signal processing and peak detection, 68 charge state deconvolution, 9,10 noise removal 8,11 and spectra filtering, 11,12 database searches and assigning statistical confidence. 1316 Due to the various steps involved in data analysis and their complexity, no single method can be a complete solution. 17 There is a lot of scope for newer bioinformatics methods and algorithms, especially those available freely in the public domain for rapid advancement of the field. Tools such as k-score plugin 18 into X!Tandem, the Trans-Proteomics Pipeline (TPP), 19 InsPecT, 20 etc. are some of the excellent examples. A robust scoring function is the heart of any peptide identi- fication algorithm. The scoring functions can be broadly divided into probabilistic and empirical scoring schemes. Mascot 21 is one of the most widely used probability based algorithm, whereas SEQUEST 22 is based on cross-correlation between theoretical and experimental spectrum. X!Tandem 23 uses a hyper geometric model, and OMSSA 24 relies on a Poisson distribution to assess the significance of matches. While all algorithms have their inherent pros and cons, any single method cannot capture all of the information content from an MS experiment. 25 It has been generally agreed that using multiple algorithms increases the number of assignments. 17,26 We present a novel empirical scoring algorithm that aims to maximize the identifications while keeping the false positives (incorrect identifications) to a minimum. Our scoring function assigns different weights to key ions, their consecutive occur- rence, their intensities, and their supporting ions. Significance of intensity as a parameter has been previously shown; 27,28 it helps discriminate between a correct and a random match. For devel- oping and testing the scoring function, we needed an easily Received: July 19, 2010 ABSTRACT: Mass spectrometry has made rapid advances in the recent past and has become the preferred method for proteomics. Although many open source algorithms for peptide identification exist, such as X!Tandem and OMSSA, it has majorly been a domain of proprietary software. There is a need for better, freely available, and configurable algorithms that can help in identifying the correct peptides while keeping the false positives to a minimum. We have developed MassWiz, a novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type. We tested MassWiz accuracy on 486,882 spectra from a standard mixture of 18 proteins generated on 6 different instruments downloaded from the Seattle Proteome Center public repository. We compared the MassWiz algorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR. MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWiz showed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We also analyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. The results demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on the number of assignments. MassWiz is open-source, versatile, and easily configurable. KEYWORDS: Tandem mass spectrometry, proteomics, peptide identification, bioinformatics, open source, algorithm, FDR, MS/ MS
7
Embed
MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Published: March 18, 2011
r 2011 American Chemical Society 2154 dx.doi.org/10.1021/pr200031z | J. Proteome Res. 2011, 10, 2154–2160
ARTICLE
pubs.acs.org/jpr
MassWiz: A Novel Scoring Algorithm with Target-Decoy BasedAnalysis Pipeline for Tandem Mass SpectrometryAmit Kumar Yadav, Dhirendra Kumar, and Debasis Dash*
Institute of Genomics and Integrative Biology (CSIR), Mall Road, Delhi, India
bS Supporting Information
’ INTRODUCTION
With the advent of soft ionization techniques like MALDI1
and ESI,2 it became possible to ionize highly polar and non-volatile molecules such as peptides without destroying them.They could now be introduced into a mass spectrometer, makinganalysis of peptides a lot easier. Sequence database searchingemerged as a valuable alternative to de novo sequencing. Dueto the rapid advances made in MS instrumentation (LTQ ,QTOF, FTICR, Orbitrap, etc.), availability of complete genomesequences, increased computational power for data analyses, anddevelopment of algorithms mass spectrometry has become themethod of choice for proteomics studies.3,4 Washburn et al.5
showed the applicability of high throughput capability of LCMSapproach in the yeast proteome establishing shotgun proteomicsas a valuable methodology.
There have been improvements in bioinformatics tools andalgorithms for signal processing and peak detection,6�8 chargestate deconvolution,9,10 noise removal8,11 and spectra filtering,11,12
database searches and assigning statistical confidence.13�16 Due tothe various steps involved in data analysis and their complexity,no singlemethod canbe a complete solution.17There is a lot of scopefor newer bioinformatics methods and algorithms, especially thoseavailable freely in the public domain for rapid advancement ofthe field. Tools such as k-score plugin18 into X!Tandem, the
Trans-Proteomics Pipeline (TPP),19 InsPecT,20 etc. are some ofthe excellent examples.
A robust scoring function is the heart of any peptide identi-fication algorithm. The scoring functions can be broadly dividedinto probabilistic and empirical scoring schemes. Mascot21 is oneof the most widely used probability based algorithm, whereasSEQUEST22 is based on cross-correlation between theoreticaland experimental spectrum. X!Tandem23 uses a hyper geometricmodel, and OMSSA24 relies on a Poisson distribution to assessthe significance of matches. While all algorithms have theirinherent pros and cons, any single method cannot capture allof the information content from anMS experiment.25 It has beengenerally agreed that using multiple algorithms increases thenumber of assignments.17,26
We present a novel empirical scoring algorithm that aims tomaximize the identifications while keeping the false positives(incorrect identifications) to a minimum. Our scoring functionassigns different weights to key ions, their consecutive occur-rence, their intensities, and their supporting ions. Significance ofintensity as a parameter has been previously shown;27,28 it helpsdiscriminate between a correct and a random match. For devel-oping and testing the scoring function, we needed an easily
Received: July 19, 2010
ABSTRACT: Mass spectrometry has made rapid advances in the recentpast and has become the preferred method for proteomics. Although manyopen source algorithms for peptide identification exist, such as X!Tandemand OMSSA, it has majorly been a domain of proprietary software. There isa need for better, freely available, and configurable algorithms that can helpin identifying the correct peptides while keeping the false positives to aminimum.We have developedMassWiz, a novel empirical scoring functionthat gives appropriate weights to major ions, continuity of b-y ions,intensities, and the supporting neutral losses based on the instrument type.We testedMassWiz accuracy on 486,882 spectra from a standard mixture of18 proteins generated on 6 different instruments downloaded from theSeattle Proteome Center public repository. We compared the MassWizalgorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR.MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWizshowed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We alsoanalyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. Theresults demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on thenumber of assignments. MassWiz is open-source, versatile, and easily configurable.
KEYWORDS: Tandem mass spectrometry, proteomics, peptide identification, bioinformatics, open source, algorithm, FDR, MS/MS
modifiable framework. So, we developed the required frameworkin Perl, which was easy to implement and modify. Although itmay not be comparable with the existing algorithms in time-performance, it can still be very useful as Perl code can easily bemodified to tweak the algorithm. We benchmarked MassWizaccuracy with Mascot, Sequest, X!Tandem, and OMSSA bycomparing the number of identified high-confidence peptidesfrom a standard mixture of 18 proteins.
Decoy methods29�31 have become popular for estimation offalse discovery rates (FDR). Although Moore et al.32 first usedthe method by simply reversing the target database, manyalternatives have been suggested.33,34 It is now used for assigningsignificance to peptide identifications at a fixed FDR value.31 Wehave integrated the reverse database decoy strategy for signifi-cance assessment that is free from distribution assumptions anddoes not require curve-fitting. MassWiz executable is available onsourceforge (http://sourceforge.net/projects/masswiz), and thesource code is available freely for academic use on request.
’MATERIALS AND METHODS
Data SetA data set of standard mixture of 18 proteins, “ISB standard
protein mix” described by Klimek et al.,35 was used for validatingMassWiz and comparing its accuracy against other algorithms.The Mix 3 data set for all six instruments was downloaded fromhttp://regis-web.systemsbiology.net/PublicData sets in mzXMLformat. The FASTA database (database of 18 proteins mix,contaminants, and Haemophilus influenzae sequences) was alsodownloaded and updated with recent sequences for all standardproteins and their homologues.
For testing MassWiz on a biological data set, we downloadedyeast mid-log phase data from the NCBI Peptidome repository(http://www.ncbi.nlm.nih.gov/peptidome/psm1001).TheFASTAdatabase was downloaded from Swissprot using taxonomy filterSaccharomyces cerevisiae (Baker’s yeast) [4932] complete proteomecontaining 6616 sequences.
Input Data PreparationAll mzXML files were converted to mascot generic format
(mgf) using MzXML2search executable from TPP. For eachinstrument, all mgf files thus obtained were concatenated andused as a common search input to all algorithms.
Algorithm ImplementationThe scoring algorithm was tested by developing a framework
in Perl (version 5.10.1). The mass calculations and theoreticalspectrum generation was accomplished using the InSilicoSpectropackage.36 The MassWiz framework includes a complete pipe-line from handling the input spectra to generating FDR correctedpeptide spectrum matches (PSMs), i.e., top ranked peptide foreach spectrum.
Spectral ProcessingAny peptide identification algorithm is only as good as the
quality of data it receives. Spectral quality is of great importancefor any algorithm to perform at its optimal level. Several studieshave been dedicated to spectral quality assessment12,37,38 toobtain better results from search algorithms. Most algorithmshave inbuilt filtering mechanism to remove noise peaks and badspectra from the input raw data. We have employed a simpleyet effective filter to perform this task. A spectrum is dynamicallydivided into mass-bins based on its precursor mass, and a
maximum of five most intense peaks are picked from every binto have better peak coverage from all parts of the spectrum.A minimum intensity threshold can be set for a peak to beconsidered as signal. Peaks below this are considered noise anddeleted before search. Similarly, the minimum number of peakscan be defined for a spectrum to be considered for search. Thisreduces random matches and saves time, thus improving sensi-tivity and efficiency of the algorithm. Not much is known aboutthe peak filtering step ofMascot. Sequest’s cross correlation takescare of the spectrum quality. OMSSA applies an intensitythreshold cutoff, and X!Tandem uses a maximum of 50 peaksfor search by default. The peak intensity filters were not used soas to compare all algorithms on complete data, irrespective of thespectra quality.
MS/MS Database SearchThemgf files were searched using the updated database and its
reversed database for target-decoy based FDR calculation. Thesearch parameters were matched as close as possible to thosedescribed in the original paper,35 and defaults were taken wherethis was not possible. Searches were performed with precursorion tolerance of 3 Da, product ion tolerance of 1 Da, trypsindigestion with 1 missed cleavage, a fixed modification ofþ57.03Da (carbamidomethylation) at cysteine residues, maximum chargeþ7, minimum 5 peaks, and peak intensity threshold set to zero.
For the yeast data set from ESI-TRAP, a 3 Da error windowwas allowed for precursors while fragment masses were allowedto bematched at 0.6 Da. Tryptic digestion with 1missed cleavagewas considered with carbamidomethylation as the fixed mod-ification and oxidation of methionine residues as variable mod-ification for the search. The other parameters were same as above.
Target-decoy searches and FDR calculation are integratedinto MassWiz framework. Once a search is complete, we get thetarget, decoy, and FDR corrected files as output. Mascot wassearched using locally installed Mascot server version 2.2.04.The target and decoy results were exported as csv without anyp-value filters for all PSMs and FDR was calculated. Sequestsearches and result extraction were conducted using Thermo’sProteome Discoverer 1.1 interface. All rank 1 PSMs wereexported to excel sheets for FDR calculation. X!Tandem(TORNADO) results were parsed from the XML files using aPerl program. From these files, FDR was calculated and FDRcorrected PSMs were written to an output file using anotherPerl program. OMSSA (2.1.9) results were obtained as csv filesfrom which FDR was calculated and output files were writtenusing a Perl program.
False Discovery Rate CalculationThe false discovery rate was calculated using Kall’s method.31
The decoy peptides that had identical corresponding peptides inthe target database were ignored from decoy results during FDRcalculation. Leu/Ile were considered indistinguishable and trea-ted as identical. FDR was calculated from database search scoreswherever possible.
FDR ¼ no: of decoy PSMs above thresholdno: of target PSMs above threshold
The target and decoy scores were sorted in descending orderand FDR calculated at each decoy score taken as the threshold.The score at which the FDR was calculated to be 1% orimmediately below 1% was taken as the score threshold for 1%FDR. For X!Tandem and OMSSA, the e-values were sorted in
FDR ¼ no: of decoy PSMs below e-value thresholdno: of target PSMs below e-value threshold
Comparison of AlgorithmsAll algorithms were compared after FDR calculation. A Perl
program was written to compare the peptides assigned by the fivealgorithms.
’RESULTS AND DISCUSSION
Scoring FunctionThe most important aspect of a mass spectrometry based
peptide identification algorithm is developing a robust scoringfunction. Due to variability in the fragmentation patterns,39�41
extent of fragmentation and intensities of the peaks42,43 acrossruns, instruments and methodologies, the task becomes challen-ging. We have developed a novel empirical scoring scheme basedon the knowledge of ion abundances and their intensities. CID
fragmentation patterns have been studied in extensive detail inseveral studies.42,44�47 On the basis of knowledge gained fromliterature, we experimented with several combinations of scoresfor the ions based on their known abundances and supportiveions. We arrived at the empirical weights for different ion typesdepending on the presence in a particular instrument type(Table 1). For matching a spectrum against a candidate peptideP, the score of the peptide is calculated as
scoreðPÞ ¼ SðPÞ 3
ffiffiffiffiffiffiffiffiffiffi∑k
i¼ 1Ii
∑n
i¼ 1Ii
vuuuuuut ðeq 1Þ
whereP = candidate peptidescore(P) = final score for the candidate peptide against the
experimental spectrumS(P) = primary score for peptide P (described in detail in eq 2)
Table 1. Scoring Matrix
MALDI ESI
ion type default TOF/TOF TOF PSD QIT-TOF QUAD-TOF QUAD-TOF TRAP QUAD FTICR 4-SECTOR
ya 100 100 100 100 100 100 100 100 100 100
bb 100 100 100 100 100 100 100 100 100 100
ac 50 50 50 50 - - - - - 50
z - - - - - - - - - 50
immonium - 100 100 100 100 - - - - 100
y-NH3 25 25 - 25 25 25 25 25 25 -
b-NH3 25 25 25 25 25 25 25 25 25 25
a-NH3 25 25 25 25 - - - - - -
y-H2O - 25 - 25 25 25 25 25 25 -
b-H2O - 25 25 25 25 25 25 25 25 25
a-H2O - 25 25 25 - - - - - -aA bonus score of 50 is awarded for y-ion continuity, and a score of 50 is deducted for discontinuous y-ions. bA bonus score of 20 is awarded for b-ioncontinuity, and a score of 20 is deducted for discontinuous b-ions. cNo score for a-ion continuity/discontinuity. So the value of Cij for a-ions in eq 2 willbe zero.
Table 2. Spectra and Peptides Assigned by the Five Algorithms in a Standard Mixture of 18 Proteins and in a Complex Mid-logPhase Yeast Data Set
k = number of peaks matchedIi = intensity of the ith peakn = number of peaks in the experimental spectrum (after
processing).The term under the square root signifies the matched ion
current. It was square root transformed to decrease the effect ofintensity irregularities caused by a variable fragmentation patternand was found to perform better than log transformation.Inclusion of intensity factor in our scoring function increases theresolution of correct assignment over randommatches. The fragmentmass errors can be very helpful in discerning good matches and hasbeen implemented using an exponential function. In simpler words,the lower themass error, the better the score for a fragment ionmatch.
SðPÞ ¼ ∑i ∈ fy, b, ag
∑n
j¼ 1
Xij þ Cij
ejΔmijj þ Nij
ejΔmijj þWij
ejΔmijj
� �þ ∑
k
j¼ 1
Q j
ejΔmjj
ðeq 2Þwhere
n = total peaks in the theoretical spectrum for a given ion series(y/b/a type ion)
for a given i ∈ y/b/a ion series:Xij = score for the jth peak matchedCij = bonus score for continuity factor when j and j� 1 peaks
matched and negative score for discontinuous ion series, i.e.,when j � 1 peak matches but j does not
Nij = score for jth matched peak for neutral loss of ammonia(NH3) when Xij 6¼ 0
Wij = score for jth matched peak for neutral loss of water(H2O) when Xij 6¼ 0
Δm = mass difference for the matched fragment peak, i.e.,Mexperimental � Mtheoretical
k = total peaks in the theoretical spectrum for immoniumion series
Qj = score for jth matched peak for Immonium ionThese empirical scores are taken from the scoring matrix given
in Table 1. The scoring function is adapted to the irregularitiesof instrument types as it makes extensive use of the informationcontent present in the spectrum along with y- and b-ions. Thecomplementary ions such as neutral losses and immonium ions(depending on the instrument types) can help differentiatebetween a correct and an incorrect hit when the b-y countsare very close together. Also, the continuity of a series (b/y/a)greatly increases the confidence in the matched ions even whenfragmentation is not complete due to partially mobile or non-mobile proton containing peptides.
We compared MassWiz with four widely used algorithms-Mascot, Sequest, X!Tandem and OMSSA. Six different data setsfrom ISB standardmixture of 18 proteins and known contaminants
Figure 1. Comparison of number of spectra identified by MassWiz,Mascot, Sequest, and OMSSA for data sets from different instruments at1% FDR.
Figure 2. Comparison of number of peptides identified by MassWiz,Mascot, Sequest, and OMSSA for data sets from different instruments at1% FDR.
Figure 3. Comparison of number of (A) spectra and (B) peptidesidentified by MassWiz, Mascot, Sequest, and OMSSA for mid-log phaseyeast data set at 1% FDR.
were searched using all five algorithms with parameters matchingas closely as possible. In parameters where we had no control, thedefaults were taken. Broadly, all data sets were searched at 3 Daprecursor tolerance, 1Da fragmentmass tolerance, tryptic digestionwith 1 missed cleavage, and a static modification of carbamido-methylation at cysteine residues. The significance test used by allalgorithms differs in terms of the statistical model and assumptionsused, so they are not directly comparable. Multiple hypothesestesting correction is accomplished through controlling the FDR at afixed value. FDR can be easily estimated using a target-decoy basedstrategy. The algorithms were compared after 1% FDR correctionwas applied to their results.
The number of assigned spectra and unique peptides areshown in Table 2, which depicts the performance of variousalgorithms on data sets from different instruments. In terms ofspectral assignments, MassWiz performs better than Sequest, X!Tandem and OMSSA for all instrument types except ABI-4700as shown in Figure 1. Between Mascot and MassWiz, the formerperforms slightly better in a few data sets, while the latter wasbetter in the AGILENT-XCT data set. When we compare thenumber of uniquely identified peptides by the algorithms, similartrends are observed (Figure 2). Although MassWiz identified0.65% (297) fewer spectra thanMascot, it identified 7.2% (3284)more thanOMSSA, 17.1% (7820)more than Sequest, and 43.5%(19875)more thanX!Tandem in the standardmixture (Table 2A).Similarly, it assigned 2.8% (63) fewer peptides than Mascot whileassigning 4.6% (103) more than OMSSA, 15.9% (358) more thanSequest, and 21.4% (481) more than X!Tandem. We observedthat, apart from identifying new peptides, MassWiz also identifieda high number of peptides that were observed by other methods.Mascot shows the highest number of uniquely identified peptides,which explains the high number of assignments. Similar analyseswere carried out on the yeast data set (Table 2B), where MassWizwas assigning close to Mascot but OMSSA assigned significantlylarge number of spectra and peptides than all the algorithms(Figure 3A and B).
While the number of spectra and peptides assigned byan algorithm has been traditionally used as a metric for comparingalgorithms, the quality of assignments is generally not checked. Themain reason is the subjective nature of manual validation, whichalso depends on the expertise of a person. We used an objective
method where we compared the agreement between algorithms asa measure of peptide quality. It has been shown that multiplealgorithm consensus enhances the accuracy of the peptideidentification.48 To compare the algorithms for their quality ofmatches, a set of high-confidence peptides is required. So, wemapped the overlaps between the algorithms for all identifiedpeptides. For each data set, we segregated peptides identified by atleast three algorithms and termed these as “high-confidencepeptides”. The number of identified and missed high confidencepeptides for each algorithm for the data sets is shown in Figure 4.The figure shows thatMassWiz identifies the highest proportion ofsuch peptides in four data sets, and in two data sets OMSSAperforms slightly better. In yeast data set, most of unique OMSSAassignments were either single spectra or nonconsensus assign-ments.MassWiz lags a little behindMascot,OMSSA and Sequest inthe ABI-4700 data set. The data is also tabulated in SupplementaryTables 1A and 1B. Similar trends were observed for other high-confidence peptides identified by at least 2 algorithms and at least 4algorithms, which strengthens the confidence in these observations(Supplementary Figure 1 and 2.). Overall, MassWiz identifies mostnumber of high confidence peptides considering all standardmixture data sets together and missed the least number of suchpeptides. This makes MassWiz a versatile and useful algorithm forvarious instrument platforms and well suited to high mass accuracydata, which are gaining popularity owing to fast technologicalimprovements.
It has been previously shown that consensus of three searchalgorithms can yield higher sensitivity and specificity than a singlesearch engine.17 MassWiz agrees highly with the consensus ofthree algorithms, which makes it highly useful when used singlyor in combination with other algorithms.
’CONCLUSIONS
Our results show that MassWiz is an efficient, accurate, andversatile algorithm. Being open-source and configurable, mod-ifications to the scoring function or development of supplemen-tary plug-ins can be easily achieved through communityparticipation. The results show that MassWiz is an effectivealgorithm for high-confidence peptide identification withoutcompromising on the number of assignments.
Figure 4. Comparison of number of identified and missed “high-confidence peptides” by MassWiz, Mascot, Sequest, OMSSA and X!Tandem forstandard mixture on different instruments (first six data series) andmid-log phase yeast data (last series) at 1% FDR. Peptides identified by any three outof five algorithms are considered as high confidence peptides. 100% corresponds to a pool of high-confidence peptides from the five algorithms.
As ETD is being explored in greater details, we intend toextend the scoring algorithm to incorporate ETD data analysisfor future work.
’ASSOCIATED CONTENT
bS Supporting InformationSupplementary Table 1 shows the comparison of high-con-
fidence peptides for the five algorithms in (A) six standard mixdata sets and (B) yeast mid-log phase data set. SupplementaryFigure 1 shows comparison of peptides identified by two or morealgorithms. Supplementary Figure 2 shows comparison of pep-tides identified by four or more algorithms. This material isavailable free of charge via the Internet at http://pubs.acs.org.
The authors thank Dr. Rajesh Gokhale, Dr. Anurag Agrawal,Dr. Shantanu Sengupta, and Dr. Akhilesh Pandey for theirvaluable suggestions. We also thank Dr. G. P. Singh for hisinsightful comments while proof-reading the manuscript. Wethank Dhanashree S. Kelkar for providing input to the manu-script. The work was supported by CSIR SRF grant and CSIRnetwork project on Plasma Proteomics � Health, Environmentand Disease (NWP-04).
’REFERENCES
(1) Karas, M.; Hillenkamp, F. Laser desorption ionization of pro-teins with molecular masses exceeding 10,000 Da. Anal. Chem. 1988, 60(20), 2299–2301.(2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse,
C. M. Electrospray ionization for mass spectrometry of large biomole-cules. Science 1989, 246 (4926), 64–71.(3) Steen, H.;Mann,M. The abc’s (and xyz’s) of peptide sequencing.
Nat. Rev. Mol. Cell Biol. 2004, 5 (9), 699–711.(4) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics.
Nature 2003, 422 (6928), 198–207.(5) Washburn,M. P.;Wolters, D.; Yates, J. R., III Large-scale analysis
of the yeast proteome by multidimensional protein identificationtechnology. Nat. Biotechnol. 2001, 19 (3), 242–247.(6) Matthiesen, R. Extracting monoisotopic single-charge peaks
from liquid chromatography-electrospray ionization-mass spectrometry.Methods Mol. Biol. 2007, 367, 37–48.(7) Nguyen, N.; Huang, H.; Oraintara, S.; Vo, A. Peak detection in
mass spectrometry by Gabor filters and envelope analysis. J. Bioinform.Comput. Biol. 2009, 7 (3), 547–569.(8) Zhang, S.; DeGraba, T. J.; Wang, H.; Hoehn, G. T.; Gonzales,
D. A.; Suffredini, A. F.; Ching, W. K.; Ng, M. K.; Zhou, X.; Wong, S. T. Anovel peak detection approach with chemical noise removal using short-time FFT for prOTOF MS data. Proteomics 2009, 9 (15), 3833–3842.(9) Tabb, D. L.; Shah, M. B.; Strader, M. B.; Connelly, H. M.;
Hettich, R. L.; Hurst, G. B. Determination of peptide and protein ioncharge states by Fourier transformation of isotope-resolved massspectra. J. Am. Soc. Mass Spectrom. 2006, 17 (7), 903–915.(10) Sadygov, R. G.; Hao, Z.; Huhmer, A. F. Charger: combination
of signal processing and statistical learning algorithms for precursorcharge-state determination from electron-transfer dissociation spectra.Anal. Chem. 2008, 80 (2), 376–386.
(11) Flikka, K.;Martens, L.; Vandekerckhove, J.; Gevaert, K.; EidhammerI. Improving the reliability and throughput of mass spectrometry-basedproteomics by spectrum quality filtering. Proteomics 2006, 6 (7), 2086–2094.
(12) Salmi, J.; Nyman, T. A.; Nevalainen, O. S.; Aittokallio, T.Filtering strategies for improving protein identification in high-through-put MS/MS studies. Proteomics 2009, 9 (4), 848–860.
(13) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empiricalstatistical model to estimate the accuracy of peptide identifications madebyMS/MS and database search. Anal. Chem. 2002, 74 (20), 5383–5392.
(14) Eriksson, J.; Fenyo, D. The statistical significance of proteinidentification results as a function of the number of protein sequencessearched. J. Proteome Res. 2004, 3 (5), 979–982.
(15) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and valida-tion of proteomic data generated by tandem mass spectrometry. Nat.Methods 2007, 4 (10), 787–797.
(16) Nesvizhskii, A. I.; Aebersold, R. Analysis, statistical validationand dissemination of large-scale proteomics datasets generated bytandem MS. Drug Discovery Today 2004, 9 (4), 173–181.
(17) Sultana, T.; Jordan, R.; Lyons-Weiler, J. Optimization of the useof consensus methods for the detection and putative identification ofpeptides via mass spectrometry using protein standard mixtures. J.Proteomics Bioinform. 2009, 2 (6), 262–273.
(18) MacLean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. Generalframework for developing and evaluating database scoring algorithmsusing the TANDEM search engine. Bioinformatics 2006, 22 (22),2830–2832.
(19) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniformproteomics MS/MS analysis platform utilizing open XML file formats.Mol. Syst. Biol. 2005, 1, 2005.
(20) Tanner, S.; Shu, H.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby,M.; Pevzner, P. A.; Bafna, V. InsPecT: identification of posttranslation-ally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77(14), 4626–4639.
(21) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S.Probability-based protein identification by searching sequence databasesusingmass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567.
(22) Eng, J. K.; McCormack, A. L.; Yates, J. R., III An approach tocorrelate tandem mass spectral data of peptides with amino acidsequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5(11), 976–989.
(23) Craig, R.; Beavis, R. C. TANDEM: matching proteins withtandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467.
(24) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.;Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectro-metry search algorithm. J. Proteome Res. 2004, 3 (5), 958–964.
(25) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza,J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S.;Simpson, R. J. An evaluation, comparison, and accurate benchmarking ofseveral publicly available MS/MS search algorithms: sensitivity andspecificity analysis. Proteomics 2005, 5 (13), 3475–3490.
(26) Dagda, R. K.; Sultana, T.; Lyons-Weiler, J. Evaluation of theconsensus of four peptide identification algorithms for tandem massspectrometry based proteomics. J. Proteomics Bioinform. 2010, 3, 39–47.
(27) Havilio, M.; Haddad, Y.; Smilansky, Z. Intensity-based statis-tical scorer for tandem mass spectrometry. Anal. Chem. 2003, 75 (3),435–444.
(28) Narasimhan, C.; Tabb, D. L.; VerBerkmoes, N. C.; Thompson,M. R.; Hettich, R. L.; Uberbacher, E. C. MASPIC: intensity-basedtandem mass spectrometry scoring scheme that improves peptideidentification at high confidence.Anal. Chem. 2005, 77 (23), 7581–7593.
(29) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for massspectrometry-based proteomics. Methods Mol. Biol. 2010, 604, 55–71.
(30) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increasedconfidence in large-scale protein identifications by mass spectrometry.Nat. Methods 2007, 4 (3), 207–214.
(31) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigningsignificance to peptides identified by tandem mass spectrometry usingdecoy databases. J. Proteome Res. 2008, 7 (1), 29–34.
(32) Moore, R. E.; Young, M. K.; Lee, T. D. Qscore: an algorithm forevaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom.2002, 13 (4), 378–386.(33) Wang, G.; Wu, W. W.; Zhang, Z.; Masilamani, S.; Shen, R. F.
Decoy methods for assessing false positives and false discovery rates inshotgun proteomics. Anal. Chem. 2009, 81 (1), 146–159.(34) Blanco, L.; Mead, J. A.; Bessant, C. Comparison of novel decoy
database designs for optimizing protein identification searches usingABRF sPRG2006 standard MS/MS data sets. J. Proteome Res. 2009, 8(4), 1782–1791.(35) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.;
Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.;Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The standard proteinmix database: a diverse data set to assist in the production of improvedPeptide and protein identification software tools. J. Proteome Res. 2008, 7(1), 96–103.(36) Colinge, J.; Masselot, A.; Carbonell, P.; Appel, R. D. InSilicoS-
pectro: An open-source proteomics library. J. Proteome Res. 2006, 5 (3),619–624.(37) Hoopmann, M. R.; Finney, G. L.; MacCoss, M. J. High-speed
data reduction, feature detection, and MS/MS spectrum quality assess-ment of shotgun proteomics data sets using high-resolution massspectrometry. Anal. Chem. 2007, 79 (15), 5620–5632.(38) Kast, J.; Gentzel, M.; Wilm, M.; Richardson, K. Noise filtering
techniques for electrospray quadrupole time of flight mass spectra. J. Am.Soc. Mass Spectrom. 2003, 14 (7), 766–776.(39) Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. Mobile
and localized protons: a framework for understanding peptide dissocia-tion. J. Mass Spectrom. 2000, 35 (12), 1399–1406.(40) Tabb, D. L.; Huang, Y.;Wysocki, V. H.; Yates, J. R., III Influence
of basic residue content on fragment ion peak intensities in low-energycollision-induced dissociation spectra of peptides. Anal. Chem. 2004, 76(5), 1243–1248.(41) Breci, L. A.; Tabb, D. L.; Yates, J. R., III; Wysocki, V. H.
Cleavage N-terminal to proline: analysis of a database of peptide tandemmass spectra. Anal. Chem. 2003, 75 (9), 1963–1971.(42) Khatun, J.; Ramkissoon, K.; Giddings, M. C. Fragmentation
characteristics of collision-induced dissociation in MALDI TOF/TOFmass spectrometry. Anal. Chem. 2007, 79 (8), 3032–3040.(43) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.;
O’Hair, R. A.; Speed, T. P.; Simpson, R. J. Mining a tandem massspectrometry database to determine the trends and global factorsinfluencing peptide fragmentation. Anal. Chem. 2003, 75 (22),6251–6264.(44) Frank, A. M. Predicting intensity ranks of peptide fragment
ions. J. Proteome Res. 2009, 8 (5), 2226–2240.(45) Bythell, B. J.; Suhai, S.; Somogyi, A.; Paizs, B. Proton-driven
amide bond-cleavage pathways of gas-phase peptide ions lacking mobileprotons. J. Am. Chem. Soc. 2009, 131 (39), 14057–14065.(46) Paizs, B.; Suhai, S. Fragmentation pathways of protonated
peptides. Mass Spectrom. Rev. 2005, 24 (4), 508–548.(47) Cramer, R.; Corless, S. The nature of collision-induced dis-
sociation processes of doubly protonated peptides: comparative studyfor the future use of matrix-assisted laser desorption/ionization on ahybrid quadrupole time-of-flight mass spectrometer in proteomics.Rapid Commun. Mass Spectrom. 2001, 15 (22), 2058–2066.(48) Yu, W.; Taylor, J. A.; Davis, M. T.; Bonilla, L. E.; Lee, K. A.;
Auger, P. L.; Farnsworth, C. C.; Welcher, A. A.; Patterson, S. D.Maximizing the sensitivity and reliability of peptide identification inlarge-scale proteomic experiments by harnessing multiple search en-gines. Proteomics 2010, 10 (6), 1172–1189.