Top Banner
Published: March 18, 2011 r2011 American Chemical Society 2154 dx.doi.org/10.1021/pr200031z | J. Proteome Res. 2011, 10, 21542160 ARTICLE pubs.acs.org/jpr MassWiz: A Novel Scoring Algorithm with Target-Decoy Based Analysis Pipeline for Tandem Mass Spectrometry Amit Kumar Yadav, Dhirendra Kumar, and Debasis Dash* Institute of Genomics and Integrative Biology (CSIR), Mall Road, Delhi, India b S Supporting Information INTRODUCTION With the advent of soft ionization techniques like MALDI 1 and ESI, 2 it became possible to ionize highly polar and non- volatile molecules such as peptides without destroying them. They could now be introduced into a mass spectrometer, making analysis of peptides a lot easier. Sequence database searching emerged as a valuable alternative to de novo sequencing. Due to the rapid advances made in MS instrumentation (LTQ , QTOF, FTICR, Orbitrap, etc.), availability of complete genome sequences, increased computational power for data analyses, and development of algorithms mass spectrometry has become the method of choice for proteomics studies. 3,4 Washburn et al. 5 showed the applicability of high throughput capability of LCMS approach in the yeast proteome establishing shotgun proteomics as a valuable methodology. There have been improvements in bioinformatics tools and algorithms for signal processing and peak detection, 68 charge state deconvolution, 9,10 noise removal 8,11 and spectra ltering, 11,12 database searches and assigning statistical condence. 1316 Due to the various steps involved in data analysis and their complexity, no single method can be a complete solution. 17 There is a lot of scope for newer bioinformatics methods and algorithms, especially those available freely in the public domain for rapid advancement of the eld. Tools such as k-score plugin 18 into X!Tandem, the Trans-Proteomics Pipeline (TPP), 19 InsPecT, 20 etc. are some of the excellent examples. A robust scoring function is the heart of any peptide identi- cation algorithm. The scoring functions can be broadly divided into probabilistic and empirical scoring schemes. Mascot 21 is one of the most widely used probability based algorithm, whereas SEQUEST 22 is based on cross-correlation between theoretical and experimental spectrum. X!Tandem 23 uses a hyper geometric model, and OMSSA 24 relies on a Poisson distribution to assess the signicance of matches. While all algorithms have their inherent pros and cons, any single method cannot capture all of the information content from an MS experiment. 25 It has been generally agreed that using multiple algorithms increases the number of assignments. 17,26 We present a novel empirical scoring algorithm that aims to maximize the identications while keeping the false positives (incorrect identications) to a minimum. Our scoring function assigns dierent weights to key ions, their consecutive occur- rence, their intensities, and their supporting ions. Signicance of intensity as a parameter has been previously shown; 27,28 it helps discriminate between a correct and a random match. For devel- oping and testing the scoring function, we needed an easily Received: July 19, 2010 ABSTRACT: Mass spectrometry has made rapid advances in the recent past and has become the preferred method for proteomics. Although many open source algorithms for peptide identication exist, such as X!Tandem and OMSSA, it has majorly been a domain of proprietary software. There is a need for better, freely available, and congurable algorithms that can help in identifying the correct peptides while keeping the false positives to a minimum. We have developed MassWiz, a novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type. We tested MassWiz accuracy on 486,882 spectra from a standard mixture of 18 proteins generated on 6 dierent instruments downloaded from the Seattle Proteome Center public repository. We compared the MassWiz algorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR. MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWiz showed good performance in the analysis of high condence peptides, i.e., those identied by at least three algorithms. We also analyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. The results demonstrate that MassWiz is an eective algorithm for high-condence peptide identication without compromising on the number of assignments. MassWiz is open-source, versatile, and easily congurable. KEYWORDS: Tandem mass spectrometry, proteomics, peptide identication, bioinformatics, open source, algorithm, FDR, MS/ MS
7

MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

Apr 24, 2023

Download

Documents

Puneet Talwar
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

Published: March 18, 2011

r 2011 American Chemical Society 2154 dx.doi.org/10.1021/pr200031z | J. Proteome Res. 2011, 10, 2154–2160

ARTICLE

pubs.acs.org/jpr

MassWiz: A Novel Scoring Algorithm with Target-Decoy BasedAnalysis Pipeline for Tandem Mass SpectrometryAmit Kumar Yadav, Dhirendra Kumar, and Debasis Dash*

Institute of Genomics and Integrative Biology (CSIR), Mall Road, Delhi, India

bS Supporting Information

’ INTRODUCTION

With the advent of soft ionization techniques like MALDI1

and ESI,2 it became possible to ionize highly polar and non-volatile molecules such as peptides without destroying them.They could now be introduced into a mass spectrometer, makinganalysis of peptides a lot easier. Sequence database searchingemerged as a valuable alternative to de novo sequencing. Dueto the rapid advances made in MS instrumentation (LTQ ,QTOF, FTICR, Orbitrap, etc.), availability of complete genomesequences, increased computational power for data analyses, anddevelopment of algorithms mass spectrometry has become themethod of choice for proteomics studies.3,4 Washburn et al.5

showed the applicability of high throughput capability of LCMSapproach in the yeast proteome establishing shotgun proteomicsas a valuable methodology.

There have been improvements in bioinformatics tools andalgorithms for signal processing and peak detection,6�8 chargestate deconvolution,9,10 noise removal8,11 and spectra filtering,11,12

database searches and assigning statistical confidence.13�16 Due tothe various steps involved in data analysis and their complexity,no singlemethod canbe a complete solution.17There is a lot of scopefor newer bioinformatics methods and algorithms, especially thoseavailable freely in the public domain for rapid advancement ofthe field. Tools such as k-score plugin18 into X!Tandem, the

Trans-Proteomics Pipeline (TPP),19 InsPecT,20 etc. are some ofthe excellent examples.

A robust scoring function is the heart of any peptide identi-fication algorithm. The scoring functions can be broadly dividedinto probabilistic and empirical scoring schemes. Mascot21 is oneof the most widely used probability based algorithm, whereasSEQUEST22 is based on cross-correlation between theoreticaland experimental spectrum. X!Tandem23 uses a hyper geometricmodel, and OMSSA24 relies on a Poisson distribution to assessthe significance of matches. While all algorithms have theirinherent pros and cons, any single method cannot capture allof the information content from anMS experiment.25 It has beengenerally agreed that using multiple algorithms increases thenumber of assignments.17,26

We present a novel empirical scoring algorithm that aims tomaximize the identifications while keeping the false positives(incorrect identifications) to a minimum. Our scoring functionassigns different weights to key ions, their consecutive occur-rence, their intensities, and their supporting ions. Significance ofintensity as a parameter has been previously shown;27,28 it helpsdiscriminate between a correct and a random match. For devel-oping and testing the scoring function, we needed an easily

Received: July 19, 2010

ABSTRACT: Mass spectrometry has made rapid advances in the recentpast and has become the preferred method for proteomics. Although manyopen source algorithms for peptide identification exist, such as X!Tandemand OMSSA, it has majorly been a domain of proprietary software. There isa need for better, freely available, and configurable algorithms that can helpin identifying the correct peptides while keeping the false positives to aminimum.We have developedMassWiz, a novel empirical scoring functionthat gives appropriate weights to major ions, continuity of b-y ions,intensities, and the supporting neutral losses based on the instrument type.We testedMassWiz accuracy on 486,882 spectra from a standard mixture of18 proteins generated on 6 different instruments downloaded from theSeattle Proteome Center public repository. We compared the MassWizalgorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR.MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWizshowed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We alsoanalyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. Theresults demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on thenumber of assignments. MassWiz is open-source, versatile, and easily configurable.

KEYWORDS: Tandem mass spectrometry, proteomics, peptide identification, bioinformatics, open source, algorithm, FDR, MS/MS

Page 2: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2155 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

modifiable framework. So, we developed the required frameworkin Perl, which was easy to implement and modify. Although itmay not be comparable with the existing algorithms in time-performance, it can still be very useful as Perl code can easily bemodified to tweak the algorithm. We benchmarked MassWizaccuracy with Mascot, Sequest, X!Tandem, and OMSSA bycomparing the number of identified high-confidence peptidesfrom a standard mixture of 18 proteins.

Decoy methods29�31 have become popular for estimation offalse discovery rates (FDR). Although Moore et al.32 first usedthe method by simply reversing the target database, manyalternatives have been suggested.33,34 It is now used for assigningsignificance to peptide identifications at a fixed FDR value.31 Wehave integrated the reverse database decoy strategy for signifi-cance assessment that is free from distribution assumptions anddoes not require curve-fitting. MassWiz executable is available onsourceforge (http://sourceforge.net/projects/masswiz), and thesource code is available freely for academic use on request.

’MATERIALS AND METHODS

Data SetA data set of standard mixture of 18 proteins, “ISB standard

protein mix” described by Klimek et al.,35 was used for validatingMassWiz and comparing its accuracy against other algorithms.The Mix 3 data set for all six instruments was downloaded fromhttp://regis-web.systemsbiology.net/PublicData sets in mzXMLformat. The FASTA database (database of 18 proteins mix,contaminants, and Haemophilus influenzae sequences) was alsodownloaded and updated with recent sequences for all standardproteins and their homologues.

For testing MassWiz on a biological data set, we downloadedyeast mid-log phase data from the NCBI Peptidome repository(http://www.ncbi.nlm.nih.gov/peptidome/psm1001).TheFASTAdatabase was downloaded from Swissprot using taxonomy filterSaccharomyces cerevisiae (Baker’s yeast) [4932] complete proteomecontaining 6616 sequences.

Input Data PreparationAll mzXML files were converted to mascot generic format

(mgf) using MzXML2search executable from TPP. For eachinstrument, all mgf files thus obtained were concatenated andused as a common search input to all algorithms.

Algorithm ImplementationThe scoring algorithm was tested by developing a framework

in Perl (version 5.10.1). The mass calculations and theoreticalspectrum generation was accomplished using the InSilicoSpectropackage.36 The MassWiz framework includes a complete pipe-line from handling the input spectra to generating FDR correctedpeptide spectrum matches (PSMs), i.e., top ranked peptide foreach spectrum.

Spectral ProcessingAny peptide identification algorithm is only as good as the

quality of data it receives. Spectral quality is of great importancefor any algorithm to perform at its optimal level. Several studieshave been dedicated to spectral quality assessment12,37,38 toobtain better results from search algorithms. Most algorithmshave inbuilt filtering mechanism to remove noise peaks and badspectra from the input raw data. We have employed a simpleyet effective filter to perform this task. A spectrum is dynamicallydivided into mass-bins based on its precursor mass, and a

maximum of five most intense peaks are picked from every binto have better peak coverage from all parts of the spectrum.A minimum intensity threshold can be set for a peak to beconsidered as signal. Peaks below this are considered noise anddeleted before search. Similarly, the minimum number of peakscan be defined for a spectrum to be considered for search. Thisreduces random matches and saves time, thus improving sensi-tivity and efficiency of the algorithm. Not much is known aboutthe peak filtering step ofMascot. Sequest’s cross correlation takescare of the spectrum quality. OMSSA applies an intensitythreshold cutoff, and X!Tandem uses a maximum of 50 peaksfor search by default. The peak intensity filters were not used soas to compare all algorithms on complete data, irrespective of thespectra quality.

MS/MS Database SearchThemgf files were searched using the updated database and its

reversed database for target-decoy based FDR calculation. Thesearch parameters were matched as close as possible to thosedescribed in the original paper,35 and defaults were taken wherethis was not possible. Searches were performed with precursorion tolerance of 3 Da, product ion tolerance of 1 Da, trypsindigestion with 1 missed cleavage, a fixed modification ofþ57.03Da (carbamidomethylation) at cysteine residues, maximum chargeþ7, minimum 5 peaks, and peak intensity threshold set to zero.

For the yeast data set from ESI-TRAP, a 3 Da error windowwas allowed for precursors while fragment masses were allowedto bematched at 0.6 Da. Tryptic digestion with 1missed cleavagewas considered with carbamidomethylation as the fixed mod-ification and oxidation of methionine residues as variable mod-ification for the search. The other parameters were same as above.

Target-decoy searches and FDR calculation are integratedinto MassWiz framework. Once a search is complete, we get thetarget, decoy, and FDR corrected files as output. Mascot wassearched using locally installed Mascot server version 2.2.04.The target and decoy results were exported as csv without anyp-value filters for all PSMs and FDR was calculated. Sequestsearches and result extraction were conducted using Thermo’sProteome Discoverer 1.1 interface. All rank 1 PSMs wereexported to excel sheets for FDR calculation. X!Tandem(TORNADO) results were parsed from the XML files using aPerl program. From these files, FDR was calculated and FDRcorrected PSMs were written to an output file using anotherPerl program. OMSSA (2.1.9) results were obtained as csv filesfrom which FDR was calculated and output files were writtenusing a Perl program.

False Discovery Rate CalculationThe false discovery rate was calculated using Kall’s method.31

The decoy peptides that had identical corresponding peptides inthe target database were ignored from decoy results during FDRcalculation. Leu/Ile were considered indistinguishable and trea-ted as identical. FDR was calculated from database search scoreswherever possible.

FDR ¼ no: of decoy PSMs above thresholdno: of target PSMs above threshold

The target and decoy scores were sorted in descending orderand FDR calculated at each decoy score taken as the threshold.The score at which the FDR was calculated to be 1% orimmediately below 1% was taken as the score threshold for 1%FDR. For X!Tandem and OMSSA, the e-values were sorted in

Page 3: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2156 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

ascending order and FDR was calculated as

FDR ¼ no: of decoy PSMs below e-value thresholdno: of target PSMs below e-value threshold

Comparison of AlgorithmsAll algorithms were compared after FDR calculation. A Perl

program was written to compare the peptides assigned by the fivealgorithms.

’RESULTS AND DISCUSSION

Scoring FunctionThe most important aspect of a mass spectrometry based

peptide identification algorithm is developing a robust scoringfunction. Due to variability in the fragmentation patterns,39�41

extent of fragmentation and intensities of the peaks42,43 acrossruns, instruments and methodologies, the task becomes challen-ging. We have developed a novel empirical scoring scheme basedon the knowledge of ion abundances and their intensities. CID

fragmentation patterns have been studied in extensive detail inseveral studies.42,44�47 On the basis of knowledge gained fromliterature, we experimented with several combinations of scoresfor the ions based on their known abundances and supportiveions. We arrived at the empirical weights for different ion typesdepending on the presence in a particular instrument type(Table 1). For matching a spectrum against a candidate peptideP, the score of the peptide is calculated as

scoreðPÞ ¼ SðPÞ 3

ffiffiffiffiffiffiffiffiffiffi∑k

i¼ 1Ii

∑n

i¼ 1Ii

vuuuuuut ðeq 1Þ

whereP = candidate peptidescore(P) = final score for the candidate peptide against the

experimental spectrumS(P) = primary score for peptide P (described in detail in eq 2)

Table 1. Scoring Matrix

MALDI ESI

ion type default TOF/TOF TOF PSD QIT-TOF QUAD-TOF QUAD-TOF TRAP QUAD FTICR 4-SECTOR

ya 100 100 100 100 100 100 100 100 100 100

bb 100 100 100 100 100 100 100 100 100 100

ac 50 50 50 50 - - - - - 50

z - - - - - - - - - 50

immonium - 100 100 100 100 - - - - 100

y-NH3 25 25 - 25 25 25 25 25 25 -

b-NH3 25 25 25 25 25 25 25 25 25 25

a-NH3 25 25 25 25 - - - - - -

y-H2O - 25 - 25 25 25 25 25 25 -

b-H2O - 25 25 25 25 25 25 25 25 25

a-H2O - 25 25 25 - - - - - -aA bonus score of 50 is awarded for y-ion continuity, and a score of 50 is deducted for discontinuous y-ions. bA bonus score of 20 is awarded for b-ioncontinuity, and a score of 20 is deducted for discontinuous b-ions. cNo score for a-ion continuity/discontinuity. So the value of Cij for a-ions in eq 2 willbe zero.

Table 2. Spectra and Peptides Assigned by the Five Algorithms in a Standard Mixture of 18 Proteins and in a Complex Mid-logPhase Yeast Data Set

MassWiz Mascot Sequest OMSSA X!Tandem

instrument spectra searched spectra peptides spectra peptides spectra peptides spectra peptides spectra peptides

(A) Protein Mixture

AGILENT XCT 244,174 12074 386 10511 343 11429 357 10516 344 5218 303

LCQ _Deca 50,986 3522 372 3661 382 3164 327 3439 357 2142 283

LTQ 79,762 6114 500 6240 504 4243 347 5977 469 3598 323

LTQ-FT 79,372 19616 503 20778 539 15052 396 18212 472 11282 374

QTOF 26,019 3134 237 3500 283 2709 207 2976 244 2390 250

ABI-4700 6,569 1193 253 1260 263 1236 259 1249 262 1148 237

TOTAL 486,882 45,653 2251 45,950 2314 37,833 1893 42,369 2148 25,778 1770

(B) Yeast

106,133 7782 877 7917 917 6004 727 9019 988 5031 646

Page 4: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2157 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

k = number of peaks matchedIi = intensity of the ith peakn = number of peaks in the experimental spectrum (after

processing).The term under the square root signifies the matched ion

current. It was square root transformed to decrease the effect ofintensity irregularities caused by a variable fragmentation patternand was found to perform better than log transformation.Inclusion of intensity factor in our scoring function increases theresolution of correct assignment over randommatches. The fragmentmass errors can be very helpful in discerning good matches and hasbeen implemented using an exponential function. In simpler words,the lower themass error, the better the score for a fragment ionmatch.

SðPÞ ¼ ∑i ∈ fy, b, ag

∑n

j¼ 1

Xij þ Cij

ejΔmijj þ Nij

ejΔmijj þWij

ejΔmijj

� �þ ∑

k

j¼ 1

Q j

ejΔmjj

ðeq 2Þwhere

n = total peaks in the theoretical spectrum for a given ion series(y/b/a type ion)

for a given i ∈ y/b/a ion series:Xij = score for the jth peak matchedCij = bonus score for continuity factor when j and j� 1 peaks

matched and negative score for discontinuous ion series, i.e.,when j � 1 peak matches but j does not

Nij = score for jth matched peak for neutral loss of ammonia(NH3) when Xij 6¼ 0

Wij = score for jth matched peak for neutral loss of water(H2O) when Xij 6¼ 0

Δm = mass difference for the matched fragment peak, i.e.,Mexperimental � Mtheoretical

k = total peaks in the theoretical spectrum for immoniumion series

Qj = score for jth matched peak for Immonium ionThese empirical scores are taken from the scoring matrix given

in Table 1. The scoring function is adapted to the irregularitiesof instrument types as it makes extensive use of the informationcontent present in the spectrum along with y- and b-ions. Thecomplementary ions such as neutral losses and immonium ions(depending on the instrument types) can help differentiatebetween a correct and an incorrect hit when the b-y countsare very close together. Also, the continuity of a series (b/y/a)greatly increases the confidence in the matched ions even whenfragmentation is not complete due to partially mobile or non-mobile proton containing peptides.

We compared MassWiz with four widely used algorithms-Mascot, Sequest, X!Tandem and OMSSA. Six different data setsfrom ISB standardmixture of 18 proteins and known contaminants

Figure 1. Comparison of number of spectra identified by MassWiz,Mascot, Sequest, and OMSSA for data sets from different instruments at1% FDR.

Figure 2. Comparison of number of peptides identified by MassWiz,Mascot, Sequest, and OMSSA for data sets from different instruments at1% FDR.

Figure 3. Comparison of number of (A) spectra and (B) peptidesidentified by MassWiz, Mascot, Sequest, and OMSSA for mid-log phaseyeast data set at 1% FDR.

Page 5: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2158 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

were searched using all five algorithms with parameters matchingas closely as possible. In parameters where we had no control, thedefaults were taken. Broadly, all data sets were searched at 3 Daprecursor tolerance, 1Da fragmentmass tolerance, tryptic digestionwith 1 missed cleavage, and a static modification of carbamido-methylation at cysteine residues. The significance test used by allalgorithms differs in terms of the statistical model and assumptionsused, so they are not directly comparable. Multiple hypothesestesting correction is accomplished through controlling the FDR at afixed value. FDR can be easily estimated using a target-decoy basedstrategy. The algorithms were compared after 1% FDR correctionwas applied to their results.

The number of assigned spectra and unique peptides areshown in Table 2, which depicts the performance of variousalgorithms on data sets from different instruments. In terms ofspectral assignments, MassWiz performs better than Sequest, X!Tandem and OMSSA for all instrument types except ABI-4700as shown in Figure 1. Between Mascot and MassWiz, the formerperforms slightly better in a few data sets, while the latter wasbetter in the AGILENT-XCT data set. When we compare thenumber of uniquely identified peptides by the algorithms, similartrends are observed (Figure 2). Although MassWiz identified0.65% (297) fewer spectra thanMascot, it identified 7.2% (3284)more thanOMSSA, 17.1% (7820)more than Sequest, and 43.5%(19875)more thanX!Tandem in the standardmixture (Table 2A).Similarly, it assigned 2.8% (63) fewer peptides than Mascot whileassigning 4.6% (103) more than OMSSA, 15.9% (358) more thanSequest, and 21.4% (481) more than X!Tandem. We observedthat, apart from identifying new peptides, MassWiz also identifieda high number of peptides that were observed by other methods.Mascot shows the highest number of uniquely identified peptides,which explains the high number of assignments. Similar analyseswere carried out on the yeast data set (Table 2B), where MassWizwas assigning close to Mascot but OMSSA assigned significantlylarge number of spectra and peptides than all the algorithms(Figure 3A and B).

While the number of spectra and peptides assigned byan algorithm has been traditionally used as a metric for comparingalgorithms, the quality of assignments is generally not checked. Themain reason is the subjective nature of manual validation, whichalso depends on the expertise of a person. We used an objective

method where we compared the agreement between algorithms asa measure of peptide quality. It has been shown that multiplealgorithm consensus enhances the accuracy of the peptideidentification.48 To compare the algorithms for their quality ofmatches, a set of high-confidence peptides is required. So, wemapped the overlaps between the algorithms for all identifiedpeptides. For each data set, we segregated peptides identified by atleast three algorithms and termed these as “high-confidencepeptides”. The number of identified and missed high confidencepeptides for each algorithm for the data sets is shown in Figure 4.The figure shows thatMassWiz identifies the highest proportion ofsuch peptides in four data sets, and in two data sets OMSSAperforms slightly better. In yeast data set, most of unique OMSSAassignments were either single spectra or nonconsensus assign-ments.MassWiz lags a little behindMascot,OMSSA and Sequest inthe ABI-4700 data set. The data is also tabulated in SupplementaryTables 1A and 1B. Similar trends were observed for other high-confidence peptides identified by at least 2 algorithms and at least 4algorithms, which strengthens the confidence in these observations(Supplementary Figure 1 and 2.). Overall, MassWiz identifies mostnumber of high confidence peptides considering all standardmixture data sets together and missed the least number of suchpeptides. This makes MassWiz a versatile and useful algorithm forvarious instrument platforms and well suited to high mass accuracydata, which are gaining popularity owing to fast technologicalimprovements.

It has been previously shown that consensus of three searchalgorithms can yield higher sensitivity and specificity than a singlesearch engine.17 MassWiz agrees highly with the consensus ofthree algorithms, which makes it highly useful when used singlyor in combination with other algorithms.

’CONCLUSIONS

Our results show that MassWiz is an efficient, accurate, andversatile algorithm. Being open-source and configurable, mod-ifications to the scoring function or development of supplemen-tary plug-ins can be easily achieved through communityparticipation. The results show that MassWiz is an effectivealgorithm for high-confidence peptide identification withoutcompromising on the number of assignments.

Figure 4. Comparison of number of identified and missed “high-confidence peptides” by MassWiz, Mascot, Sequest, OMSSA and X!Tandem forstandard mixture on different instruments (first six data series) andmid-log phase yeast data (last series) at 1% FDR. Peptides identified by any three outof five algorithms are considered as high confidence peptides. 100% corresponds to a pool of high-confidence peptides from the five algorithms.

Page 6: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2159 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

As ETD is being explored in greater details, we intend toextend the scoring algorithm to incorporate ETD data analysisfor future work.

’ASSOCIATED CONTENT

bS Supporting InformationSupplementary Table 1 shows the comparison of high-con-

fidence peptides for the five algorithms in (A) six standard mixdata sets and (B) yeast mid-log phase data set. SupplementaryFigure 1 shows comparison of peptides identified by two or morealgorithms. Supplementary Figure 2 shows comparison of pep-tides identified by four or more algorithms. This material isavailable free of charge via the Internet at http://pubs.acs.org.

’AUTHOR INFORMATION

Corresponding Author*Fax: þ91 011 27667471. E-mail: [email protected].

’ACKNOWLEDGMENT

The authors thank Dr. Rajesh Gokhale, Dr. Anurag Agrawal,Dr. Shantanu Sengupta, and Dr. Akhilesh Pandey for theirvaluable suggestions. We also thank Dr. G. P. Singh for hisinsightful comments while proof-reading the manuscript. Wethank Dhanashree S. Kelkar for providing input to the manu-script. The work was supported by CSIR SRF grant and CSIRnetwork project on Plasma Proteomics � Health, Environmentand Disease (NWP-04).

’REFERENCES

(1) Karas, M.; Hillenkamp, F. Laser desorption ionization of pro-teins with molecular masses exceeding 10,000 Da. Anal. Chem. 1988, 60(20), 2299–2301.(2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse,

C. M. Electrospray ionization for mass spectrometry of large biomole-cules. Science 1989, 246 (4926), 64–71.(3) Steen, H.;Mann,M. The abc’s (and xyz’s) of peptide sequencing.

Nat. Rev. Mol. Cell Biol. 2004, 5 (9), 699–711.(4) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics.

Nature 2003, 422 (6928), 198–207.(5) Washburn,M. P.;Wolters, D.; Yates, J. R., III Large-scale analysis

of the yeast proteome by multidimensional protein identificationtechnology. Nat. Biotechnol. 2001, 19 (3), 242–247.(6) Matthiesen, R. Extracting monoisotopic single-charge peaks

from liquid chromatography-electrospray ionization-mass spectrometry.Methods Mol. Biol. 2007, 367, 37–48.(7) Nguyen, N.; Huang, H.; Oraintara, S.; Vo, A. Peak detection in

mass spectrometry by Gabor filters and envelope analysis. J. Bioinform.Comput. Biol. 2009, 7 (3), 547–569.(8) Zhang, S.; DeGraba, T. J.; Wang, H.; Hoehn, G. T.; Gonzales,

D. A.; Suffredini, A. F.; Ching, W. K.; Ng, M. K.; Zhou, X.; Wong, S. T. Anovel peak detection approach with chemical noise removal using short-time FFT for prOTOF MS data. Proteomics 2009, 9 (15), 3833–3842.(9) Tabb, D. L.; Shah, M. B.; Strader, M. B.; Connelly, H. M.;

Hettich, R. L.; Hurst, G. B. Determination of peptide and protein ioncharge states by Fourier transformation of isotope-resolved massspectra. J. Am. Soc. Mass Spectrom. 2006, 17 (7), 903–915.(10) Sadygov, R. G.; Hao, Z.; Huhmer, A. F. Charger: combination

of signal processing and statistical learning algorithms for precursorcharge-state determination from electron-transfer dissociation spectra.Anal. Chem. 2008, 80 (2), 376–386.

(11) Flikka, K.;Martens, L.; Vandekerckhove, J.; Gevaert, K.; EidhammerI. Improving the reliability and throughput of mass spectrometry-basedproteomics by spectrum quality filtering. Proteomics 2006, 6 (7), 2086–2094.

(12) Salmi, J.; Nyman, T. A.; Nevalainen, O. S.; Aittokallio, T.Filtering strategies for improving protein identification in high-through-put MS/MS studies. Proteomics 2009, 9 (4), 848–860.

(13) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empiricalstatistical model to estimate the accuracy of peptide identifications madebyMS/MS and database search. Anal. Chem. 2002, 74 (20), 5383–5392.

(14) Eriksson, J.; Fenyo, D. The statistical significance of proteinidentification results as a function of the number of protein sequencessearched. J. Proteome Res. 2004, 3 (5), 979–982.

(15) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and valida-tion of proteomic data generated by tandem mass spectrometry. Nat.Methods 2007, 4 (10), 787–797.

(16) Nesvizhskii, A. I.; Aebersold, R. Analysis, statistical validationand dissemination of large-scale proteomics datasets generated bytandem MS. Drug Discovery Today 2004, 9 (4), 173–181.

(17) Sultana, T.; Jordan, R.; Lyons-Weiler, J. Optimization of the useof consensus methods for the detection and putative identification ofpeptides via mass spectrometry using protein standard mixtures. J.Proteomics Bioinform. 2009, 2 (6), 262–273.

(18) MacLean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. Generalframework for developing and evaluating database scoring algorithmsusing the TANDEM search engine. Bioinformatics 2006, 22 (22),2830–2832.

(19) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniformproteomics MS/MS analysis platform utilizing open XML file formats.Mol. Syst. Biol. 2005, 1, 2005.

(20) Tanner, S.; Shu, H.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby,M.; Pevzner, P. A.; Bafna, V. InsPecT: identification of posttranslation-ally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77(14), 4626–4639.

(21) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S.Probability-based protein identification by searching sequence databasesusingmass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567.

(22) Eng, J. K.; McCormack, A. L.; Yates, J. R., III An approach tocorrelate tandem mass spectral data of peptides with amino acidsequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5(11), 976–989.

(23) Craig, R.; Beavis, R. C. TANDEM: matching proteins withtandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467.

(24) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.;Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectro-metry search algorithm. J. Proteome Res. 2004, 3 (5), 958–964.

(25) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza,J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S.;Simpson, R. J. An evaluation, comparison, and accurate benchmarking ofseveral publicly available MS/MS search algorithms: sensitivity andspecificity analysis. Proteomics 2005, 5 (13), 3475–3490.

(26) Dagda, R. K.; Sultana, T.; Lyons-Weiler, J. Evaluation of theconsensus of four peptide identification algorithms for tandem massspectrometry based proteomics. J. Proteomics Bioinform. 2010, 3, 39–47.

(27) Havilio, M.; Haddad, Y.; Smilansky, Z. Intensity-based statis-tical scorer for tandem mass spectrometry. Anal. Chem. 2003, 75 (3),435–444.

(28) Narasimhan, C.; Tabb, D. L.; VerBerkmoes, N. C.; Thompson,M. R.; Hettich, R. L.; Uberbacher, E. C. MASPIC: intensity-basedtandem mass spectrometry scoring scheme that improves peptideidentification at high confidence.Anal. Chem. 2005, 77 (23), 7581–7593.

(29) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for massspectrometry-based proteomics. Methods Mol. Biol. 2010, 604, 55–71.

(30) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increasedconfidence in large-scale protein identifications by mass spectrometry.Nat. Methods 2007, 4 (3), 207–214.

(31) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigningsignificance to peptides identified by tandem mass spectrometry usingdecoy databases. J. Proteome Res. 2008, 7 (1), 29–34.

Page 7: MassWiz: A novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry

2160 dx.doi.org/10.1021/pr200031z |J. Proteome Res. 2011, 10, 2154–2160

Journal of Proteome Research ARTICLE

(32) Moore, R. E.; Young, M. K.; Lee, T. D. Qscore: an algorithm forevaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom.2002, 13 (4), 378–386.(33) Wang, G.; Wu, W. W.; Zhang, Z.; Masilamani, S.; Shen, R. F.

Decoy methods for assessing false positives and false discovery rates inshotgun proteomics. Anal. Chem. 2009, 81 (1), 146–159.(34) Blanco, L.; Mead, J. A.; Bessant, C. Comparison of novel decoy

database designs for optimizing protein identification searches usingABRF sPRG2006 standard MS/MS data sets. J. Proteome Res. 2009, 8(4), 1782–1791.(35) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.;

Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.;Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The standard proteinmix database: a diverse data set to assist in the production of improvedPeptide and protein identification software tools. J. Proteome Res. 2008, 7(1), 96–103.(36) Colinge, J.; Masselot, A.; Carbonell, P.; Appel, R. D. InSilicoS-

pectro: An open-source proteomics library. J. Proteome Res. 2006, 5 (3),619–624.(37) Hoopmann, M. R.; Finney, G. L.; MacCoss, M. J. High-speed

data reduction, feature detection, and MS/MS spectrum quality assess-ment of shotgun proteomics data sets using high-resolution massspectrometry. Anal. Chem. 2007, 79 (15), 5620–5632.(38) Kast, J.; Gentzel, M.; Wilm, M.; Richardson, K. Noise filtering

techniques for electrospray quadrupole time of flight mass spectra. J. Am.Soc. Mass Spectrom. 2003, 14 (7), 766–776.(39) Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. Mobile

and localized protons: a framework for understanding peptide dissocia-tion. J. Mass Spectrom. 2000, 35 (12), 1399–1406.(40) Tabb, D. L.; Huang, Y.;Wysocki, V. H.; Yates, J. R., III Influence

of basic residue content on fragment ion peak intensities in low-energycollision-induced dissociation spectra of peptides. Anal. Chem. 2004, 76(5), 1243–1248.(41) Breci, L. A.; Tabb, D. L.; Yates, J. R., III; Wysocki, V. H.

Cleavage N-terminal to proline: analysis of a database of peptide tandemmass spectra. Anal. Chem. 2003, 75 (9), 1963–1971.(42) Khatun, J.; Ramkissoon, K.; Giddings, M. C. Fragmentation

characteristics of collision-induced dissociation in MALDI TOF/TOFmass spectrometry. Anal. Chem. 2007, 79 (8), 3032–3040.(43) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.;

O’Hair, R. A.; Speed, T. P.; Simpson, R. J. Mining a tandem massspectrometry database to determine the trends and global factorsinfluencing peptide fragmentation. Anal. Chem. 2003, 75 (22),6251–6264.(44) Frank, A. M. Predicting intensity ranks of peptide fragment

ions. J. Proteome Res. 2009, 8 (5), 2226–2240.(45) Bythell, B. J.; Suhai, S.; Somogyi, A.; Paizs, B. Proton-driven

amide bond-cleavage pathways of gas-phase peptide ions lacking mobileprotons. J. Am. Chem. Soc. 2009, 131 (39), 14057–14065.(46) Paizs, B.; Suhai, S. Fragmentation pathways of protonated

peptides. Mass Spectrom. Rev. 2005, 24 (4), 508–548.(47) Cramer, R.; Corless, S. The nature of collision-induced dis-

sociation processes of doubly protonated peptides: comparative studyfor the future use of matrix-assisted laser desorption/ionization on ahybrid quadrupole time-of-flight mass spectrometer in proteomics.Rapid Commun. Mass Spectrom. 2001, 15 (22), 2058–2066.(48) Yu, W.; Taylor, J. A.; Davis, M. T.; Bonilla, L. E.; Lee, K. A.;

Auger, P. L.; Farnsworth, C. C.; Welcher, A. A.; Patterson, S. D.Maximizing the sensitivity and reliability of peptide identification inlarge-scale proteomic experiments by harnessing multiple search en-gines. Proteomics 2010, 10 (6), 1172–1189.