Top Banner
In vitro and in silico processes to identify differentially expressed proteins Nadia Allet*, Nicolas Barrillat, Thierry Baussant, Celia Boiteau, Paolo Botti, Lydie Bougueleret, Nicolas Budin, Denis Canet, Stéphanie Carraud, Diego Chiappe, Nicolas Christmann, Jacques Colinge, Isabelle Cusin, Nicolas Dafflon, Benoît Depresle, IrèneFasso, Pascal Frauchiger, Hubert Gaertner, Anne Gleizes, Eduardo Gonzalez-Couto, Catherine Jeandenans, Abderrahim Karmime, Thomas Kowall, Sophie Lagache, Eve Mahé, Alexandre Masselot, Hassan Mattou, Marc Moniatte, Anne Niknejad, Marianne Paolini, Frédéric Perret, Nicolas Pinaud, Frédéric Ranno, Sylvain Raimondi, Samia Reffas, Pierre-Olivier Regamey, Pierre-Antoine Rey, Patricia Rodriguez-Tomé, Keith Rose, Gérald Rossellat, Cédric Saudrais, Camille Schmidt, Matteo Villain and Catherine Zwahlen GeneProt Inc., Meyrin, Switzerland We present an integrated proteomics platform designed for performing differential analyses. Since reproducible results are essential for comparative studies, we explain how we improved reproducibility at every step of our laboratory processes, e.g. by taking advantage of the powerful laboratory information management system we developed. The differential capacity of our platform is validated by detecting known markers in a real sample and by a spiking experiment. We introduce an innovative two-dimensional (2-D) plot for displaying identifica- tion results combined with chromatographic data. This 2-D plot is very convenient for detect- ing differential proteins. We also adapt standard multivariate statistical techniques to show that peptide identification scores can be used for reliable and sensitive differential studies. The interest of the protein separation approach we generally apply is justified by numerous statistics, complemented by a comparison with a simple shotgun analysis performed on a small volume sample. By introducing an automatic integration step after mass spectrometry data identification, we are able to search numerous databases systematically, including the human genome and expressed sequence tags. Finally, we explain how rigorous data proces- sing can be combined with the work of human experts to set high quality standards, and hence obtain reliable (false positive , 0.35%) and nonredundant protein identifications. Keywords: Bioinformatics / Chromatography / Differential / Identification / Tandem mass spectrometry Received 29/12/03 Revised 15/3/04 Accepted 3/4/04 Proteomics 2004, 4, 2333–2351 2333 1 Introduction There is growing interest in analyzing body fluids by pro- teomics, as they constitute a most useful source of pro- teins associated with both health and disease. The recent initiative from the Human Proteome Organization (HUPO) [1] and the number of research papers published bear wit- ness to this tendency [2, 3]. Among the many questions that can be answered by proteomics analysis, the differ- ential study of diseased/control samples remains central. To address this question we developed several laboratory in vitro – and bioinformatics – in silico – modular meth- ods. These methods can be applied to a variety of sam- ples: individual samples or pools of carefully matched in- dividual samples, large or small volumes. The analysis of body fluids is a challenge due to the large number of peptides and proteins present and the very wide range of concentrations [4–7]. In order to identify as many proteins as possible for subsequent differential studies, we developed an industrial-scale (2500 mL; MicroProt ) process involving sample pooling for the analysis of smaller proteins [6, 8]. From this process, we recently derived improved processes for large (500 mL) Correspondence: Dr. Jacques Colinge, GeneProt Inc., Rue Pré de la Fontaine 2, Case postale 125, CH-1217 Meyrin, Switzerland Abbreviations: AIMS, analysis information management sys- tem; CEX, cation exchange; DB, database; LIMS, laboratory information management system * Authors in alphabetical order. 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de DOI 10.1002/pmic.200300840
19

In vitro andin silico processes to identify differentially expressed proteins

Mar 04, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: In vitro andin silico processes to identify differentially expressed proteins

In vitro and in silico processes to identify differentiallyexpressed proteins

Nadia Allet*, Nicolas Barrillat, Thierry Baussant, Celia Boiteau, Paolo Botti,Lydie Bougueleret, Nicolas Budin, Denis Canet, Stéphanie Carraud, Diego Chiappe,Nicolas Christmann, Jacques Colinge, Isabelle Cusin, Nicolas Dafflon,Benoît Depresle, Irène Fasso, Pascal Frauchiger, Hubert Gaertner, Anne Gleizes,Eduardo Gonzalez-Couto, Catherine Jeandenans, Abderrahim Karmime,Thomas Kowall, Sophie Lagache, Eve Mahé, Alexandre Masselot, Hassan Mattou,Marc Moniatte, Anne Niknejad, Marianne Paolini, Frédéric Perret, Nicolas Pinaud,Frédéric Ranno, Sylvain Raimondi, Samia Reffas, Pierre-Olivier Regamey,Pierre-Antoine Rey, Patricia Rodriguez-Tomé, Keith Rose, Gérald Rossellat,Cédric Saudrais, Camille Schmidt, Matteo Villain and Catherine Zwahlen

GeneProt Inc., Meyrin, Switzerland

We present an integrated proteomics platform designed for performing differential analyses.Since reproducible results are essential for comparative studies, we explain how we improvedreproducibility at every step of our laboratory processes, e.g. by taking advantage of thepowerful laboratory information management system we developed. The differential capacityof our platform is validated by detecting known markers in a real sample and by a spikingexperiment. We introduce an innovative two-dimensional (2-D) plot for displaying identifica-tion results combined with chromatographic data. This 2-D plot is very convenient for detect-ing differential proteins. We also adapt standard multivariate statistical techniques to showthat peptide identification scores can be used for reliable and sensitive differential studies.The interest of the protein separation approach we generally apply is justified by numerousstatistics, complemented by a comparison with a simple shotgun analysis performed on asmall volume sample. By introducing an automatic integration step after mass spectrometrydata identification, we are able to search numerous databases systematically, including thehuman genome and expressed sequence tags. Finally, we explain how rigorous data proces-sing can be combined with the work of human experts to set high quality standards, andhence obtain reliable (false positive , 0.35%) and nonredundant protein identifications.

Keywords: Bioinformatics / Chromatography / Differential / Identification / Tandem mass spectrometry

Received 29/12/03Revised 15/3/04Accepted 3/4/04

Proteomics 2004, 4, 2333–2351 2333

1 Introduction

There is growing interest in analyzing body fluids by pro-teomics, as they constitute a most useful source of pro-teins associated with both health and disease. The recentinitiative from the Human Proteome Organization (HUPO)[1] and the number of research papers published bear wit-ness to this tendency [2, 3]. Among the many questionsthat can be answered by proteomics analysis, the differ-ential study of diseased/control samples remains central.

To address this question we developed several laboratory– in vitro – and bioinformatics – in silico – modular meth-ods. These methods can be applied to a variety of sam-ples: individual samples or pools of carefully matched in-dividual samples, large or small volumes.

The analysis of body fluids is a challenge due to the largenumber of peptides and proteins present and the verywide range of concentrations [4–7]. In order to identify asmany proteins as possible for subsequent differentialstudies, we developed an industrial-scale (2500 mL;MicroProt ) process involving sample pooling for theanalysis of smaller proteins [6, 8]. From this process, werecently derived improved processes for large (500 mL)

Correspondence: Dr. Jacques Colinge, GeneProt Inc., Rue Préde la Fontaine 2, Case postale 125, CH-1217 Meyrin, Switzerland

Abbreviations: AIMS, analysis information management sys-tem; CEX, cation exchange; DB, database; LIMS, laboratoryinformation management system * Authors in alphabetical order.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

DOI 10.1002/pmic.200300840

Page 2: In vitro andin silico processes to identify differentially expressed proteins

2334 N. Allet et al. Proteomics 2004, 4, 2333–2351

and medium (10 mL) volume samples. These two newprocesses, which we describe in this paper, aim at high-lighting/identifying biomarkers and potential valuable tar-gets by the comparative analysis of complex body fluidssuch as plasma, serum or cerebrospinal fluid [9]. Theyfocus on the extensive analysis of peptides and lowMr proteins (mass , ca. 25 kDa) and they are based onmulti-dimensional chromatography. After depletion ofabundant and ubiquitous proteins, the biological samplesare enriched in small proteins by gel filtration. Samplecomplexity is then further reduced by multidimensionalliquid chromatography before digestion and analysis bymass spectrometry (HPLC-MS/MS). Protein integrity isthus maintained until the very end of the sample analysis.

To be able to detect the differential expression of proteins,it is essential to maintain a high level of reproducibility androbustness at every stage of both laboratory and bioinfor-matics analyses. This is achieved by carefully optimizinglaboratory procedures, by managing them through a pro-prietary and flexible laboratory information managementsystem (LIMS), and by processing mass spectral datawith high selectivity and sensitivity. We show that wehave reached a level of reproducibility that is suitable forreliable differential expression detection.

Depending on the project size, i.e. 500 mL or 10 mL, thedifferential expression is determined by different tech-niques. In the case of a pair of pools of carefully matchedindividual samples analyzed in the 500 mL format, we relyon a novel type of 2-D plot that easily reveals differentialproteins. In the case of individual samples analyzed in the10 mL format, we apply multivariate statistical methodsadapted from DNA chip literature. Recently, other authors[10], by using peak intensities, also showed that it is notnecessary to use stable isotope labeling techniques [11]to detect differential expression. Differential expressiondetection in the 500 mL process is demonstrated bytwo well-known markers, whereas it is demonstrated bya spiking experiment at different concentrations in the10 mL process.

The management of the huge amount of data generatedin large-scale projects, as well as the rapid processing ofrepeated medium scale projects, requires a sophisticatedbioinformatics platform. We already mentioned the role ofthe LIMS in tracking samples and in helping to maintainhigh levels of reproducibility. MS data identification hasto be done with great sensitivity and selectivity to ensureextensive and reliable use of the mass spectra, i.e. in-depth analysis of the samples. Hence we developed pro-prietary identification programs [12–15], which we use forsearching large and numerous databases ranging fromSwiss-Prot [16] to the human genome. These databasesearches generate too many results for a human expert

to thoroughly analyze them. Therefore, we implementedseveral bioinformatics tools to reduce redundancy inthese search results, extract useful information, comple-ment it with available information, store it and make itaccessible via ad hoc graphic/web interfaces. Humanexperts (the annotators) check for the correctness andconsistency of the results, and they finally select potentialbiomarkers/targets. All these bioinformatics tasks are amajor challenge in the success of large-scale proteomicsprojects, which, otherwise, would not succeed in analyz-ing all the data generated.

The most promising proteins are synthesized by chemis-try, which gives us the opportunity to validate differentialexpression by spiking at the individual sample level. Inthis paper, we mainly report 500 and 10 mL processresults. For comparison purposes, we give 2500 mL sta-tistics as well as statistics of a new (shotgun) process wehave developed for small volume samples (1 mL andbelow) and mainly 18O quantitation [17] (the 18O platformwill be described elsewhere).

2 Materials and methods

The processes we developed are generally applicable tobody fluids (Fig. 1). We introduce them by taking exam-ples from human plasma and serum projects. The latterprojects are based on samples of different sizes, therebyillustrating the modular nature of our methodology. Theyare briefly described in Table 1.

2.1 Protein separation

We detail here the 10 mL process. This process is derivedfrom the 500 and 2500 mL processes [6, 8]. We thenbriefly describe what differs in the larger volume pro-cesses. Human serum or plasma is obtained accordingto standard procedures. Additionally, plasma samplesare supplemented with a protease inhibitor cocktail(Complete; Roche, Mannheim, Germany). Individual sam-ples or pools of matched individual samples are aliquotedin 10 mL portions and frozen at 2807C.

2.1.1 Depletion and gel filtration

Portions of frozen plasma (10 mL) or serum are thawedand applied to a tandem column combination consistingof 30 mL albumin ligand affinity resin (column 1.6 cm id,15 cm length, prototype product based on an agarosematrix; Amersham Biosciences, Uppsala, Sweden) and10 mL Protein G Sepharose Fast Flow (1.6 cm id, 5 cmlength; Amersham Biosciences). Columns are equili-

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 3: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2335

Figure 1. Schematic view of a complete process. DB,database.

Table 1. Brief description of the samples/projects usedfor illustrating the processes

Project name Sample Sample size Process

M3c Human normal plasma,pooled individuals

2.5 L 2500 mL

M4e, M4l Human serum, pregnantwomen (3–6 m and6–9 m), pooledindividuals

500 mL 500 mL

PX Human normal plasma,pooled individuals,repeated four times

10 mL 10 mL

PXleptin Same as PX but spikedwith four amounts ofleptin (1, 10, 50, 100 nM)

10 mL 10 mL

PI Human normal plasma,individual samples,repeated twice

0.25 mL 1 mL

brated and washed with 50 mM PO4 buffer, pH 7.1, 0.15 M

NaCl. Nonretained (flow-through) fractions (35 mL) arefrozen until the second step. Protein content in the flow-through fraction is determined by analytical size exclusionHPLC using BSA as a standard. The size exclusion ana-

lytical HPLC column is calibrated weekly, with an auto-matic chromatogram treatment performed by the LIMS.The flow-through fraction is frozen at 2207C before beingapplied to gel filtration chromatography. Each fraction isthawed and filtered through a 0.45 mm sterile filter under asterile hood. Filtrate is injected on three in-line gel filtra-tion columns: 360.6 L Superdex 75 (each 4.4 cm id,40 cm length; Amersham Biosciences). The columns areequilibrated and then eluted with 50 mM PO4 buffer pH 7.4,0.1 M NaCl, 8 M urea. Hydrophobic impurities in the bufferare retained on a reverse phase precolumn upstreamof the injector (15 mL PLRP-S, (Polymer Laboratoriesreverse phase styrene divinyl benzene, Marseille, France).During the elution of low Mr proteins (nominally , 25 kDabased on SEC HPLC) the effluent is switched to an in-lineRP capture column (5 mL PLRP-S, 100 Å). The three-wayvalve controlling effluent switching to the PLRP-S columnis activated when the absorbance at 280 nm falls below33 mAU after elution of the large proteins. The cut-offvalue is established during preliminary experiments usingSEC HPLC to monitor the eluate. After washing the PLRP-S capture column, low Mr proteins and peptides areeluted with one column volume of 0.2% TFA, 80%CH3CN in water. The eluate low Mr protein fraction is fro-zen at 2207C until further use.

2.1.2 Ion exchange chromatography

The PLRP-S capture column eluate is thawed in turn andmixed with an equal volume of cation exchange (CEX)buffer A (50 mM glycine/HCl, pH 2.7, 8 M urea). The sampleis injected onto a 10 mL source 15S column (1 mm id,100 mm length; Amersham Biosciences) equilibratedand washed with buffer A. Proteins and peptides areeluted with step gradients from 100% buffer A to 100%buffer B (i.e. buffer A but containing 1 M NaCl). Fifteen elu-tion fractions are collected based on time and immedi-ately submitted to reduction/alkylation. Protein contentin each fraction is automatically calculated by the LIMS,based on 280 nm UV absorption. Calibration is estab-lished during preliminary experiments using SEC HPLCprotein concentration measurement. This chromato-graphic step is performed with an AKTA Explorer (Amers-ham Biosciences).

2.1.3 Reduction/alkylation and first RP-HPLCfractionation

After adjusting the pH to 8.5 with concentrated Tris-HCl,the 15 CEX fractions are reduced with dithioerythritol(DTE, 30 mM, 2 h at 377C) and alkylated with iodoacetam-ide (120 mM, 30 mn, 377C in the dark). The latter reactionis stopped with the addition of DTE (30 mM) followed by

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 4: In vitro andin silico processes to identify differentially expressed proteins

2336 N. Allet et al. Proteomics 2004, 4, 2333–2351

acidification (TFA, 0.1%). A volume of each fractionequivalent to 100 mg proteins (automatically calculatedby the LIMS) is then injected at 0.8 mL/min on a VydacC4, 3 mm, 300 Å column, 4.6 mm id, 100 mm length(Basel, Switzerland. The C4 column is equilibrated andwashed with 0.05% TFA in water (solution A). Proteinsand peptides are eluted with a biphasic gradient from100% A to 100% B (0.05% TFA, 80% CH3CN in water)over 15 min. Fifteen RP fractions of 0.6 mL are collected.The protein content of each RP fraction is automaticallycalculated by the LIMS, based on 215 nm UV absorption.Calibration of the peptide content is established duringpreliminary experiments using a set of protein standards.The reverse phase fractionation is performed with anAlliance HPLC system (Waters, Milford, MA, USA) or aVarian (Palo Alto, CA, USA) preparative HPLC system (500/2500 mL processes).

All tasks, including buffer preparation, are supported andtracked by the LIMS, in order to ensure reproducibility. Inprotein separation and LC-MS steps (see below), con-stant optimal amounts of proteins are injected to improvereproducibility and avoid column saturation. The chroma-togram of each separation step is used to extrapolate theprotein quantity present in each fraction, allowing the cal-culation of the appropriate injection volumes for the sub-sequent step. Since protein quantity calculations rely onabsorbance data, the chromatograms are preprocessedto correct for baseline drifts, and a graphical tool allowsa visual inspection and validation of the processing bythe operator.

2.1.4 Leptin spiking experiment

A predefined amount of leptin (Sigma, Buchs, Switzer-land) is added to a 10 mL human plasma aliquot justbefore chromatography processing. Four stock solutionsof leptin in plasma are used for this experiment. The finalconcentrations of spiked leptin are respectively 1 nM,10 nM, 50 nM and 100 nM. Spiked samples are treated instrictly the same conditions.

2.1.5 500 and 2500 mL processes

The same process is designed for higher samplevolumes, with an adapted column format and a highernumber of injections for the first separation steps. An ad-ditional reverse phase fractionation step, named RP2, isperformed. The first reverse phase step is renamed RP1for clarity.

2.1.6 1 mL process

This process is adapted to low volume individual samples(sub-milliliter). Briefly, albumin and immunoglobulins deple-tion and low Mr protein preparation are realized using adown-scaled column format. Proteins are reduced and al-kylated and then hydrolyzed by modified porcine trypsin(Promega; Catalys, Wallisellen, Switzerland). Tryptic pep-tides are first separated according to their charge by strongcation exchange and analyzed by RP LC-MS/MS. Prelimi-nary results are presented in Table 2 to compare with highervolume processes, which we focus on in this paper. More-over, the 1 mL process is a shotgun peptide sequencingprocess, as opposed to the 10, 500 and 2500 mL pro-cesses that rely on protein separation extensively.

2.2 Robotics and analysis by MS

All protein/peptide standards were either from Bachem(Bubendorf, Switzerland), Sigma (Buchs, Switzerland)or Neosystem (Strasbourg, France). All chemicals fromFluka/Riedel de Haen/Sigma/Aldrich (Buchs, Switzerland),or Merck (Dietikon, Switzerland), were of the highest qualityavailable and were used without any further purification.Water was in-house purified and desalted by reverseosmosis and Milli-Q systems (Millipore, Switzerland).

2.2.1 Automated digestion

After the last reverse phase separation step, RP fractions fol-low a multistep treatment of concentration and resolubiliza-tion optimized for high recovery rate to prepare the RP frac-tions for digestion. The LIMS generates command scripts forthe Tecan Genesis (Männedorf, Switzerland) liquid handlingrobots that add the appropriate amount of digestion buffer.RP fractions are digested overnight with modified porcinetrypsin (Promega). The appropriate amount of enzyme to beadded toeachRPfraction iscalculatedbasedontheabsorb-ance of the RP fractions (see Section 2.1.2). The digestionefficiency is ensured by both checking the pH of the solutionbefore and after digestion and digestion of a cytochromeC standard checked by MALDI-TOF MS.

2.2.2 Analysis by MS

After digestion, the tryptic peptide mixtures are furtherseparated and analyzed on an HPLC-MS/MS systemdesigned for high-throughput. Each HPLC-MS/MS sys-tem consists in two Alliance HT2795 separation module(Waters) with an in-house built flow splitter that can workin parallel and one Esquire3000plus (Bruker Daltonics,Bremen, Germany). The two separation units and the ion

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 5: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2337

Table 2. Essential statistics of selected proteomes

Project M3c M4e, M4l (avg/2)a) PX (avg/4) PI (avg/2)

Process 2 500 mL 500 mL 10 mL 1 mLCEX fractions 18 12 15 24RP/RP1 fractions 30 15 15RP2 fractions 24 24Total fractions 12 960 4 320 225 24Spectrab) 703 016 1 368 494 158 090 9 556Raw peptide identificationsc) 505 473 205 786 46 784 3 890Validated peptidesd) 102 800 135 244 27 941 1 593Tryptic peptidese) 102 800 117 274 22 389 1 335Modified peptidesf) 1 930 21 866 3 465 387Distinct peptidesg) 5 057 6 289 2 250 644DB entriesh) 481 313 343 513 55 098 984Distinct DB entriesi) 77 200 17 514 5 916 984Distinct DB entries auto nonredundantj) * 1 353 400 127Validated distinct proteinsk) 773 880 409 114Average coverage (peptides)l) 3.263.6 3.563.8 4.565.5 5.566.8Proteins with 1 peptide (%)m) 59.2 60.2 65.3 66.6Average sequence lengthn) 3686499 4546607 3846685 3246523Average per fractiono) 2.562.2 5.964.9 12.767.9 **Median per fractionp) 2.0 4.0 12.0 **Most complex fraction 21 27 38 **Sequence analysis tasksq) 94 017 23 982 8 366 1 500

a) avg/n means that the results are averaged on n proteomes of the same type.b) Number of acquired experimental MS/MS spectra.c) Number of peptide matches returned by the MS search engine with production parameters.d) Number of peptide matches finally validated.e) Number of validated tryptic peptides.f) Number of validated peptides with at least one modification that is neither Cys_CAM nor oxidation.g) Number of validated distinct peptides.h) Number of database entries returned by the MS search engine with production parameters.i) Number of distinct database entries identified.j) Number of distinct nonredundant DB entries as computed by the integration program (see Section 2.5, entries corre-

sponding to one single protein in various databases are grouped automatically as are indistinguishable splice variantsbecause specific peptides are not detected).

k) Number of finally validated proteins.l) Average number of peptides per validated protein.m) The proportion of validated proteins identified based on one peptide only.n) Average length of the validated proteins.o) Number of manually validated proteins per fraction containing at least one protein.p) The same with average replaced by median.q) Number of sequence analysis tasks of the automatic annotation. These tasks are typically sequence homology

searches such as BLAST [46], ClustalW [47], PFAM [48], NetPhos [49], SignalP [50], etc.* Software not available when these data were processed.** Not applicable as there is one “big” fraction only.

trap instrument are respectively controlled by the Hystarand EsquireControl programs (Bruker) in their latestversions. HPLC is performed on a nanobore GROM-SILC8, 5 mm (GROM, Rottenburg-Hailfingen, Germany),0.16100 mm columns, at a flow-rate of 3 mL/min using abiphasic gradient from 100% A (2% ACN, 0.1% formicacid (FA) in water) to 100% B (95% ACN, 0.1% FA inwater) over either 24 or 60 min (for 500/2500 mL, respec-

tively 10 mL). The Esquire3000plus ion trap (ESI-IT)are usually operated in data-dependent MS to MS/MSswitching mode using two precursors detected in the350–1600 m/z unit window and excluding singly chargedions. Precursors were excluded for one minute after oneMS/MS acquisition and the scan range was kept between100–1600 m/z. The mass windows (MS and MS/MS) hadbeen optimized for our sample/separation process.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 6: In vitro andin silico processes to identify differentially expressed proteins

2338 N. Allet et al. Proteomics 2004, 4, 2333–2351

Digested RP fractions come into a 96-well plate formatincluding 15 RP fractions only (24 RP fractions for500 mL) that are run overnight. Some wells remainunused and fraction layout is designed to ensure uniformevaporation over the plate. A Hystar “sample table” isautomatically created by the LIMS that includes depend-ing on the pipeline format chosen: (1) the plate format;(2) the appropriate LC and MS method command filesadapted to Hystar; (3) the injection order following anincrease in RP fraction concentration to minimize carry-over; and also (4) variable injection volume tuned to loada maximum of 800 ng on the column thus avoiding col-umn saturation or clogging. A quality control standard isadded to each plate and is directly processed by theLIMS, enabling quality control survey of chromatographicseparation, MS sensitivity and accuracy. Data files gener-ated by the mass spectrometers are automatically post-processed by the DataAnalysis program (Bruker). Theoutput data are finally transferred via the LIMS to a fileserver where they are converted to the format used bythe MS identification program.

2.2.3 Comparative proteomics set-up

In order to ensure a better fraction-to-fraction compar-ability of the results, we have created a pool of Alliance/Esquire LC-MS/MS systems exhibiting comparable per-formance that have been selected through a qualificationprocedure especially designed for this purpose. Namely,on each system we analyze a peptide mixture obtained bytryptic digestion of eight proteins (P02787 TRFE_HU-MAN, P02768 ALBU_HUMAN, P00698 LYC_CHICK,P02754 LACB_BOVIN, P00330 ADH1_YEAST, P10537AMYB_IPOBA, P00004 CYC_HORSE, P01317 INS_BO-VIN). The analysis is repeated six times at three concen-trations: 50, 100, 200 ng total mass injected, each proteincontributes to one-eighth. The LIMS tracking capability isused to analyze the same CEX fractions of each individual10 mL sample on comparable systems.

2.3 LIMS

In order to track samples and manage all the laboratoryanalyses, we developed a LIMS. The LIMS implements amodel of workflow that breaks down a complex laborato-ry process into a succession of interconnected and inter-dependent steps named tasks. A typical task representsan experimental step with input data, output data, andparameters. Input data represent samples generated bythe previous task(s), while output data are samples and/or files produced by the task. Note that in this LIMS sec-tion the term “sample” is taken in a general sense, al-though in the rest of the paper it is reserved for the initial

biological sample. Relevant experimental conditions arestored into parameters associated to the task. A task isready for execution as soon as all its input data are avail-able, implying that all its preceding tasks are completed.This rule enforces a correct and safe execution of theworkflow [18].

The LIMS platform is composed of a relational database[19], a file server, and a local agent, which includes agraphical user interface (client), deployed on each labora-tory workstation. Any information related to the specifica-tion and execution of laboratory procedures is stored inthe relational database. Such information comprises: (1)the workflow specifications, i.e. the tasks with their inter-dependencies, input/output data types, standard operat-ing procedures (SOPs) and associated default conditions;(2) the data generated by the workflow execution, i.e. allrunning conditions and generated samples and/or experi-mental data. Each sample, solution, reagent, and equip-ment used or generated is identified by a barcode. Thisallows tracking and provides history information neces-sary for quality control. (3) The progress status of the lab-oratory procedures so that the LIMS is able to provide anoperator with the appropriate task to perform on a givensample at any one time.

The file server is a central repository for all the files gener-ated by the laboratory procedures. Laboratory operatorsuse the client program to interact with the LIMS. For agiven sample, identified by its barcode, the user interfaceretrieves the SOP of the laboratory task to perform, andlets the operator specify the equipment, reagents, andsolutions used. At the end of the task, all files and experi-mental parameters are saved, and the system generatesbarcodes for the new samples.

Moreover, the client program integrates pre- and post-processing modules used mainly for the management ofthe laboratory instruments. Preprocessing modules gen-erate command scripts for the instruments, e.g. liquidhandling robots and LC-MS/MS workstations. Post-pro-cessing consists of the quality control of the output data,their automatic transfer to the file server, and the extrac-tion and processing of the data needed by the followingsteps of the process.

The LIMS representation of the 10, 500 and 2500 mL pro-cess workflows comprises 62, 130 and 184 task specifi-cations, respectively. In a typical 10 mL project, 600 tasksare executed while 350 samples (including intermediarysamples) and 10 gigabytes of compressed data are gen-erated. A typical 500 mL project consists of 4000 taskexecutions, 5000 generated samples, and 75 gigabytes

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 7: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2339

of compressed data. A typical 2500 mL project consistsof 20 000 tasks, 15 000 samples and 250 gigabytes ofcompressed data.

2.4 MS data identification

2.4.1 Databases

We identify proteins by searching peptide MS/MS spectraagainst several databases: Swiss-Prot1TrEMBL [16] hu-man sequences, proprietary protein databases, gene pre-dictions, ESTclusters built from human dbESTsequences[20], and the human genome (NCBI, build number 34).The databases are prepared according to a procedureaimed at moderately reducing redundancy and exploitingannotations found in Swiss-Prot, e.g. splice variants,PTMs, chains, signal peptide. The redundancy reductioninvolves the comparison of all sequences of all databasesagainst themselves; this huge similarity search is per-formed on a 120 node computer cluster. The similarity cri-teria we impose are very stringent to avoid eliminatingpolymorphism and other variations such as splice var-iants. Hence, redundancy is not completely removed.ESTclusters are generated by a proprietary system basedon CAT algorithm [21–24]. We produce gene predictionsfrom the human genome by running Genescan [25] andHMMGene [26].

2.4.2 Search engine

To search an experimental peptide spectrum against adatabase, we digest database proteins (DNA is previouslytranslated) into theoretical peptides from which we com-pute theoretical spectra. Theoretical and experimentalspectra are compared by applying a peptide scoringfunction to measure their correlation. In addition to com-puting a score for each comparison, we assign it a confi-dence level. The theoretical peptide that obtains the bestscore, provided the confidence level is high enough, isconsidered to correctly match the experimental spec-trum. This procedure is repeated for each experimentalspectrum. Peptide identifications are combined to obtainprotein identifications, namely database sequences.

The search engine we use is a proprietary programnamed OLAV [12–15, 27]. OLAV peptide scoring functionsare based on signal detection theory and machine learn-ing classical techniques. OLAV has several importantfunctionalities to both support high-throughput produc-tion and maximize MS data usage, which are generallynot found in available software: (i) Multiprocessor modewith dynamic load balancing; (ii) Extensive list of PTMsthat can be set either fixed or variable. In particular, it is

possible to set fixed/variable modifications at precisesequence positions to exploit database annotations orvalidate user hypotheses; (iii) Two-pass search: it is pos-sible to use a first set of search parameters to search theentire database and, subsequently, to use a second set ofparameters to search against the database entries thatwere matched at the first pass. Typically, we use lessselective parameters and impose tryptic peptides (onemissed cleavage) at the first pass and then allow non-tryptic peptides but require more significant matches atthe second pass. This strategy is very effective in exten-sively exploiting MS/MS data and controlling the falseidentification rate [13]. A similar method has been recentlyproposed to reduce computation time [28]; (iv) Searchagainst given sequences, i.e. not only databases. Thismakes us able to test hypothetical protein sequencesgenerated either manually or automatically; and (v) Pre-dict ion trap peptide charge states to avoid testing allpossible charges [29].

2.5 Integration

MS data identification yields large amounts of sequencesalong with the corresponding identified peptides andassociated scores (see Table 2, “raw peptide identifica-tions”, “DB entries”). In particular, this is due to the manydatabases we search against, which, besides pure redun-dancy, also contain incomplete sequences, splice var-iants, precursors, polymorphism, etc.

Integration is the first step in our post-identificationautomatic bioinformatics analysis. It is a three-step al-gorithm selecting candidate sequences among all thosethat were matched at the MS identification stage. Thisselection serves to eliminate some remaining false posi-tives and, mainly, to reduce redundancy. The integrationalgorithm starts with an expert system that rejects oraccepts each identified peptide based on a set of rulesdepending on several properties such as the peptidelength, its sequence, its score, its identification confi-dence level, its PTMs, and the database in which it hasbeen identified. This system has been developed in-house on the basis of the experience acquired by theannotators.

Although the MS identification program searches againstdatabases of reduced redundancy (see above), homologybetween sequences still remains. The purpose of the sec-ond part of the algorithm is to reorganize the identifiedsequences using these homologies. A hierarchical struc-ture is built that groups sequences sharing identical pep-tides. For example, two entries identified with the samepeptides are considered equivalent, whereas two entriessharing no common peptide lie on two different trunks of

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 8: In vitro andin silico processes to identify differentially expressed proteins

2340 N. Allet et al. Proteomics 2004, 4, 2333–2351

the hierarchical structure. Two entries identified on almostthe same set of peptides lie on near branches. This pro-cedure is hence capable of distinguishing between var-iants (splice or polymorphism).

Finally, for each homology group, one or several se-quences are selected, based on several factors, e.g.overall score, number of peptides identified, coverage,and reliability of the database (Swiss-Prot is more reliablethan EST clusters). The results of the MS data identifica-tion and the integration, as well as the initial mass lists, areall exported in the XML format, and subsequently storedin the proteomics database (ProtDB), which is detailedhereafter.

2.6 Automatic annotation

The next step of the bioinformatics automatic analysisconsists of the characterization of all sequences selectedby the integration algorithm, following their insertion in theproteomics database. Sequence analyses are performedvia an Analysis Information Management System (AIMS).AIMS is a proprietary distributed application, designed tohandle heavy loads of parallel analyses. It provides a gen-eric environment that allows the quick and easy integra-tion of third party analysis software, by handling all non-specific parts of biological sequence analysis. Results arestored in ProtDB, whereas all information relating to theanalysis conditions is saved in another dedicated rela-tional database.

Bioinformatics analyses often consist of a series of pro-gram executions, where data generated by a given pro-gram serve as input for the next one. In order to automatethis type of analysis, AIMS uses and extends the LIMSworkflow implementation [18] to design and managecomplex bioinformatics analyses, where stand-alone pro-grams are combined together. Dedicated decision-mak-ing tasks are present at key positions in the workflow toanalyze intermediate results and further drive the analy-sis.

Currently, sequences selected by the integration undergofifteen automated analysis workflows (in the LIMS/AIMSsense), including functional and structural domain detec-tion, pattern matching, homology searching, chromoso-mal and cellular localization, and classification into pro-tein families. Annotations from public biological data-bases are also retrieved and inserted into ProtDB. Thenumber of automatic annotation tasks with respect tothe process type is listed in Table 2.

2.7 The proteomics database (ProtDB)

2.7.1 Data management framework

ProtDB is a central database that stores results of MSidentification/integration and automatic bioinformaticsanalyses. It also stores manual annotations. We designedProtDB to meet several requirements: (1) generic datastorage for bioinformatics results; (2) flexible and fastinteractive access to the data; (3) efficient data access/insertion operations for programs participating in theautomatic annotation; (4) in-depth consistency checksduring data insertion; and (5) capacity to adapt tochanges in project specifications and to the general evo-lution of the entire bioinformatics platform.

We developed and implemented ProtDB in Oracle 9iR2

operated on a Linux platform, taking full advantage ofthe most recent Oracle functionalities. Each databasehas the internal knowledge of its own proteome(s) speci-fics and of its own users and security management.

The very different requirements of the database users, nomatter whether they are programers or interactive users,are served by the same database schema. By usingstandard practices, such as the concept of view [19]implemented in many relational database managementsystems such as Oracle, we developed an abstract layeron top of a generic database schema. This access layer,complemented by stored procedures and triggers, hasthe ability to adapt rapidly in order to follow the evolvingneeds of the users. This layer also ensures short accesstimes and a stable interface with the database for clientprograms in case the underlying schema is modified

ProtDB’s flexible design lets us integrate multiple resultsets (proteomes), from 2500 mL down to 1 mL, in a singledatabase of about 100 Gb or manage them separately,depending on the projects. We currently have 13 data-bases available on-line, all using the same schema defini-tion, accessed by the same set of programs.

2.7.2 Data insertion procedure

The XML files obtained after completion of the MS dataidentification and integration programs are parsed by thedatabase insertion program. External information onidentified sequences is automatically retrieved from therelevant biological databases such as Swiss-Prot,TrEMBL, etc. Once a batch of data has been entered, asecond program computing the views and launching theautomatic annotation machinery is started. At this point,interactive access to the newly inserted data is allowed.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 9: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2341

2.8 Expert annotation and database interface

Cutting-edge laboratory and data analysis methods wouldbe incomplete without the expert curation by human anno-tators. The quality of the identification of each sequenceis carefully checked, taking into account the accuracy ofthe peptide matches (borderline matches are manuallychecked by looking at the peptide fragment masses), thenumber of peptides that match on the sequence, and thesequence frequency in the individual chromatographicfractions. Additionally, conflicting matches (a spectrummatching more than one peptide), which are not automati-cally eliminated [13], are manually solved. This procedureeliminates a few remaining false positive identifications.Related validated sequences are then analyzed, by usingbioinformatics tools, in order to identify expressed pro-teins and eliminate a few redundant sequences left by theintegration process. This step and the usage of databasessuch as ESTclusters, individual ESTs (dbEST) and TrEMBLenables us to identify splice variants and polymorphismwhen supporting specific peptides are detected. Further-more, data mining is usually performed on sequences ofinterest. These observations are correlated with relevantliterature searches to assess the “biological” value andthe consistency of our results regarding a specific samplecomparison.

Giving the annotators access to large collections of dataand letting them operate on these data by tracking theiractions and storing their observations/decisions is not a tri-

vial task. As said before, it is supported by ProtDB design.The capability to track additional bioinformatics analysesperformed by the annotators is provided by AIMS in com-bination with an in-house adapted version of the Pise [30]web interface, the results being stored in ProtDB.

In addition to properly designed database systems, theaccess to vast amount of data calls for advanced inter-faces. The interface system we implemented is made oftwo components: a classical web-based interface (Perl/DBI, CGI, JavaScript and SVG tools) and a limited numberof more sophisticated Java applications. A first set oftools is used for the manual validation of protein identifi-cations from proteome-wide views down to individualpeptide match graphical representations. The annotatorscan also initiate additional bioinformatics analyses via theinterface. Specific tools enable the analysis of proteindifferential expression among several samples, whichinclude statistical reports and graphics, e.g. proteinscores reported for each sample, plots of the identifiedpeptides in each fraction, etc. For instance, the IRIS inter-face (Integrated Results Interface System) graphicallydisplays all the peptides matching a selected sequence,the results of the programs invoked during the automaticsequence annotation, and the manual annotations. Theseobjects are represented in a tree in conjunction with theirassociated properties such as peptide score, position,sample type (disease vs. control), etc. The tree represen-tation can be dynamically reorganized based on objectproperties, depending on user needs (Fig. 2).

Figure 2. Data mining usingthe IRIS interface. A selectedprotein is shown with its treeof analysis programs results,annotations and matching pep-tides. The annotations and thematching peptides branchesare both expanded, the intensityreflecting the number of pep-tides identified in the entire sam-ple (1). With a single mouseclick, the peptides can begrouped into CEX chromatogra-phy fractions (2). This represen-tation may indicate the pres-ence of a fragment in fraction12, since distinct set of peptidesare detected (3). Moreover, thecomplete absence of peptidesin the region between position96–117 could be explained by

the presence of a carbohydrate at position 109 (4). Visual reorganizations like grouping, filtering, and sorting datacan be applied on any kind of program result, annotation or peptide set properties. Such flexibility is made possible bya generic data model that is implemented in the underlying proteomics database and which applications written in Javasuch as IRIS relies on.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 10: In vitro andin silico processes to identify differentially expressed proteins

2342 N. Allet et al. Proteomics 2004, 4, 2333–2351

Another graphical view is provided by a new type of 2-Dplot used for mapping the identification of a selected pro-tein onto the chromatographic fractions. Depending onthe process, there are two or three chromatographicdimensions after depletion and gel filtration: CEX and RPin the case of the 10 mL process; CEX, RP1, and RP2 inthe case of the 500 mL (and 2500 mL) process. Poten-tially, a given protein can be identified in each RP fraction,respectively RP2 fraction, of the 10 mL process, respec-tively 500 mL process. To have a global view of where aprotein has been actually identified, with additional infor-mation such as identification score, coverage, and num-ber of peptides identified, we use a synthetic 2-D repre-sentation of the chromatographic space. This space isdivided into cells having two coordinates: (CEX, RP) or(CEX, RP1). In the case of the 500 mL process, a cell isfurther divided into subcells to represent the RP2 frac-tions. Finally, when two samples are compared, it is con-venient to divide the cells into two halves: the upper halfbeing associated with one sample and the lower halfwith the other. It is also possible to use color codes.Mouse clicks initiate the display of extra information.An example is shown in Fig. 3; other examples will be inSection 3.

2.9 Differential statistical analysis

Although 10 mL and 1 mL processes can be applied topools of individual samples, they are mainly designed forthe analysis of individual samples. Projects that includemultiple sample analyses generate data that we processby multivariate statistical methods adapted from DNAchips [31, 32]. In particular, we use [33] in case of smallnumbers of repetitions. For instance, let us assume thatwe compare two categories of samples: control and dis-ease. Let n be the number of samples in each category,and let X1, . . ., Xn, respectively Y1, . . ., Yn, be quantitiesrepresenting the abundance of a given protein in the ncontrol, respectively n disease, samples. We compute astatistics

Z = (med{Xi} – med{Yi}) / s, (1)

where s = sqrt(2/p [s12/n1s2

2/n]), and s12, respectively

s22, is an approximation of X, respectively Y, variance,

and med is the median. The latter variance approxima-tions require some care when n is small, see [33] formore details. The higher the absolute value of Z, themore “differential” the protein is. Namely, it is possibleto compare Z to a theoretical or an empirical distribution,and estimate the probability the given protein is differen-tially expressed (see Section 3, and Fig. 4 for an exam-ple).

Besides obvious choices for X, Y when stable isotopiclabels such as 18O are added to the samples [11, 17], weobserved that OLAV peptide scores can be used directlyas a rough indicator of protein abundance when no inter-nal standard is available. We typically use for Xi (or Yi) thesum of the scores of every peptide of a protein over everyfraction of sample i:

ExpressionðPÞ ¼X

f

X

p2MfðPÞsðpÞ (2)

where P is a given protein, Expression (P) is a quantityaimed at representing its “abundance”, f is a chromato-graphic fraction, Mf(P) is the set of identified peptides forP in f (with repetition), and s(p) is the score of peptide p.

2.10 Protein synthesis

The most promising proteins are synthesized by solidphase peptide chemistry [34, 35]. Synthesis of proteinsor peptides of less than 50 residues is conducted by Bocin situ neutralization technique [36] using custom modi-fied peptide synthesizers. Proteins ranging from 50 to160 residues are synthesized applying the native chemi-cal ligation technique [37]. Purity of the refolded and puri-fied material is higher than 95%. Classical yields are su-perior to 5 mg for each protein. These highly pure andcompletely characterized proteins can be used to validatedifferential expression and identification of promisingcandidates by spiking at the individual sample level.

3 Results and discussion

3.1 Protein separation

The 10 mL process separation scheme (Table 3) startswith the depletion of serum albumin and immunoglobu-lins. It is crucial to remove abundant larger proteins asthey interfere with the subsequent chromatographic col-umns. Small proteins are isolated by gel filtration in pres-ence of urea to disrupt noncovalent complexes. In ourhands, gel filtration ensures better protein recovery com-pared to ultra filtration (data not shown), which hasalready been used in isolation of small proteins inplasma/serum proteomics studies [3]. Furthermore, gelfiltration is easy to scale. The fractions generated by gelfiltration have a large volume. The column eluent corre-sponding to low Mr proteins is thus submitted to in-linereverse phase capture in order to concentrate the pro-teins. The low Mr protein fraction obtained by reversephase capture is submitted to strong cation exchange,15 fractions are collected. The CEX step notablydecreases the complexity of the protein mixture. 100 mg

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 11: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2343

Figure 3. Two examples of dif-ferentially expressed proteins inthe sample comparison M4e/M4l. 500 mL process RP1 frac-tions are localized by their(CEX, RP1) coordinates andrepresented by circles. The cir-cles are divided into two halves:upper half for late pregnancy(M4l, in black) and lower half forearly pregnancy (M4e, in red).Finally, each half is divided into24 sectors to represent RP2fractions. The lengths of theradius indicate protein identifi-cation scores scaled accordingto the maximum score. The pro-teins HCG (Swiss-Prot P01233)and HPL (P01243) are two well-known markers of early and latepregnancy respectively [45].HCG is identified with highscores in several RP2 fractionsof M4e and almost none in M4l,whereas HPL is identified withhigh scores in several RP2 frac-tions of M4l and not in M4e.

protein of each CEX fraction is submitted to RP fractiona-tion. The volume corresponding to this protein amount isautomatically calculated by the LIMS, based on the CEXchromatogram. We obtain similar results for the 500 mLprocess (data not shown), where a supplementary reversephase step (RP2) is performed.

During the development of the original 2500 mL process,for which the albumin/immunoglobulin depletion was per-formed identically, we analyzed the retained fractions[38]. We found no evidence of low abundant proteins;only abundant proteins such as prealbumin and apolipo-protein AI were reliably identified. We are currently in the

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 12: In vitro andin silico processes to identify differentially expressed proteins

2344 N. Allet et al. Proteomics 2004, 4, 2333–2351

Figure 4. Empirical random distribution of the statistics Zlearnt on four identical 10 mL samples (PX) analyzedseparately. Here we use the natural logarithm of thesums of the peptide OLAV scores as the quantity repre-senting protein abundance. The absolute value of Zserves to measure how differential a protein is [33]. By an-alyzing four samples (PXleptin) identical to PX but withleptin spiked at known concentrations (1 nM, 10 nM,50 nM, 100 nM), we can establish the potential of the10 mL process to identify differentially expressed pro-teins. The probability distribution function of Z (FZ) isused to estimate confidence levels: use FZ(z) if z . 0 and1 – FZ(z) if z , 0. When comparing PXleptin samples withPX we always find the leptin as differential with confi-dence . 99% (z . 4.1). This is still the case if we replacePX by PXleptin and do the comparisons 1:10, 1:50, 1:100,10:50, 10:100 (z . 5.3). The confidence level for 50:100 is81% (z = 1.1).

Table 3. Separation steps for the low molecular weight(LMr) proteins of 10 mL of plasma or serum

Separation steps No. offractions

Protein(mg)

Recovery ofLMr proteins

Depletion of albumin andimmunoglobulins

1 200 –

Gel filtration/reversephase capture

1 5 91%

Cation exchange 15 0.06–1 95%Reduction-alkylation and

reverse phase fractionation15 – –

Protein recovery (, 25 kDa) is given when available,expressed as a percentage of eluted proteins over pro-teins injected; both values as determined by size exclu-sion HPLC.

process of doing a new investigation with improved anal-ysis techniques. These results will be published else-where.

The average number of proteins found by MS per finalfraction is reported in Table 2 (“average per fraction”). Itillustrates the power of the protein separation methodol-ogy in reducing the initial sample complexity: comparethe total number of validated proteins with the complexityof the most complex final RP/RP2 fractions in Table 2(“validated distinct proteins”, “most complex fraction”).Sample complexity reduction combined with high proteinrecovery rates, as reported in Table 3, gives access to lowabundant proteins such as GP_1547 (5 pM, Fig. 5).

As we said, comparative proteomics requires high repro-ducibility at every process step. Protein separation repro-ducibility is demonstrated in the case of the 10 mL pro-cess by Fig. 6 (comparison of six RP chromatogramsobtained for one CEX fraction in two physiological condi-tions) and Table 4 (leptin separation in the spiking experi-ments). In the case of the 500 mL process, reproducibilityis demonstrated by Fig. 5. Therefore we conclude thatprotein separation reproducibility is satisfactory.

Table 4. Leptin localization in the four leptin spikingexperiments

CEX 9 CEX 10 CEX 11 CEX 12 CEX 13 CEX 14 CEX 15

RP8 1 2 3

RP 1 2 4 37 1 5 6 8 9 12 6 9 2 4 1 1 1

RP 16 1 2 5 7 11 6 3 3

Each RP fraction where we identified leptin is representedin the table: CEX 9–15, RP 6–8. We report the number ofdistinct peptides identified for each concentration. Thecells corresponding to individual RP fractions are dividedinto four subcells: upper left number of distinct peptidesat 1 nM, upper right 10 nM, lower left 50 nM, and lowerright 100 nM. The reproducibility is satisfying, with moreRP fractions containing leptin when its concentration ishigher. We also note a moderate carry-over on the right(CEX 13–15).

In addition to reducing initial sample complexity, we con-jecture that the protein separation methodology we imple-mented has the potential to reveal certain protein isoforms.As a matter of fact, for some proteins we observe separa-tion patterns such as in Fig. 7. Considering the demon-strated separation power and reproducibility obtained inthe 500 mL process, and the high selectivity of the MSdata identification and integration algorithms (see below), it

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 13: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2345

Figure 5. 500 mL processseparation power. The fractionswhere a protein is detected in asample comparison are repre-sented as in Fig. 3. The exampleof GP_1547 shows a low abun-dant protein (5 pM) sharply sepa-rated, with high reproducibility:CEX and RP1 fractions are thesame and RP2 fractions differby 1 only. In both samplesGP_1547 is identified by onepeptide (coverage 10%). Theexample of GP_84540 shows amore typical pattern: the moreabundant protein (7.2 nM) isdetected in a limited number offractions with a strong signal ina few fractions only, the otherones corresponding to carry-over. Note the similar RP2 frac-tions and identification scores.GP_84540 is identified by sevendistinct peptides that overlap(some are nontryptic, coverage31%).

Figure 6. Reverse phase chromatograms of the CEX fraction 5 of 6 distinct 10 mL samples. Three samples are control (A)and three samples are disease (B). Chromatogram reproducibility is satisfactory.

is possible that such patterns are not generated by randomchance. Nonetheless, it is still not clear if these different“regions” of the 2-D plot (Fig. 7) are due to different activefragments, chains or PTMs. Admittedly, there is a possibil-ity that the observed pattern is an artifact due, for instance,to multiple protein domains randomly interacting with the

various separation columns (mainly CEX). It is in fact prob-able that the origins of such a chromatographic patterndiffer depending on the protein. To this respect, the exam-ple of Fig. 7 is interesting since the protein is identified by adifferent set of peptides in the upper group of fractionscompared to the other two groups (middle and lower).

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 14: In vitro andin silico processes to identify differentially expressed proteins

2346 N. Allet et al. Proteomics 2004, 4, 2333–2351

Figure 7. Example of possibleisoform detection in a 500 mLprocess. A. The (CEX, RP1)separation space contains threegroups of fractions that containthe protein GP_5297. Thesethree groups possibly corre-spond to three isoforms. B. Weobserve that distinct parts ofthe protein sequence aredetected in the two uppergroups. This can be seen byusing a modified plot with theprotein score replaced by theprotein coverage. Here werepresent CEX fractions 7–12and RP1 fractions 2–8. C. De-tails of a cell representing pep-tide matches in a RP1 fraction.The RP2 fractions are mappedonto horizontal pair of lines (oneline per sample) that correspondto the entire protein sequence;only the portions of the lines cor-responding to identified pep-tides are drawn. The status ofthe lower group is less clear asthe detected peptides are thesame as in the middle group.

3.2 Mass spectrometry

The noticeable gain in performance between the first2500 mL process and the new 500, 10 (and 1) mL pro-cesses (Table 2 “validated distinct proteins”), can be at-tributed in part to the upgrade of the MS instruments:Esquire3000 to Esquire3000plus. Nonetheless, perfor-mance improvement has been a two-step process thatstarted with the necessity to gain extra control over theentire platform variability. To that end, we operated manymodifications at all stages of the process from separationdown to MS acquisition.

The introduction of an optimized chemistry in the reduc-tion/alkylation procedure, combined with a more robustdigestion including improvements in the concentration/resolubilization of the tryptic peptides prior to the MSanalysis, are some of the new steps of the 10 and 500 mLprocesses compared to the initial 2500 mL process.

The gain in performance could be further improved byoptimizing the LC-MS platform. Results obtained onstandard mixtures of proteins clearly showed that manypeaks in the MS spectra were saturated. This issue iscommon to nearly all quadrupole ion traps. In our case, it

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 15: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2347

was partly due to the fact that the amount of peptidesinjected on the HPLC column was not monitored accu-rately enough, and it was often too large for the capacityof the trap. We also noticed that many good spectra werenot identified when searching a database. By introducinga dynamic calculation (LIMS) of the volume to be injected,based on the absorbance during the previous separationstep (RP for 10 mL or RP2 for 500 mL), we could reducethe loss in resolution for intense peaks. Nevertheless, theimpact on the number of unidentified good spectra wasnot spectacular.

Since the Esquire3000plus ion trap exhibits much higherperformance than the Esquire3000 in terms of resolutionand signal-to-noise ratio (S/N), we decided to conduct anextensive study to better understand the factors govern-ing successful identifications when using ion traps. Sev-eral acquisition parameters that were known from the pre-vious platform to have an influence on the spectra, suchas the detector multiplier gain voltage, the ion path volt-ages, and other ones have been investigated and system-atically varied on every spectrometer (30 instruments).The outcome of this study was that our Esquire3000plusinstruments were all different with respect to their resolu-tion, sensitivity to space charge effects, and S/Ns. Such avariation is not surprising, but it is a real problem for anycomparative study based on MS data when using severalinstruments. We therefore had to develop a procedurethat first normalizes the inter-instrument variation at thedetection level, and second optimizes each individualinstrument for optimal transmission, trapping efficiency,resolution and saturation. The gain in positive identifica-tions of the overall procedure is close to 30%. However,even with optimized ion traps and reduced differencesbetween them, important variations still exist. Hence weused a complex mixture of digested standard proteins forfinally characterizing all the instruments and groupingthem into pools of similar spectrometers. The results fora set of 16 instruments (Esquire 3000plus) are reported inTable 5. This normalization procedure is repeated beforeeach large comparative study. The LIMS manages the in-dividual chromatographic fractions such that correspond-ing fractions of different samples are analyzed by spec-trometers in the same pool.

3.3 Data analysis

Inspection of Table 2 reveals essential facts concerningthe processes we developed. First of all, Table 2 makesobvious the progress accomplished from the first2500 mL process (M3c) to its medium-sized counterpart,i.e., the 500 mL process (M4e/l). The number of validateddistinct proteins is superior despite the smaller sample

Table 5. Average number of peptides reliably identifiedfrom a test peptide mixture

Instrument 50 ng 100 ng 200 ng

1 82.3 90.8 103.32 111.3 116.2 136.23 103.5 112.7 135.24 71.0 74.5 83.05 79.7 95.7 105.76 90.3 104.5 118.27 92.2 99.5 114.08 94.0 103.7 128.29 84.5 100.0 115.5

10 104.5 119.2 129.811 91.0 111.3 133.312 91.0 108.3 129.313 121.7 150.5 168.314 53.0 56.2 64.815 85.2 97.0 113.716 48.0 60.3 77.3

The test peptide mixture is the tryptic digest of eightproteins (details in Section 2.2.3). Different amounts(50, 100 and 200 ng total mass, each protein contri-butes to one eighth) of the test peptide mixture areinjected for testing each spectrometer. The experienceis repeated six times. Although the spectrometers are ofthe same model (Bruker Esquire 3000plus), and we tunethem to homogenize their performance, the observedperformance still contains variability. Hence the neces-sitiy to group similar spectrometers to enhance analysisreproducibility.

volume, which is remarkable considering the high numberof nonredundant low Mr proteins identified in M3c. Besidesthe technological improvement of the MS platform(Esquire 3000 vs. 3000plus), the superior performance ofthe 500 mL process is also a consequence of improvedmethods. We have already explained what has changedin the laboratory procedures, and it is important to realizethat the bioinformatics platform has been continuouslyimproved as well, as can be seen from Table 2. Forinstance, the ratio between the number of peptide identifi-cations and validated peptides has been reduced by a fac-tor of 3, meaning far less manual checks. M3c MS datawere analyzed by the search engine of MASCOT 1.7 [39]with results rescored by an embryonic OLAV scoring func-tion (L1 scoring function in [13]); M4, PX, and PI data weresearched by OLAV search engine and improved OLAVscoring functions (see [13] for a description of current scor-ing functions and a comparison OLAV vs. MASCOT 1.7).

OLAV provides high sensitivity and selectivity at the sametime. For instance, if we impose a true positive peptideidentification rate of 95% when searching a database of

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 16: In vitro andin silico processes to identify differentially expressed proteins

2348 N. Allet et al. Proteomics 2004, 4, 2333–2351

100 000 human proteins with regular production data,i.e., spectra obtained from real samples and manuallychecked by the annotators, then the false positive pep-tide identification rate is as low as 4% (21 peptides) or1.5% (31), and it is much smaller (, 0.5%) with Swiss-Prot human; see [13] for an extensive discussion of thecurrent performance of OLAV.

The integration algorithm was not in place at the time ofM3c, which caused a lot of manual work to reduceredundancy. Now, as already mentioned in Section 2.5,in addition to redundancy reduction, we also use theintegration algorithm to automatically validate peptides/sequences identified with high confidence. Only “medi-um” confident identifications have to be validated manu-ally, al though the automatic treatment results can bechanged by the annotators if necessary. This hybridmanual/automatic approach is another source of manualwork reduction without compromising the final quality ofthe annotation.

The overall procedure made of the MS identification and theintegration is complex and applied on several databases,whose total size exceeds 109 amino acids, not including theunannotated human genome (genome specific matches areprocessed separately). It is hence a natural question to eval-uate the protein false identification rate of this procedure. Toevaluate such a false identification rate when analyzing realsamples is not straightforward because no reference existsof what should be found. Other authors preferred to performthis estimation based on mixture of purified proteins [40], butthis does not reflect the actual difficulty of real data, or bysearching reversed databases [41] that are not supposed tocontain the correct answer. To search spectra against adatabase that does not contain the correct sequences iscertainly appropriate for estimating false positives but wedo not favor the technique of reversing databases. As a mat-ter of fact, reverse peptide sequences can match surpris-ingly well the spectra of the original peptides depending ontheir sequence. Therefore, we preferred to generate randomdatabases by training an order 3 Markov model [12, 42] oneach database. By reprocessing the MS data of one PX pro-ject (Table 1) with these random databases, we found a falsepositive rate of 1.2%. Moreover, none of the few retainedsequences was validatedautomatically, they all fell ina cate-gory of matches that had to be checked manually by theannotators: the average p-value exponent of the validatedpeptides is 217.7 6 8.8, i.e., a p-value of 10217, whereasthe average found by searching against the random data-bases is 27.3 6 2.0, i.e., a p-value of 1027. In case of asearch against real databases, the proportion of manualvalidation of such matches is at most 30%. Consequently,we estimate the false positive protein identification rate at0.35%.

As we mentioned, the way the integration algorithm groupssequences and “elects” one representative sequence doesnot discard splice variants or polymorphic sequences if oneof their specific peptides is identified. For instance, in M4e/M4l we identified kininogen (P01042) based on 15 peptides(coverage 34%), one of which, IGEIKEETTSHLRSCEYK, isspecific to splice isoform low Mr as reported in the Swiss-Prot entry (splice variant coverage 54%). Another examplein the same sample is provided by the H-factor like protein(Q03591) that we identify on the basis of 15 peptides (cover-age 85%). In the EST cluster database we also identifieda cluster corresponding to H-factor like protein but witha specific peptide STDTSCVNPPTVQNAYIVSR (see forinstance dbEST 8281077). This extra peptide correspondsto two variants documented in the Swiss-Prot entry (FIDVAR_001980 and VAR_001981). As the current version ofOLAV is not able to deal with this type of variant, it was onlypossible to identify this sequence using the EST databases,thereby illustrating that our redundancy reduction algorithmdoes not suppress such information. In the two previousexamples, the presence of the variants is established bythe identification of specific peptides. The presence of thewild form cannot be confirmed since none of their specificpeptides were identified.

As explained in Section 2.4.2, we search databases byapplying a two-pass procedure, which we use for identify-ing nontryptic peptides in the second pass. Table 2 givesthe number of validated tryptic peptides and we deducethat additionally we identify large numbers of reliable non-tryptic peptides. Modified peptides are also searched inboth the first and second pass (Table 2).

We implemented an additional sample-wide third passaimed at discovering more modified peptides. The princi-ple of such a pass is familiar to MS specialists: given a setof potential modifications, take the reliably identified pep-tides, add the appropriate mass deltas and search theunexplained spectra against these modified peptides(including nontryptic peptides). Because modified pro-teins can be found in different final RP fractions, we re-identify every spectrum against the validated peptidesby allowing the appropriate variable modifications. Wefound numerous nontryptic peptides in fractions wherethe tryptic peptide required to start pass 2 above wasmissing. Nonetheless, we found only a few extra modifiedpeptides (data not shown). We also have plans for addinga de novo sequencing extra pass to explain as many aspossible fragmentation spectra.

The number of peptide identifications is larger than thenumber of spectra in Table 2 (“spectra”, “peptide identifi-cations”) because of multiple interpretations of certainspectra. We name these multiple interpretations conflicts.In [13] we reported a very low conflict rate (1.8%). From

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 17: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2349

Table 2 we can deduce a higher rate. This is due to thefact that we do apply generous thresholds in productionbecause false positives can be eliminated at later stages(integration) advantageously (possibility to compare re-sults found in several databases, better overall sensitiv-ity). The conflict rates also depend on the database size.For instance, in recent projects such as M4e/M4l and PX,the conflict rate on Swiss-Prot/TrEMBL human (176106

amino acids) is approximately 3%, human EST clusters6% (109 amino acids), and human genome 18%(7.46109 amino acids); more stringent thresholds areused for the EST clusters and the human genome. Theproportion of matched spectra is given in Fig. 8; it is typi-cally 30% for spectra with more than 50 masses. Com-paring M3c and M4e/l curves in Fig. 8 does not reveal aclear advantage for M4e/l. This is due to the low thresh-olds we had to use for M3c in order not to limit the sensi-tivity of the less advanced platform, thus generating ahuge amount of manual work. Similar performance is cur-rently achieved automatically.

Besides the progress between the new 500 mL and theprevious 2500 mL processes, Table 2 shows how severalstatistics naturally scale with sample volume and proteinseparation complexity. For instance, the numbers of spec-tra, validated peptides, and validated proteins are veryillustrative. Now considering protein coverage, we canalso read from Table 2 that the number of validated pro-

teins, which are identified based on a single peptide,increases when the initial sample volume is reduced. Onthe contrary, the average number of peptides per validatedprotein increases. This is certainly due to abundant pro-teins that are largely covered and which represent anincreasing proportion of identified proteins when the initialsample volume decreases. The length of the validatedsequences is also reported in Table 2. It is interesting tonote that the typical length of sequences identified on thebasis of one peptide only does not differ (data not shown).We explain this observation by postulating that proteinabundance is thedominant factor forconditioning thenum-berof peptides detected compared to protein length. Thus,there should be a correlation between abundance and theprobability to identify a protein based on a single peptide.We have not checked that conjecture thoroughly. The pro-teins shown in Fig. 5 provide a nice supporting example.

3.4 Differential analysis

We first report and discuss results obtained with the 10 mLprocess in a spiking experiment. In Fig. 4 we present theempirical distribution of the statistics Z, Eq. (1), as welearnt it from the four analyses of four identical PX sam-ples. By analyzing the four spiked samples PXleptin andusing the natural logarithm of the sum of every peptideOLAV score, we could compute Z for several scenarios:

Figure 8. Relative frequency ofexperimental MS/MS spectrawith respect to the size of theirmass list, which is used as arough indicator of spectrumquality. The relative frequencyof matched spectra is also given(dashed lines). The total numberof spectra for each proteome isreported in Table 2. The masslist sizes of M3c are much smal-ler compared to the other pro-teomes, this is due to differentpeak-picking program versionsand instruments. We observethat, provided the mass lists arelarge enough, say 50 masses,typically one-third of the spectraare matched. This proportion isslightly smaller in more recentproteomes because the identifi-

cation software has better performance: higher selectivity for a given sensitivity. It is then possible to reduce the number ofautomatically identified peptides without losing sensitivity, thus reducing the number of false positives, which are rejectedin subsequent automatic/manual steps (check that more peptides are finally validated in Table 2 “validated peptides”).

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 18: In vitro andin silico processes to identify differentially expressed proteins

2350 N. Allet et al. Proteomics 2004, 4, 2333–2351

PX vs. PXleptin, Pxleptin (lower concentration) vs. PXlep-tin (higher concentration), i.e., 1:10, 1:50, 1:100, 10:50,10:100 and 50:100. The ln of the sums of peptide OLAVscores, Eq. (2), are 2.4 (PX, avg/4), 4.3 (PXleptin 1 nM),6.8 (PXleptin 10 nM), 9.0 (PXleptin 50 nM), and 9.4 (PXlep-tin 100 nM). Except for the comparison 50:100, where theconfidence level is 81% (z = 1.1), all other comparisonsgive the leptin differential with a confidence level betterthan 99% (z . 4.1). The latter results suggest that wecan reliably detect 1:5 ratios by solely using the identifica-tion score for proteins in the low nM range. More details onsemiquantitative differential proteomics based on peptidescores will be published elsewhere.

A better sensitivity/precision in detecting differential ratiosshould certainly be accessible by using classical stableisotopic labeling techniques such as 18O [11, 17] or even-tually peptide signal intensity [10]. Nevertheless, 18Orequires early digestion as we do in the 1 mL process,and the method of [10] provides no peptide identification,hence there is a risk in erroneously grouping masses andadditional analyses are required for identifying what ishidden behind interesting masses [43, 44]. Consequently,we believe that the differential approach purely based onthe scores, although limited, has definitive practicaladvantages that make it an attractive alternative.

Two nice examples of differential expression as we candetect it in the 500 mL process, e.g. the M4e and M4lsamples, are shown in Fig. 3. These two proteins (HCGand HPL) are markers of early and late pregnancy andtheir differential expression perfectly fits what we knowof their dynamics [45].

3.5 Manual annotation

Our continuous effort for eliminating redundant data-base entries is finalized by this last manual step, whichrelies on the powerful bioinformatics infrastructuremade of the proteomics database, AIMS, the auto-matic annotation and the graphic interfaces. This infra-structure provides the annotators with a lot of relevantinformation to rapidly take decisions. When necessary,extra bioinformatics analyses can be launched. Theirresults are stored in the proteomics database alongwith the results generated by the automated pro-cesses. For instance, specific studies are performedmanually to validate/detect splice variants, poly-morphic proteins, which require checking EST clustercorrectness, to compute new multiple alignments. Pro-teins identified in databases not providing reliableannotations are analyzed de novo by performing func-tion/domain predictions. The annotators also deter-mine which proteins are differentially expressed and

worth further investigation. Certain proteins are synthe-sized to serve as drug candidates directly or to testindividual samples (spiking).

4 Conclusion

We have presented a combination of laboratory andbioinformatics methods aimed at performing differentialanalyses of samples – body fluids – by proteomics.Besides purely differential capabilities, we also intro-duced rigorous data analysis techniques and emphasizedthe role of an integrated and database centric bioinfor-matics platform.

Protein fractionation before trypsin hydrolysis ensures adramatic decrease of final fraction complexity, comparedto classical strategies where proteins are hydrolyzed bytrypsin early in the process (shotgun). We are thus ableto concentrate in the small volume of the final RP/RP2fractions, proteins diluted at very low concentration inthe original sample. Proteins in the low pM range can beidentified in the 500 mL process. Protein fractionationmay also have the potential to separate certain proteinisoforms. We focus on the low Mr protein fraction, so dis-carding more abundant proteins, whose molecular massis more than 25 kDa. The interest to work on low Mr pro-teins has been recently highlighted [38].

By enforcing reproducibility at every step of the process,we achieved an overall reproducibility level that allows usto do differential analyses. These analyses can be per-formed based on different initial sample volumes. In the10 mL process concentration ratios such as 1:5 in thelow nM range are detected with confidence better than99%. In the 500 mL process we validated differentialdetection by retrieving two known markers of early/latepregnancy.

Our ability to deal with samples of very different sizes is aconsequence of a modular design of the laboratory pro-cedures we use. This systematic effort to develop modu-lar and flexible methods is currently pursued towards thedevelopment of complementary processes for dealingwith very small volumes (0.25 – 1 mL) and large-scale par-allel 18O spiking for validation at the individual samplelevel.

The final proteome annotation by human expertsensures consistent and reliable results. By combiningOLAV demonstrated performance, the additional contri-bution of the integration algorithm, and the final manualcheck done on each unsure identification, we believethat the validated proteins can be considered as correctaccording to the highest standards currently in use(, 0.35% false positive, total database size . 109 amino

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 19: In vitro andin silico processes to identify differentially expressed proteins

Proteomics 2004, 4, 2333–2351 Processes to identify differentially expressed proteins 2351

acids). This high selectivity does not prevent us to detectvariants (polymorphism and splice) by taking advantageof ESTs.

The authors would like to acknowledge the reviewerswho made numerous useful and important remarks. Theauthors would also like to acknowledge John Corthésyfrom Bruker for his continuous and excellent support.

5 References

[1] Hanash, S., Celis, J. E., Mol. Cell Proteomics 2002, 1, 413–414.

[2] Adkins, J. N., Varnum, S. M., Auberry, K. J., Moore, R. J. etal., Mol. Cell Proteomics 2002, 1, 947–955.

[3] Tirumalai, R. S., Chan, K. C., Prieto, D. A., Issaq, H. J. et al.,Mol. Cell Proteomics 2003, 2, 1096–1103.

[4] Anderson, N. L., Polanski, M., Pieper, R., Gatlin, T. et al., Mol.Cell Proteomics 2004, in press.

[5] Anderson, N. L., Anderson, N. G., Mol. Cell Proteomics2002, 1, 845–867.

[6] Rose, K., in: Cooper, D. (Ed.), Nature Encyclopedia of theHuman Genome, Macmillan, London 2003, pp. 435–439.

[7] Bergquist, J., Palmblad, M., Wetterhall, M., Hakansson, P.,Markides, K. E., Mass Spectrom. Rev. 2002, 21, 2–15.

[8] Rose, K., Bougueleret, L., Baussant, T., Böhm, G. et al., Pro-teomics 2004, DOI 10.1002/pmic.200300718.

[9] Heine, G., Zucht, H. D., Schuhmann, M. U., Burger, K. et al.,J. Chromatogr. B Analyt. Technol. Biomed. Life Sci. 2002,782, 353–361.

[10] Wang, W., Zhou, H., Lin, H., Roy, S. et al., Anal. Chem. 2003,75, 4818–4826.

[11] Julka, S., Regnier, F., J. Proteome. Res., 2004, 3, 350–363.[12] Colinge, J., Masselot, A., Giron, M., Dessingy, T., Magnin, J.,

Proteomics 2003, 3, 1454–1463.[13] Colinge, J., Masselot, A., Cusin, I., Mahé, E. et al., Proteom-

ics 2004, DOI 10.1002/pmic.200300708.[14] Magnin, J., Masselot, A., Menzel, C., Colinge, J. et al., J.

Proteome. Res. 2004, 3, 55–60.[15] Colinge, J., Masselot, A., Magnin, J., in: 3rd Workshop on

Algorithms in Bioinformatics (WABI) Proceedings, Springer,LNBI 2812 Budapest 2003, pp. 25–38.

[16] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. C. etal., Nucleic Acids Res. 2003, 31, 365–370.

[17] Heller, M., Mattou, H., Menzel, C., Yao, X., J. Am. Soc. MassSpectrom. 2003, 14, 704–718.

[18] Ranno, F., Shrivastava, S. K., Wheater, S. M., in: König, H. etal. (Eds.), Distributed Applications and Interoperable Sys-tems, Chapman & Hall, New York 1997, pp. 280–295.

[19] Date, C. J., An Introduction to Database Systems, PearsonAddison Wesley, Boston 2003.

[20] Boguski, M. S., Lowe, T. M., Tolstoshev, C. M., Nat. Genet.1993, 4, 332–333.

[21] Burke, J., Davison, D., Hide, W., Genome Res. 1999, 9,1135–1142.

[22] Burke, J., Wang, H., Hide, W., Davison, D. B., Genome Res.1998, 8, 276–290.

[23] Chou, A., Burke, J., Bioinformatics 1999, 15, 376–381.

[24] Hide, W., Burke, J., Davison, D. B., J. Comput. Biol. 1994, 1,199–215.

[25] Burge, C., Karlin, S., J. Mol. Biol. 1997, 268, 78–94.

[26] Krogh, A., Proc. Int. Conf. Intell. Syst. Mol. Biol. 1997, 5,179–186.

[27] Masselot, A., Magnin, J., Giron, V., Dessingy, T. et al., in:Proc. 51st ASMS Conf. Mass Spectrom. Allied Topics,ASMS, Montreal 2003, MPB 020.

[28] Craig, R., Beavis, R. C., Rapid Commun. Mass Spectrom.2003, 17, 2310–2316.

[29] Colinge, J., Magnin, J., Dessingy, T., Giron, M., Masselot, A.,Proteomics 2003, 3, 1434–1440.

[30] Letondal, C., Bioinformatics. 2001, 17, 73–82.

[31] Brown, P. O., Botstein, D., Nat. Genet. 1999, 21, 33–37.

[32] He, Y. D., Dai, H., Schadt, E. E., Cavet, G. et al., Bioinfor-matics 2003, 19, 956–965.

[33] Jain, N., Thatte, J., Braciale, T., Ley, K. et al., Bioinformatics2003, 19, 1945–1951.

[34] Cardona, V. M., Hartley, O., Botti, P., J. Pept. Res. 2003, 61,152–157.

[35] Villain, M., Gaertner, H., Botti, P., Eur. J. Org. Chem. 2003,17, 3267–3272.

[36] Schnolzer, M., Alewood, P., Jones, A., Alewood, D., Kent, S.B., Int. J. Pept. Protein Res. 1992, 40, 180–193.

[37] Dawson, P. E., Muir, T. W., Clark-Lewis, I., Kent, S. B.,Science 1994, 266, 776–779.

[38] Liotta, L. A., Ferrari, M., Petricoin, E., Nature 2003, 425, 905–910.

[39] Perkins, D. N., Pappin, D. J., Creasy, D. M., Cottrell, J. S.,Electrophoresis 1999, 20, 3551–3567.

[40] Nesvizhskii, A. I., Keller, A., Kolker, E., Aebersold, R., Anal.Chem. 2003, 75, 4646–4658.

[41] Peng, J., Elias, J. E., Thoreen, C. C., Licklider, L. J., Gygi, S.P., J. Proteome Res. 2003, 2, 43–50.

[42] Ewens, W., Grant, G., Statistical Methods in Bioinformatics,Springer New York 2001.

[43] Baggerly, K. A., Morris, J. S., Coombes, K. R., Bioinfor-matics 2004, in press.

[44] Coombes, K. R., Fritsche, H. A., Jr., Clarke, C., Chen, J. N. etal., Clin. Chem. 2003, 49, 1615–1623.

[45] Sarandakou, A., Kassanos, D., Phocas, I., Kontoravdis, A. etal., Clin. Exp. Obstet. Gynecol. 1992, 19, 180–188.

[46] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman,D. J., J. Mol. Biol. 1990, 215, 403–410.

[47] Thompson, J. D., Higgins, D. G., Gibson, T. J., Nucleic AcidsRes. 1994, 22, 4673–4680.

[48] Sonnhammer, E. L., Eddy, S. R., Durbin, R., Proteins 1997,28, 405–420.

[49] Blom, N., Gammeltoft, S., Brunak, S., J. Mol. Biol. 1999,294, 1351–1362.

[50] Nielsen, H., Brunak, S., Von Heijne, G., Protein Eng. 1999,12, 3–9.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de