University of Groningen Mastering data pre-processing for accurate quantitative molecular profiling with liquid chromatography coupled to mass spectrometry Mitra, Vikram IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2017 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Mitra, V. (2017). Mastering data pre-processing for accurate quantitative molecular profiling with liquid chromatography coupled to mass spectrometry. University of Groningen. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license. More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne- amendment. Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 12-04-2022
50
Embed
University of Groningen Mastering data pre-processing for ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
University of Groningen
Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometryMitra, Vikram
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.
Document VersionPublisher's PDF, also known as Version of record
Publication date:2017
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):Mitra, V. (2017). Mastering data pre-processing for accurate quantitative molecular profiling with liquidchromatography coupled to mass spectrometry. University of Groningen.
CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-amendment.
Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.
Uppsala, Sweden) was prepared from dried powder per the manufacturer's instructions.
Briefly, the dried powder was rehydrated in water for 15 min and washed by 50 bed volumes
of water, followed by 50 bed volumes of coupling buffer in a Handee Mini-Spin column
(Pierce, Rockford, USA, IL). The reduced peptide sample was then incubated with the resin
for 1 h at room temperature with gentle mixing, and the unbound portion (non-cysteinyl
140
peptides) was removed by spinning the column at low speed. The resin was washed in the
spin column sequentially with 0.5 mL of each of the following solutions: 1) 50 mM Tris buffer
(pH 8.0), 1 mM EDTA (washing buffer); 2) 2 M NaCl; 3) 80% ACN/0.1% TFA solution; and
4) washing buffer. To release the captured cysteinyl peptides, 100 μL of 20 mM DTT freshly
prepared in washing buffer was added to the resin and incubated for 30 min at room
temperature. The resin was further washed with 100 μL of 80% ACN which was pooled with
the previous DTT eluate. The sample was alkylated with 80 mM iodoacetamide for 30 min
at room temperature in dark. The eluted cysteinyl peptides were desalted by using a SPE
C18 column and lyophilized. Cysteinyl peptide samples were reconstituted in 25 mM
NH4HCO3 and stored at -80ºC until time for LC-MS analysis (MARS+Cys sample).
Plasma Protein Digestion. The MARS flow-through proteins were denatured and reduced
in 50 mM NH4HCO3 (pH 8.2), 8 M urea, 10 mM dithiothreitol (DTT) for 1 h at 37ºC. The
resulting protein mixture was diluted 10 fold with 50 mM NH4HCO3, and then sequencing
grade modified porcine trypsin (Promega, Madison, USA, WI) was added at a trypsin:protein
ratio of 1:50. The sample was incubated overnight at 37ºC. The following day, the trypsin
digested sample was loaded on a 1 mL SPE C18 column (Supelco, Bellefonte, USA, PA)
and washed with 4 mL of 0.1% trifluoroacetic acid (TFA)/5% acetonitrile (ACN). Peptides
were eluted from the SPE column with 1 mL of 0.1% TFA/80% ACN and lyophilized
afterwards. Peptide samples were reconstituted in 25 mM NH4HCO3 and stored at -80ºC
until LC-MS analysis.
Reversed-Phase Capillary LC-MS Analyses. A custom-built high-pressure capillary LC
system(4) coupled on-line to an Agilent LC/MSD TOF (G1969A, laboratory 2) via an in-
house-manufactured electrospray ionization interface was used to analyze the peptide
samples. In the other laboratory an LC-MS system with time-of-flight detector was used
(Waters LCT Premier for laboratory 1). The reversed-phase capillary column is prepared by
slurry packing 3-mm Jupiter C18 bonded particles (Phenomenex, Torrence, CA) into a
65-cm long and 75 mm i.d. fused silica capillary (Polymicro Technologies, Phoenix, AZ) that
incorporated a retaining stainless steel screen in an HPLC union (Valco Instruments Co.,
Houston, TX). The mobile phases consisted of 0.2% acetic acid and 0.05% TFA in water (A)
and 0.1% TFA in 90% ACN/10% water (B) and were degassed on-line by using a vacuum
degasser (Jones Chromatography Inc., Lakewood, CO). After loading 5 mL of sample
solution onto the column, an exponential gradient elution was achieved by increasing the
mobile-phase composition in a stainless steel mixing chamber from 0 to 70% B over 120
141
min. The TOF mass spectrometer was scanned in the m/z range of 400-2000 at 1
scan/second.
Monte-Carlo simulated dataset The Monte-Carlo simulation to imitate the outcome of peak-matching procedure was
performed with the following criteria:
a) There are two types of peak pairs: Accurately matched peak pairs where the retention
time coordinates of the matched peaks follow a non-linear, monotonic trend. In case of
peak order inversion, the retention times of the accurately matched peak pairs fluctuate
along the non-linear, monotonic trend with the maximal value of retention time difference
of peak changing elution order. The second type of peak pairs is obtained by randomly
matching peaks between the two chromatograms and simulates the error in the peak
matching procedure. These peak pairs are distributed randomly throughout the retention
time space while taking the initial peak density distribution in the two chromatograms
into account.
b) The non-linear monotonic trend is simulated using a cubic spline function and peak
elution order inversion is represented as random fluctuation (orthogonal residuals) along
this trend. Distribution of peak pairs along the non-linear monotonic retention time trend
is sampled directly from the peak distribution of a real LC-MS chromatogram.
c) The parameters of the simulation that can be set by the user are the following: (1)
number of accurately matched peak pairs, (2) number of randomly matched peak pairs,
(3) fluctuation in minutes of the accurately matched peak pairs simulating the amount of
maximal retention time differences related to changes in peak elution order, (4-6) three
LC-MS peak distributions expressed as a histogram along the retention time (one is
used to sample the peak distribution of the accurately matched peak pairs along the
main monotonic retention time correspondence trend and the other two are used to
sample randomly matched peak pairs from two LC-MS/MS chromatograms).
Parameters for the Monte Carlo simulations were the following:
Total number of MPPs: 100, 250, 500, 750 and 1000
Fluctuation of AMPPs around the monotonic retention time trend: 0.05, 01, 1, 5 and 15
minutes
Ratio of AMPPs relative to the MPPs: 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.75, 0.90 and 1.00
Number of repetitions: 3
142
Detailed description of the time alignment algorithm
Pre-processing of single stage LC-MS data Figure 1 shows the main parts of the quality assessment procedure and indicates in red
modules where the procedure can stop due to improper conditions for time alignment such
as low number of matched peak pairs with respect to random peak pairing, low number of
accurately matched peak pairs or high probability of peak order inversion. The quality control
procedure is a pairwise method and expect that the subsequent time alignment method
change only the retention time of one chromatogram (refereed here as sample
chromatogram, and shown as peak list 2 in Figure 1). To process the raw LC-MS/MS data,
data in vendor specific format were converted to mzXML format using msconvert tool of the
ProteoWizard library(5). Single stage part of LC-MS/MS datasets in mzXML format were
submitted to data pre-processing which included peak detection and quantification, de-
convolution of isotopic peak clusters, charge state determination of isotopologue peaks
clusters and summing of the most abundant isotopologues of each charge state per
peptides. The initial noise filtering and peak quantification was carried out using the
PeakPicker module of the OpenMS pipeline(6). The signal-to-noise ratio parameter of the
PeakPicker algorithm was set to 10. Detected isotopologues (chemical species of the same
compound with the same atomic, but different isotope constitution) of one particular charge
state are then clustered and clusters which are not in accordance with the isotope wavelet
model following the “averagine” peptide constitution(6) are filtered out during the feature
finding step. The charge state of each detected isotope cluster is then determined and the
decharged mass of the most abundant isotopologues is calculated and is attributed as mass
of single peptide. This is followed by summing of the most abundant decharged
isotopologues with the same mass (mass tolerance within ±0.01 Da) within ±30 seconds of
retention time. The final quantitative value for each compound is characterized with the mass
value of the decharged most abundant isotopologue and the average retention time of all
charge states. This information along with ioncounts are exported in tab-delimited text file;
which is referred as “peak list” in the article.
Intensity-rank-based peak matching of LC-MS data (left part of step 1. in Figure 1) Prior peak matching, all peak lists were sorted and ranked according to decreasing intensity.
Correspondences between a pair of peak lists are determined by finding peak pairs that are
close in mass and intensity rank. The peak correspondences between a pair of intensity
143
sorted peak lists are identified using a sliding window technique with the following
parameters: (1) peak pairs should be close in m/z therefore a threshold for the maximal m/z
difference between peak pairs is applied. This threshold should be set according to the
maximal mass calibration differences between the two LC-MS chromatograms. For
improvement of mass calibration it is advised to recalibrate the mass axis using either known
masses of background contaminants(7) or using accurate mass of identified peptides from
MS/MS data, if available(8); (2) number of the most abundant peaks used to identify peak
pairs. The end of the intensity sorted peak lists contains noise and other data processing
artifacts, therefore this parameter should be set to include only peaks from the intensity
sorted peak lists and exclude non informative items such as noise; (3) length of the sliding
window used in the peak matching procedure. This window defines the largest differences
between the intensity ranks of paired peaks that are considered by the matching algorithm.
In case of multiple hits for the same mass within the sliding window, the algorithm only
selects the peak with lowest difference of intensity ranks. Figure S2 in the supporting
information provides a visual summary of the peak matching procedure used to define peak
pairs between two LC-MS intensity sorted peak lists using the sliding window approach.
Optimizing peak matching parameters of single stage LC-MS peak lists (parameter optimisation in step 1. in Figure 1) All peak matching algorithms provide a certain ratio of accurately and inaccurately matched
peak pairs. The accurately matched peak pairs are common peaks between the two
chromatograms and contain the information for the correction of the retention time
differences between the two LC-MS chromatograms. When the ratio of accurately matched
peaks is high within the dataset, the retention time coordinates of the accurately matched
peaks accumulate along the retention time correspondence trend. Bivariate kernel density
estimation (2D-KDE) is applied over the retention time vectors of the matched peak pairs to
identify the regions where peaks accumulate in higher density compared to what is expected
from random pairing of peaks from two LC-MS/MS chromatograms. In 2D-KDE for n peak
pairs of ,x y retention time coordinates the estimated probability density function f̂ is
given by:
1
1
ˆ , ,n
H H i ii=
f x y = n K x x y y (1)
where KH is a bivariate ellipsoid symmetric Gaussian kernel that integrates to 1 from - to
+ for x and y values. H is the bandwidth described by the sigma of two-dimensional
144
Gaussian kernel in the x (x) and y (y) directions and is greater than zero. KH determines
the smoothing extent of the 2 dimensional density histogram, and is expressed by the
following equation:
2 2( ) ( )
2 222
2 ( )( )1 1( , ) exp2 12 1
i ix x y yi i
H i ix y x yx y
x x y yK x x y y
(2)
ρ is the correlation between the two 1-dimensional Gaussian kernel functions and defines
the rotation of the Gaussian kernel. The bandwidth parameter was optimally set using a
plug-in bandwidth matrix approach developed by Botev et al.(9). An important feature of the
full bandwidth matrix is that it does not use any normal reference rules and is data centric.
For n peak pairs, the algorithm estimates a square density matrix of size 2i, where i is
arg min 2i
in , and the data matrix cover the entire retention time domains of the matched
peak pairs. The value of 2i is maximized to i = 7, to avoid long calculation time for 2-
dimensional-Kolmogorov-Smirnov test (see two paragraph below).
Peaks paired from the two peak lists contain randomly matched peaks pairs and accurately
matched peak pairs. The ratio of correctly and incorrectly matched pairs using decharged
and isotope deconvoluted peak lists depends from the molecular composition of the two
samples and from the parameters of the peak matching procedure. In order to assess
statistically the ratio of correctly and incorrectly matched peak pairs a p-value is calculated
using 2-dimensional Kolmogorov-Smirnov (2D-KS) test between the 2D-KDE matrix
obtained from the matched peak pair distribution and a the density matrix calculated for
random peak pairing. The density matrix for random peak pairing is obtained with the cross
product of the 1-dimensional KDE of peak distribution for each LC-MS chromatogram using
x and y for the corresponding chromatograms. Therefore 2D-KS measures the statistical
probability that the peak pair distribution originates from the distribution of random pairing of
peaks from the two chromatograms.
The 1-dimensional Kolmogorov-Smirnov (1D-KS) test provides the non-parametric
probability that two distributions is equivalent and that the observed differences between the
two distributions is due to random sampling. The 1D-KS uses the maximum absolute
difference between the two cumulative distributions to calculate the probability of the equality
of two empirical distributions. Extending KS statistic to multi-dimensional space is
challenging, while there are 2d-1 number of independent cumulative distributions in d
145
dimensions. We have slightly modified the algorithm developed by Peacock et al.(10), which
estimates the largest difference between the two cumulative distributions for any possible
ordering for two dimensions. Given n points in a two-dimensional space defined by the
retention time domains of the two chromatograms, this amounts to calculating the
cumulative distribution functions in 4n2-1 quadrants. Our modification comprises that the
cumulative functions is not calculated for each peak pairs, but it is obtained directly from the
two 2D-KDE matrices, one obtained with the cross product of two 1D-KDE calculated from
the peak distribution in the two LC-MS chromatograms and the other obtained with the peak
pairing algorithm described in previous section 3.2. The DKS test statistic is then obtained by
calculating the largest difference between cumulative distributions considering all possible
4n2-1 quadrant divisions, where n in this case corresponds to dimension of the 2D-KDE
square matrices. The null hypothesis considering, that the distribution of peak pairs obtained
with random peak paring and the distribution obtained with matching of intensity sorted peak
lists is same is rejected at a significance level of α if
αKS Z>Dn2
(3)
where Zα is the cumulative standard normal deviate for the corresponding α probability. The
exact p-value for the 2D-KS estimate is obtained from the left part of the inequality (3). 2D-
KS test is then used to optimise the three parameters of the intensity-rank-based peak
pairing algorithm (the length of the intensity rank window, the threshold for the m/z
differences and the number of the most abundant peaks taken into consideration by the
peak matching procedure) with a predefined set of parameters (the exact values of the
parameters are presented in section 7 in supporting information). The 2D-KDE and the 2D-
KS calculations are performed only for peak lists matching a minimum of 100 peaks pairs.
If during the whole optimisation procedure the minimum number of matched peak pairs is
not reached the two-step time alignment procedure stops. The alignment procedure stops
as well if the probability of 2D-KS test measuring if intensity-rank-based peak pairing
distribution is the same that would be obtained with random peak pairing is higher than a p-
value of 0.001. Peak matching parameters have large effect on the peak matching accuracy
(ratio of accurately and inaccurately matched peak pairs and number of obtained peak
pairs), and for that reason this step is crucial to find the optimal peak matching parameters,
which provide the most different peak pairs distribution in the retention time space of the two
chromatograms from the distribution obtained with random peak pairing. Few examples on
the effect of parameters on peak matching results are shown in this supporting information
146
in Figure S3. Figure S4 in this supporting information shows plots presenting the mains
steps of selection of accurately matched peaks.
Selection of accurately matched peak pairs (step 2. in Figure 1) An optimal threshold selection method is required to select the dense region of 2D-KDE
obtained with any types of peak pairing method containing mixture of accurately and
inaccurately matched peak pairs. This threshold is calculated by constructing a 1
dimensional histogram from all the density values of the 2D-KDE matrix (histogram is made
with number of bins equal to the size of the square 2D-KDE matrix). The threshold (d) is set
to a density value where the positive part of the histogram’s frequency’s first derivative is
closest to the median of
dh
dh ~
minarg ,where h correspond to the abundance of the
histograms, d is the density estimate, the + sign refer’s to positive value of the hd
and ~
sign to median value of hd
). Peak pairs that are located at density areas higher than this
density threshold are selected as accurately matched peak pairs, while other peaks are
considered as randomly grouped, mismatched peak pairs. It should be noted that this
threshold selection is sensitive to peak distribution, and slight manual readjustment of the
threshold value may improve the accuracy of accurately matched peak pairs.
Monotonic non-linear alignment function (step 3. in Figure 1) The retention time coordinates of the selected accurately matched peak pairs are used to
calculate a monotonic non-linear global alignment function by using Locally Weighted
Scatterplot Smoothing (LOWESS)(11) regression in combination with bagging
resampling(12) technique. A robust version of the LOWESS regression assigning a lower
weight to outliers has been used for calculating the alignment function. The method assigns
zero weight to peak pairs outside of six median absolute deviation of the residuals from the
tested position. The four times the root mean square of the 2D-KDE bandwidth is used to
set the span and third order polynomial function is used for the LOWESS regression. The
final smoothed regression points are calculated as average of 100 bootstrap resampling.
The bootstrap resampling is performed uniformly with replacement by using all extracted
peak pairs. This procedure reduces the variance of the LOWESS predictor and helps to
avoid overfitting. When peak elution order of common peaks is the same in two
chromatograms, one-to-one peak correspondence is expressed by monotonic function
between the retention time of accurately matched peaks. For that reason the main retention
147
time correspondence trend – the alignment function should be monotonic. To make the main
time alignment function monotonic, least squared linear optimisation with monotonic
constraint is applied on the average LOWESS regression points of 100 bootstraps. A
piecewise cubic Hermite interpolating polynomial (PCHIP) function with cubic spline(13) is
used to perform monotonic interpolation for transformation of retention time of peaks
between retention time space of the two chromatogram. Partitioning of the data for PCHIP
was performed on the basis of the span used in LOWESS (root mean square of the 2D-KDE
bandwidth). Before performing PCHIP, linear interpolation was performed between
experimental data for partitioned part containing no data points, to avoid large jumps in the
main monotonic alignment function.
Probability of peak elution order similarity between two chromatograms (step 3. In Figure 1) When the peak elution order of common peaks is same in two chromatograms the accurately
matched peak pairs follow a non-linear monotonic trend between the chromatograms
without any fluctuation of the retention time coordinates of accurately matched peak pairs
along this trend. However, small scattering may be observed due to improper determination
of the peak maxima. In this case it is possible to determine the one-to-one correspondence
of peaks unambiguously in the two chromatograms with the monotonic alignment function.
This means that it is possible to unambiguously find the same peaks or to determine if a
peak has no correspondence in the other chromatogram. When the peak elution order of
common peaks is different in the two chromatograms, fluctuation of the correctly matched
peak pairs becomes larger around the non-linear monotonic retention time trend. In this case
it is not possible to match peaks between two chromatograms unambiguously. The
corresponding peak could be anywhere within the fluctuation domain of the accurately
matched peaks pairs and the non-linear monotonic function just represent the average
retention time correspondence function.
The probability for peak order inversion can be calculated by comparing the orthogonal
residual variance of the accurately matched peak pairs between two chromatograms that
have the same elution order of common peaks (e.g. two chromatograms of two samples
with similar molecular composition acquired in the same batch) with the orthogonal residual
variance obtained in two chromatograms that are of interest. It is advantageous to use at
least one same chromatogram in the two chromatogram pairs to minimise differences due
to the difference between different samples and/or different LC-MS acquisitions. By
conducting an F-test on the orthogonal residual variances obtained for the two conditions
148
the probability of peak elution order similarity can be estimated, which is the null hypothesis
of the F-test. When comparing two chromatograms obtained under different conditions (e.g.
acquired in two different laboratory), it is possible to perform two separate F-test, in which
the orthogonal residual variance with no peak order inversion are determined for both
chromatograms separately. For final decision for peak elution order similarity, F-test
providing the smaller p-value should be taken into consideration. If the probability for
similarity of peak elution order is lower than 0.01, then the algorithm stops, because the
chance for similar peak elution order is low and therefore it is not possible to establish an
unambiguous one-to-one correspondence between peaks or chromatographic locations of
the two chromatograms.
The orthogonal residuals are calculated in different way than residuals of a usual regression
analysis. In regression analysis the dependent and independent variable axis are fixed,
however in time alignment the two axis should be interchangeable (e.g. the same results
should be obtained by aligning chromatogram A to B and B to A). For this reason we have
calculated the orthogonal residual distance from the main monotonic function, by
transforming one of the retention time of peaks by using the main retention time
corresponding function. In this case the main monotonic retention time correspondence
function becomes a line with 45° regarding the two retention time axes of the scatter plots.
The procedure calculating orthogonal residuals and performing F-test to assess the
probability of peak elution order similarity is demonstrated in Figure S5.
maxD is calculated for the orthogonal variance. Components of maxD according to the
chromatograms provide the retention time error to determine retention time locations in the
other chromatogram after alignment.
Retention time correction (step 4. in Figure 1) In the case of a high probability of peak elution order similarity of common peaks in two
chromatograms (null hypothesis of the F-test is not rejected), the alignment function is used
to correct the retention times of the peaks in the sample chromatogram with respect to a
reference chromatogram. The method does not depend on which chromatogram is selected
as reference or sample, as the monotonic nature of the retention time trend between the two
chromatograms allows to determine the same one-to-one correspondence of common
peaks. Figure S12 in supporting information shows that the non-linear main retention time
correspondence trend obtained with two different order of LC-MS chromatograms is highly
similar. The retention time of peaks in the sample chromatogram is calculated by
interpolation using the monotonic alignment function. The algorithm results finally a sample
149
peak list aligned to the reference peak list. It should be noted that any other type of time
alignment method devising monotonic non-linear retention time correspondence function
can be used instead of the proposed monotonic constrained LOWESS/PCHIP approach.
Hardware and software environment The Monte-Carlo simulation, the intensity-rank-based peak matching, the 2D-KDE, the 2D-
KS algorithm were written in matlab scripting language using Matlab Mathworks R2010b
(version 7.11.0.584 64-bit linux version) and was run on desktop computer equipped with
Intel Quad Q9300 CPU at 2.5GHz, 8 GB RAM and 64-bit linux Ubuntu 10.04 operating
system. The source code is available at https://trac.nbic.nl/pre-alignment.
Peptide identification parameters The peptide and protein identification was performed using Phenyx database search
program (Geneva Bioinformatics, version 2.6, Geneva, Switzerland) using raw data in
mzData format. Datasets were searched against the Uniprot database (version: 57.4) and
against the reverse sequence of this database with following parameters: taxonomy: Rattus
Norvegicus; instrument types were selected according to the used mass spectrometer; FDR
rate: <1; scoring model: ESI-QTOF (QTOF) for QTOF data and CID_LTQ_scan_LTQ for
Orbitrap data; parent ion charge states: +2, +3, +4 (with trusted medium charge). The search
was performed in two subsequent cycles. The following search parameters were common
for both cycles: peptide AC score: ≥5; peptide length: ≥5; p-value: <0.0001; cleaving
enzyme: trypsin (KR); number of allowed missed cleavage: ≤ 1. The following search
parameters were different between the first and second search cycles: for cycle 1 amino
acid modifications: Cys_CAM (carboxy methylation, fixed), Oxidation_M (oxidation of
Figure S7. Scatter plot of matched peak pairs obtained with intensity-rank-based peak matching method of deisotoped
LC-MS peak list (left), and after deisotoping and decharging the same two LC-MS peaks list (right). Decharging the peak
list results in lower number of matched peak pairs but the peak pairs are more rich in accurately matched peak pairs
indicating the retention time trend. Peak matching parameters are for Lab1_GLY_LungEGFR_normal vs
Lab2_GLY_LungEGFR_normal using 500 as the window length 0.1 Da of maximal m/z difference and 0.9 as the rank
fraction parameters. The analysed two LC-MS peak list had the following pre-analytical parameters: LC-MS 1: laboratory
1, GLY depletion, Lung EGFR cancer type, without tumor; LC-MS 1: laboratory 2, GLY depletion, Lung EGFR cancer
type, without tumor.
0 10 20 30 40 50 60 70 80 9010
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 9010
20
30
40
50
60
70
80
90
100
Ret
entio
n tim
e LC
-MS
1 (m
in)
Retention time LC-MS 2 (min)Retention time LC-MS 2 (min)
Ret
entio
n tim
e LC
-MS
1 (m
in)
158
Figure S8. Peak with long tailing in dataset of rat CSF analysed in laboratory 2 (a) and histogram of peak width at half
peak height (35 bins taking the 1000 most abundant peaks) of the 4 samples in two datasets acquired in different
laboratories (b) (chromatograms used are the same that are in middle column of Figures 3 and S9). The plot a) was
prepared with help of OpenDX (http://www.opendx.org/) visualization software tool.
0.5 1 1.5 2 2.5 30
20
40
60
80
100
120
140
160
peak width at half peak height (min)
coun
ts
rat CSF Lab1rat CSF Lab2rat serum Lab1rat serum Lab2
(b)
(a)
159
Figure S9. Extracted ion chromatograms (EIC) of three peptides from the same sample (sample 6 from the
rat CSF dataset) in two laboratories using the original retention time values. Peptide LTLPQLEIR (green
arrows) is located on the monotonic retention-time corresponding function, while the peptides DIAPTLTLYVGK
(red arrows) and VHQFFNVGLIQPGSVK (blue arrows) are located far from this function and Figure 5 shows
the location of these peak after alignment one of chromatogram to the other. Locations of the three peaks are
shown in the scatter plot of Figure 4 with corresponding red, green and blue circles. The extracted ion
chromatograms are normalized to the highest peaks, for that reason the Y axis represent ion counts relative
to the most abundant signal intensity of the most abundant signal.
0 50 100 150 200 250 300 350 4000
1
2
3
4
5
6
x 107
Time (min)
Ionc
ount
(cts
)
645.87 +-0.025 Da, Original, Lab1.541.83 +-0.025 Da, Original, Lab1.590.66 +-0.025 Da, Original, Lab1.645.87 +-0.025 Da, Original, Lab2.541.83 +-0.025 Da, Original, Lab2.590.66 +-0.025 Da, Original, Lab2.
160
Figure S10. Scatter plots of matched peaks between two LC-MS chromatograms with time alignment
functions. All chromatograms were obtained from LC-MS chromatograms of the National Cancer Institute’s
Mouse Proteomic Technology Initiative and originate from an experimental design study of mouse serum
analysis. The scatter plots in the middle column were obtained from two LC-MS chromatograms of the same
sample prepared in two laboratories, while the right and left columns were obtained with two LC-MS
chromatograms of the same laboratory, from which one was used in the middle scatter plot. Matched peak
pairs were obtained using peak list obtained from single stage LC-MS data with OpenMS workflow and using
intensity-rank-based peak matching procedure. Peak pairs not select as accurately matched peak pairs are
blue. The peak pairs selected as accurately matched are contoured with dashed red lines and are highlighted
in green circle. The main monotonic retention time correspondence function is showed in solid red line.
Samples have the following factors in the experimental design: GLY depletion method, Lung EGFR cancer
type, tumor (middle plot) and tumor and healthy (side plots).
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100
Retention time axis in Lab 2 in minutes.
Ret
enti
on ti
me
axis
in L
ab 2
in m
inut
es.
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
Retention time axis in Lab 1 in minutes.
Ret
enti
on ti
me
axis
in L
ab 2
in m
inut
es.
0 10 20 30 40 50 60 70 80 900
10
20
30
40
50
60
70
80
90
Retention time axis in Lab 1 in minutes.
Ret
entio
n tim
e ax
is in
Lab
1 in
min
utes
.
(b) (c)(a)
Within laboratory (2 samples) Within laboratory (2) samplesInterlaboratory (same sample)
Single-stage MS peak list
Intensity-rank-based peak matching
161
Fi
gure
S11
. Ove
rlaid
plo
t of m
ultip
le m
ain
mon
oton
ic re
tent
ion
time
corr
espo
nden
ce fu
nctio
n in
7 c
hrom
atog
ram
pai
rs o
f the
sam
e ra
t CSF
sam
ples
(a, b
and
c) a
nd
24 c
hrom
atog
ram
pai
rs o
f mou
se s
erum
sam
ples
(d) m
easu
red
in tw
o la
bora
torie
s. In
a) p
airs
of L
C-M
S p
recu
rsor
ion
peak
list
s w
ere
mat
ched
bas
ed o
n ag
reem
ent
of id
entif
ied
pept
ide
sequ
ence
and
PTM
s, in
b) t
wo
LC-M
S pr
ecur
sor i
on p
eak
lists
wer
e m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g ap
proa
ch a
nd in
c) p
airs
of L
C-M
S si
ngle
sta
ge io
n pe
ak li
sts
wer
e m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g al
gorit
hm. O
verla
id p
lot i
n d)
was
obt
aine
d w
ith p
airs
of L
C-M
S si
ngle
stag
e io
n pe
ak lis
ts m
atch
ed u
sing
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g al
gorit
hm a
nd th
e m
ain
mon
oton
ic fu
nctio
n is
col
ored
acc
ordi
ng to
the
appl
ied
depl
etio
n m
etho
d
(in re
d G
LY, i
n gr
een
MAR
S, in
blu
e M
+CYS
and
in b
lack
NF)
. The
hig
h si
mila
rity
of th
e m
ain
mon
oton
ic re
tent
ion
time
corr
espo
nden
ce fu
nctio
ns s
how
s th
at m
etho
d
usin
g se
quen
ce in
form
atio
n to
mat
ch p
recu
rsor
ion
peak
list
s an
d in
tens
ity-r
ank-
base
d m
atch
ed s
ingl
e st
age
LC-M
S pe
ak li
sts
are
robu
st w
ith r
espe
ct o
f bio
logi
cal
varia
bilit
y an
d th
at th
e tw
o m
etho
ds p
rovi
de h
ighl
y si
mila
r cor
rect
ion
of re
tent
ion
time.
Sin
gle
stag
e LC
-MS
peak
list
s w
ith c
ombi
natio
n of
inte
nsity
-ran
k-ba
sed
peak
mat
chin
g is
slig
htly
less
acc
urat
e, w
hich
is re
flect
ed b
y th
e la
rger
var
iabi
lity
of th
e m
ain
mon
oton
ic re
tent
ion
time
corre
spon
denc
e fu
nctio
ns o
btai
ned
with
this
met
hod
com
pare
d w
ith th
ose
obta
ined
with
pre
curs
or io
n LC
-MS
peak
list
s m
atch
ed u
sing
agr
eem
ent o
f ide
ntifi
ed p
eptid
e se
quen
ce a
nd P
TMs.
050
100
150
200
250
300
350
400
020406080100
120
140
160
180
Ret
entio
n tim
e la
bora
tory
1 (i
n m
inut
es)
Retention time laboratory 2 (in minutes)
010
2030
4050
6070
8090
020406080100
120
Ret
entio
n tim
e ax
is la
bora
tory
1 (i
n m
inut
es)
Retention time axis laboratory 2 (in minutes)
050
100
150
200
250
300
350
400
020406080100
120
140
160
180
Ret
entio
n tim
e la
bora
tory
1 (i
n m
inut
es)
Retention time laboratory 2 (in minutes)
050
100
150
200
250
300
350
400
020406080100
120
140
160
Ret
entio
n tim
e in
labo
rato
ry 1
(in
min
utes
)
Retention time in laboratory 2 (in minutes)
Rat
CSF
Rat
seru
m
c)d)
a)b)
162
Figure S12. Monotonic nonlinear time alignment function (solid red and green lines) determined with different order of
two LC-MS/MS chromatograms as sample and reference chromatogram. Peak matching was performed using identified
peptide sequence and post-translational modification data, and blue dots shows the matched peak pairs. The two
chromatograms were from sample 6 in laboratory 1 and laboratory 2. The two monotonic retention time correspondence
functions are highly similar, which shows that the time alignment procedure do not depend from the order of the
chromatograms.
0 50 100 150 200 250 300 350 4000
20
40
60
80
100
120
140
160
180
200
Retention time in minutes (laboratory 1)
Rete
ntio
n tim
e in
min
utes
(lab
orat
ory
2)
163
References (1) Kendall MG, Buckland WR, Institute IS. A dictionary of statistical terms: Hafner Pub. Co.; 1971.
(2) Liu T, Qian WJ, Chen WN, Jacobs JM, Moore RJ, Anderson DJ, et al. Improved proteome coverage by
using high efficiency cysteinyl peptide enrichment: the human mammary epithelial cell proteome. Proteomics.
2005;5:1263-73.
(3) Liu T, Qian WJ, Strittmatter EF, Camp DG, 2nd, Anderson GA, Thrall BD, et al. High-throughput
comparative proteome analysis using a quantitative cysteinyl-peptide enrichment technology. Anal Chem.
2004;76:5345-53.
(4) Livesay EA, Tang K, Taylor BK, Buschbach MA, Hopkins DF, LaMarche BL, et al. Fully automated four-
column capillary LC-MS system for maximizing throughput in proteomic analyses. Anal Chem. 2008;80:294-
302.
(5) Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid