Top Banner
Method 2) towards Blind signal restoration for lC/Ms the Proteomegrid web service for sparse signal restoration and its extension to raw lC/Ms data with blind peak shape estimation andrew w dowsey and guang-Zhong Yang, Hamlyn Centre, Institute of Global Health Innovation, Imperial College London, UK overview In our other poster, we presented the novel seaMS signal restoration framework for integrated baseline estimation, deisotoping and decharging of raw mass spectra and show that it leads to significantly improved feature detection. Particular strengths of the framework are: » Robust performance over a wide range of its single tuning parameter. » Its scalability to large-scale datasets, e.g. LC/MS. » The range of modelling extensions that can be incorporated. Only the instrument resolution and peak shape must be known in advance. In this poster we extend the framework to automatically estimate the varying peak shape across the m/z range during a preprocessing step. We also discuss our progress towards integrated peak shape estimation and restoration on full LC/MS datasets, and our web-based ProteomeGRID visualisation and cluster computing platform. The Hamlyn Centre The Institute of Global Health Innovation Method 1) Peak shaPe estiMation for seaMs ConClusion The seaMS framework has been extended with two methods for peak shape estimation. The first can be used immediately. The second, based on a full blind restoration paradigm, is a ‘proof of concept’ but has a number of advantages: » Robustness through direct spatial constraints. » Does not need a set of spectra to train on. » Scales naturally to a second LC/MS dimension. Future work will aim to derive confidence intervals for each detected peak through a direct Variational Bayes implementation of seaMS or a downstream Gibbs sampling stage. Another major goal is to integrate differential analysis through sparse coding or functional mixed modelling and therefore provide a full framework for statistical quantification of differential expression directly in the signal domain. the ProteoMegrid weB serviCe http://www.proteomegrid.org/ aCknowledgeMents Project funded by EPSRC UK grant EP/E03988X/1 awarded to AWD. referenCes [1] Dolui et al., Proc. 2011 IEEE International Symposium on Biomedical Imaging (ISBI) [2] Babacan et al., IEEE Transactions on Image Processing 18, pp. 12-26, 2009 [3] Zubarev et al., Rapid Communications in Mass Spectrometry 10, pp. 1386-1392, 1996 [4] Dowsey et al., Proteomics 12, pp. 3800-3812, 2004 figure 4. The ProteomeGRID portal and spectrum visualisation. introduCtion seaMS applies the Sparse Richardson-Lucy technique [1] to the formation model:: In our original formulation, h is a convolution matrix and assumed known. Here we propose two extended models where h is unknown: Method 1 We assume h at each m/z can be approximated by a sum of n dyadic multiscale basis functions h i.e. Lorentzians, Gaussians or Exponentially-Modified Gaussians (EMGs). Each h has its own set of unknown coefficients c: The disadvantages of this approach are: » The number of unknowns is multiplied by n. » Cannot constrain peak shape at neighbouring m/z to be consistent, so: The algorithm is less robust and should only be used to estimate peak shape. Some post-processing is essential. Method 2 Alternatively, a separate set of coefficients can be employed for h: This is the established ‘blind signal restoration’ approach, which is currently a very active field of basic research. The strategy is to alternate between fixing c h while estimating {c a , c B }, and fixing {c a , c B } while estimating c h . However: » The resulting optimisation is highly non-convex and so prone to local minima. » A recent breakthrough [2] suggests incorporating uncertainty is essential. If each alternating optimisation is throttled by the residual uncertainty, most local minima can be avoided. our aPProaCh We have implemented Method 1 so that the seaMS framework can be used right away. Initial development of Method 2 has been focused on 2-D gel electrophoresis, as a convenient Gaussian noise assumption can be employed. We are now working towards an integrated blind approach for LC/MS with Poisson noise model. seaMS can be trivially extended to LC/MS by augmenting a set of multiscale B-spline templates in the LC dimension to each isotope and chemical noise template: » This causes the restoration for each spectrum to be regularised by its neighbours. » But does not attempt to separate coincident features in the LC dimension. Method We employ the Variational Bayes approach to blind restoration presented in [2]: » The generated peak shape is regularised purely by a smoothness penalty. » A shrinkage prior is used as in seaMS. In [2] an edge-preserving prior is used. Variational Bayes is an extension of Expectation-Maximisation that provides an analytical approximation to the posterior probability. It: » Gives a sufficiently accurate measure of uncertainty for blind restoration. » Avoids the heavy computational cost of sampling methods. » Automatically estimates the optimal level of regularisation to avoid overfitting. Results on 2-D gel and LC/MS data are given in Figure 3. The technique is very promising for gel analysis as it reveals the underlying spot shape dynamics for the first time, enabling robust extraction of co-migrating spots. disCussion Blind restoration for LC/MS will require a number of additional modelling components e.g. since LC/MS spot shape is a separable convolution, the LC peak profile should be consistent across the m/z range for a given elution time and equally, the MS profile should be constant at the same m/z in each spectrum. g = P H Bc B + A q c A q Q g is the raw observed spectrum. h is the varying instrument peak shape. a are the isotope distribution dictionaries over charge/adduct states Q. B is the multiscle baseline dictionary. c = {c a , c B } are the unknowns to estimate. g = P h i Bc B,i + A q c A ,i q Q i =1 n g = P Hc H Bc B + A q c A q Q eMg Peak Model With ToF instruments there are at least two mechanisms affecting peak shape: » The instrument exhibits a Gaussian aperture that widens with slight right skew due to expansion of the ion cloud in transit as ToF increases. » However, the right tails can often appear much stronger in practice. [3] suggests an ion-dependent difference between ions formed at the surface and those formed in gas-phase, plus the time delay between them. They postulate this forms an exponential distribution and propose convolution with the Gaussian aperture... leading to an EMG model. We therefore represent h as n=mk EMGs with m and k dyadic scalings in σ and λ (skew) respectively. Each EMG is centered on its mode: » So that the set of EMGs for a single feature will coincide on the m/z axis. Method 1) Perform seaMS with the extended model on a set of representative spectra. The more spectra provided, the more reliable the peak shape estimate will be. On the output, repeat with zero shrinkage. This removes the downward biases on feature quantification and m/z resolution estimation. Extract each feature simply as a contiguous group of non-zero coefficients. 2) Since a sum of dyadic EMGs is an approximation of a true EMG, we now fit a mean-centered EMG to the reconstruction of each detected monoisotopic peak. 3) Using robust regression weighted by the intensity of each feature, a smooth polynomial is fitted to the set of σ and λ of the derived EMGs. Standard seaMS can now be used with this estimated peak shape function. results Figure 1 illustrates the workflow on a MALDI-ToF 7-Mix spectrum. Figure 2 shows derived functions for simulated ESI-ToF spectra with a known resolution of 5000: » The robust regression handles the significant number of outliers well, but... » ...the method overestimates the instrument resolution as m/z rises. figure 3. Blind Image Deconvolution of 2-D gel (left) and LC/MS (right) images, performed on 9 image blocks separately. The computed feature maps are shown using the colour scale on the far left and far right (yellow is highest intensity). The middle images show the corresponding estimated dispersion function for each block. Note that the LC/MS results purely illustrate a ‘Proof of Concept’ - chromatogram peak shape is far more variable than can be represented with 9 blocks. POINTS SEGMENTS POINTS SEGMENTS query 1 query 2 figure 5. (left) Irregularly-sampled mzML is stored in POINTS. A multiscale representation of the dataset is simultaneously built up in SEGMENTS, with a one point look-ahead. Each top-level segment (yellow) has zero or one point inside it, and the lower levels (purple) are built up by pushing/ popping a stack (red arrows). Min/mean/ max are stored in the purple segments whilst yellow segments are discarded. (right) A database query first fetches segments corresponding to the desired image resolution and m/z range. If some segments do not exist (white), a second query retrieves the remaining points. seaMS feature detection will be provided to the community through the ProteomeGRID portal. The web service, currently in beta test, has these features: » A secure collaborative environment based on Python/Django/Pinax: Project and study hierarchy with private wikis and upload/archival/ download of mzML and Matlab data. Private messaging and automatic notification framework. » A cluster computing backend through a bespoke interface [4] to Condor. seaMS processing on our 64-node cluster. Interface to downstream protein identification. » Online visualisation of raw spectra and overlaid seaMS results with HTML5 and Javascript, as shown in Figure 4. A hybrid multiscale data representation has been developed for ef- ficient streaming and storage into a PostgreSQL database, as ex- plained in Figure 5. A version of this portal is also being used in our research group for storage and analysis of body sensor network and biosensor data. figure 1. Workflow demonstrated on a single MALDI-ToF spectrum described in our other poster. 1) seaMS performed with the extended EMG model: (green) original spectrum, (red) reconstructed baseline, (blue) reconstructed isotope distributions, (cyan) reconstructed monoisotopic peaks showing the estimated peak shape. 2) Each monoisotopic peak is fitted to an EMG. 3) Robust regression estimates EMG spread (σ) and skew (λ) across the spectrum. figure 2. Peak shape estimation on a simulated ESI spectrum (see other poster) with (left) one and (right) 16 tryptically digested proteins: (green) known resolution, (red) estimated, (blue) detected features weighted by intensity. 500 1000 1500 2000 2500 0 0.02 0.04 0.06 m/z σ 500 1000 1500 2000 2500 0 0.04 0.08 0.12 m/z λ 1670.1 1670.3 0 0.01 0.02 m/z % Intensity 1060 1070 1080 1090 1100 1110 1120 -2000 -1000 0 1000 2000 m/z Intensity 1) 2) 3) 400 600 800 1000 1200 1400 2000 4000 6000 8000 m/z Resolution 400 600 800 1000 1200 1400 2000 4000 6000 8000 m/z Resolution
1

The Hamlyn Centre the Proteomegrid web service for sparse ...€¦ · Method 2) towards Blind signal restoration for lC/Ms the Proteomegrid web service for sparse signal restoration

Sep 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Hamlyn Centre the Proteomegrid web service for sparse ...€¦ · Method 2) towards Blind signal restoration for lC/Ms the Proteomegrid web service for sparse signal restoration

Method 2) towards Blind signal restoration for lC/Ms

the Proteomegrid web service for sparse signal restoration and its extension to raw lC/Ms data with blind peak shape estimation

andrew w dowsey and guang-Zhong Yang, Hamlyn Centre, Institute of Global Health Innovation, Imperial College London, UK

overview

In our other poster, we presented the novel seaMS signal restoration framework for integrated baseline estimation, deisotoping and decharging of raw mass spectra and show that it leads to significantly improved feature detection. Particular strengths of the framework are:

» Robust performance over a wide range of its single tuning parameter.

» Its scalability to large-scale datasets, e.g. LC/MS.

» The range of modelling extensions that can be incorporated.

Only the instrument resolution and peak shape must be known in advance. In this poster we extend the framework to automatically estimate the varying peak shape across the m/z range during a preprocessing step. We also discuss our progress towards integrated peak shape estimation and restoration on full LC/MS datasets, and our web-based ProteomeGRID visualisation and cluster computing platform.

The Hamlyn Centre The Institute of Global Health Innovation

Method 1) Peak shaPe estiMation for seaMs

ConClusion

The seaMS framework has been extended with two methods for peak shape estimation. The first can be used immediately. The second, based on a full blind restoration paradigm, is a ‘proof of concept’ but has a number of advantages:

» Robustness through direct spatial constraints.

» Does not need a set of spectra to train on.

» Scales naturally to a second LC/MS dimension.

Future work will aim to derive confidence intervals for each detected peak through a direct Variational Bayes implementation of seaMS or a downstream Gibbs sampling stage. Another major goal is to integrate differential analysis through sparse coding or functional mixed modelling and therefore provide a full framework for statistical quantification of differential expression directly in the signal domain.

the ProteoMegrid weB serviCe

http://www.proteomegrid.org/

aCknowledgeMents

Project funded by EPSRC UK grant EP/E03988X/1 awarded to AWD.

referenCes

[1] Dolui et al., Proc. 2011 IEEE International Symposium on Biomedical Imaging (ISBI) [2] Babacan et al., IEEE Transactions on Image Processing 18, pp. 12-26, 2009 [3] Zubarev et al., Rapid Communications in Mass Spectrometry 10, pp. 1386-1392, 1996 [4] Dowsey et al., Proteomics 12, pp. 3800-3812, 2004

figure 4. The ProteomeGRID portal and spectrum visualisation.

introduCtion

seaMS applies the Sparse Richardson-Lucy technique [1] to the formation model::

In our original formulation, h is a convolution matrix and assumed known. Here we propose two extended models where h is unknown:

Method 1We assume h at each m/z can be approximated by a sum of n dyadic multiscale basis functions h i.e. Lorentzians, Gaussians or Exponentially-Modified Gaussians (EMGs). Each h has its own set of unknown coefficients c:

The disadvantages of this approach are:

» The number of unknowns is multiplied by n.

» Cannot constrain peak shape at neighbouring m/z to be consistent, so:• The algorithm is less robust and should only be used to estimate peak shape. • Some post-processing is essential.

Method 2Alternatively, a separate set of coefficients can be employed for h:

This is the established ‘blind signal restoration’ approach, which is currently a very active field of basic research. The strategy is to alternate between fixing ch while estimating {ca, cB}, and fixing {ca, cB} while estimating ch. However:

» The resulting optimisation is highly non-convex and so prone to local minima.

» A recent breakthrough [2] suggests incorporating uncertainty is essential. If each alternating optimisation is throttled by the residual uncertainty, most local minima can be avoided.

our aPProaChWe have implemented Method 1 so that the seaMS framework can be used right away. Initial development of Method 2 has been focused on 2-D gel electrophoresis, as a convenient Gaussian noise assumption can be employed. We are now working towards an integrated blind approach for LC/MS with Poisson noise model.

seaMS can be trivially extended to LC/MS by augmenting a set of multiscale B-spline templates in the LC dimension to each isotope and chemical noise template:

» This causes the restoration for each spectrum to be regularised by its neighbours.

» But does not attempt to separate coincident features in the LC dimension.

MethodWe employ the Variational Bayes approach to blind restoration presented in [2]:

» The generated peak shape is regularised purely by a smoothness penalty.

» A shrinkage prior is used as in seaMS. In [2] an edge-preserving prior is used.

Variational Bayes is an extension of Expectation-Maximisation that provides an analytical approximation to the posterior probability. It:

» Gives a sufficiently accurate measure of uncertainty for blind restoration.

» Avoids the heavy computational cost of sampling methods.

» Automatically estimates the optimal level of regularisation to avoid overfitting.

Results on 2-D gel and LC/MS data are given in Figure 3. The technique is very promising for gel analysis as it reveals the underlying spot shape dynamics for the first time, enabling robust extraction of co-migrating spots.

disCussionBlind restoration for LC/MS will require a number of additional modelling components e.g. since LC/MS spot shape is a separable convolution, the LC peak profile should be consistent across the m/z range for a given elution time and equally, the MS profile should be constant at the same m/z in each spectrum.

g = P H BcB + AqcAq∈Q∑⎡

⎣⎢

⎦⎥

⎝⎜

⎠⎟

g is the raw observed spectrum.h is the varying instrument peak shape.a are the isotope distribution dictionaries over charge/adduct states Q.B is the multiscle baseline dictionary.c = {ca, cB} are the unknowns to estimate.

g = P hi BcB,i + AqcA,iq∈Q∑⎡

⎣⎢

⎦⎥

i=1

n

∑⎛

⎝⎜

⎠⎟  

g = P HcH BcB + AqcAq∈Q∑⎡

⎣⎢

⎦⎥

⎝⎜

⎠⎟  

eMg Peak ModelWith ToF instruments there are at least two mechanisms affecting peak shape:

» The instrument exhibits a Gaussian aperture that widens with slight right skew due to expansion of the ion cloud in transit as ToF increases.

» However, the right tails can often appear much stronger in practice.• [3] suggests an ion-dependent difference between ions formed at the surface

and those formed in gas-phase, plus the time delay between them. • They postulate this forms an exponential distribution and propose convolution

with the Gaussian aperture... leading to an EMG model.

We therefore represent h as n=mk EMGs with m and k dyadic scalings in σ and λ (skew) respectively. Each EMG is centered on its mode:

» So that the set of EMGs for a single feature will coincide on the m/z axis.

Method1) Perform seaMS with the extended model on a set of representative spectra. The

more spectra provided, the more reliable the peak shape estimate will be.• On the output, repeat with zero shrinkage. This removes the downward biases

on feature quantification and m/z resolution estimation. • Extract each feature simply as a contiguous group of non-zero coefficients.

2) Since a sum of dyadic EMGs is an approximation of a true EMG, we now fit a mean-centered EMG to the reconstruction of each detected monoisotopic peak.

3) Using robust regression weighted by the intensity of each feature, a smooth polynomial is fitted to the set of σ and λ of the derived EMGs.

• Standard seaMS can now be used with this estimated peak shape function.

resultsFigure 1 illustrates the workflow on a MALDI-ToF 7-Mix spectrum. Figure 2 shows derived functions for simulated ESI-ToF spectra with a known resolution of 5000:

» The robust regression handles the significant number of outliers well, but...

» ...the method overestimates the instrument resolution as m/z rises.

figure 3. Blind Image Deconvolution of 2-D gel (left) and LC/MS (right) images, performed on 9 image blocks separately. The computed feature maps are shown using the colour scale on the far left and far right (yellow is highest intensity). The middle images show the corresponding estimated dispersion function for each block. Note that the LC/MS results purely illustrate a ‘Proof of Concept’ - chromatogram peak shape is far more variable than can be represented with 9 blocks.

POINTS

SEGMENTS

POINTS

SEGMENTS

query 1

query 2

figure 5. (left) Irregularly-sampled mzML is stored in POINTS. A multiscale representation of the dataset is simultaneously built up in SEGMENTS, with a one point look-ahead. Each top-level segment (yellow) has zero or one point inside it, and the lower levels (purple) are built up by pushing/popping a stack (red arrows). Min/mean/max are stored in the purple segments whilst yellow segments are discarded. (right) A database query first fetches segments corresponding to the desired image resolution and m/z range. If some segments do not exist (white), a second query retrieves the remaining points.

seaMS feature detection will be provided to the community through the ProteomeGRID portal. The web service, currently in beta test, has these features:

» A secure collaborative environment based on Python/Django/Pinax:• Project and study hierarchy with private wikis and upload/archival/

download of mzML and Matlab data.• Private messaging and automatic notification framework.

» A cluster computing backend through a bespoke interface [4] to Condor.• seaMS processing on our 64-node cluster.• Interface to downstream protein identification.

» Online visualisation of raw spectra and overlaid seaMS results with HTML5 and Javascript, as shown in Figure 4.• A hybrid multiscale data representation has been developed for ef-

ficient streaming and storage into a PostgreSQL database, as ex-plained in Figure 5.

A version of this portal is also being used in our research group for storage and analysis of body sensor network and biosensor data.

figure 1. Workflow demonstrated on a single MALDI-ToF spectrum described in our other poster. 1) seaMS performed with the extended EMG model: (green) original spectrum, (red) reconstructed baseline, (blue) reconstructed isotope distributions, (cyan) reconstructed monoisotopic peaks showing the estimated peak shape. 2) Each monoisotopic peak is fitted to an EMG. 3) Robust regression estimates EMG spread (σ) and skew (λ) across the spectrum.

figure 2. Peak shape estimation on a simulated ESI spectrum (see other poster) with (left) one and (right) 16 tryptically digested proteins: (green) known resolution, (red) estimated, (blue) detected features weighted by intensity.

500 1000 1500 2000 25000

0.02

0.04

0.06

m/z

σ

500 1000 1500 2000 25000

0.04

0.08

0.12

m/z

λ

1670.1 1670.30

0.01

0.02

m/z

% In

tens

ity

1060 1070 1080 1090 1100 1110 1120−2000

−1000

0

1000

2000

m/z

Inte

nsity

1)

2) 3)

400 600 800 1000 1200 14002000

4000

6000

8000

m/z

Res

olut

ion

400 600 800 1000 1200 14002000

4000

6000

8000

m/z

Res

olut

ion