Yasset Perez-Riverol Ph.D github: github.com/ypriverol twitter: @ypriverol OpenMS: Quantitative proteomics at large scale
Yasset Perez-Riverol Ph.Dgithub: github.com/ypriveroltwitter: @ypriverol
OpenMS: Quantitative proteomics at large scale
Proteomics BioinformaticsEMBL-EBI, December 2016
Outline• Introduction to OpenMS
Modularity & Workflows
Visualization.
Integration with other tools.
• Two example workflows
Protein identification
Label-free quantification
Proteomics BioinformaticsEMBL-EBI, December 2016
Modularity is the degree to which a system's components may be separated and recombined.
Proteomics BioinformaticsEMBL-EBI, December 2016
Proteomics BioinformaticsEMBL-EBI, December 2016
Modularity
tools for identification
DecoyDatabaseMascotAdapterXTandemAdapterMSGFPlusAdapterPeptideIndexerFalseDiscoveryRateIDPosteriorErrorProbabilityConsensusIDLuciphorAdapterHighResPrecursorMassCorrectorFidoAdapter
tools for quantification
PeakPickerHiResFeatureFinderMultiplexFeatureFinderCentroidedSpectraMergerNoiseFilterSGolayITRAQAnalyzerIDMapperIDConflictResolverMapAlignerPoseClusteringMapRTTransformerFeatureLinkerUnlabeledQTProteinQuantifier
tools for file handling
FileConverterFileMergerFileFilterIDFileConverterIDMergerIDFilterMzTabExporterFileInfo
OpenMS ⇨ collection of 180 software tools ≈ 30 tools sufficient for standard
workflows
Proteomics BioinformaticsEMBL-EBI, December 2016
OpenMSOpenMS – an open-source framework for computational mass spectrometry
Portable: available on Windows, OSX, Linux
OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools
• > 180 Building blocks: One application for each analysis step
• Vendor independent: Uses PSI standard formats
Can be integrated in various workflow systems
• TOPPAS – TOPP Pipeline Assistant
• Galaxy
• KNIME
Proteomics BioinformaticsEMBL-EBI, December 2016
KNIME and TOPPViewKNIME – KoNstanz Information MinEr
• Enable to build customized workflows by using OpenMS components.
TOPVIEW: An OpenMS Data Viewer.
• Based on standard files formats.• MS/MS information,
peptides/proteins, quantitative information.
Proteomics BioinformaticsEMBL-EBI, December 2016
KNIME – Workflow SystemKNIME – KoNstanz Information MinEr
Industrial-strength general-purpose workflow systemConvenient and easy-to-use graphical user interfaceAvailable for Windows, OSX, Linux at http://KNIME.org
KNIME (CC BY-SA 4.0)
Workflows
Plots
Tables
Console
Nodes
Proteomics BioinformaticsEMBL-EBI, December 2016
Workflow Builder: Data Flow
KNIME-OpenMS workflows consist of distinct nodes that are assembled into workflowsEither tables or files are exchanged between nodes along the edges of the workflowConfiguration dialogs are used to set node parametersLoops, allow iterating sequentially over lists of dataSwitches, allow executing nodes or subworkflows dependent on a condition
Proteomics BioinformaticsEMBL-EBI, December 2016
ScriptingKNIME permits the embedding of R code for advanced statisticsEmbedding of R scripts using the R Snippet nodeAll plotting capabilities of R can be used as well
Proteomics BioinformaticsEMBL-EBI, December 2016
Peptide/Protein IdentificationTask: Identify peptides in multiple samples
Mass spectra enter workflow on the leftLoop nodes permit execution of parts of the workflow Identified proteins end up in result files (right side)
Proteomics BioinformaticsEMBL-EBI, December 2016
TOOView: Visualization of the resultsmzML idXML
Proteomics BioinformaticsEMBL-EBI, December 2016
Workflow – Plug-In SystemTask: Identify peptides in multiple samples
Mass spectra enter workflow on the leftLoop nodes permit execution of parts of the workflow Identified proteins end up in result files (right side)
Proteomics BioinformaticsEMBL-EBI, December 2016
Workflow – Plug-In SystemTask: Identify peptides in multiple samples
Combination of Xtandem+OMSSADefining of QC parameters like FDR. Q-values, P-values.
Proteomics BioinformaticsEMBL-EBI, December 2016
Complex and customized Workflows
X!Tandem Mascot MS-GF+ Merged
PIA 1214 64 (5.3%) 1442 74 (5.1%) 1631 93 (5.7%) 1615 101 (6.2%)
Fido 996 67 (6.7%) 1439 80 (5.6%) 1679 96 (5.7%) 1619 105 (6.5%)
ProteinLP 989 64 (6.5%) 1229 77 (2.3%) 1651 93 (5.6%) 1295 104 (8.0%)
MSBayesPro 749 24 (3.2%) 958 26 (2.7%) 1303 31 (2.4%) 963 36 (3.7%)
ProteinProphet 1027 64 (6.2%) 1282 73 (5.7%) 1629 91 (5.6%) 1629 99 (6.7%)
Audain E. & Uszkoreit J. et al, Journal of Proteomics, 2017
Best Protein inference algorithm:
3 Datasets4 Search engines.5 Protein inference algorithms. > 140 combinations.
Proteomics BioinformaticsEMBL-EBI, December 2016
Some of the Identification nodesIDPosteriorErrorProbability
Compute the posterior error probability for each PSMGenerate a new file with the corresponding values.
ConsensusIDCombine PSM identifications from multiple search engines. Generate a Combined PosteriorErrorProbability for each PSM. For each peptide ID, use the best score of any search engine as the consensus score.
FalseDiscoveryRateFor each peptide ID, use the best score of any search engine as the consensus score.
Proteomics BioinformaticsEMBL-EBI, December 2016
Adapters and Complementary NodesFileMerger
This nodes takes two files (or file lists) as input and outputs a merged list of both inputs. The order corresponds to the order of the input lists and ports.
IDMergerMerges several protein/peptide identification files into one file.
PeptideIndexerRefreshes the protein references for all peptide hits.
IDFilterFilters results from protein or peptide identification engines based on different criteria.
Proteomics BioinformaticsEMBL-EBI, December 2016
Quantitative Proteomics Quantitative Proteomics
Relative Quantification
Labeled
In vivo
14N/15N SILAC
In vitro
iTRAQ TMT 16O/18O
Label-Free
Spectral Counting MRM Feature-Based
Absolute Quantification
AQUA SISCAPA
And many more…
Proteomics BioinformaticsEMBL-EBI, December 2016
Label-Free Quantification (LFQ)Label-free quantification is probably the most natural way
of quantifying • No labeling required, removing further sources of
error, cheap• Different samples acquired in different measurements –
higher reproducibility needed• Manual analysis difficult• Scales very well with the number of samples, basically
no limit, no difference in the analysis between 2 or 100 samples
Proteomics BioinformaticsEMBL-EBI, December 2016
Feature-based LFQ - LC-MS MapsSpectra are acquired with rates up to dozens per second
Stacking the spectra yields peak mapsResolution: • Up to millions of points per spectrum• Tens of thousands of spectra per LC runHuge 2D datasets of up to hundreds of GB per sample
Quantification (3x over-expressed,
…)
Feature(eluting peptide)
Proteomics BioinformaticsEMBL-EBI, December 2016
Feature -based LFQ1. Find features in all maps2. Align maps 3. Link corresponding
features4. Identify features5. Quantify features6. Quantify proteins based
on their peptidesNPC2_HUMA
N1.0 : 5.2 : 0.3
CD177_HUMAN 1.0 : 0.2 : 0.4
::
Sample 1 Sample 2 Sample 3
Proteomics BioinformaticsEMBL-EBI, December 2016
Label-Free Workflow
Different algorithms has been proposed by the OpenMS community for label free:• Weisser H, Journal of Proteome Research (2013).• Bo Zhang, Molecular Cell Proteomics (2016). • Veit J., Jounral of Proteome Research (2016)• Ranninger C., Analytica Chimica Acta (2016)
Proteomics BioinformaticsEMBL-EBI, December 2016
DeMix-Q Algorithm and Workflow
Bo Zhang, Lukas Käll & Roman A. Zubarev, MCP (2016)
Proteomics BioinformaticsEMBL-EBI, December 2016
Reliable and reproducible Quantitation
Proteomics BioinformaticsEMBL-EBI, December 2016
LFQ Relevant nodesFeatureFinderCentroid
Detects two-dimensional features in LC-MS data.
MapAlignerPoseClusteringCorrects retention time distortions between maps using a pose clustering approach.
FeatureLinkerUnlabeledQTGroups corresponding features from multiple maps.
ConsensusMapNormalizerNormalizes maps of one consensusXML file
Proteomics BioinformaticsEMBL-EBI, December 2016
OpenMS at Large ScaleGalaxyWS-PGRADE/gUSEKNIME
Each individual tool can be run in the command line making possible its distribution in large HPC environments.
$> FileFilter -in myinfile.mzML -levels 2 -rt 100:1500 -out myoutfile.mzML
$> OpenSwathDecoyGenerator.exe −in OpenSWATH_SGS_AssayLibrary.TraML −out OpenSWATH_SGS_AssayLibrary_with_Decoys.TraML −method shuffle −append exclude_similar −remove_unannotated
Conclusions
• OpenMS modular workflow system • standard workflows:
SILAC, iTRAQ/TMT, label-free, Swath, Quality Control
• strong collaboration with other projects:ProteoWizard, Thermo PD, Knime, FidoPercolator, search engines, HUPO-PSI formats
How to run OpenMS workflows• OpenMS, local installation
(Windows, OS X, Linux)http://bit.ly/1J6lz6hhttp://openms.de/workflows
• OpenMS in Proteome Discoverer(LFQProfiler and RNPxl for PD 2.1)http://openms.de/PD
• OpenMS in Galaxyhttp://galaxy.uni-freiburg.de
• OpenMS in Knimehttps://tech.knime.org/community/bioinf/openms