Top Banner
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
32

Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Jan 03, 2016

Download

Documents

Sheila Francis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Search Engine Result

Combining

Search Engine Result

Combining

Nathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center

Page 2: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

2

Peptide Identification Results

• Search engines provide an answer for every spectrum...• Can we figure out which ones to believe?

• Why is this hard? • Hard to determine “good” scores• Significance estimates are unreliable• Need more ids from weak spectra• Each search engine has its strengths ...

... and weaknesses• Search engines give different answers

Page 3: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

3

Mascot Search Results

Page 4: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

4

Translation start-site correction

• Halobacterium sp. NRC-1• Extreme halophilic Archaeon, insoluble

membrane and soluble cytoplasmic proteins• Goo, et al. MCP 2003.

• GdhA1 gene:• Glutamate dehydrogenase A1

• Multiple significant peptide identifications• Observed start is consistent with Glimmer 3.0

prediction(s)

Page 5: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

5

Halobacterium sp. NRC-1ORF: GdhA1

• K-score E-value vs PepArML @ 10% FDR• Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

Page 6: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

6

Translation start-site correction

Page 7: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

7

Search engine scores are inconsistent!

Mascot

Tan

dem

Page 8: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

8

Common Algorithmic Framework – Different Results

• Pre-process experimental spectra• Charge state, cleaning, binning

• Filter peptide candidates• Decide which PSMs to evaluate

• Score peptide-spectrum match• Fragmentation modeling, dot product

• Rank peptides per spectrum• Retain statistics per spectrum

• Estimate E-values• Appy empirical or theoretical model

Page 9: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

9

Comparison of search engines

• No single score is comprehensive

• Search engines disagree

• Many spectra lack confident peptide assignment

4%

OMSSA10%

2%

5%9%

69%

2%

X!Tandem

Mascot

Page 10: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

10

Lots of techniques out there

• Treat search engines as black-boxes• Generate PSMs + scores, features

• Apply supervised machine learning to results• Use multiple match metrics

• Combine/refine using multiple search engines• Agreement suggests correctness

• Use empirical significance estimates• “Decoy” databases (FDR)

Page 11: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

11

Machine Learning

• Use of multiple metrics of PSM quality:• Precursor delta, trypsin digest features, etc

• Requires "training" with examples• Different examples will change the result• Generalization is always the question

• Scores can be hard to "understand"• Difficult to establish statistical significance

• Peptide Prophet's discriminant function• Weighted linear combination of features

Page 12: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

12

Combine / Merge Results

Threshold peptide-spectrum matches from each of two search engines• PSMs agree → boost specificity• PSMs from one → boost sensitivity• PSMs disagree → ?????

• Sometimes agreement is "lost" due to threshold...• How much should agreement increase our confidence?

• Scores easy to "understand"• Difficult to establish statistical significance

• How to generalize to more engines?

Page 13: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

13

Consensus and Meta-Search

• Multiple witnesses increase confidence• As long as they are independent• Example: Getting the story straight

• Independent "random" hits unlikely to agree• Agreement is indication of biased sampling• Example: loaded dice

• Meta-search is relatively easy• Merging and re-ranking is hard• Example: Booking a flight to Denver!

• Scores and E-values are not comparable• How to choose the best answer?• Example: Best E-value favors Tandem!

Page 14: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

14

Searching for Consensus

Search engine quirks can destroy consensus

• Initial methionine loss as tryptic peptide

• Charge state enumeration or guessing

• X!Tandem's refinement mode

• Pyro-Gln, Pyro-Glu modifications

• Difficulty tracking spectrum identifiers

• Precursor mass tolerance (Da vs ppm)

Decoy searches must be identical!

Page 15: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

15

Configuring for Consensus

Search engine configuration can be difficult:

• Correct spectral format

• Search parameter files and command-line

• Pre-processed sequence databases.

• Tracking spectrum identifiers

• Extracting peptide identifications, especially modifications and protein identifiers

Page 16: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

16

Peptide Identification Meta-Search

• Simple unified search interface for:• Mascot, X!Tandem, K-

Score, S-Score, OMSSA, MyriMatch, InsPecT

• Automatic decoy searches

• Automatic spectrumfile "chunking"

• Automatic scheduling• Serial, Multi-Processor,

Cluster, Grid

Page 17: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

17

Peptide Identification Grid-Enabled Meta-Search

NSF TeraGrid1000+ CPUs

UMIACS250+ CPUs

Edwards LabScheduler &80+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA.

X!Tandem,KScore,OMSSA.

Page 18: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

18

PepArML

• Peptide identification arbiter by machine learning

• Unifies these ideas within a model-free, combining machine learning framework

• Unsupervised training procedure

Page 19: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

19

PepArML Overview

X!Tandem

Mascot

OMSSA

Other

PepArML

Feature extraction

Page 20: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

20

Dataset Construction

T),( 11 PS

F),( 21 PS

T),( 12 PS

X!Tandem Mascot OMSSA

T),( mn PS

……

Page 21: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

21

Voting Heuristic Combiner

• Choose PSM with most votes

• Break ties using FDR• Select PSM with min. FDR of tied votes

• How to apply this to a decoy database?

• Lots of possibilities – all imperfect• Now using: 100*#votes – min. decoy hits

Page 22: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

22

Supervised Learning

Page 23: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

23

Feature Evaluation

Page 24: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

24

Application to Real Data

• How well do these models generalize?

• Different instruments• Spectral characteristics change scores

• Search parameters• Different parameters change score values

• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all

parameters x search engine sets x instruments

Page 25: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

25

Model Generalization

Page 26: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

26

Unsupervised Learning

Page 27: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

27

Unsupervised Learning Performance

Page 28: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

28

Unsupervised Learning Convergence

Page 29: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

29

Peptide Atlas A8_IP – LTQ

Page 30: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

30

OMICS 17 Protein Mix – LCQ

Page 31: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

31

Feature Selection (InfoGain)

Page 32: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

32

Conclusions

• Combining search results from multiple engines can be very powerful• Boost both sensitivity and specificity• Running multiple search engines is hard

• Statistical significance is hard• Use empirical FDR estimates...but be

careful...lots of subtleties• Consensus is powerful, but fragile

• Search engine quirks can destroy it• "Witnesses" are not independent