Top Banner
www.bioalgorithms.info An Introduction to Bioinformatics Algorithms Protein Sequencing and Identification by Mass Spectrometry
149

Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

May 31, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Protein Sequencing and

Identification by Mass

Spectrometry

Page 2: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Tandem Mass Spectrometry

• De Novo Peptide Sequencing

• Spectrum Graph

• Protein Identification via Database Search

• Identifying Post Translationally Modified Peptides

• Spectral Convolution

• Spectral Alignment

Page 3: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 4: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

Page 5: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Peptides tend to fragment along the backbone.• Fragments can also loose neutral chemical groups

like NH3 and H2O.

H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

H+

Prefix Fragment Suffix Fragment

Collision Induced Dissociation

Page 6: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Proteases, e.g. trypsin, break protein into peptides.

• A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece.

• Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones.

• Mass Spectrometer measure mass/chargeratio of an ion.

Page 7: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 8: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without

Page 9: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

415

486

301

154

57

71

185

332

429

Page 10: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

415

486

301

154

57

71

185

332

429

Page 11: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

415

486

301

154

57

71

185

332

429

Page 12: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

415

486

301

154

57

71

185

332

429

Reconstruct peptide from the set of masses of fragm ent ions

(mass-spectrum )

Page 13: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |R1 O R2 O R3 O R4

| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH

| | | | | | |H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3

Page 14: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

G V D L K

mass0

57 Da = ‘G’ 99 Da = ‘V’LK D V G

• The peaks in the mass spectrum:• Prefix • Fragments with neutral losses (-H2O, -NH3)• Noise and missing peaks.

and Suffix Fragments.

D

H2O

Page 15: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

G V D L K

mass0

Inte

nsity

mass0

MS/MSPeptide Identification:

Page 16: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 17: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

peptides

MPSER……

GTDIMRPAKID

……

HPLCTo MS/MSMPSERGTDIMRPAKID......

protein

Page 18: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Matrix-Assisted Laser Desorption/Ionization (MALDI)

From lectures by Vineet Bafna (UCSD)

Page 19: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

RT:0.01 - 80.02

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80Time (min)

0

10

20

30

40

50

60

70

80

90

100

Rel

ativ

e A

bund

ance

13891991

1409 21491615 1621

14112147

161119951655

15931387

21551435 19872001 21771445 1661

19372205

1779 21352017

1313 22071307 23291105 17071095

2331

NL:1.52E8

Base Peak F: + c Full ms [ 300.00 - 2000.00]

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bu

nd

an

ce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Scan 1708

LC

S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7F: + c Full ms [ 300.00 - 2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bun

dan

ce

638.0

801.0

638.9

1173.8872.3 1275.3

687.6944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7

Scan 1707

MS

MS/MSIon

Source

MS-1collision

cell MS-2

Page 20: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SSeeqquueennccee

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bu

nd

an

ce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrumentMS/MS instrument

Database search•Sequestde Novo interpretation•Sherenga

Page 21: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal peptides

• Spectrum consists of different ion types because peptides can be broken in several places.

• Chemical noise often complicates the spectrum.

• Represented in 2-D: mass/charge axis vs. intensity axis

Page 22: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S # : 1 7 0 8 R T : 5 4 .4 7 AV: 1 N L : 5 .2 7 E 6T : + c d F u l l m s 2 6 3 8 .0 0 [ 1 6 5 .0 0 - 1 9 2 5 .0 0 ]

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0 1 8 0 0 2 0 0 0m /z

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

5 5

6 0

6 5

7 0

7 5

8 0

8 5

9 0

9 5

1 0 0

Re

lativ

e A

bu

nda

nce

8 5 0 .3

6 8 7 .3

5 8 8 .1

8 5 1 .44 2 5 .0

9 4 9 .4

3 2 6 .05 2 4 .9

5 8 9 .2

1 0 4 8 .63 9 7 .12 2 6 .9

1 0 4 9 .64 8 9 .1

6 2 9 .0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

EK

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database of all peptides = 20n

AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,

AVGELTI, AVGELTK , AVGELTL, AVGELTM,

YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK ,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Mass, Score

Page 23: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De Novo vs. Database Search: A Paradox

• The database of all peptides is huge ≈ O(20n) .

• The database of all known peptides is much smaller ≈O(108).

• However, de novo algorithms can be much faster, even though their search space is much larger!

• A database search scans all peptides in the database of all known peptides search space to find best one.

• De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.

Page 24: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lativ

e A

bu

nd

an

ce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

SequenceSequence

Page 25: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 26: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

(cont’d)

Page 27: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

(cont’d)

Page 28: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• How to create vertices (from masses)

• How to create edges (from mass differences)

• How to score paths

• How to find best path

Page 29: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

b

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 30: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

a

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 31: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)Mass/Charge (M/Z)

a is an ion type shift in b

Page 32: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

y

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 33: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 34: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 35: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

noise

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 36: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/z)Mass/Charge (M/z)

Page 37: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Some Mass Differences between Peaks Correspond to Amino Acids

ss

ssss

ee

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

ee

cc

cc

cc

Page 38: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Some masses correspond to fragment

ions, others are just random noise

• Knowing ion types ∆={δ1, δ2,…, δk} lets us

distinguish fragment ions from noise

• We can learn ion types δi and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Page 39: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• ∆={δ1, δ2,…, δk}

• Ion types

{ b, b-NH3, b-H2O}

correspond to

∆={0, 17, 18}

*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity

Page 40: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• The match between two spectra is the number of

masses (peaks) they share (Shared Peak Count or

SPC)

• In practice mass-spectrometrists use the weighted SPC

that reflects intensities of the peaks

• Match between experimental and theoretical spectra is

defined similarly

Page 41: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

Input:• S: experimental spectrum• ∆: set of possible ion types• m: parent mass

Output: • P: peptide with mass m, whose theoretical

spectrum matches the experimental Sspectrum the best

Page 42: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Masses of potential N-terminal peptides

• Vertices are generated by reverse shifts corresponding to ion types

∆={δ1, δ2,…, δk}

• Every N-terminal peptide can generate up to k ions

m-δ1, m-δ2, …, m-δk

• Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

• Vertices of the spectrum graph:{ initial vertex} ∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{ terminal vertex}

Page 43: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Shift in H2O+NH3

Shift in H2O

Page 44: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Two vertices with mass difference

corresponding to an amino acid A:

• Connect with an edge labeled by A

• Gap edges for di- and tri-peptides

Page 45: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Path in the labeled graph spell out amino acid sequences

• There are many paths, how to find the correct one?

• We need scoring to evaluate paths

Page 46: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

• p(P, s) = the probability that peptide Pgenerates a peak s

• Scoring = computing probabilities

• p(P,S) = πsєS p(P, s)

Page 47: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Page 48: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

(cont’d)

• For a position t that is not associated with an ion type:

qR , if peak is generated at tpR(P,st) =

1-qR , otherwise• qR= the probability of a noisy peak that does

not correspond to any ion type

Page 49: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax=

Page 50: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

• δi-ions of a partial peptide are produced independently with probabilities qi

Page 51: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise'' with uniform probability qR in any position.

∏=

k

iiq

1

∏=

−k

iiq

1

)1(

Page 52: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ions and penalties for missing ions.

• Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321

)1(

)1( ⋅−−⋅⋅

Page 53: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• T- set of all positions.

• Ti={t δ1,, t

δ2,..., ,t δk,}- set of positions that represent ions of partial peptides Pi.

• A peak at position tδj is generated with

probability qj.

• R=T- U Ti - set of positions that are not associated with any partial peptides (noise).

Page 54: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For a position t δj ∈ Ti the probability p(t, P,S) that

peptide P produces a peak at position t.

• Similarly, for t∈R, the probability that P produces a random noise peak at t is:

−=

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP

δ

−=

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Page 55: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:

= =

=n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

δ

δ

Page 56: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

How can we find the amino acid sequence that best explains the spectrum?

a) Explains the largest number of peaksb) Explains the most intensity

• Exhaustive enumeration?1. Generate every possible sequence with the same peptide

mass2. Match each sequence to the spectrum3. Choose the sequence that explains the most intensity in the

spectrum

F S N A M S D I

V SGQ L I D

Takes too long!

Page 57: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Computer programs for de-novo interpretation of MS/MS spectra date as far back as 1966 when Biemman et al. proposed a prefix extension algorithm:• Tries every possible prefix extension• Eliminates solutions with missing peaks• Outputs every peptide with a matching parent mass• ALL prefix peaks must be present in the spectrum

In an attempt to include interpretations with missing peaks, Sakurai et al. (1984) proposed:• Exhaustive search over the space of all possible permutations of amino

acid multisets where the total mass equals the parent mass of the MS/MS spectrum

• Score peptide/spectrum matches by counting prefix and suffix mass matches

• Naturally more sensitive but also very slow

Page 58: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The second half of the eighties saw a few better designed approaches to this problem, based on the same type of algorithm:• Prefix extension, one amino acid at a time.• Tolerate missing peaks.• Include both prefix and suffix peaks in the score.• User-specified maximum number of candidates in memory at any

point of the execution. (sub-optimal)

Ishikawa and Niwa’86, Siegel and Baumann’88, Johnson and Biemann’89, Zidarov et al.’90

Page 59: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

In 1990 Bartels introduced a graph representation of an MS/MS spectrum• Every peak in the spectrum defines a vertex• Vertices connected by an edge if peak mass difference is an amino acid

mass

• Best peptide is defined as the best path between the two endpoint vertices: v0 to vM(S)

• No detailed algorithm was given for finding the best peptide; interactive exploration tool was made available.

v0vM(S)

Page 60: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

What is the DP recursion?Score(i) = intensity(i) + max( Score(j) ),

for all j with mass(i)-mass(j) ∈ Amino acid masses

Recovered de-novo sequence? ESESE

Page 61: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Note that spectral graph approaches use local peak mass tolerances resulting in cumulative peptide mass errors. Could be off by as much as 6 Da in a 12aa peptide (0.5 Da per aa)!

• Alternative approach to sequencing• Represent the spectrum as an array of 0.1Da bins.• The intensity in a bin B is the sum of the peak intensities for

all peaks with rounded masses equal to B.• What is the recursion?• How accurate is the parent mass of the recovered peptide?• How can we generate all suboptimal solutions with a score

higher than a chosen threshold?

Page 62: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

DP recursion:Score(i) = intensity(i) + max( Score(j) ),

for all j with mass(i)-mass(j) ∈ Amino acid masses

Recovered de-novo sequence? ESESE

Why was the correct peptide missed?

Page 63: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The exclusion of symmetric peaks in a maximal scoring path through a spectrum graph was first proposed by Dančik et al. in 1999:

• Peaks in the spectrum are called forbidden pairs if their mass adds up to the parent mass – either vertex can be used in the output path but not both.

• A path is anti-symmetric if it uses at most one vertex from every forbidden pair.• Objective function becomes: find maximal scoring anti-symmetric path.• NP-Hard in the general case

129

244

345

47487

216

317

432

Solution:1. Jump from the mass closest to the

start/end of the spectrum2. Avoid reusing the pairing mass

(highlighted in red)

Note that this ordering is always possible (Chen’01, Bafna and Edwards’03)

What is the recursion?

Page 64: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Chen et al’01 provided a dynamic programming recursion to find a maximal scoring anti-symmetric path:• A peak si precedes a peak sj if it is closer to one of the ends of the

spectrum: min(si,m(S)-si)<min(sj,m(S)-sj)

• Let Sc[i,j] be the score of the maximal scoring anti-symmetric path from v0 to vi and from vj to vm(S), including vi and vj (all initialized to -∞)

• Then from Sc[i,j] =• If sj precedes sk, sk-si is an amino acid mass and vk and vj are not a forbidden pair

• Sc[k,j] = max(Sc[k,j], score(k)+Sc[i,j]) (prefix extension)

• If si precedes sk, sj-sk is an amino acid mass and vi and vk are not a forbidden pair • Sc[i,k] = max(Sc[i,k], score(k)+Sc[i,j]) (suffix extension)

• If sj-si is an amino acid mass and vi and vj are not a forbidden pair • Mark as possible solution (prefix/suffix connection)

Page 65: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The anti-symmetric dynamic programming solutions are• Correct: no symmetric peak can be reused and every anti-

symmetric path is considered• Optimal: a maximal scoring anti-symmetric path is constructed• “Efficient”: runtime efficiency is O(n2), n=# peaks

Other algorithms have also been proposed:• Ma’03, Frank’05, same algorithm, different scoring• Bafna and Edwards’03, same principle, extended the concept

of forbidden pairs to avoid peak reusage between any pair of ion-types

Page 66: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S # : 1 7 0 8 R T : 5 4 .4 7 AV: 1 N L : 5 .2 7 E 6T : + c d F u l l m s 2 6 3 8 .0 0 [ 1 6 5 .0 0 - 1 9 2 5 .0 0 ]

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0 1 8 0 0 2 0 0 0m /z

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

5 5

6 0

6 5

7 0

7 5

8 0

8 5

9 0

9 5

1 0 0

Re

lativ

e A

bu

nda

nce

8 5 0 .3

6 8 7 .3

5 8 8 .1

8 5 1 .44 2 5 .0

9 4 9 .4

3 2 6 .05 2 4 .9

5 8 9 .2

1 0 4 8 .63 9 7 .12 2 6 .9

1 0 4 9 .64 8 9 .1

6 2 9 .0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

EK

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK ,HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Page 67: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

Input:• S: experimental spectrum• ∆: set of possible ion types• m: parent mass

Output: • A peptide with mass m, whose theoretical

spectrum matches the experimental Sspectrum the best

Page 68: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:• S: experimental spectrum• database of peptides• ∆: set of possible ion types• m: parent mass

Output: • A peptide of mass m from the database whose

theoretical spectrum matches the experimental S spectrum the best

Page 69: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS Database Search

Database search in mass-spectrometry has been very successful in identification of already known proteins.

Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit.

SEQUEST (Yates et al., 1995)

But reliable algorithms for identification of modified peptides is a much more difficult problem.

Page 70: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• The proteome of the cell is changing

• Various extra-cellular, and other signals activate pathways of proteins.

• A key mechanism of protein activation is post-translational modification (PTM)

• These pathways may lead to other genes being switched on or off

• Mass spectrometry is key to probing the proteome and detecting PTMs

Page 71: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Proteins are involved in cellular signaling and metabolic regulation.

They are subject to a large number of biological modifications.

Almost all protein sequences are post-translationally modified and 200 types of modifications of amino acid residues are known.

Page 72: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches.

Page 73: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides.

Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types.

Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications.

Page 74: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• YFDSTDYNMAK

• 25=32 possibilities, with 2 types of modifications!

Phosphorylation?

Oxidation?

• For each peptide, generate all modifications.

• Score each modification.

Page 75: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:• S: experimental spectrum• database of peptides• ∆: set of possible ion types• m: parent mass

Output: • A peptide of mass m from the database whose

theoretical spectrum matches the experimental S spectrum the best

Page 76: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:• S: experimental spectrum• database of peptides• ∆: set of possible ion types• m: parent mass• Parameter k (# of mutations/modifications)

Output: • A peptide of mass m that is at most k

mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best

Page 77: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Database Search: Sequence Analysis vs. MS/MS AnalysisSequence analysis:

similar peptides (that a few mutations apart) have similar sequences

MS/MS analysis:

similar peptides (that a few mutations apart) have dissimilar spectra

Page 78: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Identification Problem: Challenge

Very similar peptides may have very different spectra!

Goal : Define a notion of spectral similarity that correlates well with the sequence similarity.

If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.

Page 79: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Shared peaks count (SPC): intuitive measure of spectral similarity.

Problem : SPC diminishes very quickly as the number of mutations increases.

Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.

Page 80: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}

S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}

S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}

no mutationsSPC=10

1 mutationSPC=5

2 mutationsSPC=2

Page 81: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

)0)((

))((,

12

12

122211

22111212

:

SS

xSSssSsSs

}S,sS:ss{sSS

x

−−∈∈

∈∈−=−=

:peak) (SPC count peaks shared The

with pairs of Number

Page 82: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Elements of S2 S1 represented as elements of a difference matrix . The elements with multiplicity >2 are colored; the elements with multiplicity =2 are circled. The SPC takes into account only the red entries

Page 83: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spe

ctra

l C

onvo

lutio

n

1

2

3

4

5

0-150 -100 -50 0 50 100

150

x

Page 84: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

Which of the spectra S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}

or S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

fits the spectrum S the best?

SPC: both S’ and S” have 5 peaks in common with S.Spectral Convolution: reveals the peaks at 0 and 5.

Page 85: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S S’

S S’’

Page 86: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Limitations of the Spectrum Convolutions

Spectral convolution does not reveal that spectra Sand S’ are similar, while spectra Sand S” are not.

Clumps of shared peaks : the matching positions in S’ come in clumps while the matching positions in S” don't.

This important property was not captured by spectral convolution.

Page 87: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

A = {a1 < … < an} : an ordered set of natural numbers.

A shift (i,∆) is characterized by two parameters, the position (i) and the length (∆).

The shift (i,∆) transforms {a1, …., an}

into

{a1, ….,ai-1,ai+∆,…,an+ ∆ }

Page 88: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The shift (i,∆) transforms {a1, …., an}

into {a1, ….,ai-1,ai+∆,…,an+ ∆ }

e.g.

10 20 30 40 50 60 70 80 90

10 20 30 35 45 55 65 75 85

10 20 30 35 45 5562 72 82

shift (4, -5)

shift (7,-3)

Page 89: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Find a series of k shifts that make the sets A={a1, …., an} and B={b1,….,bn}

as similar as possible.

• k-similarity between sets

• D(k) - the maximum number of elements in common between sets after k shifts.

Page 90: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Convert spectrum to a 0-1 string with 1s corresponding to the positions of the peaks.

Page 91: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Comparing Spectra=Comparing 0-1 Strings• A modification with positive offset corresponds to

inserting a block of 0s• A modification with negative offset corresponds to

deleting a block of 0s• Comparison of theoretical and experimental spectra

(represented as 0-1 strings) corresponds to a (somewhat unusual) edit distance/alignmentproblem where elementary edit operations are insertions/deletions of blocks of 0s

• Use sequence alignment algorithms!

Page 92: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment vs. Sequence Alignment

• Manhattan-like graph with different alphabet and scoring.

• Movement can be diagonal (matching masses) or horizontal/vertical (insertions/deletions corresponding to PTMs).

• At most k horizontal/vertical moves.

Page 93: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

A={a1, …., an} and B={b1,…., bn}

Spectral product A⊗B: two-dimensional matrix with nm 1scorresponding to all pairs of indices (ai,bj) and remaining

elements being 0s.

10 20 30 40 50 55 65 75 85 95

δ

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

SPC: the number of 1s at the main diagonal.

δ-shifted SPC: the number of 1s on the diagonal (i,i+ δ)

Page 94: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

kk-similarity between spectra : the maximum number

of 1s on a path through this graph that uses at most k+1 diagonals.

k-optimal spectralalignment = a path.

The spectral alignment allows one to detect more and more subtle similarities between spectra by increasing k.

Page 95: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPC reveals only D(0)=3 matching peaks.

Spectral Alignment reveals more hidden similarities between spectra: D(1)=5 and D(2)=8and detects corresponding mutations.

Page 96: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Black line represent the path for k=0Red lines represent the path for k=1Blue lines (right) represents the path for k=2

Page 97: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The spectral convolution considers diagonals separately without combining them into feasible mutation scenarios.

D(1) =10 shift function score = 10 D(1) =6

10 20 30 40 50 55 65 75 85 95

10

20

30

40

50

60

70

80

90

100

10 15 30 35 50 55 70 75 90 95

10

20

30

40

50

60

70

80

90

100

δ δ

Page 98: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Dij(k): the maximum number of 1s on a path to (ai,bj) that uses at most k+1 diagonals.

Running time: O(n4 k)

otherwisekD

jijiifkDkD

ji

ji

jijiij ,1)1(

),(~)','(,1)(max)(

''

''

),()','({

+−+

=<

)(max)( kDkD ijij

=

Page 99: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edit Graph for Fast Spectral Alignment

diag(i,j) – the position of previous 1 on the same diagonal as (i,j)

Page 100: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

+−+

=−− 1)1(

1)(max)(

1,1

),(

kM

kDkD

ji

jidiagij

)(max)( ''),()','(

kDkM jijiji

ij<

=

=

)(

)(

)(

max)(

1,

,1

kM

kM

kD

kM

ji

ji

ij

ij

Running time: O(n2 k)

Page 101: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectra are combinations of an increasing (N-terminal ions) and a decreasing (C-terminal ions) number series.

These series form two diagonals in the spectral product, the main diagonal and a complementary diagonal.

The described algorithm deals with the main diagonal only.

Page 102: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Simultaneous analysis of N- and C-terminal ions

• Taking into account the intensities and charges

• Analysis of minor ions

Page 103: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• So far de novo and database search were presented as two separate techniques

• Database search is rather slow: many labs generate more than 100,000 spectra per day. SEQUEST takes approximately 1 minute to compare a single spectrum against SWISS-PROT (54Mb) on a desktop.

• It will take SEQUEST more than 2 months to analyze the MS/MS data produced in a single day.

• Can slow database search be combined with fast de novo analysis?

Page 104: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring

Protein Query

Sequence Alignment – Smith Waterman Algorithm

Sequence matches

Protein Sequences

Filtration

Filtered Sequences

Sequence Alignment – BLAST

Database

actgcgctagctacggatagctgatccagatcgatgccataggtagctgatccatgctagcttagacataaagcttgaatcgatcgggtaacccatagctagctcgatcgacttagacttcgattcgatcgaattcgatctgatctgaatatattaggtccgatgctagctgtggtagtgatgtaaga

• BLAST filters out very few correct matches and is almost as accurate as Smith – Waterman algorithm.

Database

actgcgctagctacggatagctgatccagatcgatgccataggtagctgatccatgctagcttagacataaagcttgaatcgatcgggtaacccatagctagctcgatcgacttagacttcgattcgatcgaattcgatctgatctgaatatattaggtccgatgctagctgtggtagtgatgtaaga

Page 105: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring

MS/MS spectrum

Peptide Sequencing – SEQUEST / Mascot

Sequence matches

Peptide Sequences

Filtration

Pept ide Sequences

Database

MDERHILNMKLQWVCSDLPTYWASDLENQIKRSACVMTLACHGGEMNGALPQWRTHLLERTYKMNVVGGPASSDALITGMQSDPILLVCATRGHEWAILFGHNLWACVNMLETAIKLEGVFGSVLRAEKLNKAAPETYIN..

Database

MDERHILNMKLQWVCSDLPTYWASDLENQIKRSACVMTLACHGGEMNGALPQWRTHLLERTYKMNVVGGPASSDALITGMQSDPILLVCATRGHEWAILFGHNLWACVNMLETAIKLEGVFGSVLRAEKLNKAAPETYIN..

Page 106: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Filtration in MS/MS is more difficult than in BLAST.

• Early approaches using Peptide Sequence Tags were not able to substitute the complete database search.

• Current filtration approaches are mostly used to generate additional identifications rather than replace the database search.

• Can we design a filtration based search that can replacethe database search, and is orders of magnitude faster?

Page 107: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• De novo sequencing is still not very accurate!

0.2960.727PepNovo (Frank and Pevzner, 2005).

0.2460.673Peaks (Ma et al., 2003).

0.2890.690SHERENGA (Dancik et. al., 1999).

0.1890.566Lutefisk (Taylor and Johnson, 1997).

Whole Peptide Accuracy

Amino Acid Accuracy

Algorithm

Page 108: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Given an MS/MS spectrum:• Can de novo predict the entire peptide sequence?

• Can de novo predict partial sequences?

• Can de novo predict a set of partial sequences, that with high probability, contains at least one correct tag?

A Covering Set of Tags

- No!(accuracy is less than 30%).

- No!(accuracy is 50% for GutenTag and 80% for PepNovo )

- Yes!

Page 109: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• A Peptide Sequence Tag is short substring of a peptide.

Example: G V D L KG V D

V D L

D L KTags:

Page 110: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Peptide sequence tags can be used as filters in database searches.

• The Filtration: Consider only database peptides that contain the tag (in its correct relative mass location).

• First suggested by Mann and Wilm (1994).

• Similar concepts also used by:• GutenTag - Tabb et. al. 2003.• MultiTag - Sunayev et. al. 2003.• OpenSea - Searle et. al. 2004.

Page 111: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Filtration makes genomic database searches practical (BLAST).

• Effective filtration can greatly speed-up the process, enabling expensive searches involving post-translational modifications.

• Goal: generate a small set of covering tags and use them to filter the database peptides.

Page 112: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Parse tags from de novo reconstruction.• Only a small number of tags can be generated.• If the de novo sequence is completely incorrect,

none of the tags will be correct.

W

R

A

C

VG

EK

DW

LP

T

LT

AVGELTK

TAG Prefix Mass

AVG 0.0

VGE 71.0

GEL 170.1

ELT 227.1

LTK 356.2

Page 113: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Extract the highest scoring subspaths from the spectrum graph.

• Sometimes gets misled by locally promising-looking “garden paths”.

WR

A

C

VG

E

K

DW

LP

T

L T

TAG Prefix Mass

AVG 0.0

WTD 120.2

PET 211.4

Page 114: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Each additional tag used to filter increases the number of database hits and slows down the database search.

• Tags can be ranked according to their scores, however this ranking is not very accurate.

• It is better to determine for each tag the “probability” that it is correct, and choose most probable tags.

Page 115: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For each amino acid in a tag we want to assign a probability that it is correct.

• Each amino acid, which corresponds to an edge in the spectrum graph, is mapped to a feature space that consists of the features that correlate with reliability of amino acid prediction, e.g. score reduction due to edge removal

Page 116: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• The removal of an edge corresponding to a genuine amino acid usually leads to a reduction in the score of the de novo path.

• However, the removal of an edge that does notcorrespond to a genuine amino acid tends to leave the score unchanged.

WR

A

C

VG

K

DWL

P

T

L T

WR

A

C

VG

K

DWL

P

T

L T

WR

A

C

VG

K

DWL

P

T

L TE

W

WR

A

C

VG

K

DL

P

T

L T

Page 117: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• How do we determine the probability of a predicted tag ?

• We use the predicted probabilities of its amino acids and follow the concept:

a chain is only as strong as its weakest link

Page 118: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Results are for 280 spectra of doubly charged tryptic peptides from the ISB and OPD datasets.

0.800.570.900.700.960.75LocalTag+

0.640.310.780.410.890.49GutenTag

0.800.660.870.730.940.80GlobalTag

101101101Algorithm \ #tags

Length 5Length 4Length 3

Page 119: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tag filter SignificanceScoreTag

extension

De novo

Db55M peptides

Candidate Peptides (700)

Page 120: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Matching of a sequence tag against a database is fast

• Even matching many tags against a database is fast

• k tags can be matched against a database in time proportional to database size, but independent of the number of tags.• keyword trees (Aho-Corasick algorithm)

• Scan time can be amortized by combining scans for many spectra all at once.• build one keyword tree from multiple spectra

Page 121: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Y A K

SN

N

F

F

AT

YFAKYFNSFNTA

…..Y F R A Y F N T A…..

Page 122: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filter SignificanceScoreExtension

De novo

Db55M peptides

CandidatePeptides(700)

Page 123: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Given: • tag with prefix and suffix masses <mP> xyz <mS>• match in the database

• Compute if a suffix and prefix match with allowable modifications.

• Compute a candidate peptide with most likely positions of modifications (attachment points).

xyz<mP>xyz<mS>

Page 124: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filter SignificanceScoreExtension

De novo

Db55M peptides

Page 125: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Input:• Candidate peptide with attached modifications• Spectrum

• Output:• Score function that normalizes for length, as

variable modifications can change peptide length.

Page 126: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Filter SignificanceScoreextension

De novo

Db55M peptides

Page 127: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Features:• Score S: as computed• Explained Intensity I: fraction of total intensity explained by

annotated peaks.• b-y score B: fraction of b+y ions annotated• Explained peaks P: fraction of top 25 peaks annotated.

• Each of I,S,B,P features is normalized (subtract mean and divide by s.d.)

• Problem : separate correct and incorrect identifications using I,S,B,P

Page 128: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 129: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Quality scores:Q = w I I + wS S + wB B + wP PThe weights are chosen to minimize the mis-classification error

Page 130: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 131: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• All ISB spectra were searched. • The top match is valid for 2978 spectra (2765 for Sequest)• InsPecT-Sequest: 644 spectra (I-S dataset)• Sequest-InsPecT: 422 spectra (S-I dataset)• Average explained intensity of I-S = 52%• Average explained intensity of S-I = 28%

• Average explained intensity I∩S = 58%• ~70 Met. Oxidations• Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)

Page 132: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• The Alliance for Cellular signalling is looking at proteins phosphorylated in specific signal transduction pathways.

• 6500 spectra are searched with upto 4 modifications (upto 3 Met. Oxidation and upto 2 Phos.)

• 281 phosphopeptides with P-value < 0.05

Page 133: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Page 134: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The search was done against SWISS-PROT (54Mb).• With 10 tags of length 3:

• The filtration is 1500 more efficient.• Less than 4% of spectra are filtered out.• The search time per spectrum is reduced by two orders of magnitude

as compared to SEQUEST.

0.38 sec2.7×10-6103

> 2 minutes0.21 sec5.8×10-713Phosphorylation

0.27 sec1.6×10-6103

> 1 minute0.17 sec3.4×10-713None

SEQUEST Runtime

InsPecT Runtime

Filtration# TagsTag Length

PTMs

Page 135: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• With 10 tags of length 3:• The filtration is 1500 more efficient than using only

the parent mass alone.• Less than 4% of the positive peptides are filtered out.• The search time per spectrum is reduced from over a

minute (SEQUEST) to 0.4 seconds.

Page 136: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPIDER: Yet Another Application of de novoSequencing

• Suppose you have a good MS/MS spectrum of an elephant peptide

• Suppose you even have a good de novoreconstruction of this spectra

• However, until elephant genome is sequenced, it is hard to verify this de novo reconstruction

• Can you search de novo reconstruction of a peptide from elephant against human protein database?

• SPIDER (Han, Ma, Zhang ) addresses this comparative proteomics problem

Slides from Bin Ma, University of Western Ontario

Page 137: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

GG

N and GG have the same mass

Page 138: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

From Reconstruction to Database Candidate through Real Sequence

• Given a sequence with errors, search for the similar sequences in a DB.

(Seq) X: LSCFAV(Real) Y: SLCFAV(Match) Z: SLCF-V

sequencing error

(Seq) X: LSCF-AV(Real) Y: EACF-AV(Match) Z: DACFKAV mass(LS)=mass(EA)

Homology mutations

Page 139: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and Database Candidate

• If real sequence Y is known then:

d(X,Z) = seqError(X,Y) + editDist(Y,Z)

(Seq) X: LSCF-AV(Real) Y: EACF-AV(Match) Z: DACFKAV

Page 140: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and Database Candidate

• If real sequence Y is known then: d(X,Z) = seqError(X,Y) + editDist(Y,Z)

• If real sequence Y is unknown then the distance between de novo candidate X and database candidate Z: • d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )

(Seq) X: LSCF-AV(Real) Y: EACF-AV(Match) Z: DACFKAV

Page 141: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment between de novo Candidate and Database Candidate

• If real sequence Y is known then: d(X,Z) = seqError(X,Y) + editDist(Y,Z)

• If real sequence Y is unknown then the distance between de novo candidate X and database candidate Z: • d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )

• Problem : search a database for Z that minimizes d(X,Z) • The core problem is to compute d(X,Z) for given X and Z.

(Seq) X: LSCF-AV(Real) Y: EACF-AV(Match) Z: DACFKAV

Page 142: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Align X and Y (according to mass).

• A segment of X can be aligned to a segment of Y only if their mass is the same!

• For each erroneous mass block (Xi,Yi), the cost isf(Xi,Yi)=f(mass(Xi)).• f(m) depends on how often de novo sequencing

makes errors on a segment with mass m.• seqError(X,Y) is the sum of all f(mass(Xi)).

XYZ

seqError

editDist

(Seq) X: LSCFAV(Real) Y: EACFAV

Page 143: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Dynamic Programming:

• Let D[i,j]=d(X[1..i], Z[1..j])

• We examine the last block of the alignment of X[1..i] and Z[1..j].

(Seq) X: LSCF-AV(Real) Y: EACF-AV(Match) Z: DACFKAV

Page 144: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Cases A, B, C - no de novo sequencingerrors• Case D: de novo sequencing error

D[i,j]=D[i,j-1]+indel D[i,j]=D[i-1,j]+indel

D[i,j]=D[i-1,j-1]+dist(X[i],Z[j]) D[i,j]=D[i’-1,j’-1 ]+alpha(X[i’..i ],Z[j’..j ])

• D[i,j] is the minimum of the four cases.

Page 145: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• alpha(X[i’..i],Z[j’..j]) = min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])]= min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])].= f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]).

• This is like to align a mass with a string.• Mass-alignment Problem: Given a mass m and a

peptide P, find a peptide of mass m that is most similar to P (among all possible peptides)

Page 146: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

+−−+−

+−

=])[,()])1..([),((min

)])1..([,(

])..[),((min

min])..[,(

jZydistjiZymm

indeljiZm

indeljiZymm

jiZm

y

y

αα

αα

Page 147: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Homology Match mode:• Assumes tagging (only peptides that share a tag of

length 3 with de novo reconstruction are considered) and extension of found hits by dynamic programming around the hits.

• Non-gapped homology match mode:• Sequencing error and homology mutations do not

overlap.

• Segment Match mode:• No homology mutations.

• Exact Match mode:• No sequencing errors and homology mutations.

Page 148: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• The correct peptide sequence for each spectrum is known. • The proteins are all in Swissprot but not in Human

database.• SPIDER searches 144 spectra against both Swissprot and

human databases

Page 149: Protein Sequencing and Identification by Mass SpectrometryAn Introduction to Bioinformatics Algorithms • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• Using de novo reconstruction X=CCQWDAEACAFNNPGK, the homolog Z was found in human database. At the same time, the correct sequence Y, was found in SwissProtdatabase.

Seq(X): CCQ[W ]DAEAC[AF]<NN><PG>K

Real(Y): CCK AD DAEAC FA VE GP K

Database(Z): CCK[AD]DKETC[FA]<EE><GK>K

sequencing errors

homology mutations