Top Banner
Intro Protein structure Motifs Motif databases End Last time Probability based methods How find a good root? Reliability Reconciliation analysis
56

Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

May 08, 2018

Download

Documents

nguyencong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Last time

• Probability based methods• How find a good root?• Reliability• Reconciliation analysis

Page 2: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Today

• Intro to proteinstructure• Motifs and domains

Page 3: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Page 4: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Page 5: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”First dogma of Bioinformatics”

Sequence → structure → function

• Want to avoid determining structure• Expensive• Difficult• Sometimes impossible?

• Bioinfo dream: Structure from sequence!• ”How does the protein fold”?

Page 6: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Ab initio folding?• Folding from sequence seems out of reach

• But...:

Page 7: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Ab initio folding?• Folding from sequence seems out of reach

• But...:

Page 8: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Page 9: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Page 10: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Page 11: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

What to do in silico?

1. Compromise and use what you’ve got.”Recycle” structures

2. Find and understand protein buildingblocks: motifs and domains.

3. Identify certain protein types:transmembrane proteins

4. ”Why bother? Sequences are informative!”

Page 12: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Example 1: Motifs and domains

(Bjarnadottir et al, 2004)

Some typical G-protein coupled receptorsSmall circles: glycolization sitesOther symbols: domains

Page 13: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Example 2: Domains and structure

Page 14: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and use

Domains: Definitions, hidden Markov models(HMM), applications, databases

PSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Page 15: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databases

PSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Page 16: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databasesPSI-Blast: Sensitive search tool

Secondary structure: In general and the TMspecial case

Page 17: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Our goals

Motifs: Representation and useDomains: Definitions, hidden Markov models

(HMM), applications, databasesPSI-Blast: Sensitive search toolSecondary structure: In general and the TM

special case

Page 18: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.• Fingerprints: Combinations of motifs

Page 19: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites

• Motifs grouped in families. Confusedterminology.

• Fingerprints: Combinations of motifs

Page 20: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.

• Fingerprints: Combinations of motifs

Page 21: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motifs

• Short subsequences, DNA or AA, 5 – 20positions long.

• Foremost application: binding sites• Motifs grouped in families. Confused

terminology.• Fingerprints: Combinations of motifs

Page 22: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment

• Pattern notation, eg:[LMV]-[RSTV]-W-[DSN]-...

• Profiles and PSSM, PWM• Visualize with sequences

logo

Page 23: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...

• Profiles and PSSM, PWM• Visualize with sequences

logo

Page 24: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM

• Visualize with sequenceslogo

Page 25: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif representationMTWDNRLAAFAQNYANQRAMTWDNRLAAYAQNYANQRIMTWDNRLAAYAQNYANQRIMTWDDGLAAYAQNYANQRAVSWSTKLQAYAQSYANQRILTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLALTWDDQVAAYAQNYASQLAVSWSTKLQGFAQSYANQRIMSWDANLASRAQNYANSRAVSWSTKLQAFAQNYANQRILRWDEKVAAYARNYANQRKLRWDEKVAAYARNYANQRKVSWSTKLQAFAQNYANQRILVWNDELAQIAQVWANQCNLVWNDELAQIAQVWANQCNLTWDDEVAAYAQNYVSQLALTWDDQVAAYAQNYASQLAVSWSTKLQAFAQNYANQRILVWSDELAYIAQVWANQCQLVWNDELAYVAQVWANQCQ...

(shortened)

Motif ”V5TPXLIKE”95 seqs, width 19.• Multialignment• Pattern notation, eg:

[LMV]-[RSTV]-W-[DSN]-...• Profiles and PSSM, PWM• Visualize with sequences

logo

Page 26: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Sequence logo

PSSM of PR00837A (V5TPXLIKE;) 95 sequences.

0

1

2

3

4

bits |

1

IVLM

2

AYNIRQCESKVT

3 YW

4 LYHSND

5 GIVKFRSAQHYEMPTNDC

6 IRMASGQNDKTE

7

TIAMVL

8

SYTQEA

9

GDKYSEVHNRQTA

10 LAVRTK

IMFSNY

11 S

TMA

12 AVTKE

IRHWMQ

13 TEVW

SIAQDKRN

14 RN

FHYW

15 V

GSA

16 VR

THYEAQKSDN

17 ME

SYGHNTRKQ

18 YL

RC

19 SLNRK

TQDHVAIP

• Height indicate conservation(Too many details: Height is the Kullback-Leibler distance to

the uniform distribution)

• Symbol height proportional to frequency

Page 27: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Start of translation

http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html

Page 28: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Phosporelation site, PKA

(Blom et al, 1998)

Page 29: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Profiles

• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!

• Profile: Matrix F with frequency information

Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14

0.0

1.0

2.0

bits

TGA

A

GCTCA

GC

5TG

TC

Page 30: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Profiles

• Multialignments convenient• Patterns sparse with information• Logos are pretty pictures!• Profile: Matrix F with frequency information

Fr ,c is fraction r in position cPos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45 WebLogo 3.0b14

0.0

1.0

2.0

bits

TGA

A

GCTCA

GC

5TG

TC

Page 31: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.

• For A in position 1: nA,1 = 12 and n = 20

Page 32: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Fr ,c = nr ,c/n, where nr ,c number of r inposition c, and n is sequence count.

• For A in position 1: nA,1 = 12 and n = 20

Page 33: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Profiles

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

• Probability of AACATT being ”produced” byprofile:

0.6×0.15×1.0×0.2×0.5×0.45 = 0.00405

• Is that good? Interpretation?

Page 34: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!

• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Page 35: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix

• Mr ,c = 10 log2(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Page 36: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.

• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Page 37: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.

• MA,1 = 10 log2(FA,1/0.25) =10 log2(0.6/0.25) = 12.6

• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0

Page 38: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6

• MC,2 = 10 log2(FC,2/0.25) =10 log2(0.25/0.25) = 0

Page 39: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM: Better than profile

• Want a log-odds score!• PSSM=Position Specific Scoring Matrix• Mr ,c = 10 log2

(Fr ,c/πr

), where πr is

frequency of r in our data.• Let πA = πC = πG = πT = 0.25.• MA,1 = 10 log2(FA,1/0.25) =

10 log2(0.6/0.25) = 12.6• MC,2 = 10 log2(FC,2/0.25) =

10 log2(0.25/0.25) = 0

Page 40: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM M from our profile FProfile:

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for AACATT:

12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5

Page 41: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

PSSM M from our profile FProfile:

Pos: 1 2 3 4 5 6A 0.6 0.15 0.0 0.2 0.0 0.0C 0.0 0.25 1.0 0.4 0.0 0.55G 0.3 0.25 0.0 0.4 0.5 0.0T 0.1 0.35 0.0 0.0 0.5 0.45

PSSM:Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for AACATT:

12.6− 7.4 + 20.0− 3.2 + 10.0 + 8.5 = 40.5

Page 42: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Generalizing with PSSM?

How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for ATCTTT?

12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞

Page 43: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Generalizing with PSSM?

How handle a new variant of a motif?Pos: 1 2 3 4 5 6A 12.6 -7.4 −∞ -3.2 −∞ −∞C −∞ 0.0 20.0 6.8 −∞ 11.4G 2.63 0.0 −∞ 6.8 10.0 −∞T -13.2 4.9 −∞ −∞ 10.0 8.5

Score for ATCTTT?

12.6 + 4.9 + 20.0−∞+ 10.0 + 8.5 = 56−∞

Page 44: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs

• Pseudo counts: αr is number of ”pseudoobservations” of r .

• Include in profile calculations:

Fr ,c =nr ,c

+ αr

n

+∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Page 45: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .

• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Page 46: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Page 47: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.

• Example 2: We had nC,1 = 0.FC,1 = 0+1

20+4 = 0.042• Result: Can use PSSM to find novel motifs

Page 48: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Page 49: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

”Pseudo counts” for profiles• Idea: Pretend you have seen all possible

motifs• Pseudo counts: αr is number of ”pseudo

observations” of r .• Include in profile calculations:

Fr ,c =nr ,c + αr

n +∑

r αr

• Example 1: Let αA = αC = αG = αT = 1.FA,1 = 12+1

20+4 = 0.54.• Example 2: We had nC,1 = 0.

FC,1 = 0+120+4 = 0.042

• Result: Can use PSSM to find novel motifs

Page 50: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.• E-value theory same thanks to log-odds

score!

Page 51: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.

• E-value theory same thanks to log-oddsscore!

Page 52: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Fast motif searches

• Motifs are small, therefore easy to searchwith. Fast.

• Blast variants exists for motifs.• E-value theory same thanks to log-odds

score!

Page 53: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Page 54: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Page 55: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Motif databases

PROSITE: Important binding sites”What motifs does my protein have?”• Profiles• Pattern notation• Careful documentation

BLOCKS: Origin to BLOSUM.Presents multialignments!Assembled by most conserved partsof domains.

PRINTS: ”What motif combinations does myprotein have?”

Page 56: Intro Protein structure Motifs Motif databases End … Protein structure Motifs Motif databases End Last time • Probability based methods • How find a good root? • Reliability

Intro Protein structure Motifs Motif databases End

Next time

• PSI-Blast• Protein domains• Domain databases• Hidden Markov models?