Top Banner
Bioinformatics Algorithms RNDr. David Hoksza, Ph.D. http://siret.cz/hoksza Paterns, Profiles and Motifs
39

Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Bioinformatics Algorithms

RNDr. David Hoksza, Ph.D. http://siret.cz/hoksza

Paterns, Profiles and Motifs

Page 2: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Outline • Motivation

• Consensus sequences

• Position specific scoring matrices

• Hidden Markov Models

• Protein families databases

Credits: Based on EMBnet course “An introduction to Patterns, Profiles, HMMs andPSI-BLAST”

Page 3: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Motivation • MSA contains conserved regions corresponding to

o signals (promoters, …) o common structural motifs o chemical reactivity (active sites, …)

• When encountering a new sequence one is interested in assigning the new

sequence to other sequences o description of a set of sequences o assigning new sequence to a set of sequence o scoring of the assignment

• Models of conserved regions

o consensus sequence o patterns o position specific scoring matrix (PSSM) o Hidden Markov Models (HMM)

Page 4: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Consensus Sequence • The simplest method to build a model from a

multiple sequence alignment

• Principle o majority wins o skip too much variation

• Algorithm 1. Count symbol distribution in each column independently. 2. For each column with clear majority of one symbol pick that symbol on

the respective position in the consensus sequence. 3. Fill the remaining positions with * symbol.

Page 5: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S A

G H E G Y G G R S R G G G Y S

G H E F E G P K G C G A L Y I

G H E L R G T T F M P A L E C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

G H E G V G K V V K G G L Y A

K K Y F E D L A A G S

F Y G R S R R P S I

L E P K G C P G E C

R T T F M

GGE**G*****G*** Consensus sequence to be used to scan a

sequence database

Page 6: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Consensus Sequence – Pros & Cons

Pros Cons

• Simple

• Easy to implement

• Symbol distribution not

present in the resulting sequence

• Highly dependent on the training set

• Binary o only information whether a query

sequence matches the CS, not how well

Page 7: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Pattern • Regular expressions for biological sequences

o describes a set of sequences within one expression

• Prosite syntax o IUPAC one-letter codes o neighboring residues delimited by a ‘-’ o ‘X’ is treated as a wildcard character o any of the symbols between [] can be used at that position

• [AG] … alanine or glycine o any of the symbols between {} can not be used at that position

• {AG} … anything except alanine or glycine o () … repetitions

• [AG](2) … 2 repetitions of alanine or glycine • X(3-5) … 3 to 5 repetitions of any letter

o a range only with ‘X', i.e., A(2,4) is not a valid pattern element o a pattern restricted to either the N- or C-terminal of a sequence starts with

a `<' symbol or respectively ends with a `>' symbol

Page 8: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Pattern - Example

<A-x-[ST](2)-x(0,1)-{V}

• an alanine in the N-term • followed by any amino acid • followed by a serine or threonine twice • followed or not by any residue • followed by any amino acid except valine

Page 9: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Pattern – Example (cont.) • http://www.ibiblio.org/pub/academic/biology/molbio/data/

prosite/prosite.lis

• Post-translational signatures o cAMP- and cGMP-dependent protein kinase phosphorylation site

• [RK](2)-x-[ST] o Tyrosine kinase phosphorylation site

• [RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y

• Enzymes

o Peroxidases proximal heme-ligand signature • [DET]-[LIVMTA]-{NSYL}-{RPFC}-[LIVM]-[LIVMSTAG]-[SAG]-[LIVMSTAG]-

H-[STA]-[LIVMFY]

• Receptors o G-protein coupled receptors family 1 signature

• [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x-{PQ}-[LIVMNQGA]-{RK}-{RK}-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-{PE}-x-[LIVM]

Page 10: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S A

G H E G Y G G R S R G G G Y S

G H E F E G P K G C G A L Y I

G H E L R G T T F M P A L E C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

G H E G V G K V V K G G L Y A

K K Y F E D L A A G S

F Y G R S R R P S I

L E P K G C P G E C

R T T F M

G−H−E−X(2)−G−X(5)−[GA]−X(3) Pattern to be used to scan a sequence

database

Page 11: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Patterns – Pros & Cons Pros Cons

• Easy to implement

• Easy to understand for anyone

• Ability to better express the motif then consensus sequence

• Symbol distribution not

present in the resulting sequence

• Highly dependent on the training set

• Small patterns generate lot of hits o possible false positives

• Binary

o only information whether a query sequence matches the CS, not how much

Page 12: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Patterns - Excercise • Build pattern for

WFFKGIADKDAERHLLA WFFKNLEQKDAEARLLA WFFKR---KDAERQLLA WFFGTI---DAERQLLA WFFKDIPTKDAERQLLA WYFG----RESERLLLA WYFGKIPLKDAERQLLA WYFGKLRAKDTERLLLL

Page 13: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Position Specific Scoring Matrix

• Position Specific Scoring Matrix (PSSM)expresses the likelihood of a letter to appear at a given position o symbols x positions matrix

• Based on counts of letters at the positions

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S A

G H E G Y G G R S R G G G Y S

G H E F E G P K G C G A L Y I

G H E L R G T T F M P A L E C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2

C 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1

D 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

E 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0

F 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0

G 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0

H 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0

I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

K 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0

L 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0

M 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0

Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

R 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0

S 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1

T 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

V 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0

W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Y 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0

col1: 𝑓𝐴,1 = 05

, … , 𝑓𝐺,1 = 55

, …

col2: 𝑓𝐴,2 = 05

, … , 𝑓𝐻,2 = 55

, … … col4: 𝑓𝐴,4 = 0

5, … , 𝑓𝐹,4 = 1

5, 𝑓𝐹,4 = 2

5…

Page 14: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM – Pseudo-Counts • Small training set implicates some zero values in the counts

matrix

• The probability of occurrence of any symbol is not null

• Pseudo-counts o adding small values for non-observed frequencies to all frequencies (both observed

and non-observed)

o pseudo-counts 1:

col1: 𝑓𝐴,1 = 0+15+20

= 0.04, … , 𝑓𝐺,1 = 5+15+20

= 0.24, …

col2: 𝑓𝐴,2 = 0+15+20

= 0.04, … , 𝑓𝐻,2 = 5+15+20

= 0.24, … … col4: 𝑓𝐴,4 = 0+1

5+20= 0,04, … , 𝑓𝐹,4 = 1+1

5+20= 0.08,𝑓𝐹,4 = 2+1

5+20= 0,12, …

Page 15: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM - Computation • Resulting score for position 𝑖, 𝑗 is computed as log-

likelihood ratio from the null model (each amino acid is observed with an identical frequency in a random sequence)

𝒔𝒊𝒊 = 𝒍𝒍𝒍 (𝒇𝒊𝒊′

𝒒𝒊)

• 𝑓𝑖𝑖′ … pseudo-count modified observed frequencies • 𝑞𝑖 … expected frequency of residue 𝑖 in a random sequence

Page 16: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM - Result 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 0.7 -0.2 1.3

C -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7

D -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2

E -0.2 -0.2 2.3 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 -0.2

F -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

G 2.3 -0.2 -0.2 1.3 -0.2 2.3 0.7 -0.2 0.7 -0.2 1.3 1.7 0.7 0.7 -0.2

H -0.2 2.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

I -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7

K -0.2 -0.2 -0.2 0.7 0.7 -0.2 0.7 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2

L -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 1.3 -0.2 -0.2

M -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2

N -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

P -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2

Q -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

R -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2

S -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 0.7

T -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

V -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

W -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2

Y -0.2 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 -0.2

Page 17: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM - Querying • The matrix is used as a sliding window which

slides across the query sequence

• PSSM score sums up scores in the columns

• Position with the highest PSSM score is reported

Page 18: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM - Weighting • Highly populated families can contain big subfamilies

which can negatively influence the results

• Sequence weighting compensates the sampling bias

Page 19: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSSM – Pros & Cons Pros Cons

• Relatively fast

• Querying is simple to

implement

• Match scores are statistically interpretable

• No insertions or deletions o constant-length regions

Page 20: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSI-BLAST • Position specific

iterated BLAST o establishment of profiles o using profiles to search

sequence database

• Algorithm 1. Search database using

BLASTP 2. Collect high scoring results

and build MSA 3. Get PSSM from the MSA 4. Use the profile from PSSM to

search against database using BLASTP

5. If new hits are identified add them to the MSA and update profile

6. Repeat steps 4 and 5 until stabilization

Query sequence

Homologs

Profile

Additional homologs

New profile

BLAST

MSA

BLAST

Extended profile

Page 21: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PSI-BLAST – pros & cons Pros Cons

• Capable to identify up three times more 30% homologues then BLAST

• Fast because using BLAST heuristics

• Allows PSSMs on large databases

• profile drift o high sensitivity → false positives

→ biased profile → incorporation in subsequent cycles

Page 22: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Operating Instructions

• Consensus sequences o to find highly conserved signatures, as for example enzyme restriction sites

for DNA

• Patterns

o to search for small signatures or active sites. o to communicate with other biologists

• PSSM

o to model small regions with high variability but constant length

Page 23: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Markov Chains • A Markov Chain is a

succession of states 𝑆𝑖 (= 0,1, … ) connected by transitions. A transition from 𝑺𝒊 to 𝑺𝒊 has a probability of 𝑷𝒊𝒊

o Markov property

• next state of a Markov chain depends just on the current state 𝑺 and not on the sequence of states leading to 𝑆 o 𝑃 𝑆𝑖𝑖 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 =𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1

o Markov model contains

• transition probabilities o 𝑎𝑖𝑖 = 𝑃 𝑆𝑖 𝑆𝑖

• initial probabilities o 𝜋𝑖 = 𝑃(𝑠𝑖)

• Traffic lights o states

• red, orange, green o transition probabilities

• P(green→orange)=1, P(orange→red)=1, P(red→green)=1 1 1

1 • Weather

o states • sun, cloud, rain

o transition probabilities weather today

weather yesterday

sun cloud rain

sun 0.5 0.25 0.25

cloud 0.375 0.125 0.375

rain 0.135 0.625 0.375

• Exercise – draw diagram

Page 24: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Markov Chain Sequence Probability

𝑷 𝑺𝒊𝟏,𝑺𝒊𝟐, … ,𝑺𝒊𝒊 = 𝑃 𝑆𝑖𝑖 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 𝑃 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1= 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1 𝑃 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 = …= 𝑷 𝑺𝒊𝒊 𝑺𝒊𝒊−𝟏 𝑷 𝑺𝒊𝒊−𝟏 𝑺𝒊𝒊−𝟐 …𝑷 𝑺𝟐 𝑺𝟏 𝑷(𝑺𝟏)

• Probability of a sequence {‘sun’,’sun’, ‘rain’, ‘cloud’}

o initial probabilities: P(‘sun’)=0.5, P(‘cloud’)=0.4, P(‘rain’)=0.1

o 𝑃 ‘𝑠𝑠𝑠𝑠, 𝑠𝑠𝑠𝑠𝑠, ‘𝑟𝑎𝑖𝑠𝑠, ‘𝑐𝑐𝑐𝑠𝑐𝑠 = 𝑃 ‘𝑐𝑐𝑐𝑠𝑐𝑠 𝑠𝑟𝑎𝑖𝑠𝑠 𝑃 ‘𝑟𝑎𝑖𝑠𝑠 𝑠𝑠𝑠𝑠𝑠 𝑃 ‘𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠 ∗ 𝑃 ‘𝑠𝑠𝑠𝑠 =0.5 ∗ 0.5 ∗ 0.25 ∗ 0.625

weather today

weather yesterday

sun cloud rain

sun 0.5 0.25 0.25

cloud 0.375 0.125 0.375

rain 0.135 0.625 0.375

Page 25: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Hidden Markov Models • Hidden Markov Model (HMM) is a generalization of

Markov models where the system is a Markov process passing through hidden states

• States are not visible but each state generates (emits) one of M observations (𝑶𝟏, … ,𝑶𝑴) with given probability

• HMM is defined as 𝑴(𝑺,𝑶,𝝅) where o 𝑆= matrix of transition probabilities 𝑎𝑖𝑖 = 𝑷(𝑺𝒊|𝑺𝒊) o 𝑂 = matrix of observation probabilities 𝑏𝑖𝑚 = 𝑷(𝑶𝒎|𝑺𝒊) o 𝜋 = vector of initial probabilities 𝜋𝑖 = 𝑃(𝑆𝑖)

Page 26: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

HMM - Example

• States: ‘Low’ preasure, ‘High’ preasure • Observations: ‘Rain’, ‘Dry’ • Transition probabilities: P(‘Low’|’Low’)=0.3,

P(‘High’|’Low’)=0.7, P(‘Low’|’High’)=0.2, P(‘High’|High’)=0.8 • Observation/emission probabilities: P(‘Rain’|’Low’)=0.6,

P(‘Dry’|’Low’)=0.4, P(‘Rain’|High’)=0.4, P(‘Dry’|’High’)=0.6 • Initial probabilities: P(‘Low’)=0.4, P(‘High’)=0.6

o often two special states are added to represent start and end where start is connected to the rest of the graph using the initial probabilities

Low High

Rain Dry

0.7

0.2

0.3 0.8

0.6

0.4

0.6

0.4

Page 27: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Observation Sequence Probability

• Sequence of observations can be obtained (explained) by multiple ways with different probabilities

• If we want to calculate a probability for sequence of observations {‘Dry’, ‘Rain’} we can explain it, e.g., by 𝑃 ‘𝐷𝑟𝐷𝑠, 𝑠𝑅𝑎𝑖𝑠𝑠 , 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠

= 𝑃 ‘𝐷𝑟𝐷𝑠, 𝑠𝑅𝑎𝑖𝑠𝑠 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠= 𝑃 𝑠𝐷𝑟𝐷𝑠|𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝑅𝑎𝑖𝑠𝑠|𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝐿𝑐𝐿𝑠|𝑠𝐿𝑐𝐿𝑠∗ 𝑃 𝑠𝐿𝑐𝐿𝑠 = 0.4 ∗ 0.6 ∗ 0.3 ∗ 0.4

Page 28: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Profile HMMs • HMM can contain information contained in MSA → an alternative

to PSSM → profile HMM

• If we have a profile 𝑷 and align a sequence 𝒔 to it, at each step 𝑖 we can either o match 𝑖-th letter of 𝑠 to 𝑃 – 𝑴𝒊 o add gap to s (the corresponding letter in 𝑠 will be matched

with some latter position in 𝑃) - 𝑫𝒊 o add gap to the profile and align given position in 𝑠 with a gap

in 𝑃 - 𝑰𝒊

• 𝑴𝒊,𝑫𝒊, 𝑰𝒊 correspond to the states of the HMM which emit letters of the query sequence with given probabilities (learned from a MSA)

• Path in the HMM shows how a sequence could be aligned to the profile and moreover gives the score reflecting the probability with which such an alignment could happen

Page 29: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Training a HMM from a MSA

Page 30: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Matching

Page 31: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Matching (cont.)

Page 32: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Viterbi Algorithm • Any path through a model emits a sequence with an associated probability

(product of all the transitions and emission probabilites)

• Many paths through the HMM can lead to the same emitted sequence → different alignments to the profile → searching for the most probable path (analogous to the best scoring alignment) → Viterbi algorithm o 𝒕 𝑴𝒖,𝑴𝒖+𝟏 … transition probability from 𝑀𝑢 to 𝑀𝑢+1 o 𝑥 = 𝑥1, 𝑥2, … , 𝑥𝐿 … emitted sequence o 𝒆𝑰𝒖(𝒙𝒊) … the emission probability for residue 𝑥𝑖 from insert state 𝐼𝑢

𝒗𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 𝐦𝐦𝐦�𝒗𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏,𝑴𝒖𝒗𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏,𝑴𝒖𝒗𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏,𝑴𝒖

𝒗𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝐦𝐦𝐦 �𝒗𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖, 𝑰𝒖 𝒗𝑰𝒖 𝒙𝒊−𝟏 𝒕 𝑰𝒖, 𝑰𝒖

𝒗𝑫𝒖 𝒙𝒊 = 𝒎𝒎𝒙�𝒗𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏,𝑫 𝒗𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏,𝑫𝒖

𝒗𝒔𝒕𝒎𝒔𝒕 𝟎 = 𝟏,𝒗𝒖 𝟎 = 𝟎 o usually log-odds scores are used since probabilities lead to very small values o 𝑣𝑒𝑛𝑛 𝑥𝐿 log-odds score of the best path

Transitions between I and D are usually not considered

Emission probability based on how often a training sequence matches with the profile.

The model for insert state is based on random model → probability from the overall AA composition.

Page 33: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Forward Algorithm • One emitted sequence can be obtained by many paths. Summing

probabilities of all these paths shows the probability of given sequence to be emitted by the HMM o 𝑓𝑀𝑢(𝑥𝑖) … the total probability at the state 𝑀𝑢 when the sequence

up to and including residue 𝑥𝑖 has been emitted

𝒇𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 �𝒇𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏,𝑴𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏,𝑴𝒖+ 𝒇𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏,𝑴𝒖 �

𝒇𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝒇𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖, 𝑰𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏, 𝑰𝒖

𝒇𝑫𝒖 𝒙𝒊 = 𝒇𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏,𝑫𝒖 + 𝒇𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏,𝑫𝒖

Page 34: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Protein Family Databases • There exist many databases of MSAs and related •

o consensus sequences

o patterns

o HMMs

o …

• Some databases contain multiple representations of families

Page 35: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Prosite • http://www.expasy.ch/prosite

• Collection of motifs, protein domains, families and

functional sites

• Uses generalized profiles (Pftools) and patterns o patterns usually have 10-20 AA

• Patterns contain

o a quality estimation by counting true positives, false negatives and false positives in SWISS-PROT

o taxonomic range (archea, eukaryota, …) o a SWISS-PROT match list

• Contains ScanProsite tool

o allows to search according to profile, filter by taxonomy, length, ID, …

Page 36: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

CDD • http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml

• Conserved Domains Database

• Contains MSAs available as PSSMs

o NCBI-curated domains based on 3D structure o imported domains models (Pfam, TIGRFAM, SMART, COG, KOG …)

• CD-search

o search interface for scanning CDD against submitted protein or nucleotide query

o uses RPS-BLAST (variant of PSI-BLAST)

• CDART o Conserved Domain Architecture Retrieval Tool o being used to analyze the domain architecture and retrieve proteins with

similar architecture

Page 37: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

PRINTS • http://bioinf.man.ac.uk/dbbrowser/PRINTS

• Collection of conserved motifs used to characterize

a protein using fingerprints (conserved motifs used to characterize a protein family)

• Fingerprints should encode protein folds and functionalities more flexibly than can single motifs

• Similar to BLOCKS

Page 38: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

Pfam • http://www.sanger.ac.uk/Pfam

• Collection of protein domains and families and

respective MSAs

• Uses HMMs (HMMER3 package)

• Versions o Pfam-A

• manually curated • over 12,000,000 sequences in over 13,500 families

o Pfam-B • automatically clustered and aligned sequences not covered by Pfam-A

Page 39: Bioinformatics Algorithms - Univerzita Karlovasiret.ms.mff.cuni.cz/hoksza/teaching/vscht/bioinfoalgo/...Outline • Motivation • Consensus sequences • Position specific scoring

InterPro • http://www.ebi.ac.uk/interpro/

• Combination protein signatures from a number of

member databases into a single searchable resource o CATH/Gene3D, PANTHER, Pfam, PRINTS, ProDom, PROSITE, SMART,

SUPERFAMILY, TIGRFAM, ….

• INTERPROSCAN

o allows scanning of sequences again InterPro’s sequences o accessible also using web services