Bioinformatics Algorithms RNDr. David Hoksza, Ph.D. http://siret.cz/hoksza Paterns, Profiles and Motifs
Bioinformatics Algorithms
RNDr. David Hoksza, Ph.D. http://siret.cz/hoksza
Paterns, Profiles and Motifs
Outline • Motivation
• Consensus sequences
• Position specific scoring matrices
• Hidden Markov Models
• Protein families databases
Credits: Based on EMBnet course “An introduction to Patterns, Profiles, HMMs andPSI-BLAST”
Motivation • MSA contains conserved regions corresponding to
o signals (promoters, …) o common structural motifs o chemical reactivity (active sites, …)
• When encountering a new sequence one is interested in assigning the new
sequence to other sequences o description of a set of sequences o assigning new sequence to a set of sequence o scoring of the assignment
• Models of conserved regions
o consensus sequence o patterns o position specific scoring matrix (PSSM) o Hidden Markov Models (HMM)
Consensus Sequence • The simplest method to build a model from a
multiple sequence alignment
• Principle o majority wins o skip too much variation
• Algorithm 1. Count symbol distribution in each column independently. 2. For each column with clear majority of one symbol pick that symbol on
the respective position in the consensus sequence. 3. Fill the remaining positions with * symbol.
G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S A
G H E G Y G G R S R G G G Y S
G H E F E G P K G C G A L Y I
G H E L R G T T F M P A L E C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
G H E G V G K V V K G G L Y A
K K Y F E D L A A G S
F Y G R S R R P S I
L E P K G C P G E C
R T T F M
GGE**G*****G*** Consensus sequence to be used to scan a
sequence database
Consensus Sequence – Pros & Cons
Pros Cons
• Simple
• Easy to implement
• Symbol distribution not
present in the resulting sequence
• Highly dependent on the training set
• Binary o only information whether a query
sequence matches the CS, not how well
Pattern • Regular expressions for biological sequences
o describes a set of sequences within one expression
• Prosite syntax o IUPAC one-letter codes o neighboring residues delimited by a ‘-’ o ‘X’ is treated as a wildcard character o any of the symbols between [] can be used at that position
• [AG] … alanine or glycine o any of the symbols between {} can not be used at that position
• {AG} … anything except alanine or glycine o () … repetitions
• [AG](2) … 2 repetitions of alanine or glycine • X(3-5) … 3 to 5 repetitions of any letter
o a range only with ‘X', i.e., A(2,4) is not a valid pattern element o a pattern restricted to either the N- or C-terminal of a sequence starts with
a `<' symbol or respectively ends with a `>' symbol
Pattern - Example
<A-x-[ST](2)-x(0,1)-{V}
• an alanine in the N-term • followed by any amino acid • followed by a serine or threonine twice • followed or not by any residue • followed by any amino acid except valine
Pattern – Example (cont.) • http://www.ibiblio.org/pub/academic/biology/molbio/data/
prosite/prosite.lis
• Post-translational signatures o cAMP- and cGMP-dependent protein kinase phosphorylation site
• [RK](2)-x-[ST] o Tyrosine kinase phosphorylation site
• [RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y
• Enzymes
o Peroxidases proximal heme-ligand signature • [DET]-[LIVMTA]-{NSYL}-{RPFC}-[LIVM]-[LIVMSTAG]-[SAG]-[LIVMSTAG]-
H-[STA]-[LIVMFY]
• Receptors o G-protein coupled receptors family 1 signature
• [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x-{PQ}-[LIVMNQGA]-{RK}-{RK}-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-{PE}-x-[LIVM]
G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S A
G H E G Y G G R S R G G G Y S
G H E F E G P K G C G A L Y I
G H E L R G T T F M P A L E C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
G H E G V G K V V K G G L Y A
K K Y F E D L A A G S
F Y G R S R R P S I
L E P K G C P G E C
R T T F M
G−H−E−X(2)−G−X(5)−[GA]−X(3) Pattern to be used to scan a sequence
database
Patterns – Pros & Cons Pros Cons
• Easy to implement
• Easy to understand for anyone
• Ability to better express the motif then consensus sequence
• Symbol distribution not
present in the resulting sequence
• Highly dependent on the training set
• Small patterns generate lot of hits o possible false positives
• Binary
o only information whether a query sequence matches the CS, not how much
Patterns - Excercise • Build pattern for
WFFKGIADKDAERHLLA WFFKNLEQKDAEARLLA WFFKR---KDAERQLLA WFFGTI---DAERQLLA WFFKDIPTKDAERQLLA WYFG----RESERLLLA WYFGKIPLKDAERQLLA WYFGKLRAKDTERLLLL
Position Specific Scoring Matrix
• Position Specific Scoring Matrix (PSSM)expresses the likelihood of a letter to appear at a given position o symbols x positions matrix
• Based on counts of letters at the positions
G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S A
G H E G Y G G R S R G G G Y S
G H E F E G P K G C G A L Y I
G H E L R G T T F M P A L E C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2
C 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
D 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
E 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0
F 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
G 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0
H 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
K 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0
L 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0
M 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0
S 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1
T 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
V 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0
col1: 𝑓𝐴,1 = 05
, … , 𝑓𝐺,1 = 55
, …
col2: 𝑓𝐴,2 = 05
, … , 𝑓𝐻,2 = 55
, … … col4: 𝑓𝐴,4 = 0
5, … , 𝑓𝐹,4 = 1
5, 𝑓𝐹,4 = 2
5…
…
PSSM – Pseudo-Counts • Small training set implicates some zero values in the counts
matrix
• The probability of occurrence of any symbol is not null
• Pseudo-counts o adding small values for non-observed frequencies to all frequencies (both observed
and non-observed)
o pseudo-counts 1:
col1: 𝑓𝐴,1 = 0+15+20
= 0.04, … , 𝑓𝐺,1 = 5+15+20
= 0.24, …
col2: 𝑓𝐴,2 = 0+15+20
= 0.04, … , 𝑓𝐻,2 = 5+15+20
= 0.24, … … col4: 𝑓𝐴,4 = 0+1
5+20= 0,04, … , 𝑓𝐹,4 = 1+1
5+20= 0.08,𝑓𝐹,4 = 2+1
5+20= 0,12, …
…
PSSM - Computation • Resulting score for position 𝑖, 𝑗 is computed as log-
likelihood ratio from the null model (each amino acid is observed with an identical frequency in a random sequence)
𝒔𝒊𝒊 = 𝒍𝒍𝒍 (𝒇𝒊𝒊′
𝒒𝒊)
• 𝑓𝑖𝑖′ … pseudo-count modified observed frequencies • 𝑞𝑖 … expected frequency of residue 𝑖 in a random sequence
PSSM - Result 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 0.7 -0.2 1.3
C -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7
D -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2
E -0.2 -0.2 2.3 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 -0.2
F -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
G 2.3 -0.2 -0.2 1.3 -0.2 2.3 0.7 -0.2 0.7 -0.2 1.3 1.7 0.7 0.7 -0.2
H -0.2 2.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
I -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7
K -0.2 -0.2 -0.2 0.7 0.7 -0.2 0.7 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2
L -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 1.3 -0.2 -0.2
M -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2
N -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
P -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2
Q -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
R -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2
S -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 0.7
T -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
V -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
W -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2
Y -0.2 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 -0.2
PSSM - Querying • The matrix is used as a sliding window which
slides across the query sequence
• PSSM score sums up scores in the columns
• Position with the highest PSSM score is reported
PSSM - Weighting • Highly populated families can contain big subfamilies
which can negatively influence the results
• Sequence weighting compensates the sampling bias
PSSM – Pros & Cons Pros Cons
• Relatively fast
• Querying is simple to
implement
• Match scores are statistically interpretable
• No insertions or deletions o constant-length regions
PSI-BLAST • Position specific
iterated BLAST o establishment of profiles o using profiles to search
sequence database
• Algorithm 1. Search database using
BLASTP 2. Collect high scoring results
and build MSA 3. Get PSSM from the MSA 4. Use the profile from PSSM to
search against database using BLASTP
5. If new hits are identified add them to the MSA and update profile
6. Repeat steps 4 and 5 until stabilization
Query sequence
Homologs
Profile
Additional homologs
New profile
BLAST
MSA
BLAST
Extended profile
PSI-BLAST – pros & cons Pros Cons
• Capable to identify up three times more 30% homologues then BLAST
• Fast because using BLAST heuristics
• Allows PSSMs on large databases
• profile drift o high sensitivity → false positives
→ biased profile → incorporation in subsequent cycles
Operating Instructions
• Consensus sequences o to find highly conserved signatures, as for example enzyme restriction sites
for DNA
• Patterns
o to search for small signatures or active sites. o to communicate with other biologists
• PSSM
o to model small regions with high variability but constant length
Markov Chains • A Markov Chain is a
succession of states 𝑆𝑖 (= 0,1, … ) connected by transitions. A transition from 𝑺𝒊 to 𝑺𝒊 has a probability of 𝑷𝒊𝒊
o Markov property
• next state of a Markov chain depends just on the current state 𝑺 and not on the sequence of states leading to 𝑆 o 𝑃 𝑆𝑖𝑖 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 =𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1
o Markov model contains
• transition probabilities o 𝑎𝑖𝑖 = 𝑃 𝑆𝑖 𝑆𝑖
• initial probabilities o 𝜋𝑖 = 𝑃(𝑠𝑖)
• Traffic lights o states
• red, orange, green o transition probabilities
• P(green→orange)=1, P(orange→red)=1, P(red→green)=1 1 1
1 • Weather
o states • sun, cloud, rain
o transition probabilities weather today
weather yesterday
sun cloud rain
sun 0.5 0.25 0.25
cloud 0.375 0.125 0.375
rain 0.135 0.625 0.375
• Exercise – draw diagram
Markov Chain Sequence Probability
𝑷 𝑺𝒊𝟏,𝑺𝒊𝟐, … ,𝑺𝒊𝒊 = 𝑃 𝑆𝑖𝑖 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 𝑃 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1= 𝑃 𝑆𝑖𝑖 𝑆𝑖𝑖−1 𝑃 𝑆𝑖1, 𝑆𝑖2, … , 𝑆𝑖𝑖−1 = …= 𝑷 𝑺𝒊𝒊 𝑺𝒊𝒊−𝟏 𝑷 𝑺𝒊𝒊−𝟏 𝑺𝒊𝒊−𝟐 …𝑷 𝑺𝟐 𝑺𝟏 𝑷(𝑺𝟏)
• Probability of a sequence {‘sun’,’sun’, ‘rain’, ‘cloud’}
o initial probabilities: P(‘sun’)=0.5, P(‘cloud’)=0.4, P(‘rain’)=0.1
o 𝑃 ‘𝑠𝑠𝑠𝑠, 𝑠𝑠𝑠𝑠𝑠, ‘𝑟𝑎𝑖𝑠𝑠, ‘𝑐𝑐𝑐𝑠𝑐𝑠 = 𝑃 ‘𝑐𝑐𝑐𝑠𝑐𝑠 𝑠𝑟𝑎𝑖𝑠𝑠 𝑃 ‘𝑟𝑎𝑖𝑠𝑠 𝑠𝑠𝑠𝑠𝑠 𝑃 ‘𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠 ∗ 𝑃 ‘𝑠𝑠𝑠𝑠 =0.5 ∗ 0.5 ∗ 0.25 ∗ 0.625
weather today
weather yesterday
sun cloud rain
sun 0.5 0.25 0.25
cloud 0.375 0.125 0.375
rain 0.135 0.625 0.375
Hidden Markov Models • Hidden Markov Model (HMM) is a generalization of
Markov models where the system is a Markov process passing through hidden states
• States are not visible but each state generates (emits) one of M observations (𝑶𝟏, … ,𝑶𝑴) with given probability
• HMM is defined as 𝑴(𝑺,𝑶,𝝅) where o 𝑆= matrix of transition probabilities 𝑎𝑖𝑖 = 𝑷(𝑺𝒊|𝑺𝒊) o 𝑂 = matrix of observation probabilities 𝑏𝑖𝑚 = 𝑷(𝑶𝒎|𝑺𝒊) o 𝜋 = vector of initial probabilities 𝜋𝑖 = 𝑃(𝑆𝑖)
HMM - Example
• States: ‘Low’ preasure, ‘High’ preasure • Observations: ‘Rain’, ‘Dry’ • Transition probabilities: P(‘Low’|’Low’)=0.3,
P(‘High’|’Low’)=0.7, P(‘Low’|’High’)=0.2, P(‘High’|High’)=0.8 • Observation/emission probabilities: P(‘Rain’|’Low’)=0.6,
P(‘Dry’|’Low’)=0.4, P(‘Rain’|High’)=0.4, P(‘Dry’|’High’)=0.6 • Initial probabilities: P(‘Low’)=0.4, P(‘High’)=0.6
o often two special states are added to represent start and end where start is connected to the rest of the graph using the initial probabilities
Low High
Rain Dry
0.7
0.2
0.3 0.8
0.6
0.4
0.6
0.4
Observation Sequence Probability
• Sequence of observations can be obtained (explained) by multiple ways with different probabilities
• If we want to calculate a probability for sequence of observations {‘Dry’, ‘Rain’} we can explain it, e.g., by 𝑃 ‘𝐷𝑟𝐷𝑠, 𝑠𝑅𝑎𝑖𝑠𝑠 , 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠
= 𝑃 ‘𝐷𝑟𝐷𝑠, 𝑠𝑅𝑎𝑖𝑠𝑠 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝐿𝑐𝐿𝑠, 𝑠𝐿𝑐𝐿𝑠= 𝑃 𝑠𝐷𝑟𝐷𝑠|𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝑅𝑎𝑖𝑠𝑠|𝑠𝐿𝑐𝐿𝑠 ∗ 𝑃 𝑠𝐿𝑐𝐿𝑠|𝑠𝐿𝑐𝐿𝑠∗ 𝑃 𝑠𝐿𝑐𝐿𝑠 = 0.4 ∗ 0.6 ∗ 0.3 ∗ 0.4
Profile HMMs • HMM can contain information contained in MSA → an alternative
to PSSM → profile HMM
• If we have a profile 𝑷 and align a sequence 𝒔 to it, at each step 𝑖 we can either o match 𝑖-th letter of 𝑠 to 𝑃 – 𝑴𝒊 o add gap to s (the corresponding letter in 𝑠 will be matched
with some latter position in 𝑃) - 𝑫𝒊 o add gap to the profile and align given position in 𝑠 with a gap
in 𝑃 - 𝑰𝒊
• 𝑴𝒊,𝑫𝒊, 𝑰𝒊 correspond to the states of the HMM which emit letters of the query sequence with given probabilities (learned from a MSA)
• Path in the HMM shows how a sequence could be aligned to the profile and moreover gives the score reflecting the probability with which such an alignment could happen
Training a HMM from a MSA
Matching
Matching (cont.)
Viterbi Algorithm • Any path through a model emits a sequence with an associated probability
(product of all the transitions and emission probabilites)
• Many paths through the HMM can lead to the same emitted sequence → different alignments to the profile → searching for the most probable path (analogous to the best scoring alignment) → Viterbi algorithm o 𝒕 𝑴𝒖,𝑴𝒖+𝟏 … transition probability from 𝑀𝑢 to 𝑀𝑢+1 o 𝑥 = 𝑥1, 𝑥2, … , 𝑥𝐿 … emitted sequence o 𝒆𝑰𝒖(𝒙𝒊) … the emission probability for residue 𝑥𝑖 from insert state 𝐼𝑢
𝒗𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 𝐦𝐦𝐦�𝒗𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏,𝑴𝒖𝒗𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏,𝑴𝒖𝒗𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏,𝑴𝒖
𝒗𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝐦𝐦𝐦 �𝒗𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖, 𝑰𝒖 𝒗𝑰𝒖 𝒙𝒊−𝟏 𝒕 𝑰𝒖, 𝑰𝒖
𝒗𝑫𝒖 𝒙𝒊 = 𝒎𝒎𝒙�𝒗𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏,𝑫 𝒗𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏,𝑫𝒖
𝒗𝒔𝒕𝒎𝒔𝒕 𝟎 = 𝟏,𝒗𝒖 𝟎 = 𝟎 o usually log-odds scores are used since probabilities lead to very small values o 𝑣𝑒𝑛𝑛 𝑥𝐿 log-odds score of the best path
Transitions between I and D are usually not considered
Emission probability based on how often a training sequence matches with the profile.
The model for insert state is based on random model → probability from the overall AA composition.
Forward Algorithm • One emitted sequence can be obtained by many paths. Summing
probabilities of all these paths shows the probability of given sequence to be emitted by the HMM o 𝑓𝑀𝑢(𝑥𝑖) … the total probability at the state 𝑀𝑢 when the sequence
up to and including residue 𝑥𝑖 has been emitted
𝒇𝑴𝒖 𝒙𝒊 = 𝒆𝑴𝒖 𝒙𝒊 �𝒇𝑴𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑴𝒖−𝟏,𝑴𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏,𝑴𝒖+ 𝒇𝑫𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑫𝒖−𝟏,𝑴𝒖 �
𝒇𝑰𝒖 𝒙𝒊 = 𝒑𝒙𝒊 𝒇𝑴𝒖 𝒙𝒊−𝟏 𝒕 𝑴𝒖, 𝑰𝒖 + 𝒇𝑰𝒖−𝟏 𝒙𝒊−𝟏 𝒕 𝑰𝒖−𝟏, 𝑰𝒖
𝒇𝑫𝒖 𝒙𝒊 = 𝒇𝑴𝒖−𝟏 𝒙𝒊 𝒕 𝑴𝒖−𝟏,𝑫𝒖 + 𝒇𝑫𝒖−𝟏 𝒙𝒊 𝒕 𝑫𝒖−𝟏,𝑫𝒖
Protein Family Databases • There exist many databases of MSAs and related •
o consensus sequences
o patterns
o HMMs
o …
• Some databases contain multiple representations of families
Prosite • http://www.expasy.ch/prosite
• Collection of motifs, protein domains, families and
functional sites
• Uses generalized profiles (Pftools) and patterns o patterns usually have 10-20 AA
• Patterns contain
o a quality estimation by counting true positives, false negatives and false positives in SWISS-PROT
o taxonomic range (archea, eukaryota, …) o a SWISS-PROT match list
• Contains ScanProsite tool
o allows to search according to profile, filter by taxonomy, length, ID, …
CDD • http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
• Conserved Domains Database
• Contains MSAs available as PSSMs
o NCBI-curated domains based on 3D structure o imported domains models (Pfam, TIGRFAM, SMART, COG, KOG …)
• CD-search
o search interface for scanning CDD against submitted protein or nucleotide query
o uses RPS-BLAST (variant of PSI-BLAST)
• CDART o Conserved Domain Architecture Retrieval Tool o being used to analyze the domain architecture and retrieve proteins with
similar architecture
PRINTS • http://bioinf.man.ac.uk/dbbrowser/PRINTS
• Collection of conserved motifs used to characterize
a protein using fingerprints (conserved motifs used to characterize a protein family)
• Fingerprints should encode protein folds and functionalities more flexibly than can single motifs
• Similar to BLOCKS
Pfam • http://www.sanger.ac.uk/Pfam
• Collection of protein domains and families and
respective MSAs
• Uses HMMs (HMMER3 package)
• Versions o Pfam-A
• manually curated • over 12,000,000 sequences in over 13,500 families
o Pfam-B • automatically clustered and aligned sequences not covered by Pfam-A
InterPro • http://www.ebi.ac.uk/interpro/
• Combination protein signatures from a number of
member databases into a single searchable resource o CATH/Gene3D, PANTHER, Pfam, PRINTS, ProDom, PROSITE, SMART,
SUPERFAMILY, TIGRFAM, ….
• INTERPROSCAN
o allows scanning of sequences again InterPro’s sequences o accessible also using web services