C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E 1-month Practical Course: Genome Analysis Sequence comparison by ‘Sequence Harmony’ identifies subtype-specific functional sites
Dec 14, 2015
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
1-month Practical Course:Genome Analysis
Sequence comparison by ‘Sequence Harmony’ identifies subtype-specific functional sites
[2] [2] [2]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Significance of Alignment Positions
• Observed occurrence of amino acids at some position in an alignment that deviates from expected may indicate some (functional) significance
• What ‘deviates from expected’?
• unlikely occurrences
• What is unlikely?
• only (relatively) few possibilities to obtain observed result
[3] [3] [3]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Pfam Ig Family Alignment
[4] [4] [4]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Aquaporin: Motifs
• NPA: stabilizes loops B and E
• G(a)xxxG(a)xxG(a):
• Crossing ofright-handhelicalbundles
Andreas Engel and Henning Stahlberg, in: Current Topics in Membranes (2001), Hohmann, Agre & Nielsen (Eds.) Academic Press
[5] [5] [5]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Counting…
• Number of possibilities for finding some combination of aminoacids:
• which types?
• how much of each?
• Examples:
• WWW 3 W only 1 way
• RHH 1 R, 2 H three ways
• SHQ 1 S, 1 H, 1 Q six ways
[6] [6] [6]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Counting… (2)
• ‘Real’ examples:
• WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW• 33 W only 1 way
• RRRRRRRRRRRRRRRRHHHHHHHHHHHHHHHHH• 16 R, 17 H ? ways (~ 233 109 )
• SSSSSHSSCCCCCCCCEEQQEEEEEEEEEQEEE• 7 S, 1 H, 8 C, 14 E, 3 Q ??? ways (~ 532 1023 )
• ‘many’ ways
but, we can calculate that!
[7] [7] [7]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Shannon’s ‘Information Entropy’:
• ‘A Mathematical Theory of Communication’, The Bell System Technical Journal, Vol. 27, 1948.
“ Can we define a quantity which will measure, in some sense, how much information is ‘produced’ by such a process, or better, at what rate information is produced? ”
• He was thinking about the Transmission of Information, i.e., from a Source through some Channel to a Destination.
[8] [8] [8]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Solution: Entropy
• the entropy of a set of probabilities pi
• measures information, choice and uncertainty
• zero only if only one pi is not zero
• there is only one choice
• maximal if all pi are equal
• most ‘uncertain’ situation: all options are possible
H=∑i=1
n
pi log p iH=∑i=1
n
pi log p i
[9] [9] [9]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Information Content
• Shannon was thinking about the Transmission of Information, i.e., from a Source through some Channel to a Destination.
• …but it applies equally well to any type of ‘message’
• We can use it to measure the level of conservation in columns in an alignment
[10] [10] [10]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Simple Example: Sequence Entropy
LLLLLLALLLLLAALLLLAAALLLAAAALLAAAAALAAAAAA
.0
.1
.2
.3
.4
.5
.6
.7
.8
.9
1.0
.0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0
p
H
p1 = 0 p2 = 0
p1 = p2 = ½
p1 = f (‘L’)p2 = f (‘A’)
H=∑i=1
n
pi log p i
[11] [11] [11]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Sequence Analysis: Comparing Groups• Many biological problems relate to questions like:
“ Why do these proteins do this, and those proteins not? ”
• or
“ Why do these patients get sick, and those not? ”
The answer can be related to similarities and differences between sequences
• Similarities (conservation) relate to functionally critical positions
• Differences can explain functional differences
[12] [12] [12]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Identification of Functional Sites
• Functional differences between Protein (sub-)families
• Current practice:• use Multiple Sequence Alignment
• look for Conserved Sites within (sub-)families
• (ignore sites that are overall conserved)
[13] [13] [13]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Conservation and (functional) Differences:
Conservation in
0 11 112Ras/Ral
25462192TOTAL:
4 14 1028Rab5/6
16 7 023MIP
5 141029SMAD
NotOneBothKnownTest-set
• Sequence Entropy measures Conservation
• But Sites that are Different are not always Conserved:
[14] [14] [14]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Identification of Functional Sites (2)
• Functional differences between Protein (sub-)families
• Example Binders vs. Non-Binders:• sites crucial for binding: conserved
• sites determining ‘non-binding’: not conserved
Take into account Non-Conserved Sites as well!• comparing Amino-acid Compositions
[15] [15] [15]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
TGF- signalling pathway
TR-II TR-I
TGF-
AR-Smads
division, differentiation, motility, adhesion,
programmed cell death
Nucleusactivation/repressionTGF- target genes
Smad-associationp
p p
BMPR-I BMPR-IIBR-Smads
p
Nucleusactivation/repression
BMP target genes
BMP
Smad-association
p p
specificity
[16] [16] [16]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Smad-MH2 Alignment & Functionally Specific Sites
• 27 known sites of functional specificity
• based mostly on site-specific mutants and characterized on BMPR-I vs. TBR-I binding affinity
Smad2 H.sapiens D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.melanogaster D A A P V M Y H E P A F W C S I S Y Y E L N T R V G E T F H A S Q P S I T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 C.auratus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 R.norvegicus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 M.musculus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L C L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 X.laevis D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 H.sapiens D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 M.musculus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R L C L G L
Smad3 C.auratus D L Q P V T Y C E S A F W C S I S Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N A E - R F C L G L
Smad3 G.gallus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 R.norvegicus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad1 S.mansoni T M H P V N Y Q E P K Y W C S I V Y Y E L N N R V G E A F N A S Q L S I I I D G F T D P S N N S D R F C L G L
Smad1 M.musculus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 H.sapiens D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 S.scrofa D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 R.norvegicus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 X.tropicalis D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 G.gallus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad1 D.rerio D V H P V A Y Q E P K H W C S I V Y Y E L N N R V G E A F L A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 C.coturnix D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad5 H.sapiens D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 M.musculus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 R.norvegicus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P A N N K S R F C L G L
Smad5 G.gallus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad5 D.rerio D V Q P V E Y Q E P S H W C S I V Y Y E L N N R V G E A Y H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad8 M.musculus D F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 R.norvegicus D F R P V C Y E E P L H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 G.gallus N F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S I L I D G F T D P S N N K N R F C L G L
50
|
40
|
20
|
30
|
10
|
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N E V V E Q T R R H I G K G V R L Y Y I G G E V F A E C L S D S S I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y D W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D N A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H N F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D T S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
110
|
100
|
80
|
90
|
70
|
60
|
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L S Q S V S Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y R L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C S L K I F S N Q E F A H - - - - L L S R T V H H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T L R M S F V K G W G A E Y H R Q D V
I P S R C S L K I F N N Q E F A E - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A K Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T L R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K V F N N Q L F A Q - - - - L L A Q S V H H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K V F N N Q L F A Q L L A Q L L A Q S V H H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q L F A Q - - - - P L A Q S V N H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
170
|
160
|
140
|
150
|
130
|
120
|
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D R V L T Q M G S P R L P C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P N L R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W V E I H L N G P L Q W L D R V L T Q M G T P R N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E V H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
210
|
200
|
190
|
180
|
10%
21%
3%
%FP
59%
48%
76%
%FN
31%12SDPpred
52%21TreeDet
21%6AMAS
%TPPredictMethod
[17] [17] [17]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparing Groups of Sequences: Entropy
• Relative ‘Entropy’ rEA/B group A vs. B:•
• using probabilities p of amino acid type x at position i Degenerate for pB = 0, i.e. when A and B fully different!
Introduce Relative ‘Entropy’ rEA/AB A vs. all (‘AB’):•
Not degenerate, but still unbound.
• Upper bound depends on relative size of groups
rEiA/B=∑
x
pi,xA log
p i,xA
p i,xB
rEiA/AB=∑
x
p i,xA log
p i,xA
pi,xAB
[18] [18] [18]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparing Groups: Sequence Harmony
• Weigh groups A and B equally:
• Take pA + pB in stead of pAB
•
Defined on the fixed interval of [01]
• one is complete overlap in composition: Harmony
• zero is no overlap in composition: No Harmony
SHiA/B=∑
x
p i,xA log
p i,xA
pi,xA +p i,x
B
[19] [19] [19]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Smad-MH2 Alignment & Sequence Harmony
• Walter Pirovano*, K. Anton Feenstra* and Jaap Heringa. “Sequence Comparison by Sequence Harmony Identifies Subtype Specific Functional Sites”, Nucleic Acids Res., in press (2006).
• K. Anton Feenstra, Walter Pirovano and Jaap Heringa. “Sub-type Specific Sites for SMAD Receptor Binding Identified by Sequence Comparison using ‘Sequence Harmony’ ”. in: From Computational Biophysics to Systems Biology. pp. 73-78. Eds. U.H.E. Hansmann, J. Meinke, S. Mohanty and O. Zimmermann, Jülich, NIC Series, Vol. 34, 2006.
• Elena Marchiori*, Walter Pirovano, Jaap Heringa and K. Anton Feenstra*. “A Feature Selection Algorithm for Detecting Subtype Specific Sites for Smad Receptor Binding”, Bio-ICMLA06, accepted (2006).
Smad2 H.sapiens D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.melanogaster D A A P V M Y H E P A F W C S I S Y Y E L N T R V G E T F H A S Q P S I T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 C.auratus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 R.norvegicus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 M.musculus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L C L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 X.laevis D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 H.sapiens D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 M.musculus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R L C L G L
Smad3 C.auratus D L Q P V T Y C E S A F W C S I S Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N A E - R F C L G L
Smad3 G.gallus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 R.norvegicus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad1 S.mansoni T M H P V N Y Q E P K Y W C S I V Y Y E L N N R V G E A F N A S Q L S I I I D G F T D P S N N S D R F C L G L
Smad1 M.musculus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 H.sapiens D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 S.scrofa D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 R.norvegicus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 X.tropicalis D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 G.gallus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad1 D.rerio D V H P V A Y Q E P K H W C S I V Y Y E L N N R V G E A F L A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 C.coturnix D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad5 H.sapiens D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 M.musculus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 R.norvegicus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P A N N K S R F C L G L
Smad5 G.gallus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad5 D.rerio D V Q P V E Y Q E P S H W C S I V Y Y E L N N R V G E A Y H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad8 M.musculus D F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 R.norvegicus D F R P V C Y E E P L H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 G.gallus N F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S I L I D G F T D P S N N K N R F C L G L
50
|
40
|
20
|
30
|
10
|
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N E V V E Q T R R H I G K G V R L Y Y I G G E V F A E C L S D S S I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y D W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A T V E M T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D N A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N A A V E L T R R H I G R G V R L Y Y I G G E V F A E C L S D S A I F V Q S P N C N Q R Y G W H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H N F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N F H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D S S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C L S D T S I F V Q S R N C N Y H H G F H P T T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
L S N V N R N S T I E N T R R H I G K G V H L Y Y V G G E V Y A E C V S D S S I F V Q S R N C N Y Q H G F H P A T V C K
110
|
100
|
80
|
90
|
70
|
60
|
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L S Q S V S Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y R L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C N L K I F N N Q E F A A - - - - L L A Q S V N Q G F E A V Y Q L T R M C T I R M S F V K G W G A E Y R R Q T V
I P P G C S L K I F S N Q E F A H - - - - L L S R T V H H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T L R M S F V K G W G A E Y H R Q D V
I P S R C S L K I F N N Q E F A E - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A K Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E T V Y E L T K M C T L R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S S C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q E F A Q - - - - L L A Q S V N H G F E A V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K V F N N Q L F A Q - - - - L L A Q S V H H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K V F N N Q L F A Q L L A Q L L A Q S V H H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
I P S G C S L K I F N N Q L F A Q - - - - P L A Q S V N H G F E V V Y E L T K M C T I R M S F V K G W G A E Y H R Q D V
170
|
160
|
140
|
150
|
130
|
120
|
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D R V L T Q M G S P R L P C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S V R C S S M S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P N L R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W I E L H L N G P L Q W L D K V L T Q M G S P S I R C S S V S
T S T P C W V E I H L N G P L Q W L D R V L T Q M G T P R N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E V H L H G P L Q W L D K V L T Q M G S P L N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
T S T P C W I E I H L H G P L Q W L D K V L T Q M G S P H N P I S S V S
210
|
200
|
190
|
180
|
x xixi
xixii B
,A,
A,A
,A/B
pp
plogp SH
260 280 300 320 340 360 380 400 420 440 460
1
0
[20] [20] [20]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Smad2 H.sapiens D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.melanogaster D A A P V M Y H E P A F W C S I S Y Y E L N T R V G E T F H A S Q P S I T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 C.auratus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 R.norvegicus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 M.musculus D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L G L
Smad2 D.rerio D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N S E - R F C L C L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 X.laevis D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 H.sapiens D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 M.musculus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R L C L G L
Smad3 C.auratus D L Q P V T Y C E S A F W C S I S Y Y E L N Q R V G E T F H A S Q P S L T V D G F T D P S N A E - R F C L G L
Smad3 G.gallus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 S.scrofa D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad3 R.norvegicus D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G F T D P S N S E - R F C L G L
Smad1 S.mansoni T M H P V N Y Q E P K Y W C S I V Y Y E L N N R V G E A F N A S Q L S I I I D G F T D P S N N S D R F C L G L
Smad1 M.musculus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 H.sapiens D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 S.scrofa D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 R.norvegicus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad1 X.tropicalis D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 G.gallus D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad1 D.rerio D V H P V A Y Q E P K H W C S I V Y Y E L N N R V G E A F L A S S T S V L V D G F T D P S N N R N R F C L G L
Smad1 C.coturnix D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G F T D P S N N K N R F C L G L
Smad5 H.sapiens D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 M.musculus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K S R F C L G L
Smad5 R.norvegicus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P A N N K S R F C L G L
Smad5 G.gallus D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad5 D.rerio D V Q P V E Y Q E P S H W C S I V Y Y E L N N R V G E A Y H A S S T S V L V D G F T D P S N N K N R F C L G L
Smad8 M.musculus D F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 R.norvegicus D F R P V C Y E E P L H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G F T D P S N N R N R F C L G L
Smad8 G.gallus N F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S I L I D G F T D P S N N K N R F C L G L
26
2 27
0 28
0 29
0 30
0 31
0
AR
BR
Smads: Comparing two Groups
[21] [21] [21]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
?? (putative)HR0B5R337
?? (putative)KRk0.18LoopR334
ALK1/2NLMq0H1M327
c-Ski/SnoNKrsE0L1E309
ALK1/2SAe0H1A323
?? (putative)IV0H1V325
c-Ski/SnoNNsd–0L1–
c-Ski/SnoNNSa0L1S308
c-Ski/SnoNLiT0B3T298
c-Ski/SnoNViLMi0.11B3L297
c-Ski/SnoNTrlP0B3P295
c-Ski/SnoNSqQ0.16loopQ294
TR-INQt0B2Q284
?? (putative)HyF0loopF273
?? (putative)KqlsA0loopA272
?? (putative)EqCSh0loopS269
SARAAcenTm0B1’T267
SARAVfmLa0B1’L263
InteractionBRARSHSec.str.Pos.
Finding Low-harmony sites in Smad-MH2 2
70 2
80 2
90
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D A A P V M Y H E P A F W C S I S Y Y E L N T R V G E T F H A S Q P S I T V D G
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y S E P A F W C S I A Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E S A F W C S I S Y Y E L N Q R V G E T F H A S Q P S L T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
D L Q P V T Y C E P A F W C S I S Y Y E L N Q R V G E T F H A S Q P S M T V D G
T M H P V N Y Q E P K Y W C S I V Y Y E L N N R V G E A F N A S Q L S I I I D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G
D V H P V A Y Q E P K H W C S I V Y Y E L N N R V G E A F L A S S T S V L V D G
D V Q A V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S I L V D G
D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q P V A Y E E P K H W C S I V Y Y E L N N R V G E A F H A S S T S V L V D G
D V Q P V E Y Q E P S H W C S I V Y Y E L N N R V G E A Y H A S S T S V L V D G
D F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G
D F R P V C Y E E P L H W C S V A Y Y E L N N R V G E T F Q A S S R S V L I D G
N F R P V C Y E E P Q H W C S V A Y Y E L N N R V G E T F Q A S S R S I L I D G
30
0
[22] [22] [22]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Finding Low-harmony sites in Smad-MH2
21%
7%
28%
33%
79%
93%
(SH=0) 32
(SH<0.2) 40
Sequence Harmony
59%10%31%12SDPpred
21%
3%
%FP
48%
76%
%FN
52%21TreeDet
21%6AMAS
%TPPredictMethod
?? (putative)HR0B5R337
?? (putative)KRk0.18LoopR334
ALK1/2NLMq0H1M327
c-Ski/SnoNKrsE0L1E309
ALK1/2SAe0H1A323
?? (putative)IV0H1V325
c-Ski/SnoNNsd–0L1–
c-Ski/SnoNNSa0L1S308
c-Ski/SnoNLiT0B3T298
c-Ski/SnoNViLMi0.11B3L297
c-Ski/SnoNTrlP0B3P295
c-Ski/SnoNSqQ0.16loopQ294
TR-INQt0B2Q284
?? (putative)HyF0loopF273
?? (putative)KqlsA0loopA272
?? (putative)EqCSh0loopS269
SARAAcenTm0B1’T267
SARAVfmLa0B1’L263
InteractionBRARSHSec.str.Pos.
[23] [23] [23]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Smad-MH2: Low Harmony Patches
[24] [24] [24]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Smad-MH2: Functional Clusters
R462 C463
Q400
R410 W368
Y366
A392
S269
F273
N443
Q294
Q309L297
L440
N381
A354
V461
S460Q407
Q364
P360
R365
T267
A272
I341
P295S308
T298R337F346
P378
Q284
V325
A323R427
M327T430
R334FAST1, Mixer, SARA
c-Ski/SnoN
SARA
TR-I/ALK1/2TR-I/BMPR-I
?SARA/Mixer
TR-I/BMPR-I/ALK1/2
?
receptor-binding
retention & transcription factorsco-repressors
[25] [25] [25]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparison to Other Prediction Methods
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
80.0% 85.0% 90.0% 95.0% 100.0%
Specificity
Sen
sitiv
ity
AMASSDP-predTreeDetSequence Harmony
23 sites
8 sites
[26] [26] [26]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparison to Other Prediction Methods (2)
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
80.0% 85.0% 90.0% 95.0% 100.0%
Specificity
Sen
sitiv
ity
AMAS cumulativeAMASSDP-predTreeDetSH + Entropy (inc)SH + Entropy (dec)Sequence Harmony
[27] [27] [27]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
80.0% 85.0% 90.0% 95.0% 100.0%
Specificity
Sen
sitiv
ity
AMAS cumulativeAMASSDP-predTreeDetSH + Ranges + E(inc)SH + Ranges + E(dec)SH + Entropy (inc)SH + Entropy (dec)Sequence Harmony
18 sites
2
Comparison to Other Prediction Methods (3)
[28] [28] [28]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Ras family: Rab5 vs. Rab6
0%
20%
40%
60%
80%
100%
40% 50% 60% 70% 80% 90% 100%
Specificity
Sen
sitiv
ity
SDP-pred
TreeDet
SH + E(dec)
SH + E(inc)
SH + Ranges + E(dec)
Sequence Harmony
[29] [29] [29]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Ras Family: Ras vs. Ral
0%
20%
40%
60%
80%
100%
75% 80% 85% 90% 95% 100%
Specificity
Se
nsi
tivity
SDPpred
TreeDet
SH + E(dec)
SH + E(inc)
SH + Ranges + E(dec)
Sequence Harmony
[30] [30] [30]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
MIP family: AQP vs. GLP
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Specificity
Se
nsi
tivity
SDPpred (5Å)
TreeDet (5Å)
Sequence Harmony (5Å)
[31] [31] [31]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Conclusions Smad-MH2 Sequence Harmony• 40 Sites of Low Sequence Harmony in Smad-MH2
• different between the AR (TGF-) and BR (BMP) sub-type Smads
• Low Harmony sites in Smad-MH2 are functionally relevant
• Other methods do not select all known (functional) sites!
Sequence information maps to structure: Next: Analyze Protein-Protein Interactions
• 14 Low Harmony Sites in Smad-MH2 of unknown function
• 11 putative functions from structural considerations
• promising candidates that determine TGF-/BMP specificity
• confirm (or rebuke) putative functions?
[32] [32] [32]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
General Conclusions
• Lack of experimental data • Adequate quality and quantity hard to attain
• Discriminating power of test-sets varies
• Conservation not best identifier for functional differences• Selections too conservative and not very specific
• Differences, as measured by Sequence Harmony good alternative• Selections include most known sites, but somewhat
lower specificity
[33] [33] [33]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Brajenovic, M. et al. J. Biol. Chem. 2004;279:12804-12811
Connectivity map of human Par complexes based on TAP purifications and co-immunoprecipitation experiments
Connectivity map of human Par complexes based on TAP purifications and co-immunoprecipitation experiments. The TAP-tagged proteins used as baits are represented as rhomboids. Lines connecting proteins indicate presence in a TAP complex or coimmunoprecipitation (dotted lines). The width of each line represents the degree of sequence coverage of the identification, which depends on the robustness of the interaction but also on the expression level and a number of other factors. Green boxes/lines represent previously known interactors/interactions; red boxes/lines represent novel interactors/interactions. Proteins that are found specifically with only one TAP-protein are grouped in boxes (S1–S6), whereas proteins that are consistently found together with more than one TAP-protein are grouped in modules (M1 and M2).
[36] [36] [36]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Charting protein complexes, signaling pathways, and networks in the immune system
Bauch A, Superti-Furga G Source: IMMUNOLOGICAL REVIEWS 210: 187-207 APR 2006
[37] [37] [37]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Copyright ©2006 by the National Academy of Sciences
Yang, Xiaowen et al. (2006) Proc. Natl. Acad. Sci. USA 103, 17237-17242
Fig. 3. The selective nature of the primary interaction site
Canonical interaction motifs:Mode I: R/K-X-X-S/T-X-PModeII:R/K-X-X-X-S/T-X-PModeIII: S-W-T-Y (C-term.)
[39] [39] [39]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Copyright ©2006 by the National Academy of Sciences
Yang, Xiaowen et al. (2006) Proc. Natl. Acad. Sci. USA 103, 17237-17242
Fig. 5. Dynamic nature of the 14-3-3 dimers
Fig. 5. Dynamic nature of the 14-3-3 dimers. (A) Crystal structure of the apo-isoform looking down the peptide binding grooves, which are labeled open and closed for the individual monomers. (B) Superimposition of all seven closed state 14-3-3 isoforms using only one monomer as the reference, with shown in blue and in green. The other 14-3-3 monomers, which have intermediate positions, are colored transparent gray.
[40] [40] [40]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Copyright ©2006 by the National Academy of Sciences
Yang, Xiaowen et al. (2006) Proc. Natl. Acad. Sci. USA 103, 17237-17242
Fig. 1. Overview of the dimeric 14-3-3 structure
Fig. 1. Overview of the dimeric 14-3-3 structure. Helices and loops involved in target domain interactions are labeled.Each monomer is colored blue to red from the N to C terminus. An aperture exists at the central dimeric interface, which is marked with a circle.
Yang et al. 2006 Structural basis for protein–protein interactions in the 14-3-3 protein family PNAS 103, 17237
[41] [41] [41]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Copyright ©2006 by the National Academy of Sciences
Yang, Xiaowen et al. (2006) Proc. Natl. Acad. Sci. USA 103, 17237-17242
Fig. 2. Schematic representation of the heterodimerization process involving the 14-3-3epsilon (green) and zeta (yellow) isoformsThe lines between identified residues indicate specific interactions
[42] [42] [42]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
HIV Differential Progression/Replication• Differences in disease progression in HIV-infected
patients based on:
• Immunotype (e.g., B57 vs. non-B57)
• Occurrence of specific 'escape' mutations
• Aim: apply Sequence Harmony to find (additional) key sites that determine disease progression or viral replication rates
[43] [43] [43]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Input: multiple sequence alignment of capsid protein
[44] [44] [44]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparison of multiple groups:• B57 vs. non-B57
• 'Progressors' (P) vs. 'Long-term non-progressors' (L)
• Early stage vs. Late stage
• Late stage: progressors (P) vs. non-progressors (L)
• is especially interesting: what defines the 'non-progression'
[45] [45] [45]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
HIV Capsid Specificity:B57 vs. non-B57
• 36 selected residues from the 422 residue alignment
• below the cutoff of 0.9.
• 26 sites (excluding gaps): all 7 known B57 escape mutations
Position Sequence RankAli Ref Harmony
251 T242 0.05 1 156 I147 0.44 2 123 - 0.49 5 15 R15 0.50 1
182 S173 0.68 1 122 - 0.70 5 136 - 0.75 7 121 - 0.76 5 257 G248 0.78 1 12 E12 0.80 1
168 V159 0.80 1 62 G62 0.82 1
401 T389 0.83 1 55 E55 0.83 2
127 N124 0.83 5 130 Q127 0.83 7 277 L268 0.84 1 390 - 0.85 1 104 I104 0.87 1 53 T53 0.87 2
273 R264 0.87 1 125 T122 0.87 5 132 - 0.87 7 133 - 0.87 7 134 - 0.87 7 131 - 0.87 7 135 - 0.87 7 111 S111 0.88 1 409 K397 0.88 1 289 T280 0.88 1 224 L215 0.88 1 91 R91 0.89 1
155 A146 0.90 2 379 V370 0.90 1
[46] [46] [46]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Output nB57/B57: Structure
[47] [47] [47]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Output: 'Stereotypes'
[48] [48] [48]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Output: Distinct Specificity Regions
n-B57 vs. LP
L-early vs L-late
L vs. P
L vs. P-late
L-late vs. P-late
P-early vs. P-late
[49] [49] [49]
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Output: Detail in the sequence(s)
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
Sequence comparison by ‘Sequence Harmony’identifies subtype-specific functional sites
… end …