-
doi:10.1006/jmbi.1999.3488 available online at
http://www.idealibrary.com on J. Mol. Biol. (2000) 296, 919±934
Statistical Analysis of Amino Acid Patterns inTransmembrane
Helices: The GxxxG Motif OccursFrequently and in Association with
bbb-branchedResidues at Neighboring Positions
Alessandro Senes, Mark Gerstein and Donald M. Engelman*
Department of MolecularBiophysics & Biochemistry,Yale
University, P.O. Box208114, New HavenCT 06520-8114, USA
E-mail address of the [email protected]
Abbreviations used: TM, transmetransmembrane domain; GpA,
glyccircular dichroism.
0022-2836/00/030919±16 $35.00/0
To ®nd motifs that mediate helix-helix interactions in membrane
proteins,we have analyzed frequently occurring combinations of
residues in adatabase of transmembrane domains. Our analysis was
performed with anovel formalism, which we call TMSTAT, for exactly
calculating theexpectancies of all pairs and triplets of residues
in individual sequences,taking into account differential sequence
composition and the substantialeffect of ®nite length in short
segments. We found that the number of sig-ni®cantly over and
under-represented pairs and triplets was muchgreater than the
random expectation. Isoleucine, glycine and valine werethe most
common residues in these extreme cases. The main themeobserved is
patterns of small residues (Gly, Ala and Ser) at i and i 4found in
association with large aliphatic residues (Ile, Val and Leu)
atneighboring positions (i.e. i � 1 and i � 2). The most
over-representedpair is formed by two glycine residues at i and i 4
(GxxxG, 31.6 %above expectation, p < 1 � 10ÿ33) and it is
strongly associated with theneighboring b-branched residues Ile and
Val. In fact, the GxxxG pair hasbeen described as part of the
strong interaction motif in the glycophorinA transmembrane dimer,
in which the pair is associated with two Valresidues (GVxxGV).
GxxxG is also the major motif identi®ed using TOX-CAT, an in vivo
selection system for transmembrane oligomerizationmotifs. In
conjunction with these experimental observations, our
resultshighlight the importance of the GxxxG b-branched motif in
transmem-brane helix-helix interactions. In addition, the special
role for theb-branched residues Ile and Val suggested here is
consistent with thehypothesis that residues with constrained
rotameric freedom in helicalconformation might reduce the entropic
cost of folding in transmembraneproteins. Additional material is
available at http://engelman.csb.yale.edu/tmstat and
http://bioinfo.mbb.yale.edu/tmstat.
# 2000 Academic Press
Keywords: membrane proteins; protein folding; glycine;
b-branched;sequence analysis
*Corresponding author
Introduction
The two dozen high-resolution structures of inte-gral membrane
proteins available so far haverevealed only two simple folds, the
helical bundleand the closed beta barrel. These folds are the
sim-
ing author:
mbrane; TMD,ophorin A; CD,
plest solutions to satisfying the hydrogen bondingpotential of
the polypeptide backbone amidegroups in the lipid bilayer. In the
helical family,the membrane-spanning domains are generallycomposed
of very hydrophobic stretches of 20-30amino acid residues.
Algorithms based on hydro-phobicity scales (Boyd et al., 1998;
Engelman et al.,1986; Kyte & Doolittle, 1982; von Heijne,
1992)reliably identify these domains from primarysequences. As a
consequence, a large database ofpredicted helical transmembrane
(TM) domains(TMD) exists.
# 2000 Academic Press
-
920 Analysis of Residue Patterns in TM Helices
Thus, structural information in a membrane pro-tein sequence can
be statistically interpreted.Elements of the structural simplicity
of these pro-teins suggest the existence of commonly used pat-terns
in transmembrane helix-helix interactions.First, the space that
natural selection can sample insearch of favorable combinations
seems to be lim-ited by the low complexity of the sequences,
sincetwo-thirds of transmembrane residues comprise,on average, only
six amino acids (Leu, Ile, Val,Phe, Ala, and Gly), as schematized
in Figure 1(a).The helices tend to adopt perpendicular
orientationin order to span the bilayer (Bowie, 1997), andhelix
packing theories suggest that only a subset ofthe relative
inter-helical orientations are optimalfor interaction (Bowie, 1997;
Chothia et al., 1981;Richmond & Richards, 1978; Walther et al.,
1996).Moreover, the need for a detailed ®t to maximizeweak van der
Waals interactions and the prefer-ence for preformed interfaces to
minimize entropylost upon packing, as postulated by MacKenzie
&Engelman (1998), could also limit the number ofconformations
suitable for interaction.
The existence of correlations between residueshas been suggested
in previous statistical studieson predicted TM sequence databases
(Arkin &Brunger, 1998; Landolt-Marticorena et al., 1993;Samatey
et al., 1995). Here, we present a rigorousanalysis of the frequency
of occurrence of all pairsand triplets of amino acids in a large
non-homo-logous set of sequences, compared with theirtheoretical
expectancies.
Expectancy can be calculated trivially by theproduct of the
frequency of amino acids inthe database. However, this method
requires theassumption that in terms of composition, thesequences
belong to a homogeneous population.We have formulated a procedure
for calculatinganalytically expectancy distributions of
occurrence
Table 1. Transmembrane annotations in the Swiss-Prot datab
All proteins
Protein TMsd Potential Proteirecordsc (%)e record
Complete databasef 10,769 46,946 94.3 3863Eukaryota 6587 27,288
92.5 2630Bacteria 3156 16,881 97.2 706Archaea 297 1341 98.2
71Viruses 729 1436 90.6 456
Non-homologousdatabaseg 5309 13,606 96.4 1174
Eukaryota 2510 5619 95.2 749Bacteria 2333 6963 97.4 294Archaea
244 666 98.6 43Viruses 222 358 93.6 88
a Proteins containing a single transmembrane domains
(single-spanb Proteins containing multiple transmembrane domains
(multi-spac Number of proteins containing transmembrane
annotations.d Annotated transmembrane domains (TRANSMEM entries in
thee Transmembrane annotations marked as POTENTIAL or POSSIBLf
Database containing all Swiss-Prot annotated transmembrane domg
Database used for the statistical analysis, obtained from the
com
based on the composition of the individualsequences. Our method
takes into account the®nite-length effect that for short sequences,
such astransmembrane domains, becomes very important.The ratios
between the observed occurrences andtheir relative mean expectancy
value (odds ratio)allowed for the identi®cation of the
over-rep-resented and under-represented pairs and triplets.The
exact expectancy distribution permitted thecalculation of a precise
statistical signi®cance forthe observed differences from
expectation. Ourresults show that a large number of signi®cantcases
exist, suggesting structural themes in helix-helix interaction and
a special role for glycine andthe b-branched Ile and Val in
transmembranedomains.
Results
Characteristics of the database
The collection of sequences used in this analysiswas obtained
from the 49,946 transmembranedomains annotated in the Swiss-Prot
database. In94 % of the cases, the annotations were marked
aspotential, possible or probable, indicating that theywere
identi®ed by hydrophobicity algorithms(Table 1). In order to remove
homology, TMsequences with high similarity scores to otherswere
excluded, as described in Methods. The pro-cedure yielded a
database with 13,606 TMDs, anadequate size for the proposed
analysis.
Helical transmembrane domains generally varyin length from 20 to
30 residues. The sequencesannotated in Swiss-Prot are mainly in
that range,as shown in the histogram in Figure 1(b). Insteadof
using the complete annotations, however, weperformed the analysis
on ®xed-length windows of18 residues selected for maximum
hydrophobicity.
ase (rel. 37 and updates until March 17, 1999)
Bitopic proteinsa Polytopic proteinsb
n TMsd Potential Protein TMsd Potentials (%)e recordsc (%)e
3863 78.5 6906 43,053 95.82630 74.4 3957 24,658 94.5706 93.6
2450 16,175 97.471 94.4 226 1270 98.4
456 76.3 273 980 97.2
1174 84.2 4135 12,432 97.6749 79.0 1761 4870 97.6294 95.2 2039
6669 97.543 97.7 201 623 98.788 84.1 134 270 96.7
).n).
FT ®eld).E or PROBABLE.ains.
plete database after homology removal.
-
Figure 1. Average composition and length of thetransmembrane
annotations in Swiss-Prot. (a) Compo-sition; the shaded areas
emphasize the fact that, on aver-age, one half of transmembrane
residues comprise onlyfour amino acids and two-thirds of the total
only sixamino acids. (b) Length distribution of all transmem-brane
annotations. Inset: distribution of the sequencesnot labeled as
POTENTIAL, PROBABLE or POSSIBLE.
Figure 2. Variation in the amino acid composition atdifferent
positions in the l8 amino acid residue trans-membrane sequences
used for the pair and triplet corre-lation analysis. Each 18
residue sequence corresponds tothe most hydrophobic window of the
30 residue spancentered on each Swiss-Prot annotation.
Analysis of Residue Patterns in TM Helices 921
The composition of transmembrane sequences var-ies along
sections exposed to different environ-ments (water interface, lipid
head-group andhydrocarbon regions). To avoid highlighting aminoacid
correlations that are due to these variations, itis important to
limit the analysis to the portion ofthe sequence likely to be
exposed to the hydro-carbon region. Moreover, the exact de®nition
ofthe boundaries of a transmembrane domain is anon-trivial problem,
even among solved structures.The majority of the sequences are
putative, andincorrect assignment of the boundaries wouldresult in
contamination from ¯anking regions;using shorter sequences selected
for hydrophobi-city should minimize this risk. The exact
bound-aries of the transmembrane regions are generallynot well
established in Swiss-Prot, as demonstratedby the peak at 21
residues in the length distri-bution of the annotations (Figure
1(b)). Such asharp demarcation is probably an artifact of
thealgorithms commonly used to identify transmem-brane domains. The
distribution is much morewidespread when the putative annotations
areexcluded (inset). The amino acid composition ofeach of the 18
positions of the analyzed sequencesis shown in Figure 2. The major
differences in com-
position are limited to the extremities, and the vari-ations are
signi®cantly reduced compared to thoseof the original annotations
(data not shown).
Analytical procedures
The one-letter code of the two amino acids fol-lowed by their
separation k will be used to indicatepairs of residues at distance
i, i k (k is alsoreferred to as the register). For example, the
pair inwhich Ala and Leu are at i, i 3 (AxxL) is indi-cated by AL3.
The occurrences in the TM sequencedatabase of all 4000 pairs formed
by all combi-nations of the 20 amino acids at registers 1 to 10were
counted. Raw counts are not very informa-tive, since the main
factor determining the grossnumber of occurrences of a pair in a
set ofsequences is the relative frequency of its residues.The
Leu-Leu pairs at all separations (LLk), forinstance, are the most
numerous among the pairsin the database, since Leu is by far the
mostfrequent residue. Therefore, to identify speci®crelationships
that might be clues to helix inter-actions, it is necessary to
refer to an expectation ofthe occurrence of each amino acid pair
and tripletthat permits distinguishing over-represented
andunder-represented pairs while accounting for therelative
frequency of the amino acids. We calcu-lated the expectation with a
novel formalismnamed TMSTAT that incorporates both compo-sition and
length of every individual sequence andthus does not require the
assumption that allsequences belong to a homogeneous
compositiondistribution and accounts for ®nite-length effects.Based
on formally derived probability distri-
-
Figure 3. Occurrences of Gly-Gly pairs and theirexpectation. (a)
Probability function P(NGG4) associatedwith any possible number of
occurrences (NGG4) of thepair GG4 (GxxxG) in the database of TM
sequences. Thearrow marks the actual occurrences observed
(observed1641; expected 1246.8; statistical signi®cancep 6.4 �
10ÿ34). (b) Observed occurrences NGGk of allGGk pairs (*) as a
function of distance i, i k. Thestraight line represents the
expectation. The line issloped due to end effects: in a sequence
with ®nitelength more pairs are possible at short register than
atlonger register. The error bars de®ne the 99 % interval
ofcon®dence around the expectations (the range outsidewhich a value
has signi®cance p < 0.01).
922 Analysis of Residue Patterns in TM Helices
butions, a statistical signi®cance (p) was assignedto any
observed difference from expectation, i.e.the probability that a
difference equal to or largerthan that observed could occur by
chance if theresidues were actually randomly distributed. TheTMSTAT
method is presented in detail in Appen-dix and Figure 9.
GxxxG (GG4) is the most significant pair
In analyzing 4000 randomly distributed vari-ables, one would
expect to observe by chance oneinstance of a difference from
expectation withsigni®cance p < 0.00025 (1/4000); in the
presentanalysis, 117 pairs deviate from the expectedoccurrences
with at least that signi®cance. The datafor the most signi®cant
outliers found are shownin Table 2A (over-represented pairs) and
Table 2B(under-represented pairs). At least one of the tenmost
signi®cant over-represented pairs is found in76 % of the sequences
in the database. (A table con-taining observed and expected
occurrences of all4000 pairs is available at
http://engelman.csb.yale.edu/tmstat and
http://bioinfo.mbb.yale.edu/tmstat).
GG4 (GxxxG) is the pair with the strongest posi-tive
correlation. In the database, GG4 occurs 1641times, 32 % more than
its expectation of 1247 (oddsratio observed/expected 1.32). The
occurrence ofthe GG4 pair is compared with its expectationcurve in
Figure 3(a). The probability of observingan equal or greater
difference from expectation bychance is unrealistic (p 6.4 �
10ÿ34). The high fre-quency of GG4 in a database of predicted
TMsequences was reported by Arkin & Brunger(1998); however, the
expectation and signi®cancefor the occurrences of the pair were not
calculated.
The observed occurrences of all Gly-Gly pairs asa function of
separations and their expectationvalues are plotted in Figure 3(b).
In the pro®le, thepair GG4 peaks between two positions that
arerelatively ``unbiased.`` A signi®cant negative biasis observed
for the interaction of two adjacentGly residues (GG1, 9 % below
expectation,p 1.0 � 10ÿ4). A second positive, correspondingto GG7,
is observed 16 % more often than itsexpected occurrences (p 2.9 �
10ÿ8). Clearly, theobserved correlations must derive from
speci®cposition-dependent selection of residue properties.
Pairs containing isoleucine, glycine and valineare the most
biased
The 30 most signi®cant over-represented pairs,shown in Table 2A,
frequently contain isoleucine,glycine and valine residues. Ile is
present in 14pairs (®ve of the top eight cases). Gly exists in
11pairs (®ve of the top six cases). Val is found ineight pairs
(four of the top ten cases). Leu, themost common residue in
transmembrane domains,is found in only ®ve pairs and never in the
top tencases.
Pair correlation results reflecthelical periodicity
The pie diagram shown in Figure 4(a) depictsthe proportion of
the most signi®cant pairs ofTable 2A grouped by register. Pairs
with registers1 to 4 comprise 90 % of the cases. Together,
regis-ters 1 and 4 are found in almost two-thirds of thecases. A
stronger tendency of pairs at i, i 1 andi 4 to deviate from
expectation is found in theentire set of 4000 pairs. The tendency
is evidencedby the w2 scores of pairs as a function of
register,shown in Figure 4(b), that clearly peak at thesetwo
positions. These results are strongly consistentwith helical
geometry. Four residues compriseabout one helical turn in regular
a-helix confor-mation (3.6 residues per turn); at i, i 1 and i,i 4,
both residues of the pair are presented on thesame face of a
regular a-helix, as schematized inthe wheel diagram of Figure
4(c).
-
Table 2. The 30 most signi®cant over-represented and
under-represented pairs sorted by signi®cance
Pair Occurrencesa Expectationb Standard deviationc Significance
(p)d Odds ratioe
A. Over-represented pairsGG4 1641 1246.8 31.3 6.4 �10ÿ34 1.32II4
3782 3289.2 48.2 8.4 �10ÿ24 1.15GA4 2057 1698.4 37.0 3.6 �10ÿ21
1.21IG1 2721 2318.4 41.9 4.8 �10ÿ21 1.17IG2 2528 2182.1 41.1 1.3
�10ÿ16 1.16VG2 2268 1945.2 39.1 5.7 �10ÿ16 1.17IV4 3003 2636.3 45.5
2.1 �10ÿ15 1.14IP1 992 788.8 25.2 4.5 �10ÿ15 1.26VV4 2770 2443.2
42.5 3.8 �10ÿ14 1.13VI4 2965 2636.3 45.5 1.1 �10ÿ12 1.12AV1 3149
2823.2 45.8 2.2 �10ÿ12 1.12GL3 3392 3062.7 47.7 9.7 �10ÿ12 1.11AG4
1929 1698.4 37.0 9.1 �10ÿ10 1.14WQ1 88 45.8 6.5 3.9 �10ÿ9 1.92IL4
4784 4446.3 57.3 4.9 �10ÿ9 1.08AA3 2719 2477.0 42.2 1.3 �10ÿ8
1.10VG1 2295 2066.7 40.0 1.8 �10ÿ8 1.11GG7 1138 979.6 28.0 2.9
�10ÿ8 1.16VL4 4362 4064.3 54.9 7.7 �10ÿ8 1.07IS2 1916 1717.0 36.7
9.0 �10ÿ8 1.12SI2 1912 1717.0 36.7 1.5 �10ÿ7 1.11GI1 2536 2318.4
41.9 2.9 �10ÿ7 1.09IY10 496 397.0 19.0 4.5 �10ÿ7 1.25YY3 245 180.9
12.4 6.3 �10ÿ7 1.35IF10 1617 1443.1 35.7 1.6 �10ÿ6 1.12GI2 2375
2182.1 41.1 3.3 �10ÿ6 1.09PI3 809 696.0 24.0 4.0 �10ÿ6 1.16PV1 777
667.8 23.4 5.0 �10ÿ6 1.16PL1 1342 1203.8 30.0 5.4 �10ÿ6 1.11LP1
1342 1203.8 30.0 5.4 �10ÿ6 1.11B. Under-represented pairsII2 3223
3759.1 50.6 5.1 �10ÿ27 0.86GI4 1564 1909.3 39.1 1.4 �10ÿ19 0.82IL1
4906 5399.1 60.7 2.5 �19ÿ16 0.91FL1 3954 4394.8 55.4 9.4 �10ÿ16
0.90FI4 2182 2525.4 44.5 4.1 �10ÿ15 0.86IG4 1620 1909.3 39.1 4.8
�10ÿ14 0.85LW4 611 786.7 25.0 5.2 �10ÿ13 0.78IV2 2683 3013.0 47.6
2.3 �10ÿ12 0.89YL4 788 974.5 27.9 7.3 �10ÿ12 0.81PG1 311 434.2 19.3
2.8 �10ÿ11 0.72CP1 56 113.1 10.1 9.0 �10ÿ10 0.50FV3 1991 2244.3
42.1 1.1 �10ÿ9 0.89AP1 508 642.2 22.9 1.8 �10ÿ9 0.79IW4 376 493.2
20.4 2.9 �10ÿ9 0.76IM1 922 1091.4 29.5 4.7 �10ÿ9 0.84FL3 3575
3877.7 53.3 1.1 �10ÿ8 0.92FV4 1869 2094.6 41.0 2.5 �10ÿ8 0.89FI3
2462 2705.7 45.6 6.7 �10ÿ8 0.91LW3 707 842.8 25.7 7.5 �10ÿ8 0.84V12
2759 3013.0 47.6 7.7 �10ÿ8 0.92GP1 335 434.2 19.3 1.2 �10ÿ7 0.77YI4
575 694.8 24.0 3.7 �10ÿ7 0.83FL2 3862 4136.3 54.4 3.9 �10ÿ7 0.93VG4
1517 1702.0 37.2 5.0 �10ÿ7 0.89FF2 2244 2450.3 41.9 7.1 �10ÿ7
0.92FM1 743 872.3 26.7 8.4 �10ÿ7 0.85FL6 2861 3102.2 49.4 88 �10ÿ7
0.92II1 3744 3994.0 51.6 1.1 �10ÿ6 0.94WV1 454 549.1 21.3 5.1 �10ÿ6
0.83LL2 7509 7821.3 69.4 6.5 �10ÿ6 0.96
a Number of observed occurrences of the pair in the database.b
Average expected number of occurrences.c Standard deviation of the
expectation distribution.d Calculated as two-tailed integral of the
expectation distribution.e Occurrences/Expectation ratio.
Analysis of Residue Patterns in TM Helices 923
-
Figure 4. (a) Relative frequency of the 30 most signi®-cant
pairs of Table 2A grouped by register. (b) Overalldeviation from
expectation at different registers, calcu-lated as w2 score on the
entire set of pairs. Pairs groupedby register (group size n
400):
w2 Xpairs
Observedÿ Expected2Expected
(c) Relative angular position along the helical axis ofresidues
in pairs at different registers. The ®lled circle atthe top of the
helical wheel diagram (3.6 residues perturn) indicates the residue
at i (*). The position of theresidue at i k is indicated by the
respective number k.The arrows mark the registers with the highest
overalltendency to diverge from expectation, as observed in (a)and
(b).
Figure 5. Normalized occurrences of pairs formed bycombinations
of the b-branched residues Ile, Val andGly at all registers. Odds
ratio (observed occurrences/expected) The bars represent the 99 %
con®dence inter-val around the expectation. (a) Pairs formed by Ile
andVal. (b) Pairs formed by Gly with Ile or Val.
924 Analysis of Residue Patterns in TM Helices
Similar biases are found with pairs of residueswith similar
structure
A remarkable feature of the results shown inTable 2 is that most
of the pairs can be groupedby register and side-chain chemistry
into a fewcategories. For example, GG4, GA4 and AG4 areall observed
among them. Similarly, all combi-nations of the b-branched
aliphatic residues at i,i 4 (II4, IV4, VI4 and VV4) are extremely
sig-ni®cant. The pairs IL4 and VL4 are also amongthe most
signi®cant pairs (Leu is isomeric to Ilebut g-branched). There are
many pairs formed
by one small residue (Ala, Gly and Ser) and ab-branched
aliphatic residue at i, i 1 (IG1 andVG1; GI1 and AVI) and i, i 2
(IG2, VG2 andIS2; SI2 and GI2). Finally, a number of pairs
areformed by Pro and large aliphatic residues (Ile,Val and Leu) at
register 1 (IP1 and LP1; PV1and PL1). In the list of the signi®cant
under-rep-resented pairs, combinations of b-branched resi-dues and
glycine are very disfavored at i, i 4(GI4, IG4 and VG4) and
neighboring Pro andGly are also disfavored (PG1 and GP1).
The correspondence between the observed biasesand side-chain
chemistry also appears in the com-parison of pairs pro®les at all
registers. In Figure 5,the occurrences of pairs normalized to their
expect-ancy (odds ratios, observed/expected) are plottedas a
function of register. In Figure 5(a), all pairsformed by
combinations of Ile and Val have verysimilar pro®les with a strong
positive correlation ati, i 4 and a negative peak at i 2. Striking
simi-larity is also evident in Figure 5(b), where the pro-®les of
pairs formed by Gly and Ile or Val areshown.
These results suggest a general tendency for twolarge aliphatic
residues (in particular the b-branched ones) to correlate at i, i 4
when theyare on the same face of the helix and to anti-corre-late
when they are on opposite faces. Pairs of smal-ler residues (in
particular Gly) on the same face ofthe helix are also favored.
Lastly, pairs formed byone small and one large residue correlate
positivelyon adjacent (i, i 1) or opposite faces (i 2)
and,conversely, are strongly disfavored on the sameface (i 4).
-
Figure 6. Odds ratios of pairs formed by similar resi-dues.
[Small], small residues, Gly, Ala and Ser; [Large],large aliphatic
residues, Ile, Val and Leu. The error barsmark the 99 % con®dence
interval around the expec-tation.
Analysis of Residue Patterns in TM Helices 925
These general themes can be appreciated inFigure 6, where sets
of pairs at the same registerare grouped by side-chain size and
compared.All pairs formed by two small residues (Gly,Ala and Ser)
at register 4 are positively biasedwith a signi®cance of at least p
< 0.01 (except thecase of AS4). The b-branched residues Ile
andVal correlate very strongly at register 4 (II4, IV4,VI4, VV4).
Interestingly, Leu seems to be part ofthe trend of positively
correlating [Large][Large]pairs at i, i 4 only when it is occupying
the C-terminal position (IL4, VL4); all pairs in whichLeu precedes
a second large residue (LI4, LV4,LL4) are unbiased. The majority of
the combi-nations of large and small residues at registers 1and 2
have a positive bias with a signi®cance ofat least p < 0.05. Not
all deviations from expec-tation are large or very signi®cant.
However, theobserved trends can be taken with more con®-dence than
the individual deviations, as it is lessprobable for a series of
random deviations tooccur all in the same direction.
Analysis of triplets shows that residuecorrelations extend
beyond the pair level
The relationships between pairs of larger andsmaller residues
suggested that the correlationswere not limited to the pairs, since
positivelycorrelating pairs can be consistently combined toform
higher-order patterns. This was con®rmedby extending the analysis
to triplets. The occur-rences of 200,000 amino acid triplets
werecounted and compared to an expectation com-puted with the same
method used for the pairs.The reference was therefore calculated on
the
frequency of the single residue in the sequencesand not relative
to the pairs. The 30 tripletswith the strongest positive
correlation are listedin Table 3. The most signi®cant triplets
wereindeed composed of combinations of stronglybiased pairs. For
example, the most signi®cantcase IG1L3 (IGxxL) is composed of IG1,
IL4 andGL3, all observed in the 15 most biased pairs.The
signi®cance of IGxxL (p 1.8 � 10ÿ20) isslightly lower than that of
IG1 but higher thanthose of IL4 and GL3. However, it is incorrectto
compare the p values, since, on average, thetriplets have a smaller
number of occurrencesthan the pairs, and p values strongly depend
on``sample size'' (for example, when a coin istossed once, 100 %
``heads'' is not a signi®cantresult, but in one million tosses 51 %
headsundoubtedly indicates a defective coin). A moreappropriate
value for comparison is the oddsratio (observed/expected
occurrences). In themost signi®cant triplets, the observed odds
ratiosalways exceed those of the corresponding pairs.
Triplets containing the pair GG4 are presentmany times in Table
3, mostly in conjunction withIle, Val or Leu at registers �1 and �2
with respectto the Gly residues. The interactions of the GG4pair
with Ile and Val at these distances is evidentin Figure 7, which
illustrates the effect of a thirdresidue at positions relative to
the GG4 pair. Inaddition, many strongly correlating triplets
inTable 3 contain two large aliphatic residues inter-acting with
one Gly or another small residue atposition �1 and �2 (IG1L3,
IG2I2, VG2I2, IG1I3,IS2I2, IA3V1, etc.). Together, these
correlationsde®ne the main theme of the analysis, i.e. patternsof
larger and smaller residues that are stronglyfavored to coexist at
neighboring helical faces.
Discussion
Many of the amino acid correlations that werefound in the
present analysis are readily inter-pretable in terms of helix-helix
interaction pat-terns. Most of the positively correlating
pairsoccur at separations i, i 1 and i 4, i.e. on thesame face in
a-helical conformation. At register i,i 4 there is a marked
preference for pairs ofresidues with similar size, while
combinations ofa small and a large residue are strongly
disfa-vored. Furthermore, the GG4 pair and itsrelationship with
b-branched residues at i � 1relative to the glycine residues has
beenobserved in two important membrane oligomer-izing systems: in
the interface of glycophorin A(GpA) transmembrane dimer (Lemmon et
al.,1994; MacKenzie et al., 1997), and by an in vivoselection
system for transmembrane helix-helixassociation (Russ &
Engelman, 2000). The otherstrong correlation of GG4 with b-branched
resi-dues at i � 2 is more dif®cult to explain interms of
helix-helix interaction, because thesepatterns in an a-helical
conformation would
-
Table 3. The 30 most signi®cant over-represented triplets sorted
by signi®cance
Triplet Occurrencesa Expectationb Standard deviationc
Significance (p)d Odds ratioe
IG1L3 535 353.4 18.3 1.8 �10ÿ20 1.51IG2I2 399 258.1 15.6 3.8
�10ÿ18 1.55IG2G4 244 137.6 11.5 6.1 �10ÿ18 1.77IG1A4 309 191.7 13.6
1.6 �10ÿ15 1.61GV2G2 244 143.3 11.7 3.6 �10ÿ15 1.70VG2I2 331 211.0
14.3 6.7 �10ÿ15 1.57IG1I3 382 258.1 15.6 7.4 �10ÿ14 1.48GG4G4 146
75.9 8.8 1.4 �10ÿ13 1.92IV4L4 488 348.1 18.3 5.1 �10ÿ13 1.40IP1I3
162 88.7 9.2 5.9 �10ÿ13 1.83IS2I2 319 211.4 14.1 1.1 �10ÿ12
1.51GI2G2 255 160.6 12.4 1.6 �10ÿ12 1.59IG1G4 236 149.1 11.9 4.7
�10ÿ12 1.58IA3V1 388 274.0 16.2 7.7 �10ÿ12 1.42IG2L2 485 353.4 18.4
1.1 �10ÿ11 1.37VG2G4 201 122.9 10.9 2.6 �10ÿ11 1.64II4L4 555 419.6
19.7 2.7 �10ÿ11 1.32VV4G2 257 169.5 12.7 1.2 �10ÿ10 1.52PI3G2 90
43.1 6.5 2.2 �10ÿ10 2.09VG5L3 334 234.8 15.1 4.4 �10ÿ10 1.42IA2I2
428 316.7 17.2 7.1 �10ÿ10 1.35GG4I2 213 137.6 11.5 9.6 �10ÿ10
1.55AC3A4 71 32.3 5.6 1.6 �10ÿ9 2.20VG2L3 413 305.3 17.1 1.7 �10ÿ9
1.35VG1G4 206 133.1 11.3 1.9 �10ÿ9 1.55IG2L3 439 328.2 17.7 2.2
�10ÿ9 1.34GG4L3 274 189.6 13.5 3.3 �10ÿ9 1.45VG2L2 438 328.8 17.7
4.0 �10ÿ9 1.33IV4V4 298 210.7 14.1 4.5 �10ÿ9 1.41GL3G1 334 241.3
15.1 4.9 �10ÿ9 1.38
a Number of observed occurrences of the triplet in the
database.b Average expected number of occurrences.c Standard
deviation of the expectation distribution curve.d Calculated as
two-tailed integral of the expectation distribution.e
Occurrences/Expectation ratio.
926 Analysis of Residue Patterns in TM Helices
place the residues on opposite sides of the helix.We propose a
possible explanation for this pat-tern in terms of helix ¯exibility
modulation.
Comparison with GpA transmembrane dimer
GG4 is the key feature of the dimerization inter-face of
glycophorin A, the best characterized trans-membrane helix-helix
interaction. The single TMDof GpA forms a symmetric right-handed
homo-dimer based on the seven residue motifLIxxGVxxGVxxT (Lemmon et
al., 1992, 1994). Theglycine residues allow the backbones to reach
closeproximity and the larger side-chains pack in a``ridges into
grooves'' fashion (MacKenzie et al.,1997).
Many other features of the GpA interactionmotif are found among
the most signi®cant resultsof the present analysis: IV4 and VV4,
for instance,are two of the most strongly correlating aminoacid
pairs. In addition, the majority of the aminoacid triplets of the
motif correlate positively in thisanalysis, as shown in Table
4.
Comparison with the TOXCAT in vivo selectionsystem for
helix-helix interaction
The GG4 pair is almost invariably present in thetransmembrane
oligomerization motifs identi®edfrom randomized sequences by the
TOXCATin vivo selection system, presented in the accom-panying
paper (Russ & Engelman, 2000). Sevenpositions with the
periodicity of the GpA motifwere randomized to a set of
possibilities at eachmotif position in the context of a
poly-leucine (Leulibrary) or poly-alanine (Ala library)
background.The results (refer to Figure 4 in the accompanyingpaper)
often contained the theme of large residues(Ile, Val or Leu)
associated with the GG4 pair atpositions �1, in excellent agreement
with the pre-sent statistical analysis.
In the TOXCAT library with a Leu context, thelarger residues
occurred at positions i 1 relativeto the two glycine residues
(G[IVL]xxG[IVL]). Theb-branched residues were prevalent, especially
inthe ®rst position. In addition, Thr was often foundin the
selection system at position i 4 from thesecond glycine residue. In
the present statisticalanalysis, we ®nd that the GG4T4 triplet,
which isobserved also in the GpA motif, is strongly over-
-
Figure 7. Triplet analysis: interaction of a third resi-due with
the GG4 pair. The Figure represents the oddsratios of triplets
containing the pair GG4 in conjunctionwith either a small residue
(Gly, Ala and Ser, leftpanels) or a large aliphatic residue (Ile,
Val and Leu,right panels). The position of the bars along the
x-axisre¯ects the actual position of the residue relative to
thepair GG4. The baseline is set at 1.316, the odds ratioobserved
for the pair GG4.
Analysis of Residue Patterns in TM Helices 927
represented (58 %, p 3.4 � 10ÿ4). In the Alalibrary, the two
large residues occurred at positioni ÿ 1, on the N-terminal side of
the GG4 pair([IVL]Gxx[IVL]G). b-Branched residues were
againprevalent. A schematic comparison of our resultswith the
TOXCAT selection can also be found inTable 2 of Russ & Engelman
(2000).
The convergence of the results obtained withsuch dissimilar
approaches is remarkable,
Table 4. Results of triplet analysis for all triplets
pre(LIxxGVxxGVxxT), sorted by decreasing odds ratio
Triplet Significan
GG4T4 3.4 �10IG3G4 1.8 �10IV4V4 4.5 �10GG4V1 1.6 �10GV1G3 3.1
�10LG4G4 9.6 �10LV5V4 2.8 �10IG3V1 1.9 �10IV4G3 4.9 �10IG3V5 5.7
�10VG3V1 3.4 �10LG4V1 6.5 �10GV1V4 2.1 �10LI1G3 2.6 �10LI1V4 2.8
�10VV4T3 8.8 �10VG3T4 1.0 �10LG4V5 7.0 �10GV5T3 8.5 �10LV5G3 6.0
�10GV1T3 4.1 �10
especially if one considers that the TOXCATsystem reports the
oligomerization events of bito-pic (single-span) transmembrane
domains, whilethe correlation analysis is based mostly on
polyto-pic (multi-span) proteins (Table 1). The frequent®nding of
GG4 with large ¯anking residues byTOXCAT, which selects for strong
transmembraneinteractions, probably re¯ects the excellent
oppor-tunity provided by the deep groove and ridge ofthe motif for
bringing two helices in extensive con-tact, as observed in the GpA
structure. If stronginteractions are important in polytopic
proteins,they are essential in oligomerizing helices, as moreenergy
is required to compensate for the largerentropy cost of association
of helices that are notcovalently joined by extra-membranous
loops.
Following this line of reasoning, one couldexpect the GG4 pair
to be more frequent in theTMDs of single-span transmembrane
proteins. Toaddress this question, we analyzed bitopic andpolytopic
sequences separately (data not shown).In a raw count, the pair is
indeed found more fre-quently in bitopic sequences (on average, in
12.5 %of bitopic domains and in 12.1 % of polytopictransmembrane
domains). The GG4 pair is themost signi®cant outlier in both
databases, but it ismore over-represented relative to its
expectedoccurrences in the bitopic (37.8 %) than in thepolytopic
set (30.8 %). However, caution shouldbe exercised when inferring
the relative importanceof the motif in the two different topologies
fromthese results. In polytopic proteins, weak helix-helix
interactions embedded in a bundle might betolerable and
extra-membranous loops mightsometimes direct the folds. On the
other hand, thefraction of transmembrane anchors in the single-span
database that are not engaged in interactions
sent in the dimerization motif of glycophorin A
ce (p) Odds ratio
ÿ4 1.58ÿ6 1.43ÿ9 1.41ÿ3 1.28ÿ3 1.25ÿ3 1.20ÿ3 1.17ÿ2 1.16ÿ2
1.15ÿ2 1.15ÿ2 1.15ÿ2 1.10ÿ1 1.09ÿ1 1.06ÿ1 1.05ÿ1 1.010 1.00ÿ1
0.97ÿ1 0.96ÿ1 0.96ÿ1 0.90
-
928 Analysis of Residue Patterns in TM Helices
is also unknown. ``Passive'' sequences with nearlyrandomly
distributed residues would provide onlyan increase in the
background noise and a decreasein the signi®cance of the results.
Thus, the onlyconclusion supported by the data is that the GG4pair
is very important in both bitopic and poly-topic membrane
proteins.
bbb-Branched residues could minimize entropyloss upon
packing
Upon solution of the NMR structure of the GpAtransmembrane
dimer, MacKenzie et al. (1997) pro-posed that the association of
the monomers mightoccur between two largely preformed
interfaces.The idea was based on a fundamental implicationof the
two-stage model for membrane protein fold-ing (Popot &
Engelman, 1990). The ®rst stage ofthe model involves the
partitioning of largelyhydrophobic TM segments in the lipid
bilayer,which is strongly favored by the hydrophobiceffect. The
backbone adopts a helical conformationto satisfy its strong
hydrogen bonding potential inthe low-dielectric environment.
Sequence speci-®city comes into play only in stage 2, when
theequilibrium of associations of the preformedhelices is
established. Given the two-stage model,it is possible to have a
notion of the structure ofthe unassociated state (helical) that is
generally notavailable with the unfolded state of soluble
pro-teins. This information is crucial to relatingobserved
structural features of the native state tothe energetics of
folding, since stability depends onthe differential between the
energies of folded andunfolded states.
In the GpA dimer, many interfacial side-chains(Ile, Val, and
Thr) have only one populated rota-mer as a consequence of being in
a helix(Dunbrack & Karplus, 1993; Schrauber et al., 1993).Under
the assumption that the GpA TM is helicalin the monomeric state
(recently con®rmed exper-imentally by Fisher et al., 1999),
MacKenzie andcolleagues pointed out that minimal loss of rota-meric
freedom upon dimerization was thereforeexpected. Later, a
theoretical model based on alarge number of GpA mutants indicated
loss ofside-chain entropy as one of the major factorsdestabilizing
dimerization (MacKenzie &Engelman, 1998), supporting further
the hypothesisthat rotamerically constrained interfaces
couldprovide a signi®cant contribution to the stability
ofassociation.
In our results, there is a signi®cant dichotomy inthe role of
the three larger aliphatic residues Ile,Val and Leu. The b-branched
Ile and Val are, withGly, the residues involved in the strongest
corre-lations. Conversely, Leu, the most frequent residuein
transmembrane domains, has only a secondaryrole. As a g-branched
side-chain, Leu can samplemore conformations in helical secondary
structure.Our results are therefore consistent with thehypothesized
importance of a ``preformed inter-face'' and the possibility that
the use of residues
with constrained side-chains in helical confor-mation might have
general signi®cance in limitingthe entropic cost of association in
a large set ofmembrane proteins.
Interaction of bbb-branched residues at i, i4might modulate
helix flexibility in TMs
A combination of theoretical arguments andexperimental evidence
suggests the hypothesis thatpairs of Ile and Val at i, i 4, which
we ®nd allstrongly over-represented in this analysis, mightin¯uence
¯exibility in TM helices. Helical confor-mation prevents the w1
dihedral from positioning aheavy g-substitute in gaucheÿ
orientation due to thesteric clashes with the backbone carbonyl
oxygenatom at i ÿ 3 (McGregor et al., 1987). In an analysisof
intrahelical side-chain/side-chain interactions insoluble proteins,
Walther & Argos (1996) reportedthat the majority of the
contacts occurred betweenpairs of residues with spacing i, i 4. As
theypointed out, interactions can occur at this separ-ation, since
they are promoted by w1 rotamers thatinvolve a combination of a
trans (at i position) anda gauche (at i 4) dihedral. Conversely, i,
i 1and i, i 3 interactions require the unfavorable gÿconformation
(gÿ/g and t/gÿ, respectively). Thetwo Cg atoms of b-branched
residues are forced tooccupy simultaneously g and t positions to
avoidthe gÿ dihedral (Schrauber et al., 1993). For thisreason,
b-branched residues are good candidatesfor intrahelical
interactions at i, i 4. This is con-sistent with the high scores of
Ile and Val in the i,i 4 contact propensity calculated by Walther
&Argos (1996), a scale in which Leu scored onlyslightly above
average.
Padmanabhan & Baldwin (1994) used circulardichroism (CD) to
measure the interactions ofL[IVL] and [IVL]L pairs at i, i 3 and i,
i 4 insoluble peptides, and observed stronger helixstabilization in
i 4 pairs. The energy of inter-action of pairs of hydrophobic
residues at differentregisters in an a-helix has been calculated
byCreamer & Rose (1995) using an exhaustive Boltz-mann-weighted
conformational search. The inter-actions of pairs formed by Ile,
Val and Leu at i,i 4 were more stabilizing than those of i, i
3pairs. The energy ranking observed for these pairsat i, i 4 agrees
with our data (summarized in the[Large][Large]4 panel in Figure 6)
in the fact thatthe smallest effects are observed when there is
aLeu residue on the N-terminal side in the pair(LL4, LI4, LV4). The
calculations made by Creamer& Rose (1995) were in only partial
agreement withthe experimental results reported by Padmanabhan&
Baldwin (1994), who, conversely, observed high-er helix content in
L[IVL]4 than in [IVL]L4 pairs.However, Creamer & Rose (1995)
calculated theinteraction energies relative to the same pair at i,i
2 (on opposite faces in helical conformation)while the CD data
re¯ects the position of a helix-coil/strand equilibrium.
-
Figure 8. Example of a pair of b-branched aliphaticresidues at
i, i 4 in the fourth transmembrane segmentof bacteriorhodopsin
(RSCB PDB code 1c3w). Both I108and V112 are in their standard
helical rotamer in whichthe g-carbon atoms are positioned away from
the disfa-vored gaucheÿ orientation. According to the
IUPACnomenclature rules, the rotamers are designated respect-ively
as trans and gauche. The van der Waal sphere ofthe carbon atoms of
closest approach is represented bydots (1.9 AÊ ). The
center-to-center distance between I108-Cg2 and V112-Cg2 is 4.1 AÊ
.
Analysis of Residue Patterns in TM Helices 929
These three studies relate intrahelical side-chaininteractions
to helical stability in aqueous solutionand they concur on the
importance of i, i 4contacts. In the membrane, the helix is
alreadystabilized by the environment, but side-chain inter-actions
might additionally affect the ¯exibility ofthe helix. This might be
especially true for pairs ofb-branched residues at i, i 4, as their
only favor-able w1 rotamer conformation locks them in
closeproximity. In bacteriorhodopsin, the only helicalmembrane
protein structure available at betterthan 2 AÊ resolution (Luecke
et al., 1999), the four[IV][IV]4 pairs found in regular a-helical
confor-mation have an average minimal distance (centerto center of
the closest Cg or Cd atoms) of only4.2(�0.3) AÊ (�SD). An example
(residues I108 andV112 on the fourth transmembrane segment) isshown
in Figure 8. Whether the strongly correlat-ing pairs of b-branched
residues at i, i 4 areimportant to diminishing transmembrane
helix¯exibility is an interesting question. If
validatedexperimentally, it could provide further support tothe
hypothesis that a reduction of entropy in thehelical unassociated
state (in turn a destabilizationof the unfolded state, if
independent helices arestable in the bilayer) could be a signi®cant
factor inthe transmembrane association equilibrium.
On the other hand, glycine is frequentlyobserved in membrane
helices and induces ¯exi-bility. Glycine is compatible with helical
confor-mation in membrane proteins, as evident in GpA,which is
largely helical in both the monomeric anddimeric states despite
three glycine residues in itsTM sequence (Fisher et al., 1999).
However, exten-sive studies in host peptides by Deber and
col-leagues have shown that, while Gly has aconsiderable tendency
to form a-helices in mem-brane mimetic environments, it is somewhat
desta-bilizing compared to the more hydrophobic sidechains (Li
& Deber, 1992a,b; Liu & Deber, 1998).This is consistent
with the observation by Ri et al.(1999) using a Monte Carlo
simulation of a singleTM. The ranking observed for increased
¯exibility(Gly > Ala > Val) correlated well with the
severityof voltage-dependent gating phenotypes whenthese three
residues were substituted for the wild-type Pro residue in
connexin32.
Thus, a pair of b-branched residues i, i 4 and apair of glycine
residues at i, i 4 might lie at oppo-site sides of a hypothetical
¯exibility scale in TMhelices. The favorable role of Gly in helix
inter-actions might require the presence of additionalstability
from the b-branched residues. This argu-ment provides a speculative
but plausible expla-nation for the strong correlations between the
GG4pair and [IV][IV]4 pairs observed in opposite facesof the helix
at i 2, which could perhaps have acompensatory role in modulating
helix ¯exibility.
Final remarks
Many instances of the ``GG4 b-branched''motif and its variations
can be found in the avail-
able X-ray structures of helical transmembrane pro-teins. An
in-depth comparison of the results of ouranalysis with the
structural models has not beencompleted at this stage. This
comparison couldoffer further insights into the physical role of
thismotif and of other observed correlations. Forexample, it would
be interesting to put the strongassociation of Ile, Val and Leu
with neighboringPro residues in relation to the geometry of
thekink.
We have shown that the inherent simplicity ofhelical membrane
proteins structure results in cor-relations between residues that
are detectable withsimple statistical methods and that
suggestinterpretations in terms of protein chemistry. Inturn, our
results also support the validity of TMDprediction techniques. With
the growth of primarydata provided by the genome projects, these
resultsare an indication of the important role thatsequence
analysis will assume in the near future inthe membrane protein ®eld
as a complement to theinterpretation of experimental and structural
data.
Methods
Database
The source of transmembrane sequences for this workwas the
annotated database Swiss-Prot, release 37 andupdates to March 17,
1999 (Bairoch & Apweiler, 1999).All sequence fragments
corresponding to a TRANSMEMannotation in the FT ®eld were extracted
and a databaseof 46,946 transmembrane domains was compiled(Table
1).
-
Figure 9. Calculation of probability distributions of pair
occurrence with the TMSTAT method. The Figure isexplained fully in
the Appendix.
930 Analysis of Residue Patterns in TM Helices
-
Analysis of Residue Patterns in TM Helices 931
Homology cleanup
Homology removal was performed at the level of theTM sequences
by eliminating each sequence that wasexceedingly similar to another
sequence. Given the largenumber of proteins in the database,
homology elimin-ation at the level of the TMDs was a practical and
effec-tive alternative to more complex and intensive
clusteringprocedures at the protein level (Boberg et al.,
1992;Brenner et al., 1998; Gerstein, 1998; Hobohm &
Sander,1994; Hobohm et al., 1992). In addition, the
TMD-levelprocedure takes care of the ``internal homology''
some-times present within a given protein while preservingany
non-homologous TMDs of otherwise homologousproteins. The annotated
sequences were extended (oroccasionally shortened) to a length of
30 residues usingthe ¯anking regions. Two sequences were compared
inall possible frame shifts using a 100 PAM matrix derivedfrom the
Mutation Probability Matrix of Jones et al.(1994) and the maximum
score was recorded as the simi-larity score of the pair.
Sequences were eliminated according to the followingprocess.
First, all pairs with similarity scores of 50 orhigher were ranked
by score, from highest to lowest.Then, beginning with the pair with
the highest score,one member of each pair was marked for removal.
Theparticular sequence in a pair chosen for removal wasdetermined
by its priority number. Priorities, assignedaccording to the
description of the annotation in theSwiss-Prot database, gave
preference to non-potentialtransmembrane domains:
0, transmembrane sequences of potential protein(ORFs identi®ed
in Swiss-Prot with IDs starting withthe letter Y);1, transmembrane
domains marked as POTENTIAL,PROBABLE or POSSIBLE;2, annotations
that included the words BYSIMILARITY;3, remaining annotations.
Sequences with larger priority numbers were kept inthe database,
and when members of the pair shared thesame priority number, one
was randomly chosen forremoval. The cleanup proceeded down the list
of pairsso that when a pair in which neither sequence had
beenmarked for removal was encountered, priority numberswere
assigned and only one sequence was subsequentlykept.
Pair and triplets definition
The analysis of the pairs correlation was performed onall
combinations of amino acids separated by one to tenresidues (20 �
20 � 10 4000 pair correlations ana-lyzed). Pairs at i, i k are
indicated using the one-lettercode of the two residues followed by
the separation k(register): for example, AL1 corresponds to the
sequenceAL and AL3 to AxxL.
The triplets analyzed were formed by all combinationsof residues
at separations ranging from 1 to 5(20 � 20 � 20 � 5 � 5 200,000
triplet correlations). Tri-plets are represented by VI2P3
(corresponding toVxIxxP).
Input sequences
The analysis was performed on sequences of ®xedlength instead of
the entire annotation, in order to limitthe analysis to the
hydrophobic core of the sequences.The most hydrophobic window of 18
amino acid resi-dues in a span of 30 residues centered on each
annota-tion was selected using the GES scale (Engelman et
al.,1986). Occasionally, the selected window included resi-dues
outside the original annotations.
Exceedingly hydrophilic sequences with a hydropho-bicity score
below 15 were excluded from the analysis(4.9 % of all sequences).
Low-complexity sequences(when a single residue represented more
than half of thecomposition of the sequence or two residues more
thantwo-thirds of the composition of the sequence) were
alsoexcluded (0.5 %).
Pair and triplet correlation analysis with TMSTAT
The occurrences in the database of all pairs and tri-plets of
residues were counted. The probability distri-butions associated
with any possible number ofoccurrences of each pair and triplet
were calculated fromthe composition of the individual sequences,
asexplained in Appendix and in the scheme in Figure 9.The
statistical signi®cance of the observed deviations ofeach
occurrence from its respective average expectedvalue was calculated
by the two-tailed integral of theirprobability distributions.
Acknowledgments
We thank Mark Bowen, Zimei Bu, Lilian Fisher, KarenHo, Yuval
Kluger, Albert Lee, Huiming Li, Maura Mez-zetti, Gigi Riva, William
Russ, Koji Sonoda, Iban Ubar-retxena, Fang Zhou and other members
of the Engelmangroup for helpful discussion and critical reading of
themanuscript. This work was supported by grants fromthe NIH and
NSF.
References
Arkin, I. T. & Brunger, A. T. (1998). Statistical analysisof
predicted transmembrane alpha-helices. Biochim.Biophys. Acta, 1429,
113-128.
Bairoch, A. & Apweiler, R. (1999). The SWISS-PROTprotein
sequence data bank and its supplementTrEMBL in 1999. Nucl. Acids
Res. 27, 49-54.
Boberg, J., Salakoski, T. & Vihinen, M. (1992). Selectionof
a representative set of structures from Brookha-ven Protein Data
Bank. Proteins: Struct. Funct. Genet.14, 265-276.
Bowie, J. U. (1997). Helix packing in membrane proteins.J. Mol.
Biol. 272, 780-789.
Boyd, D., Schierle, C. & Beckwith, J. (1998). How
manymembrane proteins are there? Protein Sci. 7, 201-205.
Brenner, S. E., Chothia, C. & Hubbard, T. J.
(1998).Assessing sequence comparison methods withreliable
structurally identi®ed distant evolutionaryrelationships. Proc.
Natl Acad. Sci. USA, 95, 6073-6078.
Chothia, C., Levitt, M. & Richardson, D. (1981). Helix
tohelix packing in proteins. J. Mol. Biol. 145, 215-250.
-
932 Analysis of Residue Patterns in TM Helices
Creamer, T. P. & Rose, G. D. (1995). Interactionsbetween
hydrophobic side chains within alpha-helices. Protein Sci. 4,
1305-1314.
Dunbrack, R. L., Jr & Karplus, M. (1993). Backbone-dependent
rotamer library for proteins. Applicationto side-chain prediction.
J. Mol. Biol. 230, 543-574.
Engelman, D. M., Steitz, T. A. & Goldman, A.
(1986).Identifying nonpolar transbilayer helices in aminoacid
sequences of membrane proteins. Annu. Rev.Biophys. Biophys. Chem.
15, 321-353.
Fisher, L. E., Engelman, D. M. & Sturgis, J. N.
(1999).Detergents modulate dimerization, but not helicity,of the
glycophorin A transmembrane domain. J. Mol.Biol. 293, 639-651.
Gerstein, M. (1998). Patterns of protein-fold usage ineight
microbial genomes: a comprehensive struc-tural census. Proteins:
Struct. Funct. Genet. 33, 518-534.
Hobohm, U. & Sander, C. (1994). Enlarged representa-tive set
of protein structures. Protein Sci. 3, 522-524.
Hobohm, U., Scharf, M., Schneider, R. & Sander, C.(1992).
Selection of representative protein data sets.Protein Sci. 1,
409-417.
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1994).
Amutation data matrix for transmembrane proteins.FEBS Letters, 339,
269-275.
Kyte, J. & Doolittle, R. F. (1982). A simple method
fordisplaying the hydropathic character of a protein.J. Mol. Biol.
157, 105-132.
Landolt-Marticorena, C., Williams, K. A., Deber, C. M.&
Reithmeier, R. A. (1993). Non-random distri-bution of amino acids
in the transmembrane seg-ments of human type I single span
membraneproteins. J. Mol. Biol. 229, 602-608.
Lemmon, M. A., Flanagan, J. M., Treutlein, H. R.,Zhang, J. &
Engelman, D. M. (1992). Sequencespeci®city in the dimerization of
transmembranealpha-helices. Biochemistry, 31, 12719-12725.
Lemmon, M. A., Treutlein, H. R., Adams, P. D.,Brunger, A. T.
& Engelman, D. M. (1994). A dimeri-zation motif for
transmembrane alpha-helices.Nature Struct. Biol. 1, 157-163.
Li, S. C. & Deber, C. M. (1992a). Glycine and beta-branched
residues support and modulate peptidehelicity in membrane
environments. FEBS Letters,311, 217-220.
Li, S. C. & Deber, C. M. (1992b). In¯uence of
glycineresidues on peptide conformation in membraneenvironments.
Int. J. Pept. Protein Res. 40, 243-248.
Liu, L. P. & Deber, C. M. (1998). Uncoupling hydropho-bicity
and helicity in transmembrane segments.Alpha-helical propensities
of the amino acids innon-polar environments. J. Biol. Chem. 273,
23645-23648.
Luecke, H., Schobert, B., Richter, H. T., Cartailler, J. P.
&Lanyi, J. K. (1999). Structure of bacteriorhodopsin at1.55 AÊ
resolution. J. Mol. Biol. 291, 899-911.
MacKenzie, K. R. & Engelman, D. M. (1998). Structure-based
prediction of the stability of transmembranehelix-helix
interactions: the sequence dependence ofglycophorin A dimerization.
Proc. Natl Acad. Sci.USA, 95, 3583-3590.
MacKenzie, K. R., Prestegard, J. H. & Engelman, D. M.(1997).
A transmembrane helix dimer: structure andimplications. Science,
276, 131-133.
McGregor, M. J., Islam, S. A. & Sternberg, M. J.
(1987).Analysis of the relationship between side-chainconformation
and secondary structure in globularproteins. J. Mol. Biol. 198,
295-310.
Padmanabhan, S. & Baldwin, R. L. (1994). Tests
forhelix-stabilizing interactions between various non-polar side
chains in alanine-based peptides. ProteinSci. 3, 1992-1997.
Popot, J. L. & Engelman, D. M. (1990). Membrane pro-tein
folding and oligomerization: the two-stagemodel. Biochemistry, 29,
4031-4037.
Ri, Y., Ballesteros, J. A., Abrams, C. K., Oh, S., Verselis,V.
K., Weinstein, H. & Bargiello, T. A. (1999). Therole of a
conserved proline residue in mediatingconformational changes
associated with voltage gat-ing of Cx32 gap junctions. Biophys. J.
76, 2887-2898.
Richmond, T. J. & Richards, F. M. (1978). Packing
ofalpha-helices: geometrical constraints and contactareas. J. Mol.
Biol. 119, 537-555.
Russ, W. P. & Engelman, D. M. (2000). The GxxxGmotif: a
framework for transmembrane helix-helixassociation. J. Mol. Biol.
296, 911-919.
Samatey, F. A., Xu, C. & Popot, J. L. (1995). On the
dis-tribution of amino acid residues in transmembranealpha-helix
bundles. Proc. Natl Acad. Sci. USA, 92,4577-4581.
Schrauber, H., Eisenhaber, F. & Argos, P. (1993). Rota-mers:
to be or not to be? An analysis of amino acidside-chain
conformations in globular proteins. J. Mol.Biol. 230, 592-612.
von Heijne, G. (1992). Membrane protein structure pre-diction.
Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol.
225, 487-494.
Walther, D. & Argos, P. (1996). Intrahelical side chain-side
chain contacts: the consequences of restrictedrotameric states and
implications for helix engineer-ing and design. Protein Eng. 9,
471-478.
Walther, D., Eisenhaber, F. & Argos, P. (1996). Principlesof
helix-helix packing in proteins: the helical latticesuperposition
model. J. Mol. Biol. 255, 536-553.
Appendix I: Calculation of ExpectationDistributions for the
Occurrence of Pairs andTriplets of Amino Acids in a Database of
ShortSequences with the TMSTAT Method
The aim of the present analysis is to survey fre-quently
occurring patterns of residues (pairs andtriplets) in transmembrane
sequences. For this, weneed some measure of the expectation of
occur-rence of the patterns. The simplest way to calculatethis is
from the average composition (i.e. the prob-ability of ®nding a
particular residue is constant atall positions in all sequences and
corresponds to itsfrequency in the database). However, thisapproach
requires the assumption that, in terms ofcomposition, all sequences
derive from a homo-geneous population and that residues do not
co-segregate or anti-segregate in different sequences.
This assumption is not required if the expec-tation is based on
the composition of each individ-ual sequence instead of the overall
composition ofamino acids in the database (i.e. the probability
of®nding a particular residue is constant at all pos-itions within
a sequence and corresponds to its fre-quency in the sequence).
However, ®nite sequencelength effects also need to be accounted
for, sincethey are quite important for short sequences(18 residues
in our case). A solution is to base the
-
Analysis of Residue Patterns in TM Helices 933
calculation on all theoretically possible internalpermutations
of the sequences, that is, to take intoaccount the length and
composition of eachsequence once internal positional information
hasbeen removed. A way to conceptualize this is toask: What would
be the probability of ®nding acertain number of occurrences of a
pair in the data-base after all sequences have been randomly
per-muted? Considering the entire theoretical set ofdifferent
databases that can be obtained from theoriginal when the sequences
are allowed to inde-pendently assume any possible internal
permu-tation, the probability corresponds to the fractionof all
permuted databases that contain that exactnumber of occurrences of
the pair.
The expectancy distribution of a pair based onall theoretical
permutations of all sequences couldbe approximated by cycles of
random shuf¯ing ofthe sequences and sampling of the
occurrences.However, a sampling algorithm would produceestimates
with errors that are higher at the tails ofthe distribution, i.e.
where greater precision wouldbe desirable. To completely avoid
errors, we havecalculated analytically the exact theoretical
distri-butions of expectancy of any pair. The TMSTATmethod is
schematized in Figure 9 of the main text.The calculation is divided
into two phases: inphase 1, the probability distributions for
occur-rences of pairs in single sequences were calculatedand stored
in a matrix table for later use. Considerthe pair ALk, A and L as
examples of any twonon-identical residues at positions i, i k:
theprobability that pair ALk will occur NALk times in aparticular
sequence is:
PNALkjl; k;NA;NLwhich depends on four parameters; the length
ofthe sequence l, the register k and how many Ala(NA) and Leu
residues (NL) are in the sequence. Itis de®ned as the fraction of
all possible permu-tations of the sequence containing exactly
NALkoccurrences of the ALk pair. An example of the cal-culation is
shown explicitly in the scheme for ashort ®ve residue sequence with
two Ala and twoLeu residues and at register 3. The box shows all30
possible permutations of the short sequence (thenon-A and non-L
residue is symbolized by a dash):of the 30 possible permutations,
19 (63.3 %) haveno occurrences of the AL3 pair. The pair occursonce
in ten (33.3 %) and twice in one (3.3 %) of thepermutations. All
sequence probability distri-butions for all relevant combinations
of the fourparameters (l 18; k 1 to 10; NA 1 to 9; NL 1to 9) were
calculated and tabulated for later use.Pairs formed by two
identical residues, as forexample LLk, obey different
distributions, P(NLLk jl, k, NL), that were analogously calculated
andtabulated.
The speci®c database is considered only in phase2, when actual
occurrences of the pairs are countedand the database probabilities
are computed. Theoverall probability distribution of occurrence of
the
pair ALk in the database, PDB, was calculated byiteratively
convoluting the speci®c single-sequencePj(NALk) distributions
tabulated in phase 1 relativeto the [lj, k, NA, j, NL, j]
parameters provided byeach j sequence of the database considered.
Theprobability of observing NALk occurrences of thepair ALk in a
database of n sequences can he calcu-lated according to:
PDBnNALk XNALki0
PDBnÿ1iPnNALk ÿ ijl; k;NA;n;NL;n
de®ned recursively, with initial PDB(0)(0) 1. NA,nand NL,n are
the number of Ala and Leu residuesin sequence n.
An example of the process is shown in thescheme where the ®rst
three steps and the ®nalresult are illustrated for the analysis of
the occur-rences of the pair AL3. All sequences in the data-base
analyzed have ®xed length l of 18 residues(this restriction is not
necessary in general and themethod applies to mixed-length sequence
data-bases). The ®rst sequence of the database containstwo Ala and
three Leu residues. No occurrence ofAL3 is observed in this
sequence (black arrow atzero occurrence in chart) In the ®rst step
of theprocedure only one sequence has been consideredand the
probability distribution of the database,PDB(1)(NAL3) (bar chart)
corresponds to the prob-ability distribution of sequence 1, P1
P(NAL3jl 18, k 3, NA 2, NL 3).
The second sequence of the database contains®ve Ala and ®ve Leu
residues, and in this case oneoccurrence of AL3 is observed. P2 is
thus P(NAL3 jl 18, k 3, NA 5, NL 5) and the cumulativePDB(2)
distribution is then obtained from P2 andPDB(1), as shown in the
example. Two occurrencesof AL3 are found in the third sequence,
bringingthe total to three for the database at this stage,
andPDB(2) is then obtained from P3 and PDB(2). The cal-culation
becomes more complex as more combi-nations are available and the
curve assumes amore bell-shaped character.
Once all 13,606 sequences had been analyzed,the PDB distribution
has converged to a bell curve.Average expected values and standard
deviationswere calculated from the probability distributioncurves
according to:
NALk XNALk
NALk PDBNALk
SDALk XNALk
N2ALkPDBNALk ÿ�X
NALk
NALk PDBNALk�2:
vuutThe observed 4140 occurrences of AL3 in thedatabase are
slightly above the average expec-tation value of 4043.1. The
two-tailed integral of
-
934 Analysis of Residue Patterns in TM Helices
the PDB(NAL3) function provided a signi®cancefor the observed
occurrences of a pair. The inte-gration was computed on formally
derivedcurves; therefore, no assumption regarding thenature of the
distributions was necessary Two-tailed integrals were used, since
both above andbelow-expectation values were considered signi®-cant.
The signi®cance of the occurrences of theAL3 pair is low (p 0.075),
that is, if the resi-dues were actually randomly distributed
therewould be a realistic possibility of observing anequal or
greater number of occurrences by ran-dom chance.
The analysis of the triplets was performed withan analogous
method. The single-sequence prob-ability distributions were
calculated for the tripletALk1Vk2 as:
PNALk1Vk2 jl; k1; k2;NA;NL;NV
based on all possible sequence permutations andtabulated for the
relevant ranges of l, k1, k2, NA, NLand NV (Ala, Leu and Val
representing any threenon-identical residues at relative spacing k1
andk2). Probability distributions were also calculatedfor triplets
in which residues are repeated(AAk1Lk2, ALk1Ak2, ALk1Lk2, AAk1Ak2).
The cumu-lative probability distribution, PDB, for the occur-rence
of each triplet in the database was calculatedwith the same
recursive formula of the pairs. TheTMSTAT method is, in principle,
applicable toquadruplets and higher-order multiplets, althoughthe
increased number of combinations can limit thefeasibility.
Edited by G. von Heijne
(Received 4 November 1999; received in revised form 29 December
1999; accepted 29 December 1999)