Statistical Analysis of Amino Acid Patterns in ...papers.gersteinlab.org/e-print/tmstat-jmb/old/senes_JMB...Figure 2. Variation in the amino acid composition at different positions

doi:10.1006/jmbi.1999.3488 available online at http://www.idealibrary.com on J. Mol. Biol. (2000) 296, 919±934

Statistical Analysis of Amino Acid Patterns inTransmembrane Helices: The GxxxG Motif OccursFrequently and in Association with bbb-branchedResidues at Neighboring Positions

Alessandro Senes, Mark Gerstein and Donald M. Engelman*

Department of MolecularBiophysics & Biochemistry,Yale University, P.O. Box208114, New HavenCT 06520-8114, USA

E-mail address of the [email protected]

Abbreviations used: TM, transmetransmembrane domain; GpA, glyccircular dichroism.

0022-2836/00/030919±16 $35.00/0

To ®nd motifs that mediate helix-helix interactions in membrane proteins,we have analyzed frequently occurring combinations of residues in adatabase of transmembrane domains. Our analysis was performed with anovel formalism, which we call TMSTAT, for exactly calculating theexpectancies of all pairs and triplets of residues in individual sequences,taking into account differential sequence composition and the substantialeffect of ®nite length in short segments. We found that the number of sig-ni®cantly over and under-represented pairs and triplets was muchgreater than the random expectation. Isoleucine, glycine and valine werethe most common residues in these extreme cases. The main themeobserved is patterns of small residues (Gly, Ala and Ser) at i and i 4found in association with large aliphatic residues (Ile, Val and Leu) atneighboring positions (i.e. i � 1 and i � 2). The most over-representedpair is formed by two glycine residues at i and i 4 (GxxxG, 31.6 %above expectation, p < 1 � 10ÿ33) and it is strongly associated with theneighboring b-branched residues Ile and Val. In fact, the GxxxG pair hasbeen described as part of the strong interaction motif in the glycophorinA transmembrane dimer, in which the pair is associated with two Valresidues (GVxxGV). GxxxG is also the major motif identi®ed using TOX-CAT, an in vivo selection system for transmembrane oligomerizationmotifs. In conjunction with these experimental observations, our resultshighlight the importance of the GxxxG b-branched motif in transmem-brane helix-helix interactions. In addition, the special role for theb-branched residues Ile and Val suggested here is consistent with thehypothesis that residues with constrained rotameric freedom in helicalconformation might reduce the entropic cost of folding in transmembraneproteins. Additional material is available at http://engelman.csb.yale.edu/tmstat and http://bioinfo.mbb.yale.edu/tmstat.

# 2000 Academic Press

Keywords: membrane proteins; protein folding; glycine; b-branched;sequence analysis
*Corresponding author
Introduction

The two dozen high-resolution structures of inte-gral membrane proteins available so far haverevealed only two simple folds, the helical bundleand the closed beta barrel. These folds are the sim-

ing author:

mbrane; TMD,ophorin A; CD,

plest solutions to satisfying the hydrogen bondingpotential of the polypeptide backbone amidegroups in the lipid bilayer. In the helical family,the membrane-spanning domains are generallycomposed of very hydrophobic stretches of 20-30amino acid residues. Algorithms based on hydro-phobicity scales (Boyd et al., 1998; Engelman et al.,1986; Kyte & Doolittle, 1982; von Heijne, 1992)reliably identify these domains from primarysequences. As a consequence, a large database ofpredicted helical transmembrane (TM) domains(TMD) exists.

# 2000 Academic Press

920 Analysis of Residue Patterns in TM Helices

Thus, structural information in a membrane pro-tein sequence can be statistically interpreted.Elements of the structural simplicity of these pro-teins suggest the existence of commonly used pat-terns in transmembrane helix-helix interactions.First, the space that natural selection can sample insearch of favorable combinations seems to be lim-ited by the low complexity of the sequences, sincetwo-thirds of transmembrane residues comprise,on average, only six amino acids (Leu, Ile, Val,Phe, Ala, and Gly), as schematized in Figure 1(a).The helices tend to adopt perpendicular orientationin order to span the bilayer (Bowie, 1997), andhelix packing theories suggest that only a subset ofthe relative inter-helical orientations are optimalfor interaction (Bowie, 1997; Chothia et al., 1981;Richmond & Richards, 1978; Walther et al., 1996).Moreover, the need for a detailed ®t to maximizeweak van der Waals interactions and the prefer-ence for preformed interfaces to minimize entropylost upon packing, as postulated by MacKenzie &Engelman (1998), could also limit the number ofconformations suitable for interaction.

The existence of correlations between residueshas been suggested in previous statistical studieson predicted TM sequence databases (Arkin &Brunger, 1998; Landolt-Marticorena et al., 1993;Samatey et al., 1995). Here, we present a rigorousanalysis of the frequency of occurrence of all pairsand triplets of amino acids in a large non-homo-logous set of sequences, compared with theirtheoretical expectancies.

Expectancy can be calculated trivially by theproduct of the frequency of amino acids inthe database. However, this method requires theassumption that in terms of composition, thesequences belong to a homogeneous population.We have formulated a procedure for calculatinganalytically expectancy distributions of occurrence

Table 1. Transmembrane annotations in the Swiss-Prot datab

All proteins

Protein TMsd Potential Proteirecordsc (%)e record

Complete databasef 10,769 46,946 94.3 3863Eukaryota 6587 27,288 92.5 2630Bacteria 3156 16,881 97.2 706Archaea 297 1341 98.2 71Viruses 729 1436 90.6 456

Non-homologousdatabaseg 5309 13,606 96.4 1174

Eukaryota 2510 5619 95.2 749Bacteria 2333 6963 97.4 294Archaea 244 666 98.6 43Viruses 222 358 93.6 88

a Proteins containing a single transmembrane domains (single-spanb Proteins containing multiple transmembrane domains (multi-spac Number of proteins containing transmembrane annotations.d Annotated transmembrane domains (TRANSMEM entries in thee Transmembrane annotations marked as POTENTIAL or POSSIBLf Database containing all Swiss-Prot annotated transmembrane domg Database used for the statistical analysis, obtained from the com

based on the composition of the individualsequences. Our method takes into account the®nite-length effect that for short sequences, such astransmembrane domains, becomes very important.The ratios between the observed occurrences andtheir relative mean expectancy value (odds ratio)allowed for the identi®cation of the over-rep-resented and under-represented pairs and triplets.The exact expectancy distribution permitted thecalculation of a precise statistical signi®cance forthe observed differences from expectation. Ourresults show that a large number of signi®cantcases exist, suggesting structural themes in helix-helix interaction and a special role for glycine andthe b-branched Ile and Val in transmembranedomains.

Results

Characteristics of the database

The collection of sequences used in this analysiswas obtained from the 49,946 transmembranedomains annotated in the Swiss-Prot database. In94 % of the cases, the annotations were marked aspotential, possible or probable, indicating that theywere identi®ed by hydrophobicity algorithms(Table 1). In order to remove homology, TMsequences with high similarity scores to otherswere excluded, as described in Methods. The pro-cedure yielded a database with 13,606 TMDs, anadequate size for the proposed analysis.

Helical transmembrane domains generally varyin length from 20 to 30 residues. The sequencesannotated in Swiss-Prot are mainly in that range,as shown in the histogram in Figure 1(b). Insteadof using the complete annotations, however, weperformed the analysis on ®xed-length windows of18 residues selected for maximum hydrophobicity.

ase (rel. 37 and updates until March 17, 1999)

Bitopic proteinsa Polytopic proteinsb

n TMsd Potential Protein TMsd Potentials (%)e recordsc (%)e

3863 78.5 6906 43,053 95.82630 74.4 3957 24,658 94.5706 93.6 2450 16,175 97.471 94.4 226 1270 98.4

456 76.3 273 980 97.2

1174 84.2 4135 12,432 97.6749 79.0 1761 4870 97.6294 95.2 2039 6669 97.543 97.7 201 623 98.788 84.1 134 270 96.7

).n).

FT ®eld).E or PROBABLE.ains.

plete database after homology removal.

Figure 1. Average composition and length of thetransmembrane annotations in Swiss-Prot. (a) Compo-sition; the shaded areas emphasize the fact that, on aver-age, one half of transmembrane residues comprise onlyfour amino acids and two-thirds of the total only sixamino acids. (b) Length distribution of all transmem-brane annotations. Inset: distribution of the sequencesnot labeled as POTENTIAL, PROBABLE or POSSIBLE.

Figure 2. Variation in the amino acid composition atdifferent positions in the l8 amino acid residue trans-membrane sequences used for the pair and triplet corre-lation analysis. Each 18 residue sequence corresponds tothe most hydrophobic window of the 30 residue spancentered on each Swiss-Prot annotation.

Analysis of Residue Patterns in TM Helices 921

The composition of transmembrane sequences var-ies along sections exposed to different environ-ments (water interface, lipid head-group andhydrocarbon regions). To avoid highlighting aminoacid correlations that are due to these variations, itis important to limit the analysis to the portion ofthe sequence likely to be exposed to the hydro-carbon region. Moreover, the exact de®nition ofthe boundaries of a transmembrane domain is anon-trivial problem, even among solved structures.The majority of the sequences are putative, andincorrect assignment of the boundaries wouldresult in contamination from ¯anking regions;using shorter sequences selected for hydrophobi-city should minimize this risk. The exact bound-aries of the transmembrane regions are generallynot well established in Swiss-Prot, as demonstratedby the peak at 21 residues in the length distri-bution of the annotations (Figure 1(b)). Such asharp demarcation is probably an artifact of thealgorithms commonly used to identify transmem-brane domains. The distribution is much morewidespread when the putative annotations areexcluded (inset). The amino acid composition ofeach of the 18 positions of the analyzed sequencesis shown in Figure 2. The major differences in com-

position are limited to the extremities, and the vari-ations are signi®cantly reduced compared to thoseof the original annotations (data not shown).

Analytical procedures

The one-letter code of the two amino acids fol-lowed by their separation k will be used to indicatepairs of residues at distance i, i k (k is alsoreferred to as the register). For example, the pair inwhich Ala and Leu are at i, i 3 (AxxL) is indi-cated by AL3. The occurrences in the TM sequencedatabase of all 4000 pairs formed by all combi-nations of the 20 amino acids at registers 1 to 10were counted. Raw counts are not very informa-tive, since the main factor determining the grossnumber of occurrences of a pair in a set ofsequences is the relative frequency of its residues.The Leu-Leu pairs at all separations (LLk), forinstance, are the most numerous among the pairsin the database, since Leu is by far the mostfrequent residue. Therefore, to identify speci®crelationships that might be clues to helix inter-actions, it is necessary to refer to an expectation ofthe occurrence of each amino acid pair and tripletthat permits distinguishing over-represented andunder-represented pairs while accounting for therelative frequency of the amino acids. We calcu-lated the expectation with a novel formalismnamed TMSTAT that incorporates both compo-sition and length of every individual sequence andthus does not require the assumption that allsequences belong to a homogeneous compositiondistribution and accounts for ®nite-length effects.Based on formally derived probability distri-

Figure 3. Occurrences of Gly-Gly pairs and theirexpectation. (a) Probability function P(NGG4) associatedwith any possible number of occurrences (NGG4) of thepair GG4 (GxxxG) in the database of TM sequences. Thearrow marks the actual occurrences observed (observed1641; expected 1246.8; statistical signi®cancep 6.4 � 10ÿ34). (b) Observed occurrences NGGk of allGGk pairs (*) as a function of distance i, i k. Thestraight line represents the expectation. The line issloped due to end effects: in a sequence with ®nitelength more pairs are possible at short register than atlonger register. The error bars de®ne the 99 % interval ofcon®dence around the expectations (the range outsidewhich a value has signi®cance p < 0.01).


butions, a statistical signi®cance (p) was assignedto any observed difference from expectation, i.e.the probability that a difference equal to or largerthan that observed could occur by chance if theresidues were actually randomly distributed. TheTMSTAT method is presented in detail in Appen-dix and Figure 9.

GxxxG (GG4) is the most significant pair

In analyzing 4000 randomly distributed vari-ables, one would expect to observe by chance oneinstance of a difference from expectation withsigni®cance p < 0.00025 (1/4000); in the presentanalysis, 117 pairs deviate from the expectedoccurrences with at least that signi®cance. The datafor the most signi®cant outliers found are shownin Table 2A (over-represented pairs) and Table 2B(under-represented pairs). At least one of the tenmost signi®cant over-represented pairs is found in76 % of the sequences in the database. (A table con-taining observed and expected occurrences of all4000 pairs is available at http://engelman.csb.yale.edu/tmstat and http://bioinfo.mbb.yale.edu/tmstat).

GG4 (GxxxG) is the pair with the strongest posi-tive correlation. In the database, GG4 occurs 1641times, 32 % more than its expectation of 1247 (oddsratio observed/expected 1.32). The occurrence ofthe GG4 pair is compared with its expectationcurve in Figure 3(a). The probability of observingan equal or greater difference from expectation bychance is unrealistic (p 6.4 � 10ÿ34). The high fre-quency of GG4 in a database of predicted TMsequences was reported by Arkin & Brunger(1998); however, the expectation and signi®cancefor the occurrences of the pair were not calculated.

The observed occurrences of all Gly-Gly pairs asa function of separations and their expectationvalues are plotted in Figure 3(b). In the pro®le, thepair GG4 peaks between two positions that arerelatively ``unbiased.`` A signi®cant negative biasis observed for the interaction of two adjacentGly residues (GG1, 9 % below expectation,p 1.0 � 10ÿ4). A second positive, correspondingto GG7, is observed 16 % more often than itsexpected occurrences (p 2.9 � 10ÿ8). Clearly, theobserved correlations must derive from speci®cposition-dependent selection of residue properties.

Pairs containing isoleucine, glycine and valineare the most biased

The 30 most signi®cant over-represented pairs,shown in Table 2A, frequently contain isoleucine,glycine and valine residues. Ile is present in 14pairs (®ve of the top eight cases). Gly exists in 11pairs (®ve of the top six cases). Val is found ineight pairs (four of the top ten cases). Leu, themost common residue in transmembrane domains,is found in only ®ve pairs and never in the top tencases.

Pair correlation results reflecthelical periodicity

The pie diagram shown in Figure 4(a) depictsthe proportion of the most signi®cant pairs ofTable 2A grouped by register. Pairs with registers1 to 4 comprise 90 % of the cases. Together, regis-ters 1 and 4 are found in almost two-thirds of thecases. A stronger tendency of pairs at i, i 1 andi 4 to deviate from expectation is found in theentire set of 4000 pairs. The tendency is evidencedby the w2 scores of pairs as a function of register,shown in Figure 4(b), that clearly peak at thesetwo positions. These results are strongly consistentwith helical geometry. Four residues compriseabout one helical turn in regular a-helix confor-mation (3.6 residues per turn); at i, i 1 and i,i 4, both residues of the pair are presented on thesame face of a regular a-helix, as schematized inthe wheel diagram of Figure 4(c).

Table 2. The 30 most signi®cant over-represented and under-represented pairs sorted by signi®cance

Pair Occurrencesa Expectationb Standard deviationc Significance (p)d Odds ratioe

A. Over-represented pairsGG4 1641 1246.8 31.3 6.4 �10ÿ34 1.32II4 3782 3289.2 48.2 8.4 �10ÿ24 1.15GA4 2057 1698.4 37.0 3.6 �10ÿ21 1.21IG1 2721 2318.4 41.9 4.8 �10ÿ21 1.17IG2 2528 2182.1 41.1 1.3 �10ÿ16 1.16VG2 2268 1945.2 39.1 5.7 �10ÿ16 1.17IV4 3003 2636.3 45.5 2.1 �10ÿ15 1.14IP1 992 788.8 25.2 4.5 �10ÿ15 1.26VV4 2770 2443.2 42.5 3.8 �10ÿ14 1.13VI4 2965 2636.3 45.5 1.1 �10ÿ12 1.12AV1 3149 2823.2 45.8 2.2 �10ÿ12 1.12GL3 3392 3062.7 47.7 9.7 �10ÿ12 1.11AG4 1929 1698.4 37.0 9.1 �10ÿ10 1.14WQ1 88 45.8 6.5 3.9 �10ÿ9 1.92IL4 4784 4446.3 57.3 4.9 �10ÿ9 1.08AA3 2719 2477.0 42.2 1.3 �10ÿ8 1.10VG1 2295 2066.7 40.0 1.8 �10ÿ8 1.11GG7 1138 979.6 28.0 2.9 �10ÿ8 1.16VL4 4362 4064.3 54.9 7.7 �10ÿ8 1.07IS2 1916 1717.0 36.7 9.0 �10ÿ8 1.12SI2 1912 1717.0 36.7 1.5 �10ÿ7 1.11GI1 2536 2318.4 41.9 2.9 �10ÿ7 1.09IY10 496 397.0 19.0 4.5 �10ÿ7 1.25YY3 245 180.9 12.4 6.3 �10ÿ7 1.35IF10 1617 1443.1 35.7 1.6 �10ÿ6 1.12GI2 2375 2182.1 41.1 3.3 �10ÿ6 1.09PI3 809 696.0 24.0 4.0 �10ÿ6 1.16PV1 777 667.8 23.4 5.0 �10ÿ6 1.16PL1 1342 1203.8 30.0 5.4 �10ÿ6 1.11LP1 1342 1203.8 30.0 5.4 �10ÿ6 1.11B. Under-represented pairsII2 3223 3759.1 50.6 5.1 �10ÿ27 0.86GI4 1564 1909.3 39.1 1.4 �10ÿ19 0.82IL1 4906 5399.1 60.7 2.5 �19ÿ16 0.91FL1 3954 4394.8 55.4 9.4 �10ÿ16 0.90FI4 2182 2525.4 44.5 4.1 �10ÿ15 0.86IG4 1620 1909.3 39.1 4.8 �10ÿ14 0.85LW4 611 786.7 25.0 5.2 �10ÿ13 0.78IV2 2683 3013.0 47.6 2.3 �10ÿ12 0.89YL4 788 974.5 27.9 7.3 �10ÿ12 0.81PG1 311 434.2 19.3 2.8 �10ÿ11 0.72CP1 56 113.1 10.1 9.0 �10ÿ10 0.50FV3 1991 2244.3 42.1 1.1 �10ÿ9 0.89AP1 508 642.2 22.9 1.8 �10ÿ9 0.79IW4 376 493.2 20.4 2.9 �10ÿ9 0.76IM1 922 1091.4 29.5 4.7 �10ÿ9 0.84FL3 3575 3877.7 53.3 1.1 �10ÿ8 0.92FV4 1869 2094.6 41.0 2.5 �10ÿ8 0.89FI3 2462 2705.7 45.6 6.7 �10ÿ8 0.91LW3 707 842.8 25.7 7.5 �10ÿ8 0.84V12 2759 3013.0 47.6 7.7 �10ÿ8 0.92GP1 335 434.2 19.3 1.2 �10ÿ7 0.77YI4 575 694.8 24.0 3.7 �10ÿ7 0.83FL2 3862 4136.3 54.4 3.9 �10ÿ7 0.93VG4 1517 1702.0 37.2 5.0 �10ÿ7 0.89FF2 2244 2450.3 41.9 7.1 �10ÿ7 0.92FM1 743 872.3 26.7 8.4 �10ÿ7 0.85FL6 2861 3102.2 49.4 88 �10ÿ7 0.92II1 3744 3994.0 51.6 1.1 �10ÿ6 0.94WV1 454 549.1 21.3 5.1 �10ÿ6 0.83LL2 7509 7821.3 69.4 6.5 �10ÿ6 0.96

a Number of observed occurrences of the pair in the database.b Average expected number of occurrences.c Standard deviation of the expectation distribution.d Calculated as two-tailed integral of the expectation distribution.e Occurrences/Expectation ratio.


Figure 4. (a) Relative frequency of the 30 most signi®-cant pairs of Table 2A grouped by register. (b) Overalldeviation from expectation at different registers, calcu-lated as w2 score on the entire set of pairs. Pairs groupedby register (group size n 400):

w2 Xpairs

Observedÿ Expected2Expected

(c) Relative angular position along the helical axis ofresidues in pairs at different registers. The ®lled circle atthe top of the helical wheel diagram (3.6 residues perturn) indicates the residue at i (*). The position of theresidue at i k is indicated by the respective number k.The arrows mark the registers with the highest overalltendency to diverge from expectation, as observed in (a)and (b).

Figure 5. Normalized occurrences of pairs formed bycombinations of the b-branched residues Ile, Val andGly at all registers. Odds ratio (observed occurrences/expected) The bars represent the 99 % con®dence inter-val around the expectation. (a) Pairs formed by Ile andVal. (b) Pairs formed by Gly with Ile or Val.


Similar biases are found with pairs of residueswith similar structure

A remarkable feature of the results shown inTable 2 is that most of the pairs can be groupedby register and side-chain chemistry into a fewcategories. For example, GG4, GA4 and AG4 areall observed among them. Similarly, all combi-nations of the b-branched aliphatic residues at i,i 4 (II4, IV4, VI4 and VV4) are extremely sig-ni®cant. The pairs IL4 and VL4 are also amongthe most signi®cant pairs (Leu is isomeric to Ilebut g-branched). There are many pairs formed

by one small residue (Ala, Gly and Ser) and ab-branched aliphatic residue at i, i 1 (IG1 andVG1; GI1 and AVI) and i, i 2 (IG2, VG2 andIS2; SI2 and GI2). Finally, a number of pairs areformed by Pro and large aliphatic residues (Ile,Val and Leu) at register 1 (IP1 and LP1; PV1and PL1). In the list of the signi®cant under-rep-resented pairs, combinations of b-branched resi-dues and glycine are very disfavored at i, i 4(GI4, IG4 and VG4) and neighboring Pro andGly are also disfavored (PG1 and GP1).

The correspondence between the observed biasesand side-chain chemistry also appears in the com-parison of pairs pro®les at all registers. In Figure 5,the occurrences of pairs normalized to their expect-ancy (odds ratios, observed/expected) are plottedas a function of register. In Figure 5(a), all pairsformed by combinations of Ile and Val have verysimilar pro®les with a strong positive correlation ati, i 4 and a negative peak at i 2. Striking simi-larity is also evident in Figure 5(b), where the pro-®les of pairs formed by Gly and Ile or Val areshown.

These results suggest a general tendency for twolarge aliphatic residues (in particular the b-branched ones) to correlate at i, i 4 when theyare on the same face of the helix and to anti-corre-late when they are on opposite faces. Pairs of smal-ler residues (in particular Gly) on the same face ofthe helix are also favored. Lastly, pairs formed byone small and one large residue correlate positivelyon adjacent (i, i 1) or opposite faces (i 2) and,conversely, are strongly disfavored on the sameface (i 4).

Figure 6. Odds ratios of pairs formed by similar resi-dues. [Small], small residues, Gly, Ala and Ser; [Large],large aliphatic residues, Ile, Val and Leu. The error barsmark the 99 % con®dence interval around the expec-tation.


These general themes can be appreciated inFigure 6, where sets of pairs at the same registerare grouped by side-chain size and compared.All pairs formed by two small residues (Gly,Ala and Ser) at register 4 are positively biasedwith a signi®cance of at least p < 0.01 (except thecase of AS4). The b-branched residues Ile andVal correlate very strongly at register 4 (II4, IV4,VI4, VV4). Interestingly, Leu seems to be part ofthe trend of positively correlating [Large][Large]pairs at i, i 4 only when it is occupying the C-terminal position (IL4, VL4); all pairs in whichLeu precedes a second large residue (LI4, LV4,LL4) are unbiased. The majority of the combi-nations of large and small residues at registers 1and 2 have a positive bias with a signi®cance ofat least p < 0.05. Not all deviations from expec-tation are large or very signi®cant. However, theobserved trends can be taken with more con®-dence than the individual deviations, as it is lessprobable for a series of random deviations tooccur all in the same direction.

Analysis of triplets shows that residuecorrelations extend beyond the pair level

The relationships between pairs of larger andsmaller residues suggested that the correlationswere not limited to the pairs, since positivelycorrelating pairs can be consistently combined toform higher-order patterns. This was con®rmedby extending the analysis to triplets. The occur-rences of 200,000 amino acid triplets werecounted and compared to an expectation com-puted with the same method used for the pairs.The reference was therefore calculated on the

frequency of the single residue in the sequencesand not relative to the pairs. The 30 tripletswith the strongest positive correlation are listedin Table 3. The most signi®cant triplets wereindeed composed of combinations of stronglybiased pairs. For example, the most signi®cantcase IG1L3 (IGxxL) is composed of IG1, IL4 andGL3, all observed in the 15 most biased pairs.The signi®cance of IGxxL (p 1.8 � 10ÿ20) isslightly lower than that of IG1 but higher thanthose of IL4 and GL3. However, it is incorrectto compare the p values, since, on average, thetriplets have a smaller number of occurrencesthan the pairs, and p values strongly depend on``sample size'' (for example, when a coin istossed once, 100 % ``heads'' is not a signi®cantresult, but in one million tosses 51 % headsundoubtedly indicates a defective coin). A moreappropriate value for comparison is the oddsratio (observed/expected occurrences). In themost signi®cant triplets, the observed odds ratiosalways exceed those of the corresponding pairs.

Triplets containing the pair GG4 are presentmany times in Table 3, mostly in conjunction withIle, Val or Leu at registers �1 and �2 with respectto the Gly residues. The interactions of the GG4pair with Ile and Val at these distances is evidentin Figure 7, which illustrates the effect of a thirdresidue at positions relative to the GG4 pair. Inaddition, many strongly correlating triplets inTable 3 contain two large aliphatic residues inter-acting with one Gly or another small residue atposition �1 and �2 (IG1L3, IG2I2, VG2I2, IG1I3,IS2I2, IA3V1, etc.). Together, these correlationsde®ne the main theme of the analysis, i.e. patternsof larger and smaller residues that are stronglyfavored to coexist at neighboring helical faces.

Discussion

Many of the amino acid correlations that werefound in the present analysis are readily inter-pretable in terms of helix-helix interaction pat-terns. Most of the positively correlating pairsoccur at separations i, i 1 and i 4, i.e. on thesame face in a-helical conformation. At register i,i 4 there is a marked preference for pairs ofresidues with similar size, while combinations ofa small and a large residue are strongly disfa-vored. Furthermore, the GG4 pair and itsrelationship with b-branched residues at i � 1relative to the glycine residues has beenobserved in two important membrane oligomer-izing systems: in the interface of glycophorin A(GpA) transmembrane dimer (Lemmon et al.,1994; MacKenzie et al., 1997), and by an in vivoselection system for transmembrane helix-helixassociation (Russ & Engelman, 2000). The otherstrong correlation of GG4 with b-branched resi-dues at i � 2 is more dif®cult to explain interms of helix-helix interaction, because thesepatterns in an a-helical conformation would

Table 3. The 30 most signi®cant over-represented triplets sorted by signi®cance

Triplet Occurrencesa Expectationb Standard deviationc Significance (p)d Odds ratioe

IG1L3 535 353.4 18.3 1.8 �10ÿ20 1.51IG2I2 399 258.1 15.6 3.8 �10ÿ18 1.55IG2G4 244 137.6 11.5 6.1 �10ÿ18 1.77IG1A4 309 191.7 13.6 1.6 �10ÿ15 1.61GV2G2 244 143.3 11.7 3.6 �10ÿ15 1.70VG2I2 331 211.0 14.3 6.7 �10ÿ15 1.57IG1I3 382 258.1 15.6 7.4 �10ÿ14 1.48GG4G4 146 75.9 8.8 1.4 �10ÿ13 1.92IV4L4 488 348.1 18.3 5.1 �10ÿ13 1.40IP1I3 162 88.7 9.2 5.9 �10ÿ13 1.83IS2I2 319 211.4 14.1 1.1 �10ÿ12 1.51GI2G2 255 160.6 12.4 1.6 �10ÿ12 1.59IG1G4 236 149.1 11.9 4.7 �10ÿ12 1.58IA3V1 388 274.0 16.2 7.7 �10ÿ12 1.42IG2L2 485 353.4 18.4 1.1 �10ÿ11 1.37VG2G4 201 122.9 10.9 2.6 �10ÿ11 1.64II4L4 555 419.6 19.7 2.7 �10ÿ11 1.32VV4G2 257 169.5 12.7 1.2 �10ÿ10 1.52PI3G2 90 43.1 6.5 2.2 �10ÿ10 2.09VG5L3 334 234.8 15.1 4.4 �10ÿ10 1.42IA2I2 428 316.7 17.2 7.1 �10ÿ10 1.35GG4I2 213 137.6 11.5 9.6 �10ÿ10 1.55AC3A4 71 32.3 5.6 1.6 �10ÿ9 2.20VG2L3 413 305.3 17.1 1.7 �10ÿ9 1.35VG1G4 206 133.1 11.3 1.9 �10ÿ9 1.55IG2L3 439 328.2 17.7 2.2 �10ÿ9 1.34GG4L3 274 189.6 13.5 3.3 �10ÿ9 1.45VG2L2 438 328.8 17.7 4.0 �10ÿ9 1.33IV4V4 298 210.7 14.1 4.5 �10ÿ9 1.41GL3G1 334 241.3 15.1 4.9 �10ÿ9 1.38

a Number of observed occurrences of the triplet in the database.b Average expected number of occurrences.c Standard deviation of the expectation distribution curve.d Calculated as two-tailed integral of the expectation distribution.e Occurrences/Expectation ratio.


place the residues on opposite sides of the helix.We propose a possible explanation for this pat-tern in terms of helix ¯exibility modulation.

Comparison with GpA transmembrane dimer

GG4 is the key feature of the dimerization inter-face of glycophorin A, the best characterized trans-membrane helix-helix interaction. The single TMDof GpA forms a symmetric right-handed homo-dimer based on the seven residue motifLIxxGVxxGVxxT (Lemmon et al., 1992, 1994). Theglycine residues allow the backbones to reach closeproximity and the larger side-chains pack in a``ridges into grooves'' fashion (MacKenzie et al.,1997).

Many other features of the GpA interactionmotif are found among the most signi®cant resultsof the present analysis: IV4 and VV4, for instance,are two of the most strongly correlating aminoacid pairs. In addition, the majority of the aminoacid triplets of the motif correlate positively in thisanalysis, as shown in Table 4.

Comparison with the TOXCAT in vivo selectionsystem for helix-helix interaction

The GG4 pair is almost invariably present in thetransmembrane oligomerization motifs identi®edfrom randomized sequences by the TOXCATin vivo selection system, presented in the accom-panying paper (Russ & Engelman, 2000). Sevenpositions with the periodicity of the GpA motifwere randomized to a set of possibilities at eachmotif position in the context of a poly-leucine (Leulibrary) or poly-alanine (Ala library) background.The results (refer to Figure 4 in the accompanyingpaper) often contained the theme of large residues(Ile, Val or Leu) associated with the GG4 pair atpositions �1, in excellent agreement with the pre-sent statistical analysis.

In the TOXCAT library with a Leu context, thelarger residues occurred at positions i 1 relativeto the two glycine residues (G[IVL]xxG[IVL]). Theb-branched residues were prevalent, especially inthe ®rst position. In addition, Thr was often foundin the selection system at position i 4 from thesecond glycine residue. In the present statisticalanalysis, we ®nd that the GG4T4 triplet, which isobserved also in the GpA motif, is strongly over-

Figure 7. Triplet analysis: interaction of a third resi-due with the GG4 pair. The Figure represents the oddsratios of triplets containing the pair GG4 in conjunctionwith either a small residue (Gly, Ala and Ser, leftpanels) or a large aliphatic residue (Ile, Val and Leu,right panels). The position of the bars along the x-axisre¯ects the actual position of the residue relative to thepair GG4. The baseline is set at 1.316, the odds ratioobserved for the pair GG4.


represented (58 %, p 3.4 � 10ÿ4). In the Alalibrary, the two large residues occurred at positioni ÿ 1, on the N-terminal side of the GG4 pair([IVL]Gxx[IVL]G). b-Branched residues were againprevalent. A schematic comparison of our resultswith the TOXCAT selection can also be found inTable 2 of Russ & Engelman (2000).

The convergence of the results obtained withsuch dissimilar approaches is remarkable,

Table 4. Results of triplet analysis for all triplets pre(LIxxGVxxGVxxT), sorted by decreasing odds ratio

Triplet Significan

GG4T4 3.4 �10IG3G4 1.8 �10IV4V4 4.5 �10GG4V1 1.6 �10GV1G3 3.1 �10LG4G4 9.6 �10LV5V4 2.8 �10IG3V1 1.9 �10IV4G3 4.9 �10IG3V5 5.7 �10VG3V1 3.4 �10LG4V1 6.5 �10GV1V4 2.1 �10LI1G3 2.6 �10LI1V4 2.8 �10VV4T3 8.8 �10VG3T4 1.0 �10LG4V5 7.0 �10GV5T3 8.5 �10LV5G3 6.0 �10GV1T3 4.1 �10

especially if one considers that the TOXCATsystem reports the oligomerization events of bito-pic (single-span) transmembrane domains, whilethe correlation analysis is based mostly on polyto-pic (multi-span) proteins (Table 1). The frequent®nding of GG4 with large ¯anking residues byTOXCAT, which selects for strong transmembraneinteractions, probably re¯ects the excellent oppor-tunity provided by the deep groove and ridge ofthe motif for bringing two helices in extensive con-tact, as observed in the GpA structure. If stronginteractions are important in polytopic proteins,they are essential in oligomerizing helices, as moreenergy is required to compensate for the largerentropy cost of association of helices that are notcovalently joined by extra-membranous loops.

Following this line of reasoning, one couldexpect the GG4 pair to be more frequent in theTMDs of single-span transmembrane proteins. Toaddress this question, we analyzed bitopic andpolytopic sequences separately (data not shown).In a raw count, the pair is indeed found more fre-quently in bitopic sequences (on average, in 12.5 %of bitopic domains and in 12.1 % of polytopictransmembrane domains). The GG4 pair is themost signi®cant outlier in both databases, but it ismore over-represented relative to its expectedoccurrences in the bitopic (37.8 %) than in thepolytopic set (30.8 %). However, caution shouldbe exercised when inferring the relative importanceof the motif in the two different topologies fromthese results. In polytopic proteins, weak helix-helix interactions embedded in a bundle might betolerable and extra-membranous loops mightsometimes direct the folds. On the other hand, thefraction of transmembrane anchors in the single-span database that are not engaged in interactions

sent in the dimerization motif of glycophorin A

ce (p) Odds ratio

ÿ4 1.58ÿ6 1.43ÿ9 1.41ÿ3 1.28ÿ3 1.25ÿ3 1.20ÿ3 1.17ÿ2 1.16ÿ2 1.15ÿ2 1.15ÿ2 1.15ÿ2 1.10ÿ1 1.09ÿ1 1.06ÿ1 1.05ÿ1 1.010 1.00ÿ1 0.97ÿ1 0.96ÿ1 0.96ÿ1 0.90


is also unknown. ``Passive'' sequences with nearlyrandomly distributed residues would provide onlyan increase in the background noise and a decreasein the signi®cance of the results. Thus, the onlyconclusion supported by the data is that the GG4pair is very important in both bitopic and poly-topic membrane proteins.

bbb-Branched residues could minimize entropyloss upon packing

Upon solution of the NMR structure of the GpAtransmembrane dimer, MacKenzie et al. (1997) pro-posed that the association of the monomers mightoccur between two largely preformed interfaces.The idea was based on a fundamental implicationof the two-stage model for membrane protein fold-ing (Popot & Engelman, 1990). The ®rst stage ofthe model involves the partitioning of largelyhydrophobic TM segments in the lipid bilayer,which is strongly favored by the hydrophobiceffect. The backbone adopts a helical conformationto satisfy its strong hydrogen bonding potential inthe low-dielectric environment. Sequence speci-®city comes into play only in stage 2, when theequilibrium of associations of the preformedhelices is established. Given the two-stage model,it is possible to have a notion of the structure ofthe unassociated state (helical) that is generally notavailable with the unfolded state of soluble pro-teins. This information is crucial to relatingobserved structural features of the native state tothe energetics of folding, since stability depends onthe differential between the energies of folded andunfolded states.

In the GpA dimer, many interfacial side-chains(Ile, Val, and Thr) have only one populated rota-mer as a consequence of being in a helix(Dunbrack & Karplus, 1993; Schrauber et al., 1993).Under the assumption that the GpA TM is helicalin the monomeric state (recently con®rmed exper-imentally by Fisher et al., 1999), MacKenzie andcolleagues pointed out that minimal loss of rota-meric freedom upon dimerization was thereforeexpected. Later, a theoretical model based on alarge number of GpA mutants indicated loss ofside-chain entropy as one of the major factorsdestabilizing dimerization (MacKenzie &Engelman, 1998), supporting further the hypothesisthat rotamerically constrained interfaces couldprovide a signi®cant contribution to the stability ofassociation.

In our results, there is a signi®cant dichotomy inthe role of the three larger aliphatic residues Ile,Val and Leu. The b-branched Ile and Val are, withGly, the residues involved in the strongest corre-lations. Conversely, Leu, the most frequent residuein transmembrane domains, has only a secondaryrole. As a g-branched side-chain, Leu can samplemore conformations in helical secondary structure.Our results are therefore consistent with thehypothesized importance of a ``preformed inter-face'' and the possibility that the use of residues

with constrained side-chains in helical confor-mation might have general signi®cance in limitingthe entropic cost of association in a large set ofmembrane proteins.

Interaction of bbb-branched residues at i, i4might modulate helix flexibility in TMs

A combination of theoretical arguments andexperimental evidence suggests the hypothesis thatpairs of Ile and Val at i, i 4, which we ®nd allstrongly over-represented in this analysis, mightin¯uence ¯exibility in TM helices. Helical confor-mation prevents the w1 dihedral from positioning aheavy g-substitute in gaucheÿ orientation due to thesteric clashes with the backbone carbonyl oxygenatom at i ÿ 3 (McGregor et al., 1987). In an analysisof intrahelical side-chain/side-chain interactions insoluble proteins, Walther & Argos (1996) reportedthat the majority of the contacts occurred betweenpairs of residues with spacing i, i 4. As theypointed out, interactions can occur at this separ-ation, since they are promoted by w1 rotamers thatinvolve a combination of a trans (at i position) anda gauche (at i 4) dihedral. Conversely, i, i 1and i, i 3 interactions require the unfavorable gÿconformation (gÿ/g and t/gÿ, respectively). Thetwo Cg atoms of b-branched residues are forced tooccupy simultaneously g and t positions to avoidthe gÿ dihedral (Schrauber et al., 1993). For thisreason, b-branched residues are good candidatesfor intrahelical interactions at i, i 4. This is con-sistent with the high scores of Ile and Val in the i,i 4 contact propensity calculated by Walther &Argos (1996), a scale in which Leu scored onlyslightly above average.

Padmanabhan & Baldwin (1994) used circulardichroism (CD) to measure the interactions ofL[IVL] and [IVL]L pairs at i, i 3 and i, i 4 insoluble peptides, and observed stronger helixstabilization in i 4 pairs. The energy of inter-action of pairs of hydrophobic residues at differentregisters in an a-helix has been calculated byCreamer & Rose (1995) using an exhaustive Boltz-mann-weighted conformational search. The inter-actions of pairs formed by Ile, Val and Leu at i,i 4 were more stabilizing than those of i, i 3pairs. The energy ranking observed for these pairsat i, i 4 agrees with our data (summarized in the[Large][Large]4 panel in Figure 6) in the fact thatthe smallest effects are observed when there is aLeu residue on the N-terminal side in the pair(LL4, LI4, LV4). The calculations made by Creamer& Rose (1995) were in only partial agreement withthe experimental results reported by Padmanabhan& Baldwin (1994), who, conversely, observed high-er helix content in L[IVL]4 than in [IVL]L4 pairs.However, Creamer & Rose (1995) calculated theinteraction energies relative to the same pair at i,i 2 (on opposite faces in helical conformation)while the CD data re¯ects the position of a helix-coil/strand equilibrium.

Figure 8. Example of a pair of b-branched aliphaticresidues at i, i 4 in the fourth transmembrane segmentof bacteriorhodopsin (RSCB PDB code 1c3w). Both I108and V112 are in their standard helical rotamer in whichthe g-carbon atoms are positioned away from the disfa-vored gaucheÿ orientation. According to the IUPACnomenclature rules, the rotamers are designated respect-ively as trans and gauche. The van der Waal sphere ofthe carbon atoms of closest approach is represented bydots (1.9 AÊ ). The center-to-center distance between I108-Cg2 and V112-Cg2 is 4.1 AÊ .


These three studies relate intrahelical side-chaininteractions to helical stability in aqueous solutionand they concur on the importance of i, i 4contacts. In the membrane, the helix is alreadystabilized by the environment, but side-chain inter-actions might additionally affect the ¯exibility ofthe helix. This might be especially true for pairs ofb-branched residues at i, i 4, as their only favor-able w1 rotamer conformation locks them in closeproximity. In bacteriorhodopsin, the only helicalmembrane protein structure available at betterthan 2 AÊ resolution (Luecke et al., 1999), the four[IV][IV]4 pairs found in regular a-helical confor-mation have an average minimal distance (centerto center of the closest Cg or Cd atoms) of only4.2(�0.3) AÊ (�SD). An example (residues I108 andV112 on the fourth transmembrane segment) isshown in Figure 8. Whether the strongly correlat-ing pairs of b-branched residues at i, i 4 areimportant to diminishing transmembrane helix¯exibility is an interesting question. If validatedexperimentally, it could provide further support tothe hypothesis that a reduction of entropy in thehelical unassociated state (in turn a destabilizationof the unfolded state, if independent helices arestable in the bilayer) could be a signi®cant factor inthe transmembrane association equilibrium.

On the other hand, glycine is frequentlyobserved in membrane helices and induces ¯exi-bility. Glycine is compatible with helical confor-mation in membrane proteins, as evident in GpA,which is largely helical in both the monomeric anddimeric states despite three glycine residues in itsTM sequence (Fisher et al., 1999). However, exten-sive studies in host peptides by Deber and col-leagues have shown that, while Gly has aconsiderable tendency to form a-helices in mem-brane mimetic environments, it is somewhat desta-bilizing compared to the more hydrophobic sidechains (Li & Deber, 1992a,b; Liu & Deber, 1998).This is consistent with the observation by Ri et al.(1999) using a Monte Carlo simulation of a singleTM. The ranking observed for increased ¯exibility(Gly > Ala > Val) correlated well with the severityof voltage-dependent gating phenotypes whenthese three residues were substituted for the wild-type Pro residue in connexin32.

Thus, a pair of b-branched residues i, i 4 and apair of glycine residues at i, i 4 might lie at oppo-site sides of a hypothetical ¯exibility scale in TMhelices. The favorable role of Gly in helix inter-actions might require the presence of additionalstability from the b-branched residues. This argu-ment provides a speculative but plausible expla-nation for the strong correlations between the GG4pair and [IV][IV]4 pairs observed in opposite facesof the helix at i 2, which could perhaps have acompensatory role in modulating helix ¯exibility.

Final remarks

Many instances of the ``GG4 b-branched''motif and its variations can be found in the avail-

able X-ray structures of helical transmembrane pro-teins. An in-depth comparison of the results of ouranalysis with the structural models has not beencompleted at this stage. This comparison couldoffer further insights into the physical role of thismotif and of other observed correlations. Forexample, it would be interesting to put the strongassociation of Ile, Val and Leu with neighboringPro residues in relation to the geometry of thekink.

We have shown that the inherent simplicity ofhelical membrane proteins structure results in cor-relations between residues that are detectable withsimple statistical methods and that suggestinterpretations in terms of protein chemistry. Inturn, our results also support the validity of TMDprediction techniques. With the growth of primarydata provided by the genome projects, these resultsare an indication of the important role thatsequence analysis will assume in the near future inthe membrane protein ®eld as a complement to theinterpretation of experimental and structural data.

Methods

Database

The source of transmembrane sequences for this workwas the annotated database Swiss-Prot, release 37 andupdates to March 17, 1999 (Bairoch & Apweiler, 1999).All sequence fragments corresponding to a TRANSMEMannotation in the FT ®eld were extracted and a databaseof 46,946 transmembrane domains was compiled(Table 1).

Figure 9. Calculation of probability distributions of pair occurrence with the TMSTAT method. The Figure isexplained fully in the Appendix.



Homology cleanup

Homology removal was performed at the level of theTM sequences by eliminating each sequence that wasexceedingly similar to another sequence. Given the largenumber of proteins in the database, homology elimin-ation at the level of the TMDs was a practical and effec-tive alternative to more complex and intensive clusteringprocedures at the protein level (Boberg et al., 1992;Brenner et al., 1998; Gerstein, 1998; Hobohm & Sander,1994; Hobohm et al., 1992). In addition, the TMD-levelprocedure takes care of the ``internal homology'' some-times present within a given protein while preservingany non-homologous TMDs of otherwise homologousproteins. The annotated sequences were extended (oroccasionally shortened) to a length of 30 residues usingthe ¯anking regions. Two sequences were compared inall possible frame shifts using a 100 PAM matrix derivedfrom the Mutation Probability Matrix of Jones et al.(1994) and the maximum score was recorded as the simi-larity score of the pair.

Sequences were eliminated according to the followingprocess. First, all pairs with similarity scores of 50 orhigher were ranked by score, from highest to lowest.Then, beginning with the pair with the highest score,one member of each pair was marked for removal. Theparticular sequence in a pair chosen for removal wasdetermined by its priority number. Priorities, assignedaccording to the description of the annotation in theSwiss-Prot database, gave preference to non-potentialtransmembrane domains:

0, transmembrane sequences of potential protein(ORFs identi®ed in Swiss-Prot with IDs starting withthe letter Y);1, transmembrane domains marked as POTENTIAL,PROBABLE or POSSIBLE;2, annotations that included the words BYSIMILARITY;3, remaining annotations.

Sequences with larger priority numbers were kept inthe database, and when members of the pair shared thesame priority number, one was randomly chosen forremoval. The cleanup proceeded down the list of pairsso that when a pair in which neither sequence had beenmarked for removal was encountered, priority numberswere assigned and only one sequence was subsequentlykept.

Pair and triplets definition

The analysis of the pairs correlation was performed onall combinations of amino acids separated by one to tenresidues (20 � 20 � 10 4000 pair correlations ana-lyzed). Pairs at i, i k are indicated using the one-lettercode of the two residues followed by the separation k(register): for example, AL1 corresponds to the sequenceAL and AL3 to AxxL.

The triplets analyzed were formed by all combinationsof residues at separations ranging from 1 to 5(20 � 20 � 20 � 5 � 5 200,000 triplet correlations). Tri-plets are represented by VI2P3 (corresponding toVxIxxP).

Input sequences

The analysis was performed on sequences of ®xedlength instead of the entire annotation, in order to limitthe analysis to the hydrophobic core of the sequences.The most hydrophobic window of 18 amino acid resi-dues in a span of 30 residues centered on each annota-tion was selected using the GES scale (Engelman et al.,1986). Occasionally, the selected window included resi-dues outside the original annotations.

Exceedingly hydrophilic sequences with a hydropho-bicity score below 15 were excluded from the analysis(4.9 % of all sequences). Low-complexity sequences(when a single residue represented more than half of thecomposition of the sequence or two residues more thantwo-thirds of the composition of the sequence) were alsoexcluded (0.5 %).

Pair and triplet correlation analysis with TMSTAT

The occurrences in the database of all pairs and tri-plets of residues were counted. The probability distri-butions associated with any possible number ofoccurrences of each pair and triplet were calculated fromthe composition of the individual sequences, asexplained in Appendix and in the scheme in Figure 9.The statistical signi®cance of the observed deviations ofeach occurrence from its respective average expectedvalue was calculated by the two-tailed integral of theirprobability distributions.

Acknowledgments

We thank Mark Bowen, Zimei Bu, Lilian Fisher, KarenHo, Yuval Kluger, Albert Lee, Huiming Li, Maura Mez-zetti, Gigi Riva, William Russ, Koji Sonoda, Iban Ubar-retxena, Fang Zhou and other members of the Engelmangroup for helpful discussion and critical reading of themanuscript. This work was supported by grants fromthe NIH and NSF.

References

Arkin, I. T. & Brunger, A. T. (1998). Statistical analysisof predicted transmembrane alpha-helices. Biochim.Biophys. Acta, 1429, 113-128.

Bairoch, A. & Apweiler, R. (1999). The SWISS-PROTprotein sequence data bank and its supplementTrEMBL in 1999. Nucl. Acids Res. 27, 49-54.

Boberg, J., Salakoski, T. & Vihinen, M. (1992). Selectionof a representative set of structures from Brookha-ven Protein Data Bank. Proteins: Struct. Funct. Genet.14, 265-276.

Bowie, J. U. (1997). Helix packing in membrane proteins.J. Mol. Biol. 272, 780-789.

Boyd, D., Schierle, C. & Beckwith, J. (1998). How manymembrane proteins are there? Protein Sci. 7, 201-205.

Brenner, S. E., Chothia, C. & Hubbard, T. J. (1998).Assessing sequence comparison methods withreliable structurally identi®ed distant evolutionaryrelationships. Proc. Natl Acad. Sci. USA, 95, 6073-6078.

Chothia, C., Levitt, M. & Richardson, D. (1981). Helix tohelix packing in proteins. J. Mol. Biol. 145, 215-250.


Creamer, T. P. & Rose, G. D. (1995). Interactionsbetween hydrophobic side chains within alpha-helices. Protein Sci. 4, 1305-1314.

Dunbrack, R. L., Jr & Karplus, M. (1993). Backbone-dependent rotamer library for proteins. Applicationto side-chain prediction. J. Mol. Biol. 230, 543-574.

Engelman, D. M., Steitz, T. A. & Goldman, A. (1986).Identifying nonpolar transbilayer helices in aminoacid sequences of membrane proteins. Annu. Rev.Biophys. Biophys. Chem. 15, 321-353.

Fisher, L. E., Engelman, D. M. & Sturgis, J. N. (1999).Detergents modulate dimerization, but not helicity,of the glycophorin A transmembrane domain. J. Mol.Biol. 293, 639-651.

Gerstein, M. (1998). Patterns of protein-fold usage ineight microbial genomes: a comprehensive struc-tural census. Proteins: Struct. Funct. Genet. 33, 518-534.

Hobohm, U. & Sander, C. (1994). Enlarged representa-tive set of protein structures. Protein Sci. 3, 522-524.

Hobohm, U., Scharf, M., Schneider, R. & Sander, C.(1992). Selection of representative protein data sets.Protein Sci. 1, 409-417.

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1994). Amutation data matrix for transmembrane proteins.FEBS Letters, 339, 269-275.

Kyte, J. & Doolittle, R. F. (1982). A simple method fordisplaying the hydropathic character of a protein.J. Mol. Biol. 157, 105-132.

Landolt-Marticorena, C., Williams, K. A., Deber, C. M.& Reithmeier, R. A. (1993). Non-random distri-bution of amino acids in the transmembrane seg-ments of human type I single span membraneproteins. J. Mol. Biol. 229, 602-608.

Lemmon, M. A., Flanagan, J. M., Treutlein, H. R.,Zhang, J. & Engelman, D. M. (1992). Sequencespeci®city in the dimerization of transmembranealpha-helices. Biochemistry, 31, 12719-12725.

Lemmon, M. A., Treutlein, H. R., Adams, P. D.,Brunger, A. T. & Engelman, D. M. (1994). A dimeri-zation motif for transmembrane alpha-helices.Nature Struct. Biol. 1, 157-163.

Li, S. C. & Deber, C. M. (1992a). Glycine and beta-branched residues support and modulate peptidehelicity in membrane environments. FEBS Letters,311, 217-220.

Li, S. C. & Deber, C. M. (1992b). In¯uence of glycineresidues on peptide conformation in membraneenvironments. Int. J. Pept. Protein Res. 40, 243-248.

Liu, L. P. & Deber, C. M. (1998). Uncoupling hydropho-bicity and helicity in transmembrane segments.Alpha-helical propensities of the amino acids innon-polar environments. J. Biol. Chem. 273, 23645-23648.

Luecke, H., Schobert, B., Richter, H. T., Cartailler, J. P. &Lanyi, J. K. (1999). Structure of bacteriorhodopsin at1.55 AÊ resolution. J. Mol. Biol. 291, 899-911.

MacKenzie, K. R. & Engelman, D. M. (1998). Structure-based prediction of the stability of transmembranehelix-helix interactions: the sequence dependence ofglycophorin A dimerization. Proc. Natl Acad. Sci.USA, 95, 3583-3590.

MacKenzie, K. R., Prestegard, J. H. & Engelman, D. M.(1997). A transmembrane helix dimer: structure andimplications. Science, 276, 131-133.

McGregor, M. J., Islam, S. A. & Sternberg, M. J. (1987).Analysis of the relationship between side-chainconformation and secondary structure in globularproteins. J. Mol. Biol. 198, 295-310.

Padmanabhan, S. & Baldwin, R. L. (1994). Tests forhelix-stabilizing interactions between various non-polar side chains in alanine-based peptides. ProteinSci. 3, 1992-1997.

Popot, J. L. & Engelman, D. M. (1990). Membrane pro-tein folding and oligomerization: the two-stagemodel. Biochemistry, 29, 4031-4037.

Ri, Y., Ballesteros, J. A., Abrams, C. K., Oh, S., Verselis,V. K., Weinstein, H. & Bargiello, T. A. (1999). Therole of a conserved proline residue in mediatingconformational changes associated with voltage gat-ing of Cx32 gap junctions. Biophys. J. 76, 2887-2898.

Richmond, T. J. & Richards, F. M. (1978). Packing ofalpha-helices: geometrical constraints and contactareas. J. Mol. Biol. 119, 537-555.

Russ, W. P. & Engelman, D. M. (2000). The GxxxGmotif: a framework for transmembrane helix-helixassociation. J. Mol. Biol. 296, 911-919.

Samatey, F. A., Xu, C. & Popot, J. L. (1995). On the dis-tribution of amino acid residues in transmembranealpha-helix bundles. Proc. Natl Acad. Sci. USA, 92,4577-4581.

Schrauber, H., Eisenhaber, F. & Argos, P. (1993). Rota-mers: to be or not to be? An analysis of amino acidside-chain conformations in globular proteins. J. Mol.Biol. 230, 592-612.

von Heijne, G. (1992). Membrane protein structure pre-diction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol. 225, 487-494.

Walther, D. & Argos, P. (1996). Intrahelical side chain-side chain contacts: the consequences of restrictedrotameric states and implications for helix engineer-ing and design. Protein Eng. 9, 471-478.

Walther, D., Eisenhaber, F. & Argos, P. (1996). Principlesof helix-helix packing in proteins: the helical latticesuperposition model. J. Mol. Biol. 255, 536-553.

Appendix I: Calculation of ExpectationDistributions for the Occurrence of Pairs andTriplets of Amino Acids in a Database of ShortSequences with the TMSTAT Method

The aim of the present analysis is to survey fre-quently occurring patterns of residues (pairs andtriplets) in transmembrane sequences. For this, weneed some measure of the expectation of occur-rence of the patterns. The simplest way to calculatethis is from the average composition (i.e. the prob-ability of ®nding a particular residue is constant atall positions in all sequences and corresponds to itsfrequency in the database). However, thisapproach requires the assumption that, in terms ofcomposition, all sequences derive from a homo-geneous population and that residues do not co-segregate or anti-segregate in different sequences.

This assumption is not required if the expec-tation is based on the composition of each individ-ual sequence instead of the overall composition ofamino acids in the database (i.e. the probability of®nding a particular residue is constant at all pos-itions within a sequence and corresponds to its fre-quency in the sequence). However, ®nite sequencelength effects also need to be accounted for, sincethey are quite important for short sequences(18 residues in our case). A solution is to base the


calculation on all theoretically possible internalpermutations of the sequences, that is, to take intoaccount the length and composition of eachsequence once internal positional information hasbeen removed. A way to conceptualize this is toask: What would be the probability of ®nding acertain number of occurrences of a pair in the data-base after all sequences have been randomly per-muted? Considering the entire theoretical set ofdifferent databases that can be obtained from theoriginal when the sequences are allowed to inde-pendently assume any possible internal permu-tation, the probability corresponds to the fractionof all permuted databases that contain that exactnumber of occurrences of the pair.

The expectancy distribution of a pair based onall theoretical permutations of all sequences couldbe approximated by cycles of random shuf¯ing ofthe sequences and sampling of the occurrences.However, a sampling algorithm would produceestimates with errors that are higher at the tails ofthe distribution, i.e. where greater precision wouldbe desirable. To completely avoid errors, we havecalculated analytically the exact theoretical distri-butions of expectancy of any pair. The TMSTATmethod is schematized in Figure 9 of the main text.The calculation is divided into two phases: inphase 1, the probability distributions for occur-rences of pairs in single sequences were calculatedand stored in a matrix table for later use. Considerthe pair ALk, A and L as examples of any twonon-identical residues at positions i, i k: theprobability that pair ALk will occur NALk times in aparticular sequence is:

PNALkjl; k;NA;NLwhich depends on four parameters; the length ofthe sequence l, the register k and how many Ala(NA) and Leu residues (NL) are in the sequence. Itis de®ned as the fraction of all possible permu-tations of the sequence containing exactly NALkoccurrences of the ALk pair. An example of the cal-culation is shown explicitly in the scheme for ashort ®ve residue sequence with two Ala and twoLeu residues and at register 3. The box shows all30 possible permutations of the short sequence (thenon-A and non-L residue is symbolized by a dash):of the 30 possible permutations, 19 (63.3 %) haveno occurrences of the AL3 pair. The pair occursonce in ten (33.3 %) and twice in one (3.3 %) of thepermutations. All sequence probability distri-butions for all relevant combinations of the fourparameters (l 18; k 1 to 10; NA 1 to 9; NL 1to 9) were calculated and tabulated for later use.Pairs formed by two identical residues, as forexample LLk, obey different distributions, P(NLLk jl, k, NL), that were analogously calculated andtabulated.

The speci®c database is considered only in phase2, when actual occurrences of the pairs are countedand the database probabilities are computed. Theoverall probability distribution of occurrence of the

pair ALk in the database, PDB, was calculated byiteratively convoluting the speci®c single-sequencePj(NALk) distributions tabulated in phase 1 relativeto the [lj, k, NA, j, NL, j] parameters provided byeach j sequence of the database considered. Theprobability of observing NALk occurrences of thepair ALk in a database of n sequences can he calcu-lated according to:

PDBnNALk XNALki0

PDBnÿ1iPnNALk ÿ ijl; k;NA;n;NL;n

de®ned recursively, with initial PDB(0)(0) 1. NA,nand NL,n are the number of Ala and Leu residuesin sequence n.

An example of the process is shown in thescheme where the ®rst three steps and the ®nalresult are illustrated for the analysis of the occur-rences of the pair AL3. All sequences in the data-base analyzed have ®xed length l of 18 residues(this restriction is not necessary in general and themethod applies to mixed-length sequence data-bases). The ®rst sequence of the database containstwo Ala and three Leu residues. No occurrence ofAL3 is observed in this sequence (black arrow atzero occurrence in chart) In the ®rst step of theprocedure only one sequence has been consideredand the probability distribution of the database,PDB(1)(NAL3) (bar chart) corresponds to the prob-ability distribution of sequence 1, P1 P(NAL3jl 18, k 3, NA 2, NL 3).

The second sequence of the database contains®ve Ala and ®ve Leu residues, and in this case oneoccurrence of AL3 is observed. P2 is thus P(NAL3 jl 18, k 3, NA 5, NL 5) and the cumulativePDB(2) distribution is then obtained from P2 andPDB(1), as shown in the example. Two occurrencesof AL3 are found in the third sequence, bringingthe total to three for the database at this stage, andPDB(2) is then obtained from P3 and PDB(2). The cal-culation becomes more complex as more combi-nations are available and the curve assumes amore bell-shaped character.

Once all 13,606 sequences had been analyzed,the PDB distribution has converged to a bell curve.Average expected values and standard deviationswere calculated from the probability distributioncurves according to:

NALk XNALk

NALk PDBNALk

SDALk XNALk

N2ALkPDBNALk ÿ�X

NALk

NALk PDBNALk�2:

vuutThe observed 4140 occurrences of AL3 in thedatabase are slightly above the average expec-tation value of 4043.1. The two-tailed integral of


the PDB(NAL3) function provided a signi®cancefor the observed occurrences of a pair. The inte-gration was computed on formally derivedcurves; therefore, no assumption regarding thenature of the distributions was necessary Two-tailed integrals were used, since both above andbelow-expectation values were considered signi®-cant. The signi®cance of the occurrences of theAL3 pair is low (p 0.075), that is, if the resi-dues were actually randomly distributed therewould be a realistic possibility of observing anequal or greater number of occurrences by ran-dom chance.

The analysis of the triplets was performed withan analogous method. The single-sequence prob-ability distributions were calculated for the tripletALk1Vk2 as:

PNALk1Vk2 jl; k1; k2;NA;NL;NV

based on all possible sequence permutations andtabulated for the relevant ranges of l, k1, k2, NA, NLand NV (Ala, Leu and Val representing any threenon-identical residues at relative spacing k1 andk2). Probability distributions were also calculatedfor triplets in which residues are repeated(AAk1Lk2, ALk1Ak2, ALk1Lk2, AAk1Ak2). The cumu-lative probability distribution, PDB, for the occur-rence of each triplet in the database was calculatedwith the same recursive formula of the pairs. TheTMSTAT method is, in principle, applicable toquadruplets and higher-order multiplets, althoughthe increased number of combinations can limit thefeasibility.

Edited by G. von Heijne

(Received 4 November 1999; received in revised form 29 December 1999; accepted 29 December 1999)

Statistical Analysis of Amino Acid Patterns in ...papers.gersteinlab.org/e-print/tmstat-jmb/old/senes_JMB...Figure 2. Variation in the amino acid composition at different positions

Documents