Top Banner
,) J Mol Evol (1986) 24:130-142 Journal of Molecular Evolution @) Springer-Verlag New York Inc 1986 ~ " , Protein Export in Prokaryotes and Eukaryotes: Indications of a Difference in the Mechanism of Exportation Olivier Gascuell and Antoine Danchiw I Unite, 194 de l'INSERM, 91 Boulevard de l'Hopital, 75013 Paris, France 2 Departement de Biochimie et Genetique Moleculaire, lnstitut Pasteur, 28 Rue du Docteur Roux, 75724 Paris Cedex, France Summary. Investigatio~ of possible variations be- the two types of organisms, in spite of the common tween prokaryotic and eukaryotic signal sequences features of the signal sequences. of exported proteins has revealed unexpected dif- ferences. Apart from the known similarities (pres- Introduction ence of a core hydrophobic sequence preceded by a positively charged amino terminus and followed by Proteins and nucleic acids are generally considered a flexible structure), we have found that the core is to have played the central role in the evolution of much more rigid in eukaryotic signals than in their living cells. But the requirement for an enveloping prokaryotic counterparts, and that at both ends the membrane separating the environment and the "mi- .constraints are much more stringent in bacteria than lieu int6rieur," as emphasized by Claude Bernard, in human cells. The differences have been sum- was also a major factor in evolution. Prokaryotes marlzed as a set of 17 criteria describing noteworthy as well as eukaryotes are able to cope with their features discriminating between the two classes of environment using a variety of means, among them signal peptides. The program we used permitted each export or secretion of proteins through the sur- class of sequences to be learned; Escherichia coli rounding membrane(s). Biochemical studies of ex- sequences were well learned (i.e., they could be rec- portation in eukaryotic cells have led to a detailed ognized by the programs as having common fea- hypothesis, the "signal" hypothesis proposed first tures), whereas human sequences were found to ex- in 1972 and 1975 (Milstein et al. 1972; Harrison et hibit a much wider variation. Thus it was possible al. 1974; Blobel and Dobberstein 1975a) and sub- to propose a consensus in the case of the bacterial sequently in various revised forms. The essential peptides, but none (or a much looser one) in the feature of the signal concept is the presence of a case of the human sequences. Two sequences were polypeptide of 20-40 amino acids, predom.inantly exceptional among the E. coli signal peptides, those hydrophobic, at the amino terminus of the secreted of lipoprotein and plasmid-borne beta-lactamase, protein. In the current model the signal peptide is suggesting that they have special origins or desti- cleaved off and is not present in the mature protein nations. Finally, the differences found strongly sug- (Blobel and Dobberstein 1975b). The hydrophobic gest that the mode of secretion is rather different in portion of the peptide is thought to drive insertion into the core of the membrane and to allow the Key words: Artificial intelligence - Pattern rec- subsequent amino acid sequence to pass through. ognition - Signal sequence - Protein secretion - Compilation of published signal sequences has per- Eukaryote-prokaryote discrimination mitted identification of some common features that are related to models of secretion (Bedouelle and Hofnung 1982; Perlman and Halvorson 1983; von Heijne 1983; Watson 1984). Precursors of secreted Offprint requests to: A. Danchin proteins normally have short half-lives, and are ap- i~
13

,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

Aug 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

,)

J Mol Evol (1986) 24:130-142

Journal ofMolecular Evolution@) Springer-Verlag New York Inc 1986

~

" ,

Protein Export in Prokaryotes and Eukaryotes: Indications of a Difference in

the Mechanism of Exportation

Olivier Gascuell and Antoine Danchiw

I Unite, 194 de l'INSERM, 91 Boulevard de l'Hopital, 75013 Paris, France2 Departement de Biochimie et Genetique Moleculaire, lnstitut Pasteur, 28 Rue du Docteur Roux, 75724 Paris Cedex, France

Summary. Investigatio~ of possible variations be- the two types of organisms, in spite of the common

tween prokaryotic and eukaryotic signal sequences features of the signal sequences.

of exported proteins has revealed unexpected dif-ferences. Apart from the known similarities (pres- Introduction

ence of a core hydrophobic sequence preceded by apositively charged amino terminus and followed by Proteins and nucleic acids are generally considered

a flexible structure), we have found that the core is to have played the central role in the evolution of

much more rigid in eukaryotic signals than in their living cells. But the requirement for an enveloping

prokaryotic counterparts, and that at both ends the membrane separating the environment and the "mi-

.constraints are much more stringent in bacteria than lieu int6rieur," as emphasized by Claude Bernard,

in human cells. The differences have been sum- was also a major factor in evolution. Prokaryotes

marlzed as a set of 17 criteria describing noteworthy as well as eukaryotes are able to cope with their

features discriminating between the two classes of environment using a variety of means, among them

signal peptides. The program we used permitted each export or secretion of proteins through the sur-

class of sequences to be learned; Escherichia coli rounding membrane(s). Biochemical studies of ex-

sequences were well learned (i.e., they could be rec- portation in eukaryotic cells have led to a detailed

ognized by the programs as having common fea- hypothesis, the "signal" hypothesis proposed first

tures), whereas human sequences were found to ex- in 1972 and 1975 (Milstein et al. 1972; Harrison et

hibit a much wider variation. Thus it was possible al. 1974; Blobel and Dobberstein 1975a) and sub-

to propose a consensus in the case of the bacterial sequently in various revised forms. The essential

peptides, but none (or a much looser one) in the feature of the signal concept is the presence of a

case of the human sequences. Two sequences were polypeptide of 20-40 amino acids, predom.inantly

exceptional among the E. coli signal peptides, those hydrophobic, at the amino terminus of the secreted

of lipoprotein and plasmid-borne beta-lactamase, protein. In the current model the signal peptide is

suggesting that they have special origins or desti- cleaved off and is not present in the mature protein

nations. Finally, the differences found strongly sug- (Blobel and Dobberstein 1975b). The hydrophobic

gest that the mode of secretion is rather different in portion of the peptide is thought to drive insertion

into the core of the membrane and to allow the

Key words: Artificial intelligence - Pattern rec- subsequent amino acid sequence to pass through.

ognition - Signal sequence - Protein secretion - Compilation of published signal sequences has per-

Eukaryote-prokaryote discrimination mitted identification of some common features that

are related to models of secretion (Bedouelle and

Hofnung 1982; Perlman and Halvorson 1983; von

Heijne 1983; Watson 1984). Precursors of secreted

Offprint requests to: A. Danchin proteins normally have short half-lives, and are ap-

i~

Page 2: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

01 131

parently structurally as well as functionally (Austin analysis of the data. As an internal control, we val-1979; Watson 1984; Pugsley and Schwartz 1985) idated the method at the "experimental" level bysimilar in eukaryotes and prokaryotes. However, testing the descriptors on the sequences of mem-despite their structural relationship, signal se- brane proteins lacking a processed signal sequencequences do not show a high degree of sequence ho- as well as on newly discovered signal sequences, andmology, even in otherwise closely related proteins at the mathematical level using the "jackknife" test.(Perlman and Halvorson 1983; von Heijne 1983; ".

Watson 1984). Essential features of signal sequencesare (i) one or more positively charged residues (pref- Materials and Methods

erably lysine) at the NH2 terminus, (ii) a continuousstretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data

tended thread (Austen 1979; Bedouelle and Hof-nun 1982) and (iii) a "signal peptidase" cleavage T.he ~equences of signal peptides have been taken from the co~-. g , . pIlatIon of Watson (1984); from the 277 sequences present m

sIte, located m a beta-turn (Oxender et al. 1984). this compilation we chose 18 bacterial and 22 human sequences,In spite of the many analyses of signal peptides to avoid redundancy (highly related sequences are present a num-

using different approaches (including analysis of ber of times in the compilation) as well as ambiguous experi-mutant proteins that fail to be exported) the actual mental characterization of the cleavage site; in addition, to get amechanism of secretion is still a matte~ of specu- self:consis~ent set of data we used sequences from the same .or-

. ganIsm; thIs was necessary because we sought to prevent creationlatIon (Kelly 1985; Pugsley and Schwartz 1985). Se- of discriminating patterns due to speciation. Sequences used incretion has been shown, at least in eukaryotes, to this study are summarized in Table 1. Other sequences are de-involve a cotranslational process (Walter and B10be1 scribed in the text.

1981; Meyer et al. 1982). However, recent evidencefrom the analysis of secretion in prokaryotes hassuggested that differences in-secretory mechanism Hardware

might exist between eukaryotes and prokaryotes. In. . . .'. The treatment of data and generation of descriptors were per-addItIOn, expre~sIon of eukaryot~c proteIns m pro- formed on a VAX 785 computer. A typical run, correspondingkaryotes, and VIce versa, results m a rather low ef- to generation and evaluation of 2500 descriptors, took 600 minficiency of secretion, although good yield may be of central-processing-unit time.achieved by appropriate construction and selection.

This last point suggests that in spite of the simi-larities between signal sequences in eukaryotes and, Softwareprokaryotes, significant, yet previously unsuspected,. . -differences might also exist at the level of either the a) The f:earnl~g 'pr°.Kram. (PLAGE). ~LAGE IS a learnmg.pro-

. ' . gram usIng artIficIal-Intelligence techniques. [Cohen and Felgen-

cognate gene or the peptIde sequence Itself. Apart baum (1982) provide an excellent introduction to these methods.]from experimental studies, we wondered whether It is a derivative of such well-known programs as MET A-DEN-techniques devised for artificial intelligence (com- DRAL, which Cletermines cleavage rules for molecules by inves-puter-aided discovery) might help solve the prob- tigating .their. mass s~ctra (Buchanan. and Feigenb~um 1978);

lem. The result of our analysis is clear: Eukaryotic A1;1,whichdiscover~sImple.mathematIca1~oncePts(mtegraland. . pnme numbers, vanous basIc theorems) usIng only the conceptssequen~es can eas~ly be told. fro.m prokaryotI~ ones, of a set, not formal set theory (Lenat 1977); and LEX, whichaccordIng to 17 dIfferent cntena, among WhICh the designs efficient formal mathematical integration strategiescharge and the aromaticity of the amino acids as (Mitchell et al. 1983). These programs carry out what is knownwell as the presence or absence of a cysteine residue in cognitive psychology as "abstractive learning," that is, creatingare important. Also we find that Escherichia coli abstractions fro.m what is kno~ initially using induction to solve

. ' . . the problem at hand. A learnIng program can be broken downsIgnal sequences are well learned, YIeldmg a clearcut into three parts: (1) combination of possible abstractions givenconsensus of descriptors, whereas human signal se- to the program, called the "descriptor space" or "rule space"; (2)quences seem to behave as a much less homoge- methodology used to search the descriptor space; and (3) criterianeous sample. This suggests that the secretion mech- that permit assessing whether a descriptor or abstraction is per-anism differs in the two classes of organisms, and tinpeLAnt. GE . h b k d . h.

1 . 1 .fi ' . . h IS t us ro en own Into tree parts:that the electnca potentIa, speCI c InteractIons WIt 1. The descriptor space is constructed using two languages,the secretory apparatus, and possible posttransla- LISP and LG, an extension of LISP (Gascuel 1986), as in thetiona1 modifications might participate to differing following example: Suppose the user wants to know if the numberextents in export in prokaryotes and in eukaryotes. of times a certain type of amino acid occurs in the sequence isOur program is meant to find out which descriptors involved in disc.rimination. The fol~owing steps are nc:cessary to

. . .. define the descnptor space for solVIng the problem: (1) Create aallow dlscnmmatIon among the examples we feed LISP function called NUMBER, such as: (NUMBER S T) yieldsin. These descriptors have in themselves a certain the number of amino acids of type T in sequence S. (ii) Create"explicative" power, and permit some statistical an LG hierarchy linking the different types of amino acids to one

Page 3: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

... I

132

Table 1. Signal sequences used in the learning program (data from Watson 1984)

Molecule Sequence-

Bacterial signal sequences

Lipoprotein MKA TKL VLGA VILGSTLLAGLambda receptor MMITLRKLPLA V A V AAGVMSAQAMA

ompA MKKTAIAlAVALAGFATVAQAompC MKVKVLSLLVPALLVAGAANAompE MKKSTLAL VV}{GIV ASASVQA

tolC MKKLLPILIGLSLSGFSSLSQApap pili subunit MIKSVIAGA V AMA VVSFGVNNA '

Alkaline phosphatase MKQSTIALALLPLLFTPVTKAPhosphate binding protein MKVMRTTVATVVAATLSMSAFSVFAMaltose binding protein MKIKTGARILALSALTTMMFSASALALeu-specific binding protein MKANAKTIIAGMIALAlSHT AMA

His binding protein MKKLALSLSLVLAFSSATAAFAPhage M 13 major coat protein MKKSL VLKASV A V A TL VPMLSF A

Beta-lactamase MSIQHFRVALIPFFAAFCLPVFALys-Arg-Orn binding protein MKKTVLALSLLIGLGA T AASY A

ompF MMKRNILAVIVPALLVAGTANAPhage Ml3 minor coat protein MKKLLFAIPLVVPFYSHSLeu-lie-Val binding protein MNIKGKALLAGCIALAFSNMALA

Human signal sequences

Growth hormone MA TGSRTSLLLAFGLLCLPWLQEGSAAlpha-gonadotropin MDYYRKY AAlFL VTLSVFLHVLHS

Placental lactogen MPGSRTSLLLAFALLCLPWLQEAGARelaxin MPRLFLFHLLEFCLLLNQFSRA V AA

Proinsulin MALWMRLLPLLALLALWGPDPAAAPancreatic peptide MAAARLCLSLLLLSTCV ALLLQPLLGAQGAlpha-I-antitrypsin, MPSSVSWGILLLAGLCCL VPVSLA

Alpha-interferon A MALTFALLVALLVLSCKSSGSVG

Beta-interferon MTNKCLLQIALLLCFSTTLASGamma-interferon MKYTSYILAFQLCIVLGSLGIg heavy chain MEFGLSWLFL V AlLKGVQC

Ig kappa-chain 101 MDMRVLAQLLGLLLLCFPGARCHLA-Ds alpha-chain MILDKALMLALGALTTVMSPCGGHLA - Dr beta-chain MVCLKLPGGSSLAAL TVTLMVLSSRLAF A

Apolipoprotein Al MKAAVLTLAVLFLTGSQAAntithrombin III MYSNVIGTVTSSKRKVYLLSLLLIGFWDCVTCAlpha-fibrinogen MFSMRIVCL VLSVVGT A WTBeta-fibrinogen MKRMVSWSFHKLKTMKHLLLLLLCVFL VKS

Pro ACTH-beta-lipotropin MPRSCCSRSGALLLALLLQASMEVRGAlpha-fetoprotein MKWVESIFLIFLLNFTESRRetinol binding protein MKWVW ALLLLAA W AAA !

AchR alpha-subunit MEPWPLLLLFSLCSAGL VLG

another. For example, a small portion of the hierarchy involved malism employed here is similar to ATN (augmented transition

in the problem might look like: network). In our example, the user will write:

AMINQ ACID -.. (-" NUMBER +- S -.- AMINO ACID -") -..

~ 1 :~ Given the graph and the hierarchy, the p_Togram knows that CI =(NUMBER S AMINO ACID) and! C2 = (NUMBER S ARO-

POLAR DESTABILIZING MA TIC) are possible descriptors and that C2 is a specialization\ AROMATIC ofCI. 2. The descriptor space is searched using a top-down strategy.

//, "'~ "Top down" implies that the program starts from the most gen-/ ~ eral descriptor. The first descriptor is then used to generate spe-

Tyr Phe cialized descriptors, which are tested for relevance to the prob-1

lem. If pertinent, the descriptors are used to solve the problem.,(iii) Define how NUMBER will be used in LG, i.e., how the Further descriptors are generated from the specialized not per..syntax for calling the function NUMBER, and how a descriptor tinent descriptors, tested for problem relevance, and so on. I~including NUMBER, may be refined or specialized. The for- our example, the program would search the following graph:

'--

Page 4: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

---~.,.

. 133

The most general From a probabilistic standpoint, when (ad - bc) > 0, C(i) =descriptor 1 implies that i is more human and C(i) = 0 implies that i is

more bacterial. The converse is true when (ad - bc) < O. One(NUMBER S AMINO ACID) can therefore define C'(i), the "C viewpoint on i;' to equal 1

when i is more human, 0 when i is more bacterial, and 0.5 whenno response can be given for descriptor C [i.e., when C(i) = s].

We haveoimposed on the set of descriptors D a simple bench-(NUMBER S mark (i.e., a mapping between a given sequence of amino acids

(NUMBER S) DESTABILIZING) and a binary representation of the descriptors) as follows:POLAR) B(i) = ~ C'(i). With 17 descriptors B(i) = 0 when i is typically a\ (NUMBER S AROMATIC) E. coliDsignal sequence, whereas B(i) = 17 for a typically human

~~~ signal sequence.

(NUMBER S Tyr) (NUMBER S Phe) c) The Jackknife Test. The "jackknife" is a test that evaluatesa learning program's performance on given examples. The test's

3. Each descriptor generated during the search is evaluated and principle is as follows: Let E be the set of examples, and let e bescored using elements of, the ~~ginal sets: Th~ hi~e~ the de- a given element of the set. The program, knowing only E - Ie},scriptor's score, the more likely It IS that the ~escnptor IS mvolv~d assigns a class to example e; this class is then compared with thein discrimination. Only descriptors sconng above a ce~n actual class of e. The process is then repeated for each e of E; ifthreshold are used to solve the problem. When the descnptor the number of examples showing a correct class assignment isbeing scored is Boolean [i.e., tru~ (1) or false (O~], the score (S) high, the program has performed well in the discrimination task.is the measure of chi square applied to the descnptor-<:lass con- The test therefore verifies whether the information learned istingency table shown below. general. Two comments about the test are worth making: (i) If

the program uses pure memorization, the results will be null. (ii)Oass This test is costly in computation time, since it requires that the. program do as much learning as there are examples.

S:x H. sapiens E. coli

1 a c ResultsDescriptor d0 b Discovery oj Discriminating Patterns

A chi-square test on the hypothesis "class and descriptor have The aim of our program was to identify from thedependent responses" can be done based on the table. If the f h Id .probability is high that the hypothesis is true, then class and s~qu~nc.e d~ta a set 0 patterns t. at wou perm. It

descriptor can be assumed to have dependent responses. Then it dlscnmmatlo!1 between eukaryotlc and prokaryotIcis possible (with a degree of confidence fixed by the chi-square sequences, at least when they are considered asvalue) to deduce the class of a given example once its response groupS. In order to have well-matched examples weto the descriptor function is known. The optimal score occurs used data from two sets of sequences in which awhen a = d = 0 or c = b = O. In this case, considering the size. . . knof the classes (22 human signals and 18 E. coli signals), the score signIficant number of sIgnal sequences are own:is 40. an E. coli set (including two sequences from the

All descriptors presented in this paper but one have scores closely related organism Salmonella typhimurium)higher than 7.879. This value corresponds to a dependence prob- and a human set (Table 1). The amino acids wereability of99.5%. In other words, the chances are 995 out of 1000 classified according to the scheme of Schwartz andthat the responses of the descriptors created as the problem's D h ff (1978)solution are dependent on the classes. A score of 12.116 (the ay 0 . .. .highest number represented in the chi-square table we possess) Descnptors were d!.SCovered USIng a softwarecorresponds to a dependence probability of 99.95%. package, PLAGE, that!behaves in the following way:

To score descriptors yielding quantitative responses (such as Primitive descriptors (amino acid number, bary-the baryc~nter.of a class of amino aci~s), the program transforms center, distribution, et~.), as well as a grammar forthe descnptor mto the Boolean descnptor [Constant :$ M]. The b .. th t t d th . 1. . , ., com mmg em are es e on e sIgna -sequencevalue of the constant IS chosen so as to maXimIZe the descnptor s ' . .. .core which is then assigned to the descriptor. From then on the sets to find out whether they give IdentIcal or dlf-Booiean form of the descriptor is used in place of its initial ferent answers for both sets; For instance, from thequantitative value. three primitive descriptors "number," "aromatici-

. , . ty." and "sequence." the program builds up the de-b) The Benchmark. Let C be a bmary descnptor, and C(l) be . " b f t . . od .

th, . , . scnptor num er 0 aroma lC ammo aCl s m ethe value of the descnptor for a signal sequence I. The question .. . . . . . .

we ask is, if C(i) equals I (or 0), is it more human or more sequence. The validity of this descnptor is estl-bacterial? In general, let us consider a C having the following mated, and the descriptor is then retained or droppedcontingency table: for use in obtaining the final score. In the present

. E /.. work more than 2500 different descriptors were as-H. sapIens . co I d Th f .bl d . . d hsaye. e set 0 pOSSl e escnptors IS terme t e1 a c descriptor space; it is defined by the interaction be-0 b - d tween the primitive descriptors and the grammar.

Page 5: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

.. .134

Descriptors that are kept by the program are not measures not the actual number of amino acids be-very redundant, because a selection process elimi- tween those considered, but the number of intervalsnates strongly redundant ones, as in the following in between; when one or both of the two amino acidscase: If two discriminating descriptors were "pres- considered is not present in the sequence the answerence of the pattern LL" and "presence of the pattern to the descriptor is "s." (vi) The descriptor "pres-LLL," the program would retain only the first de- ence ofa pattern" evaluates whether a given patternscriptor, because it is the most general of the two. is present in the sequence.Some redundancy may, however, remain; for in- The number of elements in each set of signal pep-stance, we may have both "the number ofL residues tides varies according to the nature of the descriptor,is higher than 6" and "the minimal distance between because of the possibility of no answer (s). The max-two L residues is 5." Given that in the sequences imum number is given in Table 1: 22 for humanunder study the total number of amino acids is at signal sequences and 18 for E. coli signal sequences.most 30, the second descriptor is a consequence of In the case of the second descriptor ("the barycenterthe first one; hence there is redundancy. Elimination of aromatic amino acids is higher or equal to 7"),of such weakly redundant descriptors has to be per- for instance, the number is 19 in the human set andformed by the program's user. Here, of the 29 de- 12 in the E. coli set. '

scriptors proposed by the program, we retained 17,with as litt~e redundancy as possible. A score derived Descriptors Discovered by the Pro ramfrom a chI-square test performed on the set of all g

sequences is obtained for each descriptor assayed: We now give the descriptors yielding scores greaterThe higher the score, the more significant the de- than or equal to 7, in descending order of score.

scriptor.To understand the figures that will be presented, Descriptor 1. "T_he signal sequences contains at

one should understand the following: (i) The prim- least one C." "'~itive descriptor "number" calculates the number of S.17 H \. E /..

, ' d f . 1 . ' fi .. sapiens , co Iammo aCI s 0 a given c ass (ldentI ed by the cqn-ventional one-letter code) present in a sequence; for 1 17 2instance, the number of As in the sequence 0 5 16

AFFVTSW AAALA is 5, and the number of pairs[aromatic/(ASTPG)], in that order, in the sequence Cs are rare in E. coli signal sequences. The onlyAFPPWT AA WG is 3. (ii) The "barycenter" of a exceptions are plasmid-borne beta-lactamase andgiven amino acid (or a group of amino acids) is Leu-lIe-Val binding protein. Conversely, Cs are veryan integral value from 0 to 10 giving the relative frequent in human signal sequences; in fact thisposition of the center of gravity of a given class character appears to be one of the most pertinentof amino acids in a sequence; for instance, the to specifying human signal sequences.

barycenter for A is 0 in ABBBBBBBBBBBBBBBBBBB, 5 in BBBBBAAABBBBB, and 10 in Descriptor 2. "The barycenter of aromatic aminoBBBBBBBBBBBBBBAAA. (iii) The descriptor acids is greater than or equal to 7."

"distribution" calculates the regularity with which S 14 H . /..'. d . h ., b ' :. sapIens E, CO I

ammo aCI s appear m t e sequence; It IS 0 tamedafter calculation of the variance of the distance be- 1 6 12tween amino acids of a given class. For instance, 0 13 0

the distribution of A is 0 in ABABABABABAB, 13in AAABBBBBBBBBBBBBBBBB, and 8 in AA This descriptor appears to be, once again, highlyBABABBBBBBBBBBABBA; when the total num- discriminating. In E, coli, aromatic amino acids areber of amino acids considered is lower than 3, the mainly phenylalanine residues located at the COOHanswer to the descriptor is "s" (no answer). (iv) The end of the sequence (see also descriptor 8 below).

descriptor "position from the start" (or "from theend") may be illustrated as follows: "An aromatic Descriptor 3. "At least one negatively chargedamino acid is present at position 3 from the start"in amino acid is present in the sequence."

MAFAAA is true, while "a leucine is present at .." 4 fj h ,,' h . S:14 H, sapIens E. coli

pOsItiOn rom t e start m t e same sequence ISfalse. (v) The "minimum distance" between amino 1 12 0acids (or classes of amino acids) may be illustrated 0 10 18

as follows: The minimum distance between an ar-omatic amino acid and an A in the sequence The negation of the descriptor identifies E. coli sig-AFFVT A WLLA is 3; it is a vectorial distance, and nal sequences very efficiently: No exceptions to this

Page 6: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

135

descriptor are found in the set. One should note that Descriptor 8. ..There is at least one tryptophan inin signal-sequence mutants that fail to export, one the sequence."often finds negatively charged residues (see Be- 8' 12 H .

E /. .

. . sapIens . co I

douelle and Hofnung 1981).1 11 0

Descriptor 4. ..The sequence contains the pattern 0 11 18

(MLIV)(MLIV)L at least once." Th . d . I d d . 2 d. . .IS ~scnptor, re ate to escnptor , Iscnml-

8:14 H. sapiens E. coli nates strongly against E. coli sequences. Taken to-1 17 3 gether with descriptor 2, it suggests a role for aro-0 5 15 matic amino acids in the secretion process.

This descriptor is highly specific for human signal Descriptor 9. .. A amino acid from the set (M, L,

sequences; the pattern corresponds to a cluster of I, V) is present at position -6 in the sequence."hydrophobic amino acids having bulky side groups. 8:12 H. sapiens E. coliThe core of signal sequences is different in E. coli 1 0(see descriptor 15). 11

0 11 18Descriptor 5. .. A K is present at position + 3 from This is specific for human signal sequences, in spite

the start of the sequence." of the existence of several exceptions. Note how this

8:14 H. sapiens E. coli compares with descriptor 13. Human signal se-1 0 9 quences require a hydrophobic, bulky residue at po-0 22 9 ~ition -~, whereas a chain-breaking one is favored

mE. coho

Signal sequences from E. coli have a ..standard" Descriptor 10. ..Positively charged amino acidsstart: MKK (see descriptor 16). A K in the third are present at least twice in the sequence."position seems to be avoided in human sequences, '.although a positively charged residue (R) may be 8:11 H. sapIens E. coil

present. J 8 16

0 14 2Descriptor 6. .. An amino acid from the set (A, S, ~-

T P G) is present at position -10." In E/COfi; apart from one exception (the protein, , forming the pili of bacteria) and a nonstandard case

8:14 H. sapiens E. coli (beta-lactamase, which contains a histidine near an1 3 13 arginine residue at the beginning of the sequence),0 19 5 there are two or three positively charged amino acids

"ft!~ (mainly lysine residues) in signal sequences; in hu-i' Five exceptions to this rule are found in E. coli. man signal sequences this frequency varies from

... However, it appears that the rule is infringed only zero to eight.

very slightly in all cases except plasmid-borne beta-lactamase. Indeed, in one case the amino acid pres- Descriptor 11. .. A leucine residue is present at

ent at position -10 is a valine, i.e., a small amino position -11."acid (see descriptor 11), and in the three other cases 8:11 . H. sapiens E. coli

., it is a leucine residue bracketed by one of the amino 1 14 2 'acids from the set: ALA, SLS, or TLS.0 8 16

Descriptor 7. .. At least six leucine residues are. . . .present in the sequence." ThIS refinement of descnptor 6 has some specIficIty

- for the human signal sequences. Exceptions disap-I8:12 H. sapiens E. coli pear when one considers that an L residue must be

1 13 1 present either at position -10 or -12, and an F0 9 17 residue can also be placed at position - 11. This

character seems to be selected against in E. coli sig-This descriptor clearly discriminates against E. coli nal sequences.signal sequences (with one exception, the ToiC sig-nal sequence), and it is a good marker of human Descriptor 12. ..An alanine residue is present atsignal sequences. position -1."

-

Page 7: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

136

5:11 H. sapiens E. coli E. coli sequences, where (A, S, T, P, G) residues are1 8 16 found much more often. Alanine residues seem to0 14 2 playa particular role in this latter class, as they are

preferentially incorporated into E. coli signal pep-The consensus AXA, discovered by Perlman and tides.Halvorson (1983) in eukaryotic as well as prokary-otic signal sequences, corresponds to this descriptor. De3Criptor 150 ..At position -10 an amino acidIn fact, this descriptor discriminates the bacterial from the set (M, L, I, V) is present."signal sequences easily, but is much less significant 5:10 H. sapiens E. colifor human signal sequences. The bacterial excep- 1 16 4 .tions are the signal sequence for lipoprotein (see 0 6 14below) and the phage M 13 minor coat protein. Theupstream A residue can be replaced by S, T, V, or This descriptor discriminates against E. coli signalN in E. coli signal sequences. These amino acids are sequences, and shows significant specificity for hu-not related in their chemical properties; however, man signal sequences, refining what was found forthey all fit the steroechemical scheme. descriptors 6 and 11 (see also descriptor 4).

~ ICOOH Descriptor 16. .. A positively charged residue is

/CH-~H present at position +2."0 NH2 5:9 H. sapiens E. coli

where 1 and 2 are, respectively, H, H (Ala); OR, H 1 5 13

(Ser); OR, CH3 (Thr); CH3, CH3 (Val); and H, 0 17 5

CONH2 (Asn). Th o d . . d . h d o 5 d fiIS escnptor associate WIt escnptor e nes

D . t 13 ..An . od fj th set (A the consensus MKK that is present in seven E. coliescrlp or. ammo act rom e ,

S, T, P, G) at position -6." sequences.

5:11 H. sapiens E. coli Descriptor 17. ..The number of (MLIV)(ASTPG)1 8 16 dipeptides is higher than 3."0 14 2 5:7 H" sapiens E. coli

0 ~ d . E / . b 1 1 12 17Two exceptIons are loun m . co I: eta- actamase(where a C is present at this position) and the pap 0 10 1

protein of pili [where an F is present at this position, Th . d . b fi d. I E / .. 1b b k d b (A S T P G) 0 od . SFG] IS escnptor can e re ne . n 0 co 1 sIgna se-

ut rac ete y " " ammo act s.. fi '" f h 1. 0 0 quences, one nds m a maJonty 0 cases tea ter-The ammo acids present are m fact a subset of natin attern (ASTPG)(MLIV)(ASTPG)(MLIV)Schwartz and Dayhoff's (1978) class 2, and corre- . go p . . '

d h "ghl 1 d o .d whIch IS rarely present m human signal sequences.spon to lyre ate ammo act s: . 0 "

ThIs correlates WIth a descnptor that was ruled out+CH3 T because it is slightly redundant with respect to all. -~ the retained descriptors, and showing that (A, S, T,

S ~ P, G) amino acids are distributed evenly in E. coli~': :::-:-"-» sequences, whereas they are rarer and more clus-r -OR A tered in human sequences.

In fact, in E. coli. one often finds the pattern FS or D ." '/" C S. I S. "I d ot O 6 etermmatlon OJ onsensus Igna equencesa slml ar one aroun pOSt Ion - .In all, we retained 17 descriptors. Each gives a good

DescriptorI4...ThenumberofAresiduesisgreat- score (higher than 7) in the chi-square test. To beer than 4." able to use descriptors collectively (for instance, to

5:10 H. sapiens Eo coli identify a new signal sequence; see below), we es-1 6 14 tablished a benchmark, then used the benchmark0 16 4 for discrimination. Finally, we used the ..jackknife"

test on the overall methodology (discovery of de-This descriptor .is related to several other descrip~ors scriptors, establishmento of ~he" be~chmar~, a~d use

Isuggesting that m general human sequences are nch- of the benchmark for dlscnmmatIon) to JustIfy (orer in amino acids from the set (M, L, I, V) than are falsify) our method.

Page 8: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

Ir:~t.. - .

137

Table 2. The benchmark

Discriminator

Molecule 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Score

Bacterial signal sequences

Lipoprotein 0 sOlO 0 0 0 0 1 0 0 I 0 I 1 1 6.5Lambda receptor 0 s 0 0 0 1 0 0 0 1 0 1 I I 0 0 I 2.5ompA 0 1 0 0 1 0 0 0..0 1 0 1 1 1 1 1 1 2ompC 0 s 0 0 0 1 0 0 -0 1 0 1 1 1 0 1 1 1.5ompE 0 s 0 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0.5tolC 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 1 1 4pap pili subunit 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 3Alkaline phosphatase 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 3Phosphate binding protein 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 3Maltose binding protein 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1Leu-specific binding protein 0 s 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1.5His binding protein 0 1 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 2Phage M13 major coat protein 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 1 1Beta-lactamase 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 6Lys--Arg-Om binding protein 0 1 0 0 1 1 0 0 0 1 0 1 1 1 0 1 1 0ompF 0 s 0 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1.5PhageM13 minor coat protein 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 3Leu-lIe-Val binding protein 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 3Consensus 0 1 0 0 - 1 0 0 0 1 0 1 1 1 0 1 1 0

Human signal sequeTlces

Growth hormone 1 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 13Alpha-gonadotropin 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 11Placental lactogen 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 12Relaxin 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 12Proinsulin 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 8Pancreatic peptide 1 sOlO 0 1 0 1 0 1 0 0 1 1 0 1 12.5Alpha-I-antitrypsin 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 13Alpha-interferon A 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 11Beta-interferon 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 11Gamma-interferon 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 10Ig heavy chain 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 14Ig kappa-chain 101 1 1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 0 13 ~;HLA-Ds alpha-chain 1 s 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 12.5 ~

HLA-Dr beta-chain 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0 1 8Apolipoprotein Al 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 1 5Antithrombin III 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 13Alpha-fibrinogen 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 15

IBeta-fibrinogen 1 0 0 1 0 0 1 1 1 1 1 0 0 0 1 1 0 14Pro ACTH-beta-lipotropin 1 s 1 1 0 0 1 0 0 1 1 0 1 0 1 0 0 12.5Alpha-fetoprotein 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 11Retinol binding protein 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 8AchR alpha-subunit 1 0 1 1 0 1 1 1 0 0 0 0 1 0 0 0 0 12Consensus 1 0 - 1 0 0 - - - - - - - 0 1 0 - 17

~ See text for explanationI~~ a) The Benchmark. Table 2 summarizes the va- lipoprotein and beta-lactamase signal peptides.

lidity o~ each descriptor ~th respect to each se- Among the features that separate these sequencesquence In the sets of E. coh and human sequences. from the bulk is the presence of a cysteine residueIn each case a score was computed (see Materials (see below).and Methods) that permits evaluating the degree of

A th h th f h... . mong e uman sequences e score 0 t erelatIonshIp of a given sequence WIth respect to all . ' . . .

the sequences in each set. In the E. coli set the score benchmark van:s from 5 t~ 15, WIth the maJon~yvaries from 0 to 6.5. A consensus sequence having of sequences having scores hIgher than 11. The maintypically E. coli features has the score 0; the signal features are presence of cysteine and tryptophan res-peptide of the lysine-arginine-ornithine transport idues, presence of a core sequence containing theprotein exhibits typical features of the consensus. pattern (MLIV)(MLIV)L, and, more generally, rich-Two sequences immediately demand attention: the ness in L and poorness in A.

111111111

Page 9: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

I

,

138

Table 3. Benchmark test on a sample of sequences from various origins

Discriminator

Molecule Sequence I 2 3 4 5 6 7

Lactose permease' MYYLKNTNFWMFGLFFFFYFFlGMA 0 0 0 0 0 0 0NADH dehydrogenase- MTTPLKKIVIVGGGAGGLEMAT 0 s I 0 0 I 0Leader peptidase" MANMFALILVIATLVTG 0 0 0 I 0 0 0Lambda receptor (mutant)b MMITLRKLPV AAGVMSAQAMA 0 s 0 0 0 I 0dacAc MNTIFSARIMKRLALTTALc;:fARSAAHA I 0 0 0 0 0 0dacA (2nd start site) MKRLALTTALcrAFISAAHA I I 0 0 0 0 0rbsFd MNMKKLATLVSAVALSATVSANAMA 0 s 0 0 0 I 0rbsP (2nd start site) MKKLATLVSAVALSATVSANAMA 0 s 0 0 I I 0sacBe MNIKKSAKQATVLTFTTALLAGGATQAFA 0 1 0 0 0 0 0blaZr MKKLIFLIVIAL VLSA 0 0 0 I I 0 0penpr MKL WFSTLKLKKAAA VLLFSCV ALAG I 0 0 I 0 0 IpenPcr MKNKRMLKIGICVGILGLSITSLEA I s I 0 0 0 0Chromosomal beta-lactamase- MFKTTLCALLITASCSTFA I 0 0 0 I 0 0Tox (diphtheria toxin)' MSRKLSASILIGALLGIGAPPSAHA 0 s 0 0 0 I 0

References:' Watson (1984); b Bedouelle and Hofnung (1981); c Jackson et al. (1985); d Groarke et al. (1983); e Steinmetz eta!. (1985);

rMezes and Lampen (1985); . Kaczorek et al. (1983). For further explanation, see text

b) Discrimination. As seen on Table 2, the bench- tides from Gram-positive organisms: levane sac-mark permits us to discriminate strongly between charase of Bacillus subtilis (Steinmetz et al. 1985);E. coli and human signal sequences. However, three beta-lactamases, from Staphylococcus aureus,whereas several signal peptides are typical E. coli Bacillus licheniformis, and Bacillus cereus (Mezessignal sequences according to the benchmark, none and Lampen 1985); and the diphtheria toxin pre-are typically human. Discrimination is obtained by sequence (Kaczorek et al. 1983). In four cases, in-performing a chi-square test on the scores derived cluding levane saccharase, the score obtained is equalfrom the benchmark. This method builds classes to or greater than 6, near the upper limit fixed forhaving the maximum number of elements from each typical E. coli signal peptides; the diphtheria toxin,sequence set; in our case the border is placed at a with a score of 3.5, would fit the E. coli consensus.score of 6.5. This permits us to separate exactly all We then tested E. coli sequences not included inthe sequences but one human one, apolipoprotein Watson's (1984) compilation: ribose binding pro-Al,whichdisplaysascoreof5.Itwasalsoofinterest tein (Groarke et al. 1983) and D-alanine racemaseto investigate whether it was possible using the (Jackson et al. 1985). In the latter case it was knownbenchmark to discriminate between E. coli se- that the fit with the consensus pattern was not goodquences and amino-terminal sequences of mem- (which was actually what motivated Jackson et al.brane-bound polypeptides, as well as to compare to use that signal for a study of compartmentationsignal sequences from other types of bacteria (for of E. coli membrane proteins), and indeed the scoreinstance, Gram-positive organisms) and test wheth- for the published D-alanine racemase presequenceer they looked like E. coli sequences. Finally, se- was 8. An internal methionine residue was presentquences recently discovered could be checked for in the sequence, and since it is difficult to assigntheir fit with the set we used when establishing the unambiguously a correct start to proteins when thebenchmark. actual protein sequence has not been determined,

Table 3 displays the pattern obtained with se- we investigated whether placing the start at the sec-quences falling into several classes. First, lactose ond methionine would improve the score. We thenpermease, NADH dehydrogenase, and leader pep- found a score of 6, under the upper limit of thetidase are integral membrane proteins; their scores benchmark. We therefore propose that the actualare 9, 5.5, and 10, respectively. Two of these se- start of the dacA gene preprotein is at the secondquences therefore fall outside the consensus pattern methionine rather than at the position given byfor E. coli signal peptides. The score for NADH Jackson et al. (1985). We performed a similar testreductase would place it near the border, and since in the case of ribose binding protein: In this case,it is an unprocessed protein we may take it as another however, the initial sequence already had a goodindication of the upper limit for a "standard" E. score (3.5). But since a start at the second methio-coli signal sequence. This matches well the data giv- nine would improve the score (1.5) we consider iten in Table 2, where only two sequences fall outside worthwhile to question the assignment of the me-this limit: lipoprotein and beta-lactamase. thionine start site in this sequence.

The second class assayed was a set of signal pep- Since it has been found that a C is involved in a~

Page 10: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

- ,

139

Table 3. Extended (discussed above) and also to limit computation time.This means that out of all the descriptors given in

Discriminator the descriptor space, the descriptors "number,"

8 9 10 II 12 13 14 15 16 17 Score "distribution," "barycenter," "position from the

0 0 0 9 beginning," and "position from the end" have been1 0 0 0 1 0 0 d h . .. d . "b "

0 0 I 0 0 1 0 0 0 1 5.5 use. Moreover, t e pnmltIve escnptor num er

0 0 0 1 0 1 0 1 0 1 10 has bsen tested as such, while its derivative "pairs0 0 1 0 1 1 1 0 0 1 2.5 of amino acids" has not been assayed.0 1 I 1 1 0 1 0 0 I 8 In the absence of the jackknife test the program0 1 1 1 1 0 I 0 1 1 6 PLAGE was able as shown above to classify 3900111110013.5 '.'.0 0 1 1 1 1 1 0 1 1 1.5 out of 40 sequences (the only exception beIng human0 0 1 I 1 1 I 1 0 0 6 apolipoprotein AI). With the jackknife test 37 se-0 0 1 0 1 1 0 1 1 0 6 quences were classified, the exceptions being beta-1 0 1 0 0 0 1 1 1 0 11 lactamase, apolipoprotein A I, and retinol binding0 1 1 0 1 0 0 1 1 1 89 .5 protein. The value 37 out of 40 is very high, and

000 1110100 . .0 0 1 I 1 1 1 0 0 I 3.5 demonstrates that what was learned IS really quIte

general.

Discussioncovalent interaction with a component of the mem-brane in the case of lipoprotein (Dev and Ray 1984), Looking for differences between human and bac-it appeared interesting to ask whether the beta-lac- terial signal sequences means not seeking the com-tamase signal peptide, which possesses an internal mon features of such signals. Indeed, we assume thatcysteine residue, was not involved in a special com- the overall structures of the signal sequences arepartmentation. A chromosomal E. coli beta-lacta- similar in the two types of organisms. We shallmase exists, and its signal sequence does not fit many therefore discuss only how each class departs fromof the descriptors for the E. coli consensus (score, the other, starting from the features they have in9); for example, it contains two cysteine residues. common. The 17 descriptors found by learning the

The benchmark we use is meant to discriminate human and E. coli sets of signal peptides allow dis-between E. coli and human sequences, not to give crimination between the two types of organisms: Anspecific common features of signal peptides. It is E. coli sequence can be told from a human sequence.therefore of interest to test the benchmark on the More importantly, as shown in Table 2, it is possiblesequences of mutants that fail to be exported. ~e to define a precise pattern fit by all E. coli sequenceschose one such mutant, a muta.nt of the bacteno- except perhaps the plasmid-borne beta-lactamasephage lambda receptor, whose faIlure to be exported and lipoprotein signal sequences.has been difficult to explain (see Bedouelle and Hof- We shall now review the three main features ofnung 1981). As shown i.n Table 3, its score, not signal sequences-charged amino acids at the NH2unexpectedly, does not dIffer from the score of the terminus, followed by a hydrophobic core of at leastwild-type sequence. eight amino acids, ending with a flexible COOH

terminus probably folded into a beta-turn struc-c) The Jackknife Test. To "mathematically" eval- ture-and compare how these features are mani-

uate the adequacy of the program we performed the fested in bacteria and in human cells.jackknife test (see Materials and Methods) on the (i) The presence of positively charged amino acidsexamples used in the learning step of the program. is reflected in bacteria by a consensus, MKK, thatThe test determines whether the program learns is present in seven signal peptides of our E. colisimply by "memorizing" all the data or is able to sample. We do not find a single example of such amake generalizations from the learning sets. We used sequence in the human sample, the nearest casethe test to validate the program's discriminatory (present only once) being MKR; moreover, nega-power. Discrimination is the ability to classify a tively charged amino acids are often present at thesequence that is not one of the sequences the pro- start of human sequences (ME or MD is found ingram used in learning. It is obtained when the pro- four cases).gram sorts out significant differences between the (ii) The hydrophobic core of the sequence seemsinput sets; this is reflected by the construction of a to have distinct features in E. coli and human signalbenchmark used for the evaluation of new se- sequences: In the former it contains the patternquences. In the present study a subset of the de- (ASTPG)(MLIV)(ASTPG)(MLIV) (12 cases),scriptor space was used to avoid weak redundancies whereas in the latter the corresponding pattern is a

Page 11: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

140

variant of a long, purely hydrophobic stretch con- quences seems to be the presence of cysteine resi-taining the pattern (MLIV)(MLIV)L (17 cases). dues: In plasmid-coded beta-lactamase one is found

(iii) The fonnation of a specific cleavage site in a at a position (- 6) that has been recognized as im-beta-turn of the protein follows a rule that seems to portant for recognition by signal peptidase, and twobe different in the two classes: In the E. coli signal are present in the chromosomal counterpart. A cys-sequences there is a stringent requirement for an A teine residue is also present in the Leu-lIe-Val bind-residue at position -1. [This is quite important es- ing Qrotein, but at a position (-12) that has notpecially if one sets aside the case of the lipoprotein. been found to be discriminating. The presence ofThe exception to the rule in that case corresponds appropriately placed cysteine residues, as well asto the COOH end of the signal sequence. This can other features pennitting discrimination with re-be easily understood if one remembers that the li- spect to the consensus features of usual signal pep-

Ipoprotein signal peptide is cleaved off only after tides, might pennit a specialized localization of thecovalent binding to a membrane glyceride fatty acid protein, as the amino-tenninal cysteine residue ofat the cysteine residue immediately adjacent to the mature lipoprotein does (Inouye et al. 1983). In anyproteolytic cleavage site (see Pugsley and Schwartz case it suggests that experimental studies meant to1985 for discussion), and that this requires a special account for the general properties of secretion insignal peptidase for reasons discussed above.] There bacteria should not be perfonned with beta-lacta-seems to be a much looser requirement at position mase signal peptides, contrary to what is often per-- 1 in the human signals (S, C, and G residues are fonned.

equally tolerated). Similarly, at position -6 an (A, There is at least one physical parameter that dif-S, T, P, G) amino acid must be present in the bac- fers between eukaryotes and prokaryotes: the elec-terial peptides (and preferably an F(ASTPG) pair, tric potential of the membrane. A potential has beenas in eight cases), whereas an (M, L, I, V) residue shown to be required in the prokaryotic export pro-(generally L) is preferably present at that position cess and for exportation in the absence of activein the human peptides. translation (Vlasuk etal. 1983;Oxenderetal. 1984).

In summary, the NH2 tenninus of bacterial signal Our results substantiate that the export mechanismssequences is positively charged, and is followed by must be ,iifferent in the two types of organisms.a flexible, mildly hydrophobic core sequence that Indeed, we find significant differences in signal se-turns around a phenylalanine residue to end in a quence between prokaryotes and eukaryotes. Thetypical "small" (i.e., including valine and aspara- membrane-potential effect can be accounted for bygine) amino acid-X-alanine triplet at the cleavage the distribution of positive charges in prokaryotessite. While E. coli signal sequences are well de- (Vlasuk et al. 1983): The starts of the signal se-fined-which allowed us to predict actual transla- quences could certainly bind to the interior of thetion start sites in recently identified peptides (Table cytoplasmic membrane; then the peptide could en-3)-it is much harder to infer a consensus sequence ter the membrane and probably fold like a hairpinfor human peptides (Table 2). Human signal se- until it passed through the membrane and exposedquences start with a hydrophilic sequence of vari- its cleavage region to a specific signal peptidase. Theable length, possibly negatively charged, followed protein could then pass through the membrane (orby a rigid, leucine-rich, very hydrophobic core se- possibly remain stuck in it) as translation proceeded.quence that turns around a leucine residue to end In this model, because E. coli signal sequences arein a sequence preferably rich in (A, S, T, P, G) amino so similar to one another, there is no specific inter-acids, possibly containing cysteine residues. It is action between the signal sequences and the nascentnoteworthy that whereas in an E. coli sequence con- exported protein; much of the infonnation requiredtaining a leucine-rich core, as in alkaline phospha- for localization of the protein must therefore be in-tase, the core is broken by a proline residue, in a cluded in the protein itself. This is consistent withhuman sequence a leucine residue rigidifies the core recent evidence (Jackson et al. 1985) suggesting thatat the same position. exchange of signal sequences between proteins nor-

In many instances it is difficult to know whether mally having different localizations (outer mem-a signal sequence is processed. The case of the E. brane and cytoplasmic membrane) does not changecoli chromosomal beta-lactamase is revealing in this the final localizations of the proteins.context: Its score would clearly separate it from stan- The situation for human signal sequences is moredard E. coli sequences, and, by analogy with other difficult to visualize. The apparent rigidity of theunprocessed tennini of membrane proteins, this hydrophobic core of the signal peptide might meanmight suggest that the leader peptide is not cleaved that its structure is fonned entirely on the ribosome,(or cleaved more slowly, as many human signal pep- before interaction with the membrane, so that it cantides are when cloned into E. coh). More generally, go through the membrane as a prefolded rigid hair-a significant feature of beta-lactamase signal se- pin [such a model has been proposed by Engelman

-

Page 12: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

141

and Steitz (1981)]. Perhaps this would be important come the limitations of the human brain, andin the absence of an active membrane potential. In therefore can help identify previously unrecognizedaddition, because the signal sequences are so differ- patterns. Using such software we have found thatent from one another, one can propose that in this E. coli signal peptides can be told not only fromsystem each signal sequence interacts in a specific bulk protein amino acid sequences-as was alreadyway with each transported protein. One should also known [consider, for instance, the significantly higherkeep in mind that there are at least two mechanisms proportion of leucine residues in eukaryotic proteinsof secretion in eukaryotes, constitutive secretion and (von jIeijne 1981)]- but also from signal peptidesregulated secretion (Kelly 1985); the apparent het- from such remote organisms as humans. Finally,erogeneity in the pattern of amino acid distribution our work shows that the techniques of artificial in-in human signal sequences might well reflect this telligence may be of some help in biology. Indeed,fact. The abundance of tryptophan and cysteine res- of the 17 descriptors we retained, few had alreadyidues suggests a specific interaction with compo- been noted as important by other workers (vonnents of the transport machinery. Aromatic-aro- Heijne 1981, 1983; Perlman and Halvorson 1983),matic interactions are known to participate in the mainly because extensive exploration of primitivefolding of proteins as well as in quaternary inter- descriptors requires an enormous amount of time,actions (Burley and Petsko 1985). Cysteine residues as indeed was reflected in the time consumed in themight participate as well in transient formation of central processing unit of the computer to generatedisulfide bridges with components of the secretion the 2500 descriptors explored in this study.machinery, and both types of amino acids are wellsuited to permit a kind of ratchet mechanism that Acknowledgments. We are grateful to J. Sallantin for his stim-would help the protein cross the membrane' the ulating interest and to C. Gottschalk for linguistic editing. Fi-

f .. .' nancial support came from the Centre Mondial Informatique,presence 0 cysteIne m such a mechanIsm would be CNRS (ATP DVAR n .899) INSERM (CRE 83 1007) dselected against in bacteria because of the possibil- the Institut Pasteur (Un~te BRq. ' an

ity, known to exist in the case of lipoprotein, ofcovalent interactions between cysteine and mem-brane lipids. Alternatively, such a covalent inter- References

action between the signal peptide and specific lipidcomponents of the membrane could be part of the Austen BM (1979) Predicted secondary structures of amino-

t . . k t terminal extension sequences of secreted proteins. FEBS Lettsecre Ion process m eu aryo es. 103:308-313

Finally, one could argue that some of the differ- Bedouelle H, Hofnung M (1981) On the role of the signal pep-ences between E. coli and human signal sequences tide in the initiation of protein exportation. In: Pulman B (ed)reflect differences at the level of the mRNA rather Intermolecular forces. Reidel, Dordrecht Boston, pp 361-372than at the level of the protein. Indeed, translation Blobel G, Dobberstein B (1975a) Transfer of proteins across. h t d.ffi t . k t d . k membranes. I. Presence of proteolytic ally processed and un-IS somew a 1 eren m pro aryo es an m eu ary- ,.". processed nascent Immunoglobulin light chams on mem-otes, and a translatIon effect IS vIsIble m the codon brane-bound ribosomes of murine myeloma. J Cell BioI 67:usage (Burns and Beacham 1985). The bias in codon 835-851usage, however, does not imply a difference at the Blobel G, Dobberstein B (1975b) Transfer of proteins acrosslevel of the amino acids in the protein; moreover, membranes. II. Reconstitution of functional ro~gh micro-

h.l ld th t . .fi t d .ffi somes from heterologous components. J Cell BioI 67:852-w 1 e one cou argue a a Slgnl can 1 erence 862

might be found near the translation initiation site, Buchanan BG, Feigenbaum EA (1978) Dendra! and meta-den-this is much less likely near the signal-peptide cleav- dral: their applications dimensions. Artificial Intelligence II:age site, whose position with respect to the start 5-24varies. The conclusions that we have drawn about Burley SA, Fetsko GA (1985) Aromatic-aromatic interaction:th d th COOH d f th a mechanism of protein structure stabilization. Science 229:

e core an e en 0 e sequences are 23-28

therefore very likely to hold even if some constraint Burns DM, Beacham IR (1985) Rare codons in E. coli and S.arising from the mRNA structure has to be taken typhimurium signal sequences. FEBS Lett 189:318-322into account. Cohen PR, Feigenbaum EA (1982) The handbook of artificial

Pattern recognition is a notoriously difficult task. intelligence, vol 3. Pitman, London, chapter 14As pointed out by Simon (1979) the human brain Dev I,K, R~y PH (~984) Rapid assay and p~rification of a. ' .. umque signal peptidase that processes the prollpoprotein fromIS compelled to abandon one track when It tnes to Escherichia coli. J BioI Chern 259:11114-11120

follow more than seven tracks at the same time; EngelmanDM,SteitzTA (1981) Thespontaneousinsertionoftherefore very good heuristics are required to dis- proteins into and across membranes: the helical hairpin hy-cover new patterns in a given set of complex data. pothesis. Cell 23:411-422

. GascuelO ( 1986) PLAGE: A way to o;ve and use knowledgeA computer cannot as such create new mforma- .' ' c-. '.' . . m a learnIng program. In: Proceedmgs of the European Work-

hon, but software for dlscovenng patterns USIng ar- ing Session on Learning, L.R,I., Orsay (available upon re-tificial-intelligence techniques is designed to over- quest)

Page 13: ,) Journal of · stretch of at least eight hydrophobic or neutral ami-no acids tending to form an alpha-helix or an ex- Data tended nun 1982) thread and (Austen (iii) a "signal 1979;

, ..

142

Groarke JM, Mahoney WC, Hope IN, Furlong CE, Robb Fr, Oxender DL, Landick R, Nazos P, Copeland BR (1984) RoleZalkin H, Hermondson MA (1983) The amino acid se- of membrane potential in protein folding and secretion inquence ofD-ribose-binding protein from Escherichia coli K-12. Escherichia coli. Microbiology 1984:4-7J Bioi Chern 258:12952-12956 Perlman D, Halvorson HA (1983) A putative signal peptidase

Harrison TM, Brownlee GC, Milstein C (1974) Studies on poly- recognition site and sequence in eukaryotic and prokaryoticsome-membrane interactions in mouse myeloma cells. Eur J signal peptides. J Mol Bioi 167:391-409Biochem 47:613-620 PugsleyAP,SchwartzM (1985) Export and secretion of proteins

Inouye I, Franceschini T, Sato M, Itakura K, Inouye M (1983) by bacteria. FEMS Microbiol Rev 32:3-38Prolipoprotein signal peptidase of Escherichia coli requires a Schwartz RM, Dayhoff MO (1978) In: Dayhoff MO (ed) Atlascysteine residue at the cleavage site. EMBO J 2:87-91 of protein sequence and structure, vol 5. National Biomedical

Jackson ME, Pratt JM, Stoker NG, Holland IB (1985) An inner Research Foundation, Washington, DC, pp 353-358membrane protein N-terminal signal is able to promote ef- Simon H (1979) Models of Discovery. Reidel, Dordrecht Bos-ficient localisation of an outer membrane protein in Esche- tonrichia coli. EMBO J 4:2377-2383 Steinmetz M, Le Coq D, Aymerich S, Gonzy-Treboul G, Gay P

Kaczorek M, Delpeyroux F, Chenciner N, Streeck RE, Murphy (1985) The DNA sequence of the gene for the secreted Ba-JR, Boquet P, Tiollais P (1983) Nucleotide sequence and cillw subtilis enzyme levansucraseand its genetic control sites.expression in the diphtheria tox228 gene in Escherichia coli. Mol Gen Genet 200:220-228Science 221 :855-858 Vlasuk GP, Inouye S, Ito H, Itakura K, Inouye M (1983) Effects

Kelly RB (1985) Pathways of protein secretion ill eUkaryotes. of the complete removal of basic amino acid residues fromScience 230:25-32 the signal peptide on secretion of lipoprotein in Escherichia

Lenat DB (1977) The ubiquity of discovery. Artificial lntelli- coli. J Bioi Chern 258:7141-7148gence 9:257-285 von Heijne G (1981) Membrane proteins: the amino acid com-

Meyer DI, Krause E, Dobberstein B (1982) Secretory protein position of membrane-penetrating segments. Eur J Biochemtranslocation across membranes- the role of the "docking 116:419-422protein." Nature 297:647-650 von Heijne G (1983) Patterns of amino acids near signal-se-

Mezes PF, Lampen 0 (1985) Secretion of protein by bacilli. quence cleavage sites. Eur J Biochem 133:17-21In: Dubnau DA (ed) The molecular biology of the bacilli, vol Walter P, Blobel G (1981) Translocation of proteins across theII. Academic Press, New York, pp 159-180 endoplasmic reticulum. III. Signal-recognition protein (SRP)

Milstein C, Brownlee GG, Harrison TM, Mathews MB (1972) causes signal sequence-dependent and site-specific arrest ofA possible precursor of immunoglobulin light chains. Nature chain elongation that is released by microsomal membranes.(New Bioi) 239: 117-120 J Cell Bioi 91 :557-562

Mitchell TM, Utgoff FE, Banexji RB (1983) Learning by ex- Watson MEE (1984) Compilation of published signal se-perimentation: acquiring and refining problem-solving heu- quences. Nucleic Acids Res 12:5145-5164

ristics. In: Michalski RS, Carbonell JG, Mitchell TM (eds)Machine learning. Tioga, Palo Alto, California, pp 163-190 Received February 3, I 986/Accepted June 26, 1986