V6 SS 2006 Membrane Bioinformatics – Part II 1 V6 – Secondary Structure of TM proteins suggested reading for this lecture: Appl. Bioinf. 1, 21 (2002) Introduction.

V6 SS 2006

Membrane Bioinformatics – Part II1

V6 – Secondary Structure of TM proteins

suggested reading for this lecture:

Appl. Bioinf. 1, 21 (2002)

Introduction

Prediction of secondary structure elements

Performance on test sets

V6 SS 2006


Introduction

Membrane proteins are crucial for survival:

- they are key components for cell-cell signaling

- they mediate the transport of ions and solutes across the membrane

- they are crucial for recognition of self.

The pharmaceutical industry preferably targets membrane-bound receptors.

Particularly important: large super-family of G protein-coupled receptors (GPCRs)

- receptors for hormones, neurotransmitters, growth factors, light and

odor-related ligands.

More than 50% of the prescription drugs act on GPCRs.

V6 SS 2006


Inside the lipid bilayer, the protein backbone may not form hydrogen bonds with the

aliphatic chains of the phospholipid molecules

the backbone atoms need to form H-bonds among eachother.

they adopt either -helical or -sheet conformations.

Topology of Membrane Proteins

V6 SS 2006


Topology of Membrane Proteins

http://www.biologie.uni-konstanz.de/folding/Structure%20gallery%201.html

V6 SS 2006


History of membrane protein structure determination

1984 bacterial reaction center noble price to Michel, Deisenhöfer, Huber 1987

1990 EM map of bacteriorhodopsin Henderson

1997 high-resolution structure by Lücke

now several intermediates of the photocycle

1992 porin (complete -barrel)

1998 halorhodopsin

1995 Cytochrome c Oxidase

1998 F1ATPase noble price to John Walker 1997

1998 KCSA ion channel noble price to Roderick McKinnon 2003

2000 aquaporin

2000 rhodopsin (Palczewski)

2002 SERCA Ca2+ ATPase (Toyoshima)

2003 voltage-gated ion channel

2005 NaH Antiporter (Hunte)

V6 SS 2006


Lipid bilayer simplifies the prediction problem

TM proteins are forced into two classes: -helical, or -sheet.

-helices are typically tilted with respect to the membrane normal

between 10 – 45°.

The hydrophobic lipid bilayer reduces the three-dimensional structure formation

almost to a 2D problem.

V6 SS 2006


Predicting TM helix location

Hydrophobicity scales provide simple criteria to predict membrane helices.

TMH can be predicted based on the distinctive patterns of hydrophobic (TM) and

polar (non-membrane or water-soluble) regions within the sequence.

Observed patterns:

(1) TM helices are predominantly apolar and 12-35 residues long.

(2) Globular regions between TMH are typically shorter than 60 residues

(3) Most TMH proteins have a specific distribution of the positively charged amino

acids arginine and lysine, „positive-inside-rule“ (Gunnar von Heijne).

Connecting „loop“ regions on the inside of the membrane have more positive

charges than „loop“ regions on the outside.

(4) Long globular regions (> 60 residues) differ in their composition from those

globular regions subject to the „inside-out-rule“:

V6 SS 2006


Kyte-Doolittle hydrophobicity scale (1982)

Assign hydropathy value to each amino acid.

Use sliding-window to identify membrane

regions.

Sum the hydrophobicity scale over all

w residues in the window of length w.

Use threshold T to assign segment

as predicted membrane helix.

w = 19 residues could best discriminate

between membrane and globular proteins.

Threshold T > 1.6 was suggested for the

average over 19 residues.

V6 SS 2006


More refined indices

One drawback of pure hydropathy-based methods is that they fail to discriminate

accurately between membrane regions and highly hydrophobic globular segments.

PRED-TMR algorithm: combine with propensities of finding certain amino acid

residues at the termini of TM helices.

Other hydrophobicity scales:

- Wimley & White : based on partition experiments

of peptides between water/lipid bilayer and

water/octanol

- TMFinder (Liu & Deber scale) : based on HPLC

retention time of peptides with non-polar phase

helicity.

http://blanco.biomol.uci.edu/hydrophobicity_scales.html

V6 SS 2006


Folding of helical membrane proteins

White, FEBS Lett. 555, 116 (2003)

V6 SS 2006


Hydrophobicity Scales

White, FEBS Lett. 555, 116 (2003)

V6 SS 2006


Translocon-assisted folding of TM proteins?

White, FEBS Lett. 555, 116 (2003)

Upper picture (model!):

the newly synthesized polypeptide

chain of a membrane protein is

inserted from the ribosome into the

membrane via interaction with a TM

complex, the “translocon” (EM map

shown).

lower picture:

experiment largely supports the

concerted view.

What determines insertion into the

membrane ?

V6 SS 2006


Integration of H-segments into the microsomal membrane

Hessa et al., Nature 433, 377 (2005)

b, Membrane integration of H-segments with the

Leu/Ala composition 2L/17A, 3L/16A and 4L/15A.

Bands of unglycosylated protein are indicated by a

white dot; singly and doubly glycosylated proteins are

indicated by one and two black dots, respectively.

Ingenious experiment! Introduce marker that shows whether helix segment H

is inserted into membrane or not.

a, Wild-type Lep has two N-terminal TM segments (TM1 and TM2) and a

large luminal domain (P2). H-segments were inserted between residues 226

and 253 in the P2-domain. Glycosylation acceptor sites (G1 and G2) were

placed in positions 96–98 and 258–260, flanking the H-segment. For H-

segments that integrate into the membrane, only the G1 site is glycosylated

(left), whereas both the G1 and G2 sites are glycosylated for H-segments

that do not integrate in the membrane (right).

V6 SS 2006


Insertion determined by simple physical chemistry

gg

g

ff

fp

21

1

g

gapp f

fK

2

1


c, Gapp values for H-segments with 2–4 Leu residues.

Individual points for a given n show Gapp values obtained when the position of Leu is changed.

d, Mean probability of insertion (p) for H-segments with n = 0–7 Leu residues.

measure fraction of singly glycosylated (f1g) vs. doubly glycosylated (f2g) Lep molecules

appapp KRTG ln

V6 SS 2006


Biological and biophysical Gaa scales


a, Gappaa scale derived from H-segments with the indicated amino acid placed in

the middle of the 19-residue hydrophobic stretch.

Only Ile, Leu, Phe, Val really favor membrane insertion. All polar and charged

ones are very unfavored.

b, Correlation between Gappaa values measured in vivo and in vitro.

c, Correlation between the Gappaa and the Wimley–White water/octanol free

energy scale for partitioning of peptides.

V6 SS 2006


Positional dependencies in Gapp


a, Symmetrical H-segment scans with pairs of Leu (red), Phe (green), Trp (pink) or Tyr (light blue)

residues. The Leu scan is based on symmetrical 3L/16A H-segments with a Leu-Leu separation of one

residue (sequence shown at the top; the two red Leu residues are moved symmetrically outwards) up to

a separation of 17 residues. For the Phe scan, the composition of the central 19-residues of the H-

segments is 2F/1L/16A, for the Trp scan it is 2W/2L/15A, and for the Tyr scan it is 2Y/3L/14A. The G

app value for the 4L/15A H-segment GGPGAAALAALAAAAALAALAAAGPGG is also shown (dark

blue).

b, Red lines show G app values for symmetrical scans of 2L/17A (triangles), 3L/16A (circles), and

4L/15A (squares) H-segments.

c, Same as b but for a symmetrical scan with pairs of Ser residues in H-segments with the composition

2S/4L/13A.

Tyr and Trp are favorable in interface region.

V6 SS 2006


Using observed amino acid propensities

With availability of more and more 3D structures, it became possible to train

statistical approaches based on the observed frequencies of amino acids in

membrane proteins vs. non-membrane proteins.

Similar concept as that in secondary structure prediction for globular proteins.

TMpred : uses statistical amino acid preferences for scoring

SPLIT (Juretic et al.) :

- uses derived amino acid preferences for the „state“ membrane helix for a data set

of integral membrane proteins with partially known secondary structure

- combine with preferences for -strand, turn and non-regular secondary structure

based on sets of soluble proteins with known structure.

This method can identify shorter, unstable or movable membrane-helices.

V6 SS 2006


Incorporating more information: TopPred

TopPred (von Heijne 1992)

predicts the complete topology of membrane proteins by using

- hydrophobicity analysis

- automatic generation of possible topologies

- ranking these topologies by the positive-inside rule.

TopPred uses a particular sliding trapezoid window to detect segments of

outstanding hydrophobicity.

The two bases of the trapezoid are 11 and 21 residues long.

TopPred chooses thresholds by considering a segment as TM helix that yielded

the optimal difference between the number of positively charged residues at the

inside and at the outside.

V6 SS 2006


Improvements from dynamic programming: MEMSAT

MEMSAT (1994) implemented statistical tables (log likelihoods) compiled from

well-characterized TM proteins

and a dynamic programming algorithm to recognize membrane topology models by

expectation maximisation.

Residues are classified as being one of 5 structural states:

Li inside loop

Lo outside loop

Hi inside helix end

Hm helix middle

Ho outside helix end.

Helix end caps are defined to span over 4 adjacent residues (one helical turn).

Compile propensities of amino acids for 5 states.

Calculate score of relating given sequences to a predicted topology.

Finding optimal score is guaranteed by dynamic programming.

V6 SS 2006


Using evolutionary information

It is known from predicting secondary structures of globular proteins that using

multiple sequence alignment information improves prediction accuracy

significantly.

PHDtm: predict location and topology of TM helices by a system of neural

networks.

Was later combined with dynamical programming.

V6 SS 2006


Using evolutionary information

TMAP (1996):

uses propensity values determined for segments of 21 consecutive residues in

transmembrane segments (Pm),

and for the flanking 4-residue caps of TM helices (Pe).

Residues with high Pm tend to be hydrophobic

residues with high Pe tend to be polar and basic.

Compute compositional difference in the protein segments exposed to the

two surfaces of a membrane for 12 important residues:

mostly at the outside of membranes: Asn, Asp, Gly, Phe, Pro, Trp, Tyr, Val

mostly inside: Ala, Arg, Cys, Lys.

Use consensus over these 12 residues to predict topology.

V6 SS 2006


Using grammatical rules

The lipid bilayer constrains the structure of the membrane-passing regions of

proteins in many ways.

TMHMM (Sonnhammer et al. 1998, Krogh et al. 2001) and HMMTOP (Tusnady &

Simon 1998, 2001) implement Hidden Markov Models.

TMHMM: uses cyclic model with 7 states for

- TM helix core

- TM helix caps on the N- and C-terminal side

- non-membrane region on the cytoplasmic side

- 2 non-membrane regions on the non-cytoplasmic side (for short and long loops to

account for different membrane insertion mechanism)

- a globular domain state in the middle of each non-membrane region

V6 SS 2006


Using grammatical rules

HMMTOP: uses hidden Markov model distinguishing 5 structural states

- inside non-membrane regions

- inside TMH-cap

- membrane helix

- outside TMH-cap

- outside non-membrane region

This model is similar to MEMSAT.

V6 SS 2006


Availability of prediction methods.

Many of these servers are also available through a Meta-Server META-PP at the

site of Burkhard Rost.

V6 SS 2006


Prediction accuracy

Often, authors claimed that their methods are > 90% accurate.

However, Chen and Rost claim that most authors have significantly overestimated

the accuracy of their methods.

(1) there are not enough high-resolution structures to allow a statistically

significant analysis.

Training and test sets may share or have homologous members.

Using low-resolution experiments, e.g. gene fusion, is no work around.

Low-resolution experiments differ from high-resolution structures almost as much

as prediction methods.

(2) All methods optimise some parameters.

Methods perform much better on proteins for which they were developed than on

new proteins.

V6 SS 2006


Prediction accuracy

(3) Methods using evolutionary information failed due to the surprising fact that

membrane helices are not entirely conserved across species.

This is surprising since it implies that those proteins either do not perform similar

cellular functions, e.g. GPCRs, or that we can actually realize the function with a

different number of membrane regions in some cases.

(4) Levels of prediction accuracy between methods can often not be compared

appropriately to one another since they are frequently based on different

measures for prediction accuracy and on different data sets.

V6 SS 2006


Most methods get number of helices right

All methods based on advanced algorithms tend to underestimate TM helices %obs > %prd.

a Data set: Sequence-unique subset of 36 high-resolution TM helical proteins from PDB. This is the largest subset of all 105 high-resolution membrane chains, which fulfils the condition that no pair in the set has significant sequence similarity as defined in Rost (1999).b Methodsc Per-segment accuracy: Qok percentage of proteins for which all TM helices are predicted correctly (allowed deviation of up to 3 residues), Q%obs

htm percentage of all

observed helices that are correctly predicted, Q%prdhtm percentage of all predicted helices that are correctly predicted, TOPO percentage of proteins for which the

topology (orientation of helices) is correctly predicted (empty for methods that do not predict topology).d Per-residue accuracy: Q2 percentage of correctly predicted residues in two-states: membrane helix / non-membrane helix, Q%obs

2T percentage of all observed TMH

helix residues that are correctly predicted, Q %prd2T percentage of all predicted TMH helix residues that are correctly predicted, Q%obs

2N percentage of all observed

non-TMH helix residues that are correctly predicted, Q%prd2N percentage of all predicted non-TMH helix residues that are correctly predicted.

e ERROR: the estimates for per-segment accuracy resulted from a bootstrap experiment with M = 100 and K = 18; the estimates for per-residue accuracy were obtained by standard deviations over Gaussian distributions for the respective score. f Numbers in italics: two standard deviations below the numerically highest value in each column (set in bold letters).NOTE: all methods are tested on the same set of proteins. However, the numbers are NOT from a cross-validation experiment, ie some methods may have used some of the proteins for training. Generally, newer methods are more likely to be overestimated than older ones. In particular, HMMTOP2, TMHMM1, and WW have been developed using ALL the proteins listed here.

V6 SS 2006


Prediction accuracy

About 86% of the TMH residues predicted by the best methods are correctly

predicted.

Assume that we consider a prediction of a membrane helix correct if the predicted

and the observed helical regions differ by less than 3 residues.

the best current methods correctly predict all membrane helices for 70 – 75% of

all proteins.

However, the topology is predicted correctly for only about half of all proteins.

The best method, HMMTOP2, had all proteins listed in its training set.

Simple hydrophobicity scales are less accurate than advanced methods.

V6 SS 2006


All methods confuse TM helices with signal peptides

Signal peptides that are cleaved off secreted proteins usually contain stretches of

hydrophobic residues resembling membrane helices.

The most accurate specialists for membrane prediction (TMHMM and PHDhtm)

falsely predict about 30 – 40% of all signal peptides as TM helices.

Simple hydrophobicity scales predict more than 90% of the signal peptides as TM

helices.

V6 SS 2006


Many methods predict TM helices in globular proteins

Simple hydrophobicity scales reach levels close to 100% false positives.

Advanced methods (SOSUI; TMHMM1, PHDhtm) predict TM helices in less than

2% of all globular proteins.

Different methods predict similar numbers of TM proteins in genomes:

about 10 – 30%.

The overall content of TM proteins in genomes of different complexity is similar.

However, eukaryotes have significantly more proteins with > 10 TM helices than

all other species.

Also, the distribution is different:

eukaryotes have more 7 TM proteins (receptors)

prokaryotes have more 6TM and 12TM proteins (ABC transporters).

V6 SS 2006


Future directions

Meta servers yield improved predictions.

> 90% correct topologies can be obtained by a simple majority vote between the

results of various methods.

TM helix prediction and signal peptide prediction should be combined

Useful: databases for particular families of TM proteins and sequence motifs

e.g. GPCR database

Membrane-specific substitution matrices improve database searches

e.g. PHAT by Henikoff & Henikoff improved alignments of TM proteins

V6 SS 2006


Summary

TM helices are typically continuous stretches of mostly hydrophobic residues.

Simple methods based on summing up hydrophobicities work okay but not really

well.

Advanced methods include additional features such as the „positive-inside rule“.

The currently most successful methods are based on Hidden Markov Models or

Neural Networks.

Evaluating performance accuracy should be done using carefully separated

training and test sets.

It is possible to discriminate signal peptides and TM helices.

Only Split 4.0 may detect short non-membrane spanning helices.

V6 SS 2006 Membrane Bioinformatics – Part II 1 V6 – Secondary Structure of TM proteins suggested reading for this lecture: Appl. Bioinf. 1, 21 (2002) Introduction.

Documents

membrane regions

membrane normal

membranebound receptors

predicted membrane helix

polar nonmembrane

globular proteins

loop regions

w residues