-
J. Mol. Biol. (1987'1 198,327-337
A Strategy for the Rapid Multiple Alignmentof Protein
Sequences
Confidence Levels from Tertiary Structure Comparisons
Geoffrey J. Bartont and Michael J. E. Sternberg
Laboratory of Molecular BiologyD epartment of C ry
stallography
Birkbeck College, Malet Street, Lonilon WClE 7HX, U.K.
( Receiued, 28 April 1987, anil in reuised, form 18 July 1987
)
An algorithm is presented for the multiple alignment of protein
sequences that is bothaccurate and rapid computationally. The
approach is based on the conventional dynamic-programming method of
pairwise alignment. Initially, two sequences are aligned, then
thethird sequence is aligned against the alignment of both
sequences one and two. Similarly,the fourth sequence is aligned
against one, two and three. This is repeated until allsequences
have been aligned. Iteration is then performed to yield a final
alignment.
The accuracy of sequence alignment is evaluated from alignment
of the secondarystructures in a family of proteins. For the
globins, the multiple alignment was on average99/o accurate
compared Lo g}o/o for pairwise comparison of sequences. For the
alignment ofimmunoglobulin constant and variable domains, the use
of many sequences yielded analignment of 630/o average accuracy
compared to 4lfo average for individual variable/constant
alignments. The multiple alignment algorithm yields an assignment
of disulphideconnectivity in mammalian serotransferrin that is
consistent with crystallographic data,whereas pairwise alignments
give an alternative assignment.
1. Introduction
The advent of fast techniques for DNAsequencing has led to a
rapid expansion in thenumber of known protein sequences
(currently-4000 in the PIR databank: George et al., 1986).Access to
these primary structures leads to thealignment of two or more
protein sequences thatcan identify conserved regions of functional
and/orstructural importance. Furthermore, if homologycan be shown
with a biochemically or crystallo-graphically well-characterized
protein, manyproperties or aspects of three-dimensional
structuremay be predicted (e.g. see Browne et al., 1969).
Since the early work of Fitch (1966) andNeedleman & Wunsch
(1970), techniques for thecomparison and alignment of two protein
or DNAsequences have been developed for speed (e.g. seeGotoh, 1982;
Taylor, 1984; Fickett, 1984;Wilbur & Lipman, 1983), the
identification of localsimilarities (e.g. see Sellers, 1979; Goad
& Kanehisa,1982; Boswell & Mclachlan, 1984) and
increasedsensitivity (Argos, 1987). Although the alignment of
f Author to whom reprint requests should beaddressed.
$22-2836 I 87 | 220327 -r | $03.00/0
three sequences has been used to confirm weakhomology between
two sequences (e.g. see Doolittle,l98l), the practical limitations
of computermemory and central processing unit (CPU) timerestrict
the rigorous extension of two sequencemethods (e.g. see Needleman
& Wunsch, 1970) toshort sequences (Murata et a1.,1985).
The multiple alignment of four or more sequencescannot in
practice be solved by a rigorous method,since the number of segment
comparisons that mustbe carried out is of the order of the product
of thesequence lengths (many more if gaps are
explicitlyconsidered). Thus, multiple alignment algorithms incommon
with fast pairwise methods (e.g. see Wilbur& Lipman) seek to
identify an optimum alignmentby considering only a small number of
the totalpossible residue or segment comparisons.
Sankoff and co-workers (Sankoff & Cedergren,1976) provided a
workable multiple alignmentalgorithm applicable to nucleic acid
sequences,which requires the sequences to be linked by
apredetermined evolutionary tree. Sobel & Martinez(1986)
described an algorithm that bases thealignment of DNA sequences on
the identificationof common subsequences of a specified
minimumlength. Waterman (1986) elaborated a similar
327O 1987 Academic Press Limited
-
328 G. J. Barton and, M. J. E. Sternberg
technique but also allowed for mismatches, whilstBains (1986)
described a related algorithm thatworks well for some families of
closely similarnucleic acid sequences. None of these methods
hasbeen applied directly to protein sequences, althoughWaterman
(1986) described how his algorithmcould be so applied.
The algorithm of Taylor (1986) allows largenumbers of protein
sequences to be alignedbut for maximum effect requires that
three-dimensional structures are known for some of thesequences in
order to provide a "seed" alignment.Johnson & Doolittle (1986)
desuibed a moregeneral multiple alignment algorithm for
proteinsequences whereby a small subset of all possiblesegment
comparisons is considered. Although theiralgorithm can cope with
the three-way alignment oflong sequences, alignment of more than
foursequences is restricted to short proteins, due toexcessive CPU
demands. Bacon & Anderson (1986)reduced the number of segment
comparisonsperformed by considering the sequences in anarbitrary
order and maintain only the best-scoringsegments as each new
sequence is added. Theiralgorithm does not explicitly cater for
gaps, nordoes it produce a complete alignment of thesequences;
however, it presents a sensitivetechnique for the identification of
significant shorthomologies.
This paper reports an algorithm that cangenerate a multiple
alignment including theconsideration of gaps for a large number of
proteinsequences without the need to introduce
additionalnon-sequence information. Performing all
pairwisecomparisons for the sequences suggests confidencelevels for
the multiple alignment of particularsequence groups.
2. Procedures
(a) Needlemnn & Wunsch algorithm for two sequences
(l) A matrix of amino acid pair scores, D, is chosen.Throughout
this study the MDM^ matrix was used(Dayhoff, 1978) with a constant
of 8 added to remove allnegative terms.
(2) The protein sequences are defined as Al., A2n,where m and z
are the number of residues in sequence Al ,A2, respectively.
(3) A matrix R..n is generated with reference to D,where each
element .B;,; represents the score for Al, uersusA2i'
(4) n^," is acted on to generate 8..,, where eachelement Sr,.;
holds the maximum score for a comparison ofAlr,. with A2;,,.
(5) Either suitable pointers are recorded in (4), or atraceback
procedure through ,B.,n is performed to enablean alignment with the
maximum score for Al^uersus A2nto be generated.
In order to limit the total number of residues alignedwith
blanks, a gap-penalty, G, is subtracted during theprocess of
generating Br,, whenever a gap is introduced.In our earlier work
(Barton & Sternberg, 1987) westudied the effect upon the
accuracy of pair-wisealignment of varying both length-dependent and
length-independent gap penalties. The results indicated that a
length-dependent penalty is unnecessary. Furtherunpublished
results suggest that for the given D matrix, alength-independent
penalty in the range 6 to l0 oftenyields a reasonable alignment.
Ideally a range of penaltiesshould be investigated. However, on the
basis of thesefindings and to provide a consistent benchmark we
chosethe penalty of 8 (which is not necegsarily optimal) for
usethroughout the current study.
(b) Multiple alignment
Let the sequences to be aligned by Al .. . A/y', then:(l) Align
A2 with Al using Needleman & Wunsch
algorithm. Let the length of the aligned sequences bedenoted L1
,2 .
(2) Align A3 with the alignment of A2 and Al obtainedin step (
l) .
(3) Align sequence A,4 with the alignment of Al, .{2and A3
length Ll, 3.
(4) Similarly align the sequences A5 to A1[.In general, for the
&th sequence align with the
previously obtained alignment for Al through A(ft- l).When
generating the matrix .B in order to align the ftth
sequence a scoring scheme is adopted that includes acontribution
from all previously aligned sequences, thushighlighting congerved
regions in the alignment
Let i be the position of an aligned residue in sequencesA l . .
. A ( f - l ) s u c h t h a t I < i < L l , ( l - l ) . L e t
j b e t h eposition of a residue in sequence Ak then:
o,., : J; of t
Dno,.n*r. ( l)' k - t o = 1
For example, if 3 sequences have been aligned so far andwe are
considering the comparison of the ith position inthe alignment
(Ala-Val-Leu) with thejth amino acid in the4th sequence (Ala), then
the score (fi,,r) is given by thescore for (Ala uersus Ala) * (Ala
aersus Val) + (Alaaersus Leu,) x l /3 : (10+8+6)/3:8. The value
ofD2.p,,e*t when Ap, is a gap is set to the minimum value forany
residue to residue score (0 in this work).
The multiple alignment obtained in (4) may be refinedby
realigning each sequence with the completedalignment less that
sequence. Accordingly, sequence Al isaligned with the alignment of
sequences A,2 . . . A1[(having first removed any gaps that are
common toA2 . . . Anf ). A.2 is then realigned with the alignment
ofAl, A3...Alf . This process is repeated unti l A1{ hasbeen
realigned with Al .. . A(nf-l). The complete cyclemay then be
repeated.
(c) Criteria for assessing the quality of alignment
The unique conformation that a globular proteinadopts and the
resultant disposition of key catalytic orbinding residues
ultimately determines its biologicalactivity. Therefore, when 2 or
more protein sequencesare aligned it is of crucial importance that
residuesdefining a common tertiary fold are correctlyequivalenced.
Although the general fold may beconserved, there can be
considerable variation in3-dimensional structure within a protein
family. Inparticular, the presence of insertions or deletions makes
itimpossible to assign structurally equivalent residues overthe
full length and common to o/l members of a family.Even when
insertions and deletions are absent, it can bedifficult to justify
a particular structural alignmentespecially in the loop regions
that connect elements ofsecondary structure. In the light of these
observations weuse those regions that are common and also found in
the
-
Multiple Sequence Alignmenl 329
:
core f-strands or a-helices of the proteins as test zones.An
automatic alignment method should at least be ableto align these
zones correctly, although sequencealignments based on structure may
also be justifiedoutside these regions.
We define the accura,cy of an alignment of 2 sequencesas the
percentage of residues within these zones that arealigned in the
same wa,y as in the reference alignment.
(d) Ord,er o! alignment
The order in which the sequences are added may beexpected to
have an effect on the final multiplealignment. However, for N
sequences there are N!alternative orders, so it is important to
have a systematicapproach to selecting the single order. Our
previousfindings suggested that a pairwise sequence comparisonthat
gives a significance score of >6.0 s.o. may be alignedto
>751o accuracy within regions of secondary structure(Barton
& Sternberg, 1987). Further results presented inthis paper for
49 pairwise comparisons of immunoglobulinand globin sequences (Fig.
l) support this observationand indicate that the alignment eccuracy
is correlatedwith the significance score. Accordingly, our strategy
fordetermining the alignment order first identifies the pair
ofsequences that have the highest pairwise significancescore.
Having established Al and A2, A3 is identified asthe sequence
having the highest significance score witheither Al or 42.
Similarly, A.4 is the sequence thatexhibits the highest
significance score with Al, A,2 or AB.This procedure is continued
until all sequences in thegroup have been entered in the order.
o c peo o o
B m
(e) Red,ucing calculation time
Before applying the ordering algorithm to a group ofsequences it
is necessary to perform all pairwisecomparisons for the group. The
calculation of significancescores for all pairwise alignments of N
sequences is anexpensive procedure since if M randomizations per
pairare performed then 1[ x (lf - l) x M 12 comparisons mustbe
carried out. If no pairwise comparisons havepreviously been made
then it is necessary to determinethe cheapest (in CPU time)
approach to determining anorder.
Feng el al. (1985) considered how many randomizationsneed to be
performed on a pair of sequences beforeconsistent results are
obtained and suggested on the basisof 4 pairs of sequences that as
few as 25 could produce agenuinely reflective score. We have
repeated this analysiswith a larger data set by carrying out all
pairwisecomparisons for I members of the immunoglobulinsuperfamily
and 6 serine proteinase sequences (47 pairs inall), using from l0
to 100 randomizations in steps of 10.The results indicate that
instabilities in significance scoredo not damp out until at least
60 randomizations havebeen performed. It would seem impractical
therefore touse this approach routinely for establishing an
alignmentorder when more than a small number of sequences
areinvolved.
Doolittle (1981) has demonstrated the usefulness ofscoring
systems based upon a single alignment score andnot involving
randomization of the sequences. Fig. 2illustrates the relationship
between one such scheme, thematch score divided by the length of
the shortestsequence (NASs) and the significance score for 2 groups
of
@ o^ o
u o
o
8a 6 0foooo--
lg 40
. +i +
+++
+++
D Vorioble vcrsas vorioble* Vorioble vel'sus constontO Constont
r€lsl/s consionlO Globins
Significonce score (soJ
Figure 1. The relationship between alignment accuracy as
measured by reference to alignments obtained from3-dimensional
structure superposition and the significance score for 2l pairwise
alignments of 7 globin sequences and 28alignments of 8
immunoglobulin domains (Variable refers to immunoglobulin variable
domains,- Constant toimmunoglobulin constant domains).
-
330 G. J. Barton anil M. J. E. Sternberg
A
oo
A
oo
aa
{o,o
A ,
/o
O Serine oroleinosesA lmmunoglobulin superfomily
5 ro 15 20 25 30 35 40 45Signif iconce score (s.a)
Figure 2. Relationship between the normalizedalignment score
(NASs) calculated from the match scoredivided by the length of the
shortest sequence and thesignificance score for eaeh of 47 pairwise
comparisonswithin the serine proteinases and
immunoglobulinsuperfamily. The data are correlated at 0.gb?.
NASa(match score divided by the number of residues notaligned with
a gap) for the same sequences ga,ve acorrelation value of
0.958.
r2.o
I t .5
sequences that contain some very closely relatedmembers and many
with tenuous similarities. Althoughthese data are not sufficiently
representative of proteinsin general to allow a conclusion to be
drawn on thequantitative relationship between NASs and
significance,the qualitative relationship is clear. F'or groups
ofproteins where randomization procedures would be tooexpensive,
NASs values or the slightly more expensiveNASa (match score divided
by the number of residues notaligned with gaps), may be used to
generatesystematically a rational order for multiple alignment.
(f) Test sequences and reference alignments
(i) Globins
Globins belong to the ala class of proteins and have acommon
fold that is highly conserved in proteins fromorganisms distantly
related in evolution. The sequences,however, show considerable
variation and provide aninteresting test for the alignment method.
Seven globinsequences and their reference alignment were taken
fromLesk & Chothia (1980) without modification
(humanhaemoglobin d-strand (HAHU), human haemoglobinf-strand
(HBHU), horse haemoglobin a-strand (HAHO),horse haemoglobin
B-strand (HBHO), sperm whalemyoglobin (MYWHP), sea lamprey
cyanohaemoglobin(PILHB) and root nodule leghaemoglobin
(LGHB)).Seven zones totalling 95 residues and corresponding tothe
A, B, C, E, F, G and H a-helices were defined for eachsequence as
illustrated in Fig. 3(a).
(ii) Immunogl,obulin d,omains
The immunoglobulin domains belong to the plp class ofproteins
and have been studied in detail in terms of bothsequence and
tertiary structure (e.g. see Amzel & Poljak,1976; Beale &
Feinstein, 1976; Lesk & Chothia, 1982).Although the overall
fold of the domains is conserved,there is considerable sequence
variation, particularlybetween the variable and constant domains.
Thesesequences thus provide a particularly stringent test for
analignment method.
Eight domains were selected (Brookhaven data bank
EK A H O K K V L G R F
K K V L H S F O E C
KGHOKKVRDNL TNR
KKVODRLTLR
RI I I 'NV
D L K K H O V T V L T R L G R
aaz
I t .o
ro .5
ro.o
q . 5
9.0
8.5
8.0
H E H U
H B H O
H N H U
HRHO
P l L H B
I,IYIIHF
L O H B
TIBHU
H B H O
TIRHU
HNHO
F l L H B
IIYIIHP
L O H B
h I d n
h l d n
h v d d
h t d d
r r d d
t t g h
v h l
v q l
v l
v l
p l v d t g r v a p l
v l
9 9
f .
a o
n og o l
f r r l g d l r t p d o v . g
? d r l g d l r n p g o v r g
l p h f d l r h 9
l p h f d l r h g
f p t l k g l t t o d r l k l
l d r f k h l k t r o r r k o
f r r f l l t g g t r r v p q n q l . f r ! Y v a r l D n T
L
FTLSELHCDI I h v
F f lLSELHCDk I h v
Sf ,LSDLHf lHk I r v
SNLSDLHht& I r v
KDLSOKHf lK IT f r v
K F L R O S H N
g k . l t p p v
g k d f t p r l
p o . t t p o v
p n d l t p o v
p g d l g o d o
g o t r r r r l
f I h
k t h
t t .
h t .
o f
k y l r l g y q g
r r d d o o
h k r
( o )
Fig. 3.
EEKSNV TRLIJG
T5 G VI I IL VKF F
P
P
CN V L V C
R L L O N V L V V
K L L S H C L L V
LLSI ICLLs
Y F K V L N R V I R D
L E F I S E R I I H
E N I L K T
D K F L S S V S T V L
K N L 0 5 V H V S N g v v
-
Multiple Sequence Alignmed 331
FnEcL qproop l i l ru -F i lp . . . . rqonro l i lG i i f r
rypgo yF R B C H I c r t l g p l S V F F L { p r r t r r r g g t o
l a t O C t - v f a y r p . p vF C c H 3 q p . r p l O V Y t L { p
. . . . r t t n q v l S t t C L V I g t y l r d rF c c H 2 p l s v
F L F 4 p r p r d t t r t r r r p l E v T c v v l v d v r h r d p q
v
cF
d r r p v l o l t p r l q r n n l y c
g o l t . o v l q r r g l y r
n g q P . n p r l d r d g r l l
d g r g v h n c q q t h . t t .
q l p g t o p t l l t l
q l p g r o p k l l t ; r d o . r p . g Y p t r
q o p g k g l r r r o I I r d d g r d q h y o d r r t g c f l
l
d g v q v h n
Bl"'*+rrscT0lrlRLscssflsLTcTvFlEvTcvvrlTLvcLrlt|nLccLvrlsLTcLvF
FnBcL q' t. t, t.pEEi{o. e r ri iEllr y. p !. o.F n E c H l r r
l g t q t l v t c r v $ h p r n t k l V D $ r r p k r oFccns
nrqqgn' l rscsvnl r r .nt r rnr ,y 'Tol l . lFcc ! t 2 n r t dg t r
c l yKcKv* r ko lp rp [ ! ! ] t r r r o rg
A B C D E, . l i l ioe- f l . , .sopgq. , l i iEEiTl . .
.nrsoenf t i i f r lqq lpsrcarnrr r ,nnn.{5vsra[ lssaru. . l vu
rone f . . s tpsq . ' l r r sc ro l t . r " re r r l r vHu$qrpg rcp
r l l r y rd . . . p . s rp r r r l sosx$ lqas r -. y l O L v 0 S C
l g g v r q p g r r l l R L S C S l r g f t r r r y l 8 n Y U V l e
q o p g l g l r r v c l l r d d g r r l q h y r r l r r r f r r l T
l S R { d r r l N T L
x r l o L E 0 s c b g l r f p . q t l l s l T c T v l . g r . l
d d t l Y 3 T g + q p p g r g l r r l g t r t t h g t r d t d r p l
. . r ' l I H L Y N l t r t l N 0 F
d r r l r g t l l l r l r
r r d l r y t g t l r l Y l g q
g h g l o r r c r o t g p d l q g t p Y t r . rI I o g o I d l q
g r l v ! v r r
C"ll'F1.rlr vNtr Ylc
4nHYuT4YsTrtiKFN"TvlTVf,UKlo
"lrvsuuf,hvE1g|
( d )
Figure 3. Test multiple alignments. Regions of the sequences
written in capitals a,nd boxed correspond to test zonesselected
from homologous secondary structures. (a) Alignment (l) 7 globin
sequences: A, B, C, E, X', G, H are a-helixes;(b) alignment (2), 4
immunoglobulin constant domains; (c) alignment (3), 4
immunoglobulin variable domains;(d) alignment (4), 8 immunoglobulin
domains including variable and constant. In (b) to (d), A, B, C, D,
E, F and G aref-strands.
( b )
FAEVL
F B 4 V L
F 8 4 V H
F N E V H
FNEVL
F E + V L
F D 4 V H
FABVH
( c )
FABVL
F B T V L
F B 4 V H
FNBVH
FCCH2
FADCL
FRBCHI
F C C H 3
FREVL
F B 4 V L
F B 4 V H
FRBVH
FCCH2
FNBCL
F N B C H I
F C C H 3
P
q p k o o p
o r ! k g p
q P f . P
. . n t g o g
. r n l g .
g l l l r r
g t r f d d
d v r h r d p q
d l y p g o
d y f p r p
g l y p r d
q q t
p r k q r
p o v l q r
p v l d r
. r . 9 . P g q . v
. o . g ! P g q . Y
g v r q p g r r l
p g l r r p r q t l
l p l p t r l l l r l r r t pr r r r l q c n k o
r r L r l r g g l o
r r a . r l t n q v
t g l q o r d r o d
g l r o r d r r d
d r l r p r d t g v
. Y ! ! . d l o r
l h q n r l d g k r
d r r p v l
g o l t
n 9 q P .
g l k l t r l n
l g t h v l v l g q
g l p v l v . r
g r l Y t r r t
t t r k o l g
t v r p l a o .
Y . p h . o
l r l
p r q r l r h k r
I t . l g t q t
l r r r q g g n r
d n r l r
n r r d n r y
l l o g o l r l
k o l p o p
. g . t
k p r n t l
-
332 G. J. Barton and, M. J. E. Sternberg
codes). Four from 3FAB: (l) light chain constant regionCl
(FABCL); (2) light chain variable region V,l (FABVL);(3) heavy
chain constant region I Cyl (I,ABCHl);(4) heavy chain variable
region V7 (FABVH). Three fromIFCI: (l) heavy chain constant region
2 C72 (FCCH2);(2) heavy chain constant region 3 C73 (X'CCH3).
Twofrom lFB4: (l) light chain variable region V,t (FBaVL);(2) heavy
chain variable region Vy (FB VH). Thereference alignment for the 8
domains was takenprincipally from Cohen et al. (1981) and Lesk
& Chothia(1982). However, the most recent version of the
co-ordinates deposited in the Brookhaven data bank(Bernstein et
al., 19771 for SFAB shows a modifiedsequence in the B-C loop and C
strand of FABVL. Froma consideration of hydrogen bonding, and
least-squaresfitting of the domains, the alignment was revised in
ttrisarea to take account of these changes. In addition itshould be
noted that the sequence of FB4VL used here(taken from the
Brookhaven data bank structure lFB4)differs slightly from the
earlier version used bv Lesk &Chothia: threonine (T) has been
substituted foiserine atresidue numbers 23 and.33, whilst alanine
(A) substitutesfor glycine (G) at 75. Seven test zones comprising
38residue positions in total and corresponding to t[e 7homologous
p-strands A to G were defined aJ illustratedin Fig. 3(b) to
(d)).
3. Results(a) Pairwise canq)arisons
For each of the 28 unique pairwise comparisonsfor the
immunoglobulins and 2l comparisons for
Pouwise occurocy (7o)
Figure 4. Comparison of accuracy for multiplealignments (l) to
(a) with the same sequences alignedpairwise. Points above the
diagonal line indicate animprovement in accuracy on multiple
alignment.
the globins, the percentage agreement with thereference
alignment was calculated. In addition, aconventional test for
significance was carried out byrandomizing each pair of sequences
100 times andcalculating the mean (m) and standard deviation(s.o.)
of the distribution. The significance score isquoted as (V
-m)ls.D., where 7 is the alignmentscore for the two native
sequences.
Figure I illustrates the result of these compari-sons.
Alignments that score > 15.0 s.n. (7 examples)give at or near
100/o agreement with the referencealignment. Those scoring between
5.0 and lb.0 s.o.(25 examples) give better thanT0o/o agreement
withthe reference alignment, whilst scores below b.0 s.n.(17
examples) show a sharp rise in alignmentaccuracy correlated with
significance score andranging from 0o/o (0.57 s.o.; FABCHI
aersusFB4VH) to 84o/o (2.4 s.r.; FABVL aerslrs FABCL).Above 5.0
s.l. there are no poor alignments;however, in the lower s.n. range
small changes inobserved significance score can indicate a,
consider-able difference in alignment accuracy.
When aligning two sequences it is useful to bearin mind these
findings since they can suggest thelikely quality of the alignment
obtained. As anapproximate guide we consider an s.D. score above5.0
to indicate a "good" alignment, with theconfidence in alignment
increasing with alignmentscore. An alignment giving a score below
5.0 s.D. weregard with a caution that becomes more stringentas the
score decreases.
(b) Test multiple alignments:comparison with pairwise
In each of four test alignments performed, thesequences were
ordered by similarity on the basis of100 randomizations as
shown.
(l) The seven globin sequences HBHU, HBHO,HAHU, HAHO, MYWHP,
PILHB, LGHB.
(2) The four constant domains FABCL,F'ABCHI, F'CCH3, FCCH2.
(3) The four variable immunoqlobulin domainsFABVL, FB4VL, FB4VH,
FABVII.
(a) The eight immunoglobulin domains FABVL,FB4VL, FB4VH, FABVH,
FCCH2, FABCL,FABCHI. FCCH3.
Figure 4 shows the accuracy of alignmentobtained for pairs of
sequences within multiplealignments (l) to (a) (Fig. 3, (a) to (d))
comparedwith the same sequences aligned pairwise. Pointsabove the
diagonal line represent a,n improvementin alignment when the
multiple alignmentalgorithm is applied. Multiple alignment of
theglobins (1) results in an improvement from g0lo to99/o overall
with the most dramatic improvementsfor the more distantly related
sequences PILHB,MYWHB and LGHB and the largest changeoccurring for
the comparison of LGHB with HBHO(77 to SS"/").The final alignment
shown in Figure3(a) has 94 out of 95 defined residues
correctlyaligned for all seven sequences. Furthermore, the D
Ioo:o
g.9=
601T
I40l-
O Constont domoins (olone) (2)tr Vorioble domoins (olone) (3)X
Conslont domoins (with vorioble) (4)A Vorioble domoins (wifh
consfont) (4)t Vorioble yersas constonf (4)
-
Multiple Sequerwe Alignment 333
a-helix, which is only present in HBHU, HBHO,MYWHP and PILHB, is
also correctly aligned.The single error occurs at the beginning of
the Fhelix and may in part be caused by the choice ofscore used for
a residue aersus a pre-introduced gap(discussed further below).
Analysis of the globinsignificance scores by the technique of
single linkagecluster analysis (Dayhoff et al., 1972; Sokol
&Sneath, 1973) in the light of these results suggeststhat
sequences that cluster above 5.0 s.n. align atleast as well by the
multiple algorithm as they dopairwise. The immunoglobulin constant
domains (2)that cluster at 8.5 s.l. and align to 86/o
accuracypairwise and 90/o by the multiple algorithm with30/38
positions correctly equivalenced across thecomplete four-sequence
alignment (Fig. 3(b)) andthe variable domains, which also cluster
at 8.5 s.o.with mean accuracies of 83o/o (pairwise), 84/o(multiple)
and 29/38 positions correctly alignedacross all four sequences
(Fig. 3(c)), lend furthersupport to this hypothesis.
When all eight immunoglobulins are aligned (4),the accuracies of
variable aersus variable andconstant uersus constant domain
comparisonsmarginally deteriorate (84+ l0% to 8lt7o/);however,
there is a striking improvement in thealignment accuracy for
variable aersus constantdomains from a me&n value of 4l +28o/o
to63+3yo. This is most noticeable for the alignmentof FB4VH aersus
FABCHI and FB4VH uersusFCCH3, which were completely misaligned
whencompared pairwise. FABVH uersus FABCL, whichcould only be
aligned pairwise to
-
334 G. J. Barton and, M. J. E. Sternberg
Table IEffect of iteration on the mean alignment accuracy of
pairs of sequences within the
multiple alignments
Iteration (mean accuracy* I s.o.)
Alignmentf
( l )(2)(3 )(4)
98 .91 l88.2 + 683.3 + I62.6+t4
99.5 + 0'588.2 + 683.3 + I70.5+ t2
99.5+0.589.5 + 784.2+g70.8+ l0
99.5 + 0.589.5 + 785.5 + I70.5 + l0
99.5 + 0.589'5+ 787.3 + 771 .6+ l 0
f (l) 7 globins; (2) 4 immunoglobulin constant domains; (3) 4
immunoglobulin variable domains; (4)8 immunoglobulin domains
(0during iteration can help to alleviate this problembut may also
introduce errors at other points in thealignment (results not
shown). Averaging also hasan effect on the non-gap regions of the
alignment,so that the effect of iteration becomes less apparentas
the number of sequences is increased. Thisproperty can be an
advantage for very largealignments, since a single alignment pass
with noiterations may be sufficient to yield a finalalignment.
The alignment of sequences in one specific orderis the main
route by which this algorithm reducesthe number of comparisons to
manageable propor-tions. To investigate the importance of order on
thefinal result, ten unique alternative orders weregenerated for
alignments (l) and (4). The meanalignment accuracy for the ten
globin orders waslittle different, at 98.7 o/o, from that obtained
foralignments ordered by s.o. score (gg.b/o) or NASa(98.9%).
However, the ten immunoglobulin ordersgave a mean value of 57.60/o
compared to 70.8o/ofor the alignment ordered by s.o. score. This
resultis hardly surprising, since many of the orders startwith the
poor alignment of a variable and constantdomain that is not
subsequently corrected. lndeed,the order that performed least well
of the ten wasone in which variable and constant domainsalternated
(FABVH, FABCHI, FB4VH, FCCH3,FABVL, FABCL, FB4VL, FCCH2). This gave
only38/o accuracy before iteration with no correctlyaligned
positions across all eight sequences. Aftertwo iterations, the
accuracy had improved to 460/oand 4/38 residues (the first four of
the F-strandwere in complete alignment). The order defined byNASa
scores (FABVL, FB4VL, FB4VH, FABVH,FABCHI, FABCL, FCCH3, FCCH2) is
verv similarto the s.D. score order and gave an aiignmentaccuracy
of 67'2o/o.
Although it might appear from the variable andconstant domain
example that a bad initialalignment will always lead to a generally
pooroverall alignment, this is not necessarily true. Forexample, it
is possible for a multiple alignment of20 sequences to be poor for
the first ten, yet goodfor the second ten provided that the second
ten areclosely related. This feature is another consequence
of the scoring scheme shown by equation (l), sinceone good
comparison can be identified against abackground of average
scores.
4. Applications and Conclusions
One advantage of the algorithm described here isits speed. For
example, the complete seven-sequence globin alignment (Fig. 3(a))
(2 iterations)required 65 seconds, whilst the same operation forthe
eight-sequence immunoglobulin alignment(Fig. 3(d)) took 50 seconds
CPU time on aVAX ll/750. Pairwise comparisons to establish anorder
without randomization required 44 secondsfor the globins and 34
seconds for the immuno-globulins. For comparison, the Johnson &
Doolittle(1986) algorithm requires 60 minutes CPU time toalign five
sequences of less than S0 residues inlength.
The determination of an alignment order uiapairwise comparisons
is the most time-consurningpart of the procedure, particularly if
random-izations are performed. However, such an analysisis often
carried out as part of the characterizationof a newly determined
sequence and would not needto be repeated to permit multiple
alignment. If thetime required to perform all pairwise comparisons
isprohibitive then the seven-globin alignmentsuggests that an
arbitrary order may performalmost as well for an alignment of
similarsequences. Each iterative pass of our algorithmrequires time
approximately proportional Lo N M2,where N is the number of
sequences and tll is thelength of the sequences when aligned.
Although it isexpensive to align long sequences, the task is
notimpossible: however, the longest alignment thatmay be produced
is limited by the need to store onearray of dimensions M x M.
Aligning large numbers of medium-length proteinsequences (150 to
300 residues) is therefore a matterof routine. For example, the
alignment of 128globin sequences, including haemoglobin-a and,
B,myoglobin and leghaemoglobin from a wide rangeof species,
required 25 minutes of CPU timeincluding two iterations (Fig. 5).
The sequencesused in the previous section to test the method
areindicated on the Figure together with the test
-
Multiple Sequence Alignment 335
E
es{c3GPSYC2GPSYS@sYclclmGPPiI6m
Figure 5. Multiple alignment of 128 globin sequences taken from
the PIR databank (databank codes shown). The 7sequences of known
3-dimensional structure illustrated in Fig. 3(a) are indicated (4
MYWHP, sperm whale myoglobin;40 HAHU, human a-haemoglobin; 43HAHO,
horse a-haemoglobin; 79 HBHU, human B-haemoglobin; 82 HBHO,
horsep-haemoglobin; l l9 PILHB, sea lamprey cyanohaemoglobin; 120
LGHB, root nodule leghaemoglobin).
rB0rxw0XYBDflwMBffi
@ IMEIYSHTYBO
wcHryft2mmwLzxwr0
MYmxJMruUCHPE
NXucH2sccllNllM
NYWA2NOGuo|llm
ilTSlN@ucPilonUE
l:|lmSJSBum|w2ucYsccllM|{Mffi1l&lfuffiBt|ftl2ffiQPrcwtHM!hBns
4 5
a a
5 2
5a
5A5960
626 3
6 556
7 t7 2
6a tl@J65 HBLL65 HBBOGa7 H8808AA BEYA269 HMN90 HHA91 Hm92 HEES93
HBOL9a Hm95 HilS96 Src97 HEEOF9 a | f f i99 Hmt
IOO H88Y10r HwIO2 HBOElo3 '{m104 @l105 mxG2C106 HrcH107 ffiro8
t]MIO9 HEP110 llwr t l BQrr2 ffiE113 l;&TItla |:ml115 |.@115
ffiI I ' 6R{J
zones, whilst the alignment order was determinedby similarity,
on the basis of NASa scores for allpairwise comparisons. The seven
test sequences arealigned correctly except for the beginning of
theF-strand (as before), and the last two residues ofthe G-strand
in Pl LHB. Furthermore, theremaining l2l sequences also appear to
be aligned
in a consistent manner. For example, the proximaland distal
histidine residues are aligned in all buttwo sequences. The
exceptions are the distalhistidine of MYELI (sequence 16, Indian
elephant)where Gln is substituted, and HACCI (sequence 63,desert
sucker a-chain) in which the His is displacedby one residue. In a
recent detailed study of residue
-
336 G. J. Barton and, M. J. E. Sternberg
Figure 6. Connectivity of disulphide bridges in thetransferrins.
Bridges I to 4, 3 to b and 2 to 6 correspondto bridges 4, 5 and I I
in the nomenclature of Williams(1982). TI'HUL, human
lactotransferrin; TFHUP,human serotransferrin; TFCHE, chicken
ovotransferrin;P97, human melanoma antigen. (a) Connectivity
forTFCHE; (b) connectivity suggested by Metz-Boutigue etal. (1984\,
and Rose et al. (1986); (c) conne"ii,ritysuggested by the rabbit
serotransferrin crystal srructure;3nd (d) multiple alignment that
supports t"he connectivityin (c) .
conservation in the globins based upon pairwisealignment with
ma,nual corrections (Bashford,Chothia & Lesk, personal
communication), thesecond example was identified as a
sequencingerror, since the order of residues &s shown(H G K K)
would lead to a shift in the E helix by aquarter turn; the sequence
should be (K H G K).
We stress that the alignment shown in Figure 5was produced
entirely automatically, withoul anymanual intervention or
pre-alignment of k"yregions. To our knowledge there is no
otheralgorithm that will permit an objective globalalignment of so
many protein sequences and to sucha high level of accura,cy.
Application of the multiple alignment algorithmhas proved
valuable during the crystallographicdetermination of a mammalian
serotransferrin(rabbit) in this laboratory (Gorinsky et al.,
tgTg;P. Lindley et al., perconal communication). Humantransferrin
(TFHUP) shows strong sequencehomology with chicken transferrin
(TFCHE).human lactotransferrin (TFHUL)
"nd human
melanoma antigen (P97). TFCHE has been studiedbiochemically and
the connectivity of thedisulphide links determined (Williams et
al., lg82\.Figure 6(a) illustrates the topology of disulphidesI to
4 and 3 to 5, which are unambiguous in thealignment of TFCHE, TFHUL
and PgZ. However,TFHUP in common with rabbit serotransferrin hasan
additional disulphide in this region, and the
lopolggy as indicated by the published alignmenrsfor TFHUP with
TFCHE and TFHUT (Metz-Boutigue et al., 1984) as well as TFHUP with
pg7(Rose ef ol., 1986) is shown in Figure 6(b). However,inspection
of the electron density map for serotrans-ferrin at 3.3 A (l A:0.1
nm) iesolulion suggestedthat this topology could not be
accommodated, butthat the arrangement shown in Figure 6(c) wasmore
likely. To provide independent-evidence, themultiple alignment
algorithm was applied to thefour sequences. Pairwise comparisonJ
ihowed thatthe seq_uences clustered at 4l s.n., suggesting a
highlevel of confidence in the alignment, whilst multiplealignment
of the complete sequences (- 800residues) performed with two
iterations lead to thealignment pa,rtly shown in Figure O(d).
Thisalignment supports the crystallographic interpreta-tion of
topology shown in Figure 6(c).
The algorithm presented in this paper representsa practical
solution to the problem of automaticallyaligning more than two
protein sequences whenonly sequence information is available. It
appearsmost valuable when there are weaker similarities(e.g.
immunoglobulin constant uersus variabledomains). However, the great
sensitivity of align-ment_accuracy below 5.0 s.n. (Fig. l) to
changes insignificance score and the sensitivity to
alternativealignment orders make the level of success in
analignment that includes weakly similar sequencesdifficult to
predict.
,The test systems presented here suggest thatwhen a group of
sequences cluster at >5.0s.o.multiple alignment by our algorithm
will provide aconvenient representation, which is likely to
be>70o/o correct within secondary structures andmore
&ccur&te than individual pairwise alignments.
-.Our results suggest the overall accuracy ofalignment that
might be expected for a particularsignificance score; however, the
problem stillremains of identifying which regions of thealignment
are correct. Argos (lg8?) has described asensitive procedure for
identifying significant localhomologies between two sequences or
pre-alignedfamilies of sequences and shown that these
oft"r,correspond to regions of similarity in three-dimensional
structure. Thus, when there are two ormore distinct clusters of
sequences to be aligned,the Argos method may be applied to
alignmentsobtained automatically by our algoritlim andprovide an
indication of which regions are correctlyequivalenced.
Although the incorporation of properties inaddition to the
Dayhoff matrix can improve thesensitivity of sequence comparison
methods byreducing background noise (Argos, lg87), without
( o )
( b )
( c )
( d )
I I F C I E
F I E l , l f lF F F s h s l c l v F c F 0 x c 0 F p N L l c l n
L l c l F c r c E d hn F F s c s l c l a F E R o o r o F r o r l c
l o L l c l e o{ F F sn s lc lv F c F r r e o KL lc ln o lc l r o
0F( r (0 y F c c s l q v p c h 0 E I s y s E s L l q i L E l i c o
3 s 0 e c
-
Multiple S equen ce Alignment 337
a more complete understanding of the relationshipbetween
sequence and three-dimensional structureit is difficult to envisage
a scoring scheme thatwould, for example, lead to the correct
alignment ofthe A p-strands of immunoglobulin variable andconstant
domains (Fig. 3(d)). When sequencesimilarity is weak, sequence
alignment becomes anexercise in structure prediction and
correctalignment is constrained by the fact that the coderelating
amino acid sequence to three-dimensionalstructure is
degenerate.
Alignments of clearly similar sequences generatedby our
algorithm have been used as the basis of animproved secondary
structure and active siteprediction algorithm (Zvelebil et al.,
1987) and alsoto align four different strains of human
immuno-deficiency virus (HIV) enu (800 residues), gag (500residues)
and pol (1000 residues) polyproteins withthe aim of predicting
potential T andB-lymphocyte-defined epitopes (Coates et al.,
1987Sternberg et aL,1987).
It has been suggested that a few key residues canbe sufficient
to define a tertiary fold (e.g. seeWierenga et al., 1986). The
multiple alignmentalgorithm provides a useful tool for identifying
suchpatterns from closely related sequences. We arecurrently
developing and calibrating techniques foridentifying these
patterns, and rapidly scanning theprotein sequence databank to
identify proteins ofpotentially similar tertiary folds.
We .thank Professor T. Blundell for his continuedsupport, M.
Zvelebil and I. Haneef for helpfuldiscussions, and R,. Garrett, B.
Gorinsky and P. Lindleyfor presenting the transferrin problem. This
work wasfunded by the Science and Engineering Research Council.
References
Amzel, L. M. & Poljak, R. (1979). Annu. Reu. Biochem.48,
961-997.
Argos, P. (1987). J. Mol. Biol.193,385-396.Bacon, D. J. &
Anderson, W. F. (1986). J. Mol. Biol. l9l,
153- l6 l .Bains, W. (1986). Nu.cI. Acids Res.
14,159-177.Barton, G. J. & Sternberg, M. J. E. (1987). Protein
Erry.
1,89-94.Beale, D. & Feinstein, A. (1976). Quart. Reu.
Biophys.9,
135-180.Bernstein, F. C., Koetzle, T. F., Will iams, G. J.
B..
Meyer, D. F., Jr, Brice, M. D., Rodgers, J. R,.,Kennard, O.,
Shimanouchi, T. & Tasumi, M. (1977).J. MoI. Biol.
ll2.535-542.
Boswell, D. R. & Mclachlan, A. D. (1984) . Nu,cI. AcitlsRes.
12,457-464.
Browne, W. J., North, A. C. T., Phil l ips, D. C., Brew,
K.,Vanaman, T. C. & Hill, R,. C. (1969). J. Mol.
Biol.120,97-120.
Coates, A. R. M., Cookson, J., Barton, G. J., Zvel6bil,M. J.
& Sternberg, M. J. E. (1987). Nature (
Lond,on),326.549-550.
Cohen, F. E., Novotny, J., Sternberg, M. J. E., Campbell,D. G.
& Will iams, A. F. (1981). Biochem. J. 195,3l-40.
Dayhoff, M. O. (1972) .ln Atlas of Protein Sequence
and'Strunture (Dayhoff, M. O., ed.), vol. 5, pp. 89-110,National
Biomedical Research Foundation,Washington, DC.
Dayhoff, M. O. (1978) .In Atlas of Protein Sequence
and,Structure (Dayhoff, M. O., ed.), vol. 5, suppl. 3,pp. 345-358,
National Biomedical ResearchFoundation, Washington, DC.
Dayhoff, M. O., Park, C. M. & Mclaughlin, P. J. (1972).ln
Atlas of Protein Sequence and, Structure (Dayhoff,M. O., ed.), p.
12. National Biomedical ResearchFoundation, Washington, DC. -
Doolittle, R. F. ( 1981 ). Scierwe, 2l4, l4g-159.Edelman, G. M.
(1970). Biochemistry,9, 3197-3205.Feng, D. F., Johnson, M. S. &
Doolittle, R,. F. (1985).
J. Mol. Eool. 21. ll2-125.Fickett, J. W. (1984). NwL Acitls Res.
12,175-179.Fitch, W. M. (1966). J. Mol. BioI. 16,9-16.George, D.
G., Barker, W. C. & Hunt, L. T. (1986). Nuct.
Acitls Res. 14, I l-15.Goad, W. B. & Kanehisa, M. I. (1982).
NucI. Acids Res.
10,247-263.Gorinsky, B., Horsbaugh, C., Lindley, P. F., Moss, D.
S.,
Parkar, M. & Watson, J. L. (1979). Nature (Lond,on),281 ,157
-158 .
Gotoh, O. (1982). J. MoI. BioI.162,705-708.Johnson, M. S. &
Doolitt le, R. F. (1986). J. Mol. Eool.23,
267-278.Lesk, A. M. & Chothia, C. (1980). J. Mol. Biol.
136,
225-270.Lesk, A. M. & Chothia, C. (1982). J. Mol. BioI.
160,
325-342.Metz-Boutigue, M. H., Jollds, J., Mazurier, J.,
Schoentgen, F., Legrand, D., Spik, G., Montreuil, J.&
Jollds, P. (1984). Eur. J. Biochem. 145, 659-6?6.
Moore, G. W. & Goodman, M. (1977). J. Mol. Euol. 9,l2 l
-130.
Murata, M., R,ichardson, J. S. & Sussman, J. L. (1985).Proc.
Nat. Aca.d. Sci.. U.5.A.82. 3073-3077.
Needleman, S. B. & Wunsch, C. D. (1970). J. Mol.
Biol.48,443-453.
Rose, T. M., Plowman, G. D., Teplow, D. B., Dreyer,W. J.,
Hellstrom, K. E. & Brown, J. P. (1986). Proc.Nat. Acatl. Sci.,
U.5.A.83, l26l-1265.
Sankoff, R,. J. & Cedergren, G. L. (1976). J. MoI.
Eaol.7,133-149.
Sellers, P. (1979). Proc. Nat. Acad'. Sci., U.5.A.76,304I.Sobel,
E. & Martinez, H. M. (1986). NucI. Acitls Res. 14,
363-374.Sokol, R. R. & Sneath, P. A. H. (1973). In
Numerical
Tarornmy, p. 201, Freeman, San Francisco.Sternberg, M. J. E.,
Barton G. J., Zvel6bil, M. J. J.,
Cookson, J. & Coates, A. R. M. (1987). IEBSLettere,2lE,
231-237.
Taylor, P. (1984). Nucl. Aciils Res. 12, 447-455.Taylor, W. R.
(1986). J. Mol. Biol. 188,233-258.Waterman, M. S. (1986). Nu.cl.
Acids Res. 14,9095-9102.Wilbur, W. J. & Lipman, D. J. (1983).
Proc. Nat. Acad'.
Sci., U.S.A. E0, 726-730.Williams, J. (1982). Trend's Biochem.
Sci.7,894-397.Williams, J., Elleman, T. C., Kingston, I. 8.,
Wilkins,
A. G. & Kuhn, K. A. (1982). Eur. J. Biochem.122,297-303.
Wierenga, R,. K., Terpstra, P. & Hol. W. G. J. (1986).J.
Mol. BioI. 187, l0l-107.
Zvelebil, M. J., Barton, G. J., Taylor, W. R. & Sternberg,M.
J. E. (1987). J. MoI. 8io1.195,957-961.
Eiliteil by R. Huber