7/27/2019 Hamilton Depression http://slidepdf.com/reader/full/hamilton-depression 1/15 Am J Psychiatry 161:12, December 2004 2163 Reviews and Overviews http://ajp.psychiatryonline.org The Hamilton Depression Rating Scale: Has the Gold Standard Become a Lead Weight? R. Michael Bagby, Ph.D. Andrew G. Ryder, M.A. Deborah R. Schuller, M.D. Margarita B. Marshall, B.Sc. Objective: The Hamilton Depression Rat- ing Scale has been the gold standard for the assessment of depression for more than 40 years. Criticism of the instrument has been increasing. The authors review studies pub- lished since the last major review of this in- strument in 1979 that explicitly examine the psychometric properties of the Hamil- ton depression scale. The authors’ goal is to determine whether continued use of the Hamilton depression scale as a measure of treatment outcome is justified. Method: MEDLINE was searched for stud- ies published since 1979 that examine psychometric properties of the Hamilton depression scale. Seventy studies were identified and selected, and then grouped into three categories on the basis of the major psychometric properties exam- ined—reliability, item-response character- istics, and validity. Results: The Hamilton depression scale’s internal reliability is adequate, but many scale items are poor contributors to the measurement of depression severity; oth- ers have poor interrater and retest reliabil- ity. For many items, the format for re- sponse options is not optimal. Content validity is poor; convergent validity and discriminant validity are adequate. The factor structure of the Hamilton depres- sion scale is multidimensional but with poor replication across samples. Conclusions: Evidence suggests that the Hamilton depression scale is psychomet- rically and conceptually flawed. The breadth and severity of the problems mil- itate against efforts to revise the current instrument. After more than 40 years, it is time to embrace a new gold standard for assessment of depression. (Am J Psychiatry 2004; 161:2163–2177) T he Hamilton Depression Rating Scale (1) was devel- oped in the late 1950s to assess the effectiveness of the first generation of antidepressants and was originally pub- lished in 1960. Although Hamilton (1) recognized that the scale had “room for improvement” (p. 56) and that further revision was necessary, the scale quickly became the stan- dard measure of depression severity for clinical trials of antidepressants (2, 3). The Hamilton depression scale has retained this function and is now the most commonly used measure of depression (3). Our objective in this arti- cle is to provide a review of the Hamilton depression scale literature published since the last major evaluation of its psychometric properties, more than 20 years ago (4). More recent reviews have appeared (3, 5–7), but they have not systematically examined the literature with regard to a broad range of measurement issues. Significant develop- ments in psychometric theory and practice have been made since the 1950s and need to be applied to instru- ments currently in use. We evaluate the Hamilton depres- sion scale in light of these current standards and conclude by presenting arguments for and against retaining, revis- ing, or rejecting the Hamilton depression scale as the gold standard for assessment of depression. Method Studies for the review were identified by means of MEDLINE searches for both “ depression” and “Hamilton.” All studies pub- lished during the period since the last major review (January 1980 to May 2003) were considered. Studies selected for review had to be explicitly designed to evaluate empirically the psychometric properties of the instrument or to review conceptual issues re- lated to the instrument’ s development, continued use, and/or shortcomings. At least 20 published versions of the Hamilton de- pression scale exist, including both longer and shortened ver- sions. This review was limited to studies that examined the origi- nal 17-item version, as the majority of the studies that evaluated the scale’ s psychometrics used the 17-item version. Only a small number of studies evaluated other versions, and most of these versions contain the original 17 items. Seventy articles met the se- lection criteria and were categorized into three groups on the ba- sis of the major psychometric property examined — reliability, item response, and validity. Table 1 lists the articles included in the review. Results Reliability Clinician-rated instruments should demonstrate three types of reliability: 1) internal reliability, 2) retest reliability, and 3) interrater reliability. Cronbach’ s alpha statistic (78) is used to evaluate internal reliability, and estimates ≥0.70
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
ing Scale has been the gold standard for theassessment of depression for more than 40years. Criticism of the instrument has beenincreasing. The authors review studies pub-lished since the last major review of this in-strument in 1979 that explicitly examinethe psychometric properties of the Hamil-ton depression scale. The authors’ goal is todetermine whether continued use of theHamilton depression scale as a measure of treatment outcome is justified.
Method: MEDLINE was searched for stud-ies published since 1979 that examinepsychometric properties of the Hamilton
depression scale. Seventy studies wereidentified and selected, and then groupedinto three categories on the basis of themajor psychometric properties exam-ined—reliability, item-response character-istics, and validity.
Results: The Hamilton depression scale’s
internal reliability is adequate, but manyscale items are poor contributors to the
measurement of depression severity; oth-
ers have poor interrater and retest reliabil-
ity. For many items, the format for re-
sponse options is not optimal. Content
validity is poor; convergent validity and
discriminant validity are adequate. The
factor structure of the Hamilton depres-
sion scale is multidimensional but with
poor replication across samples.
Conclusions: Evidence suggests that the
Hamilton depression scale is psychomet-
rically and conceptually flawed. The
breadth and severity of the problems mil-
itate against efforts to revise the current
instrument. After more than 40 years, it is
time to embrace a new gold standard for
assessment of depression.
(Am J Psychiatry 2004; 161:2163–2177)
The Hamilton Depression Rating Scale (1) was devel-
oped in the late 1950s to assess the effectiveness of the first
generation of antidepressants and was originally pub-
lished in 1960. Although Hamilton (1) recognized that the
scale had “room for improvement” (p. 56) and that further
revision was necessary, the scale quickly became the stan-
dard measure of depression severity for clinical trials of
antidepressants (2, 3). The Hamilton depression scale has
retained this function and is now the most commonly
used measure of depression (3). Our objective in this arti-
cle is to provide a review of the Hamilton depression scale
literature published since the last major evaluation of its
psychometric properties, more than 20 years ago (4). More
recent reviews have appeared (3, 5–7), but they have notsystematically examined the literature with regard to a
broad range of measurement issues. Significant develop-
ments in psychometric theory and practice have been
made since the 1950s and need to be applied to instru-
ments currently in use. We evaluate the Hamilton depres-
sion scale in light of these current standards and conclude
by presenting arguments for and against retaining, revis-
ing, or rejecting the Hamilton depression scale as the gold
standard for assessment of depression.
Method
Studies for the review were identified by means of MEDLINEsearches for both “depression” and “Hamilton.” All studies pub-
lished during the period since the last major review (January 1980
to May 2003) were considered. Studies selected for review had to
be explicitly designed to evaluate empirically the psychometric
properties of the instrument or to review conceptual issues re-
lated to the instrument’s development, continued use, and/or
shortcomings. At least 20 published versions of the Hamilton de-
pression scale exist, including both longer and shortened ver-
sions. This review was limited to studies that examined the origi-
nal 17-item version, as the majority of the studies that evaluated
the scale’s psychometrics used the 17-item version. Only a small
number of studies evaluated other versions, and most of these
versions contain the original 17 items. Seventy articles met the se-
lection criteria and were categorized into three groups on the ba-
sis of the major psychometric property examined—reliability,item response, and validity. Table 1 lists the articles included in
the review.
Results
Reliability
Clinician-rated instruments should demonstrate three
types of reliability: 1) internal reliability, 2) retest reliability,
and 3) interrater reliability. Cronbach’s alpha statistic (78)
is used to evaluate internal reliability, and estimates ≥0.70
Rehm and O’Hara (61) 1985 English 158 100 Community (symptomatic) subjects × ×
Reynolds and Kobak (62) 1995 English 357 59 Psychiatric outpatient/nonreferredcommunity subjects
×
Riskind et al. (63) 1987 English 191 54 Psychiatric outpatients × ×
Santor and Coyne (64) 2001Sample 1 English 316 — b Primary care outpatients ×
Sample 2 English 318 70 Depressed outpatients ×
Santor and Coyne (65)Sayer et al. (66)
2001 English 732 — b Depressed patients ×
1993 English 114 61 Psychiatric inpatients × ×
Senra Rivera et al. (67) 2000 Castilian 52 65 Depressed patients × ×Shain et al. (68) 1990 English 45 64 Depressed adolescent inpatients ×
Smouse et al. (69) 1981 English — b — b Depressed patients ×
Steinmeyer and Möller (70) 1992 German 223e 68 Psychiatric inpatients ×
Steinmeyer and Möller (70) 1992 German 174f 68 Psychiatric inpatients ×
Strik et al. (71) 2001Sample 1 Dutch 156 0 Medical patients × ×
Sample 2 Dutch 50 100 Medical patients × ×
Teri and Wagner (72) 1991 English 75 68 Alzheimer’s patients ×
Thase et al. (73) 1983 English 147 100 Depressed outpatients × ×
Thompson et al. (74) 1998 English 242 100 Psychiatric referrals ×
Whisman et al. (75) 1989 English 70 100 Depressed outpatients × ×
Williams (76) 1988 English 23 65 Psychiatric inpatients ×
Zheng et al. (77) 1988 Chinese 329 47 Psychiatric inpatients/outpatients × ×
a Studies were published between January 1980 and May 2003 and identified by means of a MEDLINE search for both “depression” and“Hamilton.”
b Not reported.c Number of subjects providing data at time 1.d Number of subjects providing follow-up data 3 months after admission.e Number of subjects providing baseline (i.e., pretreatment) data.f Number of subjects providing endpoint (week 6) data after treatment with either paroxetine or amitriptyline.
0.98, and the intraclass r ranged from 0.46 to 0.99. Some
investigators provided evidence that the skill level or ex-
pertise of the interviewer and the provision of structured
queries and scoring guidelines affect reliability (19, 23, 35,
54). Across studies, the best estimate mean of interrater re-
liability for studies reporting higher levels of interviewer
skill and use of expert raters, structured queries, and scor-
ing guidelines did not statistically differ from that for other
studies (z=0.81, n.s.).
At the individual item level, interrater reliability is poor
for many items. Cicchetti and Prusoff (19) assessed reli-
ability before treatment initiation and 16 weeks later at
trial end. Only early insomnia was adequately reliable be-
fore treatment, and only depressed mood was adequately
reliable after treatment. Thirteen items had coefficients
<0.50 before treatment, and 11 items had coefficients
<0.50 after treatment. Rehm and O’Hara (61) performed a
similar analysis with data from two samples. Six items
showed adequate reliability in the first sample (early in-
somnia, middle insomnia, late insomnia, somatic anxiety,
gastrointestinal, loss of libido), as did 10 in the second
sample (depressed mood, guilt, suicide, early insomnia,
middle insomnia, late insomnia, work/interests, psychic
anxiety, somatic anxiety, gastrointestinal). Loss of insight
showed the lowest interrater agreement in both samples.
Craig et al. (20) found that only one item, work/interests,
had adequate interrater reliability. Moberg et al. (50) re-
ported that nine items demonstrated adequate reliability
when the standard Hamilton depression scale was admin-
istered (depressed mood, guilt, suicide, early insomnia,
late insomnia, agitation, psychic anxiety, hypochondria-
sis, loss of insight), but all items showed adequate reliabil-
ity when the scale was administered with interview guide-
lines. Potts et al. (59) demonstrated that a single omnibus
coefficient can mask specific problems. Using a structured
interview version of the Hamilton depression scale, they
TABLE 2. Studies Reporting Reliability Estimates for the Total 17-Item Hamilton Depression Rating Scale a
Study YearInternal Reliability(Cronbach’s alpha)
Interrater Reliability(Pearson’s r)
Interrater Reliability(Intraclass r)
Retest Reliability(Pearson’s r)
Addington et al. (9) 1990 0.82Addington et al. (10) 1996 0.93Akdemir et al. (11) 2001 0.75 0.87 – 0.98b 0.85Baca-Garcí a et al. (12) 2001 0.97Cicchetti and Prusoff (19) 1983
Time 1 0.46Time 2 0.82
Craig et al. (20) 1985 0.95Deluty et al. (22) 1986 0.96Demitrack et al. (23) 1998 0.65 – 0.79b
Fuglum et al. (28) 1996 0.86 0.81Gastpar and Gilsdorf (29) 1990 0.48Gilley et al. sample 1 (31) 1995 0.92Gottlieb et al. (32) 1988 0.99Hammond (34) 1998 0.46Kobak et al. (37) 1999 0.91 0.98Koenig et al. (38) 1995 0.97Leung et al. (42) 1999 0.94Maier et al. (45) 1988
Sample 1 0.70Sample 2
Time 1 0.72Time 2 0.70
McAdams et al. (43) 1996 0.77Meyer et al. (48) 2001 0.57 – 0.80b
Middelboe et al. (49) 1994 0.75O’Hara and Rehm (54) 1983
Expert raters 0.91Novice raters 0.76
Pancheri et al. (57) 2002 0.90Potts et al. (59) 1990 0.82 0.92Ramos-Brieva and Cordero-Villafafila (60) 1988 0.72Rehm and O’Hara (61) 1985
Study 1 0.76 0.78 – 0.91b
Study 2 0.91 – 0.96b
Reynolds and Kobak (62) 1995 0.92 0.96Riskind et al. (63) 1987 0.73Shain et al. (68) 1990 0.97Teri and Wagner (72) 1991 0.65 – 0.97b
Whisman et al. (75) 1989 0.85
Williams (76) 1988 0.81Zheng et al. (77) 1988 0.71 0.92a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton
depression scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b Range over multiple pairs of raters.
a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depres-sion scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”
b Correlation of item scores with total scores. An uncorrected Pearson’s r>0.20 was considered significant. Significant correlations are shown inboldface type.
c Interrater Pearson’s r≥0.70 was considered significant; intraclass r≥0.60 was considered significant. Significant correlations are shown in bold-face type.
d The study included both standard and interview guideline methods; interrater reliability was calculated by using the intraclass r.e The subjects were assigned to two groups by means of a median split according to Hamilton depression scale total scores; interrater Pear-
son’s r values were calculated for both groups.f Test-retest Pearson’s r >0.70 was considered acceptable. Acceptable correlations are shown in boldface type.
Validity of psychiatric rating scales such as the Hamil-
ton depression scale comprises 1) content, 2) convergent,
3) discriminant, 4) factorial, and 5) predictive validity.
Content validity is assessed by examining scale items to
determine correspondence with known features of a syn-
drome. Convergent validity is adequate when a scaleshows Pearson’s r values of at least 0.50 in correlations
with other measures of the same syndrome. Discriminant
validity is established by showing that groups differing in
their diagnostic status can be separated by using the scale.
Predictive validity for symptom severity measures such as
the Hamilton depression scale is determined by a statisti-
cally significant (p<0.05) capacity to predict change with
treatment. Factorial validity is established by using factor
analysis or related techniques (e.g., principal-component
analysis) to demonstrate that a meaningful structure can
be found in multiple samples. An a priori criterion of 0.40
has been used to identify which items are part of which
factors (88).
Content validity. Because of its wide use and long clini-
cal tradition, the Hamilton depression scale seems to both
define as well as measure depression. One could criticize
DSM-IV for not adequately capturing Hamilton depres-
sion scale depression as much as one could criticize the
Hamilton depression scale for not providing full coverage
of DSM-IV depression. Nonetheless, the operational crite-
ria provided in DSM-IV are used as the official nosology
for much of psychiatry worldwide. The criteria for major
depression have been revised three times in response to
developments in field trial research and clinical consensus
based on expert opinion, most recently in 1994. Research-
ers have developed a number of longer versions of the
Hamilton depression scale that include additional symp-
toms such as the reverse vegetative features of atypical de-
pression. However, the core items of the Hamilton depres-sion scale have remained unchanged for more than 40
years. It is reasonable to ask whether this instrument cap-
tures depression as it is currently conceptualized. Several
symptoms contained within the Hamilton depression
scale are not official DSM diagnostic criteria, although
they are recognized as features associated with depression
(e.g., psychic anxiety). For other symptoms included in the
Hamilton depression scale (e.g., loss of insight, hypochon-
driasis), the link with depression is more tenuous. More
critically, important features of DSM-IV depression are of-
ten buried within more complex items and sometimes are
not captured at all. The work/interests item includes an-
hedonic features along with listlessness, indecisiveness,
social avoidance, and lowered productivity. It is impossi-
ble to determine the extent to which anhedonia per se in-
fluences severity. Guilt is captured in both Hamilton de-
pression scale depression and DSM-IV depression, but the
Hamilton depression scale contains no explicit assess-
ment of feelings of worthlessness. Decision-making diffi-
culties are buried within the work/interests item of the
Hamilton depression scale, but concentration difficulties
are not included. The reverse vegetative symptoms—
TABLE 4. Studies Reporting Estimates of Convergent Validity of the 17-Item Hamilton Depression Rating Scale, ComparedWith Other Depression Measuresa
r
Study Year
Beck DepressionInventory
BriefPsychiatric
Rating Scale
Center forEpidemiologic
StudiesDepression Scale
Clinical GlobalImpression
Scale
CarrollRating Scale
for Depression
GlobalAssessment
Scale
Akdemir et al. (11) 2001 0.48 0.56Berard and Ahmed (15) 1995 0.48
Brown et al. (17) 1995 0.70 – 0.85b
Carroll et al. (18) 1981 0.60 0.71Craig et al. (20) 1985 0.56 0.65Feinberg et al. (26) 1981 0.77 0.75Gottlieb et al. (32) 1988
Low-severity group 0.89High-severity group 0.57
Hotopf et al. (36) 1998 0.77Kobak et al. (37) 1999 0.89Leung et al. (42) 1999Maier et al. total sample (46) 1988Olsen et al. (55) 2003Rehm and O’Hara (61) 1985 0.73 – 0.86Senra Rivera et al. (67) 2000
Time 1 0.70Time 2 0.92
Whisman et al. (75) 1989
Time 1 0.27 0.41Time 2 0.67 0.68
Zheng et al. (77) 1988 – 0.47a Estimates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depres-
sion scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b Multiple assessments over an 8-month period.
Strik et al. (71) 2001 11/12 0.76 0.86 0.41 0.99Thompson et al. (74) 1998 — c 0.69 – 0.87d 0.99 – 1.00d — c — c
Mean 12.6/13.5 0.76 0.91 0.77 0.92a Rates are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression
scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.”b The minimum score above which sensitivity and specificity are maximized in the detection of depression with the Hamilton depression scale
for a given study. Where two scores are given, the lower score represents the threshold below which cases are classified as nondepressed,and the higher score represents the threshold above which cases are classified as depressed.
c Not reported.d Range of scores across multiple assessments.
searchers have examined the capacity of the Hamilton de-
pression scale to distinguish different groups of clinical
patients (e.g., patients with endogenous versus those with
nonendogenous depression, patients with anxiety versus
those with depression) using statistical techniques to de-
tect mean group differences. Classification rates resulting
from receiver operating curve analysis have not been
widely reported in the Hamilton depression scale litera-ture. Our search only identified seven studies (Table 5), and
some of these investigations sought to detect depression in
samples of patients with medical conditions other than
psychiatric disorders (Table 1). Sensitivity, specificity, and
negative predictive power were generally consistent and
large, but positive predictive power was more variable, and
two studies reported very low positive predictive power.
The second type of discriminant validity study attempts
to distinguish different clinical groups. In a comparison of
healthy, depressed, and bipolar depressed individuals,
Rehm and O’Hara (61) found that the total Hamilton de-
pression scale score clearly differentiated these three cate-
gories, with the depressed patients scoring higher than the
healthy participants and with the bipolar depressed pa-
tients scoring higher than both of the other groups. At the
item level, four items—psychomotor agitation, gastro-
intestinal symptoms, loss of insight, and weight loss—failed to differentiate depressed from healthy subjects.
Only psychic anxiety and hypochondriasis significantly
differentiated the subjects with unipolar and bipolar de-
pression. Kobak et al. (37) showed significant total scale
score differences between individuals with major depres-
sion, individuals with minor depression, and healthy com-
parison subjects. Zheng et al. (77) reported that the Hamil-
ton depression scale was able to discriminate psychiatric
patients classified as mildly, moderately, and severely dys-
functional on the basis of Global Severity Scale scores.
Thase et al. (73) found that the Hamilton depression scale
could distinguish patients with endogenous depression
from patients with nonendogenous depression, with pa-
tients in the former category having higher scores. Gott-
lieb et al. (32) reported no significant differences between
the Hamilton depression scale scores of patients classified
as having low-severity versus high-severity Alzheimer’sdisease. Several researchers have investigated the capacity
of the Hamilton depression scale to differentiate between
patients with anxiety and those with depression. Prusoff
and Klerman (89) suggested the Hamilton depression
scale could indeed separate these constructs, and Maier et
al. (45) demonstrated that the Hamilton depression scale
had a higher correlation with an external measure of de-
pression than with an external measure of anxiety, but thesaturation of the Hamilton depression scale with anxiety-
related concepts was nonetheless considerable.
Predictive validity. Edwards et al. (90) performed a meta-
analysis of 19 studies with a total of 1,150 patients that
compared the predictive validity of the Hamilton depres-
sion scale and the Beck Depression Inventory. Treatments
included pharmacotherapy, behavior therapy, cognitive
restructuring, dynamic psychotherapy, and various com-
binations. The Hamilton depression scale was found to be
TABLE 6. Studies Reporting Factor Analyses and Principal-Component Analyses of the 17-Item Hamilton Depression RatingScalea
Study YearNumber
of FactorsDepressed
Mood Guilt SuicideEarly
InsomniaMiddle
InsomniaLate
InsomniaWork/
Interests
Addington et al. (10) 1996Time 1 7 I I, V V — II II, V, VI I, IVTime 2 7 I, II, VII II, III, VII VII III III III, V II
Akdemir et al. (11) 2001 6 — II II III III III — Berrios and
Bulbena-Villarasa (16) 1990Sample 1 4 I I, II I I I I IISample 2 4 I I, II I IV I, IV I I, II, IV
Brown et al. (17) 1995 6 III III III I V V VIDaradkeh et al. (21) 1997 5 II II, IV I I IIIFleck et al. (27) 1995 3 I I — III III III IGibbons et al. (30) 1993 5 I, IV I I — II II I, IVMarcos and Salamero (47) 1990 3 II — II III III III IIO’Brien and Glaudin (53) 1988
Sample 1 6 I I, VI I — IV IV I, IISample 2 8 III VII VI II II II III
Onega and Abraham (56) 1997 4 I I I II II II IPancheri et al. (57) 2002 4 III II — I I I IIIRamos-Brieva and
Cordero-Villafafila (60) 1988 5 III II, III I, III I I I IIISmouse et al. (69) 1981 3 I I I, II I, II I, II I, II ISteinmeyer and Möller (70) 1992
Time 1 6 II V V III III III IITime 2 2 I, II II I, II II I I I
Zheng et al. (77) 1988 5 III IV III V V V IVa Results are from studies published between January 1980 and May 2003 that measured psychometric properties of the Hamilton depression
scale. Studies were identified by means of a MEDLINE search for both “depression” and “Hamilton.” Roman numerals indicate the numberof the factor on which the item loaded significantly. A factor loading of ≥0.40 was considered statistically significant.