Reliability of the knee examination in osteoarthritis: Effect of standardization
Post on 13-May-2023
0 Views
Preview:
Transcript
ARTHRITIS & RHEUMATISMVol. 50, No. 2, February 2004, pp 458–468DOI 10.1002/art.20025© 2004, American College of Rheumatology
Reliability of the Knee Examination in Osteoarthritis
Effect of Standardization
Jolanda Cibere,1 Nicholas Bellamy,2 Anona Thorne,3 John M. Esdaile,4 Kelly J. McGorm,5
Andrew Chalmers,3 Simon Huang,3 Paul Peloso,6 Kam Shojania,4 Joel Singer,7
Hubert Wong,3 and Jacek Kopec4
Objective. To assess the reliability of physicalexamination of the osteoarthritic (OA) knee by rheuma-tologists, and to evaluate the benefits of standardiza-tion.
Methods. Forty-two physical signs and techniqueswere evaluated using a 6 � 6 Latin square design.Patients with mild to severe knee OA, based on physicaland radiographic signs, were examined in random orderprior to and following standardization of techniques.For those signs with dichotomous scales, agreementamong the rheumatologists was calculated as theprevalence-adjusted bias-adjusted kappa (PABAK),while for the signs with continuous and ordinal scales, areliability coefficient (Rc) was calculated using analysis
of variance. A PABAK of >0.60 and an Rc of >0.80 wereconsidered to indicate adequate reliability.
Results. Adequate poststandardization reliabilitywas achieved for 30 of 42 physical signs/techniques(71%). The most highly reliable signs identified byphysical examination of the OA knee included align-ment by goniometer (Rc � 0.99), bony swelling (Rc �0.97), general passive crepitus (Rc � 0.96), gait byinspection (PABAK � 0.78), effusion bulge sign (Rc �0.97), quadriceps atrophy (Rc � 0.97), medial tibiofemo-ral tenderness (Rc � 0.94), lateral tibiofemoral tender-ness (Rc � 0.85), patellofemoral tenderness by grindtest (Rc � 0.94), and flexion contracture (Rc � 0.95).The standardization process resulted in substantialimprovements in reliability for evaluation of a numberof physical signs, although for some signs, minimal orno effect of standardization was noted. After standard-ization, warmth (PABAK � 0.14), medial instability at30° flexion (PABAK � 0.02), and lateral instability at30° flexion (PABAK � 0.34) were the only 3 signs thatwere highly unreliable.
Conclusion. With the exception of physical exam-inations for instability, a comprehensive knee examina-tion can be performed with adequate reliability. Stan-dardization further improves the reliability for somephysical signs and techniques. The application of thesefindings to future OA studies will contribute to im-proved outcome assessments in OA.
In clinical practice and in research, the kneeexamination is a key component in the assessment ofpatients with osteoarthritis (OA). Recommendations forthe physical examination of the OA knee include manydifferent signs and techniques (1–9). In the AmericanCollege of Rheumatology (ACR) clinical diagnostic
Supported by grants from the Canadian Institutes of HealthResearch and the Canadian Arthritis Network. Dr. Cibere’s work wassupported by a Canadian Institutes of Health Research ClinicianScientist Award and a Michael Smith Foundation for Health ResearchPostdoctoral Fellowship Award.
1Jolanda Cibere, MD: Arthritis Research Centre of Canada,Vancouver, British Columbia, Canada; 2Nicholas Bellamy, MD, MSc,MBA: Centre of National Research on Disability and RehabilitationMedicine, and University of Queensland, Brisbane, Australia; 3AnonaThorne, MSc, Andrew Chalmers, MD, Simon Huang, MD, HubertWong, PhD: University of British Columbia, Vancouver, BritishColumbia, Canada; 4John M. Esdaile, MD, MPH, Kam Shojania, MD,Jacek Kopec, MD, MSc, PhD: University of British Columbia, andArthritis Research Centre of Canada, Vancouver, British Columbia,Canada; 5Kelly J. McGorm, BN, MPH: Arthritis Research Centre ofCanada, Vancouver, British Columbia, Canada (current address: Uni-versity of Edinburgh, Edinburgh, UK); 6Paul Peloso, MD, MSc:University of Iowa, Iowa City; 7Joel Singer, PhD: University of BritishColumbia and Centre for Health Evaluation and Outcome Sciences,Vancouver, British Columbia, Canada.
Address correspondence and reprint requests to JolandaCibere, MD, Arthritis Research Centre of Canada, 895 West 10thAvenue, Vancouver, British Columbia V5Z 1L7, Canada. E-mail:jcibere@arthritisresearch.ca.
Submitted for publication December 31, 2002; accepted inrevised form October 1, 2003.
458
criteria for knee OA, assessments for crepitus, bonyswelling, bony tenderness, and warmth are required (10).However, the reliability of these and other signs byphysical examination has not been evaluated compre-hensively. A search of the literature revealed very fewstudies on the intra- and interrater reliability of the kneeexamination in OA (11–15) (Table 1). Although the datain Table 1 suggest apparently large between-study dif-ferences in reliability, a comparison of these values is notappropriate because they are not all based on the samemeasure of reliability and because the kappa statistic isvery sensitive to differences in bias and prevalence. As aresult, it is not clear from these studies which, if any,physical signs can be examined reliably in knee OA.Additional limitations of prior studies are that some OAknee examination techniques have not been evaluated,and the effect of standardization has only been assessedin 1 study (15).
Because kappa is known to be a measure that issensitive to prevalence and bias, a prevalence-adjustedbias-adjusted kappa (PABAK), described by Byrt et al(16), was used in this study. PABAK measures agree-ment beyond chance, while taking into account both theprevalence of a positive finding and the bias of each
observer for reporting a positive finding. PABAK isthought to be a better estimate for agreement than thestandard kappa (16). In addition, PABAK has theadvantage that the results can be directly comparedbetween different variables and even between studies,when study populations are similar.
In this study, we assessed a wide range of kneephysical examination signs and techniques in patientswith mild to severe radiographic knee OA. The purposeof this study was 2-fold, in that we sought to determine1) which signs can be assessed reliably by rheumatolo-gists, and 2) whether standardization can reduce theinterobserver variability.
PATIENTS AND METHODS
Patients. The study was approved by the local institu-tional review board and all participating subjects providedtheir written informed consent. Six subjects were selected froma database of patients with knee OA. Subjects were eligible ifthey met predefined criteria. Inclusion criteria were 1) age40–79 years, 2) knee pain on most days of the month at anytime in the past, 3) any knee pain during the previous 12months, and 4) osteophytes on plain radiography. Exclusioncriteria were 1) prior total knee arthroplasty, 2) knee surgery
Table 1. Summary of published findings on the interobserver reliability of osteoarthritis knee examination signs
Knee examination signCushnaghan et al
(11)*Hart et al
(12)*Jones et al
(13)*Hauzeur et al
(14)†
Bellamy et al (15)‡
Prestandardization Poststandardization
Alignment – – – – – –Bony swelling 0.55 0.10 – – 0.65 0.69Crepitus
General crepitus – 0.14 0.23 – – –Tibiofemoral crepitus 0.64 – 0.09 – – –Patellofemoral crepitus 0.24 – 0.10 – – –
Gait – – – – – –Inflammation
Nonbony swelling 0.28 0.25 0.13 – – –Synovial fluid – – 0.22 0.45 – –Popliteal cyst – – 0.21 – – –Warmth – – 0.23 – – –
InstabilityMediolateral instability 0.23 – – – – –Anteroposterior instability 0.00 – – – – –
Muscle strength – – – – – –Tenderness/pain
General tibiofemoral – 0.74 – – – –Medial tibiofemoral 0.40 – 0.48 – – –Lateral tibiofemoral 0.43 – 0.44 – – –Patellofemoral 0.35 – 0.33 – – –Periarticular – – 0.35 – – –Pain on movement – 0.85 – – – –
Range of motion – – – – 0.58–0.94 0.89–0.95
* Reliability assessed by kappa.† Reliability assessed by weighted kappa.‡ Reliability assessed by reliability coefficient.
RELIABILITY OF EXAMINATION OF KNEE OA 459
within the previous 4 months, 3) fibromyalgia or inflammatoryarthritis, 4) knee pain derived from the hips or back, and 5)history of acute injury to the knee within the previous 6months.
Patients were selected by an independent rheumatol-ogist who was not involved in the standardization process (KS).The primary criterion for patient selection was the presence ofa full range of physical examination signs, including a range ofseverity of different physical signs. A secondary criterion wasthe selection of patients on the basis of radiographic severity ofOA as assessed by the Kellgren/Lawrence scale (17). Patientscompleted the Western Ontario and McMaster UniversitiesOA (WOMAC) Index, version VA3.1 (18), but they were notselected on the basis of symptom severity as reported on theWOMAC. Rheumatologists who participated in the standard-ization process were blinded to these patient selection criteria.
Design. The study involved 6 rheumatologists whowere experienced in the conduct of knee OA research studies,and a biostatistician. The selection of physical examinationsigns was based on the frequency with which they are recordedin rheumatology clinical practice or research and their poten-tial to be useful in the evaluation of knee pain (1–6,9,10). Ifseveral techniques were available for the evaluation of a sign,these were included and specified. The framework used toevaluate all physical signs was based on 9 domains, comprisingalignment, bony swelling, crepitus, gait, inflammation, instabil-ity, muscle strength, tenderness/pain, and range of motion. Thefinal selection of physical signs/techniques was determined bygroup consensus (Figure 1). A pilot knee examination wasconducted by 1 of the investigators (JC) to estimate the timerequired for the evaluation and to remedy any difficulties withthe data collection forms or their scoring.
The standardization study was completed over 2 days.On the first day, a prestudy briefing was conducted. Therheumatologists were familiarized with the instructions, equip-ment, and scoring sheets for the 2-day program. The studymanual contained a list of physical examinations with a briefdescription of patient position, examination method, and scor-ing. Basic misconceptions were clarified, but no detaileddiscussion of technique was entertained and no attempt tostandardize was undertaken. Rheumatologists were encour-aged to perform the examinations in accordance with theirusual practice. Equipment for examinations included a 30.5-cmgoniometer and a 150-cm tape measure, each identical inbrand and specification.
Physical examinations. Prestandardization examina-tions. Following the briefing session, the rheumatologists ex-amined each patient according to a 6 � 6 Latin square design(19). Examinations were conducted independently, in separaterooms, at a single site. The patient order and the order ofphysical examinations were randomized for each rheumatolo-gist. For physician convenience and patient comfort, physicalexaminations were randomized by position of test (standing,sitting, lying) and then type of test. Patients wore shorts, suchthat there was adequate bilateral exposure of knees and thighs.Following each examination, all data forms were checked forcompleteness by the study coordinator, and the results wereentered immediately for analysis by the biostatistician. Theschedule included a 5-minute break for patients and rheuma-tologists between each examination and a 20-minute breakafter the first 3 examinations. All participants (patients and
rheumatologists) were instructed not to discuss any examina-tions or techniques during the breaks. On completion of theprestandardization examinations, patients were asked to com-plete a 1-page questionnaire to document any perceiveddifferences in examination technique between rheumatolo-gists. Although analgesic medications were allowed in thestudy, none of the patients had taken any on the day of theexaminations.
Standardization. Following the prestandardization ex-aminations, rheumatologists convened for the standardizationmeeting, chaired by one of us (NB). The standardizationprocess was based on 4 elements: 1) patient responses to thequestionnaires on perceived differences in examinations; 2)graphic display of data and identification of outliers by thebiostatistician; 3) discussion of examination techniques toelucidate reasons for variability; and 4) demonstration ofphysical signs on a healthy volunteer. Although the patientquestionnaires provided some insight into differences in rheu-matologists’ examinations, these were ultimately not helpful toidentify outliers or explain variability. For each physical exam-ination item, a graphic chart of the variability of findings wasdisplayed by the biostatistician, followed by a discussion of theexamination technique to elucidate the reasons for the vari-ability. Areas of disagreement were resolved by consensus. Themain discussion points were the physical examination tech-niques, measurement landmarks, and the scoring scale. Thestandardized technique was recorded in the study manual byeach rheumatologist. Immediately following the standardiza-tion meeting, the poststandardization data collection formswere updated to incorporate all changes.
Poststandardization examinations. The following day,patients and rheumatologists returned for the poststandardiza-tion examinations. These were performed using a 6 � 6 Latinsquare design (19) with a different randomization schedulefrom that at prestandardization. The procedures for breaks,data checking, and patient feedback at the end of the exami-nations were the same.
Statistical analysis. The statistical analysis was con-ducted by the biostatistician using the S-Plus statistical pro-gram (20). Interobserver agreement with regard to signs withdichotomous scales was calculated as the prevalence-adjustedbias-adjusted kappa (PABAK), which is calculated as 2po � 1,where po is the observed proportion of agreement (16); thisobserved proportion of agreement (po) was obtained by calcu-lating the mean of observed proportions of agreement for allpossible pairs of rheumatologists. Although the PABAK isadjusted for prevalence and bias, it still needs to be examinedin conjunction with prevalence and bias. A high level of biaswould need to be investigated to determine its cause. In asituation of low prevalence, insufficient information is avail-able for a precise assessment of agreement, and PABAKvalues may be relatively uninformative. For the interpretationof PABAKs, we adopted the standard kappa descriptive scaleby Landis and Koch (21), which, although somewhat arbitrary,is widely used: �0.00 � poor agreement, 0.00–0.20 � slightagreement, 0.21–0.40 � fair agreement, 0.41–0.60 � moderateagreement, 0.61–0.80 � substantial agreement, and 0.81–1.00 � almost perfect agreement. A PABAK of �0.60 waschosen a priori to indicate adequate reliability. In view of thefact that the PABAK represents an improvement on Cohen’skappa, the consensus among the study rheumatologists and the
460 CIBERE ET AL
biostatistician was that it was reasonable to use the samedescriptive scale. Similarly, the choice of the cutoff value wasbased on a decision by consensus of all study rheumatologistsand the biostatistician.
Interobserver agreement with regard to signs withcontinuous and ordinal scales was assessed using analysis ofvariance (ANOVA). A number of different forms of theintraclass correlation coefficient may be used, depending onthe study question and structure (22). Since the purpose ofthis study was to examine only the variation due to thedifferences between doctors, and because this study attemptedto reduce this variation, it was thought that the most appro-priate reliability coefficient (Rc) would be calculated as 1 �variancedoctor, where variancedoctor is the proportion of total
variance attributed to doctors (15). By consensus, a reliabilitycoefficient of �0.80 was chosen a priori to indicate adequatereliability.
For the 7 signs in which it was necessary to changefrom an ordinal to a dichotomous scale after standardization,reliability coefficients were calculated for both the pre- andpoststandardization data. Although the change in scale meansthat differences between pre- and poststandardization valuesmust be interpreted with caution, it was believed that it wouldbe more appropriate to compare 2 reliability coefficients thana reliability coefficient and a PABAK.
For all physical signs/techniques, an A–D gradingsystem was applied as follows: grade A � reliability adequateon both pre- and poststandardization examinations (i.e.,
Figure 1. Assessment of poststandardization reliability (�) for 42 physical examination signsevaluated in patients with knee osteoarthritis. Thick vertical bars indicate the cutoff values for anacceptable prevalence-adjusted bias-adjusted kappa (PABAK) (cutoff �0.60) and reliabilitycoefficient (Rc) (cutoff �0.80). For a description of the grading system of reliability (grades A–D),see Patients and Methods. � � the effusion–balloon test was evaluated poststandardization only.
RELIABILITY OF EXAMINATION OF KNEE OA 461
PABAK �0.60 or Rc �0.80); grade B � reliability adequateonly on poststandardization examination; grade C � reliabilityadequate on pre- but not poststandardization examination;and grade D � reliability inadequate on both pre- andpoststandardization examinations. Since the poststandardiza-tion reliability is of primary interest, a higher grade (A or B)indicates greater reliability and usefulness of the physical sign,with grade A being the most desirable since adequate reliabil-ity is present without standardization.
RESULTS
Patient characteristics. Three male and 3 femalepatients participated in the study. Their median age was62 years (range 44–74 years), median duration of kneepain was 8 years (range 3–20 years), median WOMACscore for pain on walking was 4 mm (range 0–25 mm),and median body mass index was 23.9 kg/m2 (range22.4–26.7). Three of the patients underwent examina-tion of the right knee, and 3 underwent examination oftheir left knee. The radiographic severity of OA was aKellgren/Lawrence grade 2, grade 3, and grade 4 in 2patients each.
Pre- and poststandardization agreement. Thestandardization meeting resulted in changes to the scor-ing in some of the examination domain items. All 6 itemsevaluating pain or tenderness were changed from a4-point scale (none, mild, winces, withdraws) to a2-point scale (absent, present), mainly because the 2extreme descriptions of “winces” and “withdraws” wererarely elicited. Similarly, for the effusion bulge sign, the4-point scale (none, mild, moderate, severe) was re-placed by a 2-point scale (absent, present) in the post-standardization examinations. Scoring of bony swellingwas changed from a 3-point scale (none, mild, severe) toa 4-point scale (none, mild, moderate, severe). The3-point scale for the 11 crepitus signs was not changed,although the scale description was modified from “none,mild, severe” to “none, fine, coarse,” which was consid-ered to more accurately reflect the nature of the abnor-malities being detected. The scoring and description ofall other signs/techniques remained the same. In addi-tion, it was decided to score any marginal findings asnormal or absent.
Figure 2. Effect of standardization of examination for 41 physical signs/techniques. Improvementand worsening of reliability as a result of standardization are indicated by symbols positioned eitherabove or below the diagonal line, respectively. The degree of improvement or worsening isreflected by the vertical distance from the diagonal. Those signs/techniques assessed for reliabilityby prevalence-adjusted bias-adjusted kappas (PABAK) are denoted by lower case letters; thePABAK was considered acceptable if �0.60. Those signs/techniques assessed with reliabilitycoefficients are denoted by upper case letters and numbers; the reliability coefficient wasconsidered acceptable if �0.80. � � the effusion–balloon test was evaluated poststandardizationonly. A � active; P � passive; TF � tibiofemoral; PS � passive with stress; PF � patellofemoral.
462 CIBERE ET AL
The results for all physical examinations/signswithin each domain are summarized in Figure 1, whichshows the poststandardization reliability, and Figure 2,which shows the effect of standardization. Overall, ade-quate reliability was achieved for 30 (71%) of 42 physicalsigns/techniques, with a grade A or B following stan-dardization. Although most of the other signs/techniques were close to the cutoff value for reliability,a few, including warmth, medial instability at 30° flexion,and lateral instability at 30° flexion, were well below thecutoff and thus were highly unreliable even after stan-dardization (Figure 1). The process of standardizationresulted in substantial improvements in reliability formany of the physical examination signs, although forsome, standardization had minimal or no effect (Figure2).
Detailed results on the 12 signs with dichotomousscales are shown in Table 2, which lists the PABAKvalues, prevalence, and bias for both pre- and poststan-dardization as well as the letter grade (A–D). Onedichotomous item (effusion by balloon test) was addedduring the standardization meeting, and therefore onlypoststandardization data are reported for this item.Table 3 shows detailed results on the 29 signs withcontinuous or ordinal scales, listing the percentage vari-ance due to patient, doctor, order, and error, and thereliability coefficients for both pre- and poststandardiza-tion as well as the letter grade (A–D).
Results by domain. Alignment. With regard toassessment of alignment, 3 of the 4 physical signs/techniques (inspection, goniometer, and intercondylardistance) were of adequate reliability (all grade A),whereas intermalleolar distance was not reliable evenfollowing standardization (grade D) (Table 3 and Fig-ures 1 and 2). Alignment measured by goniometer wasmost reliable (Rc � 0.99). However, simple inspectionfor varus, valgus, or normal alignment also achieved areliability coefficient of 0.94 poststandardization; be-cause of its simplicity, this technique could also be usefulin future research.
Bony swelling. The assessment for bony swelling,which is a key component of the ACR clinical diagnosticcriteria for OA (10), was found to be highly reliable(grade A) in contrast to that observed in previous studies(11,12,15). Even with a change from a 3-point to a4-point scale, reliability further improved after standard-ization, achieving a reliability coefficient of 0.97 (Table 3and Figures 1 and 2).
Crepitus. Crepitus was assessed as general andcompartment-specific crepitus, and was assessed usingactive, passive, and passive with stress movement. Thelatter evaluation has been suggested to correlate withearly arthroscopic findings of cartilage damage (9). Thefindings for reliability were inconsistent. Although gen-eral crepitus was reliably assessed using passive move-ment (Rc � 0.96), assessment of general crepitus with
Table 2. Pre- and poststandardization prevalence-adjusted bias-adjusted kappa (PABAK) for dichotomous physical examination signs
Physical sign/technique Scale
Prestandardization Poststandardization
Grade*PABAK Prevalence Bias PABAK Prevalence Bias
GaitInspection Normal/abnormal 0.52 0.33 0.24 0.78 0.50 0.11 B
InflammationEffusion–balloon test† Present/absent – – – 0.88 0.19 0.06 A/BEffusion–patellar tap Present/absent 0.88 0.19 0.06 0.78 0.17 0.11 APopliteal cyst Present/absent 0.78 0.06 0.09 0.66 0.08 0.14 AWarmth Present/absent 0.24 0.28 0.24 0.14 0.31 0.30 D
InstabilityLateral
0° flexion Normal/abnormal 0.56 0.11 0.20 0.88 0.03 0.06 B30° flexion Normal/abnormal 0.08 0.42 0.21 0.34 0.28 0.27 D
Medial0° flexion Normal/abnormal 0.48 0.14 0.14 0.66 0.08 0.14 B30° flexion Normal/abnormal 0.02 0.39 0.31 0.02 0.50 0.18 D
AnteriorDrawer test Normal/abnormal 0.60 0.28 0.16 0.54 0.19 0.17 D
PosteriorDrawer test Normal/abnormal 0.82 0.11 0.09 0.82 0.11 0.09 ASag Normal/abnormal 0.82 0.06 0.09 0.78 0.06 0.11 A
Range of motionHyperextension Normal/abnormal �0.02 0.33 0.44 0.88 0.03 0.06 B
* For a description of the grading system of reliability (grades A–D), see Patients and Methods.† Variable only evaluated poststandardization.
RELIABILITY OF EXAMINATION OF KNEE OA 463
Tab
le3.
Pre-
and
post
stan
dard
izat
ion
com
pone
nts
ofva
rian
cean
dre
liabi
lity
coef
ficie
nts
for
cont
inuo
usan
dor
dina
lphy
sica
lexa
min
atio
nsi
gns
Phys
ical
sign
/tech
niqu
eSc
ale
(pos
tsta
ndar
diza
tion)
%of
vari
ance
,pre
stan
dard
izat
ion
%of
vari
ance
,po
stst
anda
rdiz
atio
nR
elia
bilit
yco
effic
ient
Gra
de*
Patie
ntD
octo
rO
rder
Err
orPa
tient
Doc
tor
Ord
erE
rror
Pres
tand
ardi
zatio
nPo
stst
anda
rdiz
atio
n
Alig
nmen
tIn
spec
tion
Nor
mal
,var
us,v
algu
s0.
790.
040.
020.
150.
520.
060.
040.
380.
960.
94A
Gon
iom
eter
Deg
rees
0.21
0.16
0.06
0.57
0.42
0.01
0.08
0.49
0.84
0.99
AIn
terc
ondy
lar
dist
ance
Cen
timet
ers
0.86
0.06
0.01
0.06
0.87
0.06
0.02
0.05
0.94
0.94
AIn
term
alle
olar
dist
ance
Cen
timet
ers
0.45
0.22
0.03
0.30
0.59
0.25
0.03
0.13
0.78
0.75
DB
ony
swel
ling
Palp
atio
nN
one,
mild
,mod
erat
e,se
vere
†0.
540.
090.
100.
270.
610.
030.
090.
270.
910.
97A
Cre
pitu
sG
ener
alA
ctiv
eN
one,
fine,
coar
se0.
540.
220.
050.
180.
370.
330.
100.
190.
780.
67D
Pass
ive
Non
e,fin
e,co
arse
0.56
0.14
0.04
0.26
0.38
0.04
0.15
0.43
0.86
0.96
AL
ater
altib
iofe
mor
alA
ctiv
eN
one,
fine,
coar
se0.
150.
240.
150.
450.
370.
240.
080.
310.
760.
76D
Pass
ive
Non
e,fin
e,co
arse
0.18
0.18
0.08
0.57
0.28
0.09
0.14
0.49
0.82
0.91
APa
ssiv
ew
ithst
ress
Non
e,fin
e,co
arse
0.15
0.24
0.19
0.42
0.28
0.33
0.04
0.35
0.76
0.67
DM
edia
ltib
iofe
mor
alA
ctiv
eN
one,
fine,
coar
se0.
410.
170.
070.
340.
490.
220.
060.
220.
830.
78C
Pass
ive
Non
e,fin
e,co
arse
0.35
0.25
0.07
0.33
0.34
0.22
0.06
0.39
0.75
0.78
DPa
ssiv
ew
ithst
ress
Non
e,fin
e,co
arse
0.20
0.24
0.13
0.42
0.32
0.06
0.15
0.47
0.76
0.94
BPa
tello
fem
oral
Act
ive
Non
e,fin
e,co
arse
0.57
0.23
0.01
0.19
0.18
0.27
0.21
0.35
0.77
0.73
DPa
ssiv
eN
one,
fine,
coar
se0.
690.
080.
080.
140.
180.
230.
130.
460.
920.
77C
Pass
ive
with
stre
ssN
one,
fine,
coar
se0.
180.
100.
070.
650.
210.
130.
130.
540.
900.
87A
Infla
mm
atio
nE
ffus
ion–
bulg
esi
gnPr
esen
t,ab
sent
‡0.
320.
180.
030.
470.
260.
030.
030.
670.
820.
97A
Mus
cle
stre
ngth
Ham
stri
ngst
reng
thPo
or,m
oder
ate,
full
0.12
0.12
0.12
0.65
0.14
0.14
0.14
0.57
0.88
0.86
AQ
uadr
icep
sst
reng
thPo
or,m
oder
ate,
full
0.14
0.14
0.14
0.57
0.14
0.14
0.14
0.57
0.86
0.86
AE
xten
sion
lag
Non
e,m
ild,m
oder
ate,
seve
re0.
530.
240.
050.
180.
120.
120.
120.
630.
760.
88B
Qua
dric
eps
atro
phy
Non
e,m
ild,s
ever
e0.
620.
140.
020.
210.
680.
030.
030.
270.
860.
97A
Ten
dern
ess/
pain
Lat
eral
tibio
fem
oral
tend
erne
ssPr
esen
t,ab
sent
‡0.
500.
150.
040.
310.
500.
150.
020.
330.
850.
85A
Med
ialt
ibio
fem
oral
tend
erne
ssPr
esen
t,ab
sent
‡0.
330.
170.
060.
440.
440.
060.
140.
360.
830.
94A
Pate
llofe
mor
alte
nder
ness
bygr
ind
test
Pres
ent,
abse
nt‡
0.70
0.07
0.07
0.17
0.67
0.06
0.02
0.25
0.93
0.94
A
Ans
erin
ebu
rsa
tend
erne
ssPr
esen
t,ab
sent
‡0.
310.
260.
100.
330.
450.
100.
030.
420.
740.
90B
Pate
llar
tend
onte
nder
ness
Pres
ent,
abse
nt‡
0.34
0.16
0.16
0.34
0.67
0.07
0.07
0.20
0.84
0.93
AE
nd-o
f-ra
nge
stre
sspa
inPr
esen
t,ab
sent
‡0.
340.
110.
060.
480.
400.
130.
070.
400.
890.
87A
Ran
geof
mot
ion
Fle
xion
rang
eof
mot
ion
Deg
rees
0.55
0.04
0.13
0.28
0.63
0.15
0.01
0.21
0.96
0.85
AF
lexi
onco
ntra
ctur
eD
egre
es0.
580.
190.
070.
160.
820.
050.
050.
080.
810.
95A
*F
ora
desc
ript
ion
ofth
egr
adin
gsy
stem
ofre
liabi
lity
(gra
des
A–D
),se
ePa
tient
san
dM
etho
ds.
†Pr
esta
ndar
diza
tion
scal
e�
none
,mild
,sev
ere.
‡Pr
esta
ndar
diza
tion
scal
e�
none
,mild
,mod
erat
e,se
vere
.
464 CIBERE ET AL
active movement did not achieve adequate reliability(Rc � 0.67). For compartment-specific crepitus, ade-quate reliability was present for the lateral compartmentonly on passive movement (Rc � 0.91), and for themedial and patellofemoral compartments, adequate re-liability was achieved only on passive with stress move-ment (Rc � 0.94 and Rc � 0.87, respectively). All othercrepitus evaluations achieved a grade C or D, although itshould be noted that most of the poststandardizationreliability coefficients were close to the cutoff value of0.80 (Table 3 and Figures 1 and 2).
Gait. Gait was assessed by simple inspection andwas found to be reliable, with a poststandardizationPABAK of 0.78. However, standardization was requiredto achieve adequate reliability (grade B) (Table 2 andFigures 1 and 2).
Inflammation. Signs of inflammation includedjoint effusion, popliteal cyst, and warmth. Effusion wasassessed by bulge sign, balloon test, and patellar tap. Ofthese, the bulge sign was most reliable (Rc � 0.97)(Table 3). However, assessment of effusion by balloontest also achieved a poststandardization PABAK of 0.88,although it is uncertain whether standardization wasrequired for such an achievement, since this sign wasonly assessed poststandardization (Table 2). The exam-ination for popliteal cyst was also reliable (grade A).However, the PABAK for assessment of popliteal cystunexpectedly decreased from 0.78 on prestandardizationto 0.66 on poststandardization (Table 2). Agreement onassessment of warmth was low and remained low despitestandardization, with a poststandardization PABAK of0.14 (grade D) (Table 2 and Figures 1 and 2).
Instability. Lateral and medial instability wereassessed reliably at 0° of flexion after standardization(both grade B), although the prevalence of a positivefinding for these 2 items was very low, and thereforethese results must be interpreted with caution. In con-trast, at 30° of flexion, agreement on assessment of bothlateral and medial instability was poor (both grade D),particularly for medial instability, which achieved apoststandardization PABAK of only 0.02. It is also ofinterest that the bias for both of these signs was high onboth pre- and poststandardization examinations, sug-gesting that, overall, the rheumatologists’ bias for find-ing instability was not altered by the standardizationprocess (Table 2). Assessment of posterior instability byposterior drawer test and posterior sag was reliable(grade A), but also had a low prevalence, whereasanterior instability assessment by anterior drawer testwas found to be unreliable (grade D) (Table 2).
Muscle strength. Good agreement was achievedfor all assessments of muscle strength, with reliabilitycoefficients of 0.86, 0.86, 0.88, and 0.97 for quadriceps
strength, hamstring strength, extension lag, and quadri-ceps atrophy, respectively. All signs achieved grade A,except for extension lag, which achieved grade B (Table3 and Figure 1).
Tenderness/pain. Good reliability was achievedfor assessment of all signs of articular and periarticulartenderness/pain (grade A or B). Although the change inscoring from a 4-point to a 2-point scale may haveimproved the agreement among rheumatologists, goodreliability was already present prior to standardizationfor most tenderness/pain signs (Table 3 and Figures 1and 2).
Range of motion. Assessment of range of motionof the knee was subdivided into flexion, flexion contrac-ture, and hyperextension. All 3 assessments were ofadequate reliability (grade A, A, and B, respectively)(Tables 2 and 3). The low PABAK and high bias for theexamination of hyperextension before standardizationwas related to a difficulty in the interpretation of thescale, which was coded as absent (0) or present (1).Some rheumatologists interpreted the absence of hyper-extension as abnormal, and therefore coded their findingas 1 instead of 0. Poststandardization, this coding errorwas eliminated by changing the scale to normal/abnormal. This resulted in an improved PABAK of 0.88and a much lower bias of 0.06. However, it should alsobe noted that the poststandardization prevalence wasvery low, such that this PABAK needs to be interpretedwith caution (Table 2 and Figures 1 and 2).
DISCUSSION
The principal elements of the knee examinationinclude evaluations for alignment, bony swelling, crepi-tus, gait, inflammation, instability, muscle strength,tenderness/pain, and range of motion. The availability ofreliable physical examination signs from within each ofthese domains is crucial for the ability to assess the kneejoint comprehensively in future outcome studies of OA.
With regard to examinations for crepitus, ourstudy findings were variable. Compartment-specificcrepitus was not assessed with consistent reliability byusing active or passive movement. Passive movementwith stress was reliable only for the medial tibiofemoraland the patellofemoral compartments and may be moredifficult to implement in clinical research, since it is notusual practice, is not easily performed, and is more timeconsuming. Therefore, given that passive crepitus isgenerally assessed in clinical practice, and since thereliability coefficients were either acceptable or close tothe cutoff of 0.80, this technique would seem mostfeasible for use in future studies, if compartment-specificcrepitus is of importance. For general (non–
RELIABILITY OF EXAMINATION OF KNEE OA 465
compartment-specific) crepitus, the assessment wasmost reliable using passive movement.
With regard to alignment, inflammation, andmuscle strength, several interchangeable signs were eval-uated in this study, with at least 1, and frequently morethan 1, physical sign achieving good reliability. Thisallows for selection of appropriate physical examinationsigns in future studies on the basis of not only reliability,but also suitability and preference. On the other hand,for some domains, such as instability, tenderness/pain,and range of motion, the individual physical signs eval-uated represent different dimensions of these domainsand are therefore not interchangeable. With regard toinstability, we were only able to reliably assess posteriorinstability. The reliability of medial and lateral instabil-ity, which depends on adequate assessments at both 0°and 30° of flexion, was not established. As a result, theassessment of instability as a whole was found to beunreliable and may need to be investigated in furtherstudies. In contrast, the assessments for tenderness/painand range of motion were found to be reliable, since allphysical signs within those groups were deemed to bereliable.
Overall, this study showed that the majority ofphysical signs can be assessed reliably. Furthermore, theeffect of standardization was to improve reliability formost of the signs/techniques. However, for some physi-cal examinations, there was a decrease in reliabilityfollowing standardization. For most of these, a decreaseof less than 0.05 in the reliability coefficient or PABAKwas seen. Such small changes, in either direction, maynot be clinically important and are likely due to randomerror resulting from the dynamic interactions whichoccur within and between subjects as well as within andbetween assessors. For a few signs, a greater decrease inthe reliability coefficient or PABAK was observed. Thisis likely due to the fact that not all physical measures areequally responsive to a simple standardization proce-dure, but require more intense or repeated training. Inparticular, a conflict between what assessors normally docompared with what they are obligated to do on the basisof imposed study requirements can likely influence thereliability. In this study, the standardization meetingincluded demonstration of physical technique for thepurpose of reaching an agreement on standardization,but no extensive assessor training was undertaken, andtherefore this may have adversely affected the reliabilityof some physical examination findings. This possibilityneeds to be considered in future studies.
Only 3 signs were clearly unreliable for thephysical examination of knee OA and were not reme-died by standardization. These were warmth, lateral
instability at 30° flexion, and medial instability at 30°flexion. It is not surprising that the latter 2 were unreli-able, since some mediolateral movement is invariablypresent at 30° of flexion and the decision of what degreeof movement constitutes instability is subjective anddifficult to standardize. It is interesting to note thatstandardization did achieve substantial improvement inagreement for lateral instability at 30° flexion, but notfor medial instability at 30° flexion. This discrepancy isdifficult to explain. It is possible that mediolateralinstability represents a single inseparable sign, whichmay achieve adequate reliability, if examined as such.This will need to be explored in future studies.
Similar to instability, we found poor reliability forthe assessment of warmth and were unable to standard-ize for it. Because a finding of warmth could be affectedby repeated joint examinations, the order of examina-tions and whether later examinations were associatedwith findings of warmth was evaluated. However, noeffect due to order was found, and therefore it is likelythat the poor reliability of warmth was due to the highlysubjective nature of its assessment.
The interpretation of our results also requires anunderstanding of the inherent limitation of dichotomiz-ing a continuum. The cutoff values of PABAK andreliability coefficients, although sensible, were arbitrary.The adequacy or inadequacy of reliability is not clearlydiscriminated by values that fall immediately above orbelow the cutoff points. Thus, our findings of adequatereliability may need to be interpreted more or lessstrictly, depending on the application of these results.Particularly with regard to the PABAK, which has beenutilized in few studies, the appropriateness of a cutoffvalue of 0.60 is uncertain. However, given that a con-
Table 4. Summary of poststandardization values for the most reli-able physical examination techniques in each domain
Domain Physical examination sign Reliability
Alignment Alignment by goniometer 0.99*Bony swelling Palpation 0.97*Crepitus General passive crepitus 0.96*Gait Inspection 0.78†Inflammation Effusion bulge sign 0.97*Instability – UnreliableMuscle strength Quadriceps atrophy 0.97*Tenderness/pain Medial tibiofemoral
tenderness0.94*
Tenderness/pain Lateral tibiofemoraltenderness
0.85*
Tenderness/pain Patellofemoral tendernessby grind test
0.94*
Range of motion Flexion contracture 0.95*
* By reliability coefficient.† By prevalence-adjusted bias-adjusted kappa.
466 CIBERE ET AL
ventional kappa greater than 0.60 is interpreted as atleast substantial agreement (21), and given that thePABAK provides more reliable values for an index ofagreement than does the conventional kappa, we thinkthat such a cutoff is indeed appropriate for the purposeof our study, which attempted to identify physical signsthat are highly reliable for use in future OA studies.
In addition, the magnitude of these statisticalvalues depends very much on the statistical methodsbeing applied. The ANOVA-generated coefficients arecharacterized by an interplay between the error due todoctors and the error due to patients and residuals, suchthat these latter 2 sources of variation can have aprofound influence on the magnitude of the error due todoctors and thus the reliability coefficient.
The small sample of both patients and assessorsmay also be seen as a limitation. However, because ofpotential patient and assessor fatigue with repeatedexaminations, this type of work, by necessity, involvessmall samples. Furthermore, the selection of patientswas carried out in such a way that they were represen-tative of patients with mild to severe radiographic OAwith a range of physical examination findings, and thusthey were the kind of patients typically seen in clinicalpractice and in research. More importantly, the preva-lence of a positive finding was adequate for the majorityof physical signs, thereby allowing for an appropriateassessment of reliability. For those signs in which theprevalence was low, increasing the number of subjectsmay improve the assessment of reliability.
Assessors were selected based on their expertisein OA and in clinical assessments, and therefore thestudy results may not be generalizable to other rheuma-tologists. However, the application of the standardizedtechniques developed in this study will likely proveuseful to further evaluate the reliability of the kneeexamination in other OA studies.
Finally, the intraobserver reliability was not as-sessed in this study, because doing so would haverequired more repetitions. Since our primary aim was toevaluate interobserver reliability, we designed the studyto minimize the repetitions of knee examinations inorder to avoid patient and examiner fatigue and, inparticular, to avoid reinforcement of memory, whichcould potentially bias the findings for interobserverreliability. In addition, the long-term stability of inter-observer agreement was not assessed. As a result, it isuncertain whether the improvement in reliabilityachieved during the standardization study is maintainedover time. This is an important consideration for clinicaltrials and other OA studies, since long-term followup isoften required. Further studies are necessary to evaluate
long-term reliability of the physical examination and thefrequency at which assessor training needs to be carriedout in order to reliably perform the knee examination inOA.
Despite these potential limitations, the followingkey findings can be summarized. The majority of phys-ical examinations can be performed reliably even with-out standardization (grade A signs). Even with highlyreliable signs/techniques, standardization can furtherimprove the reliability. Some physical examinations re-quire assessor training and should not be used otherwise.The examination techniques with the highest reliabilitycoefficients or PABAKs will likely be of most value inclinical research and possibly in the evaluation of earlyOA, in which more subtle findings are expected.
The key most reliable physical examinations ofknee OA are listed in Table 4 and include alignment bygoniometer, bony swelling, general passive crepitus, gait,effusion bulge sign, quadriceps atrophy, medial andlateral tibiofemoral tenderness, patellofemoral tender-ness by grind test, and flexion contracture. If kneeexamination techniques are to be included in futurestudies of OA, the inclusion of these key signs will beimportant and will allow for reliable and thereforeimproved outcome assessments.
ACKNOWLEDGMENT
We would like to thank our patient volunteers for theirparticipation in this research study.
REFERENCES
1. Graham GP, Fairclough JA. The knee. In: Klippel JH, Dieppe PA,editors. Rheumatology. 2nd ed. London: Mosby International;1998. p. 4.11.1–14.
2. Dillingham MF, Barry NN, Lannin JV. Hip and knee pain. In:Ruddy S, Harris ED Jr, Sledge CB, editors. Kelley’s textbook ofrheumatology. 6th ed. Philadelphia: W.B. Saunders; 2001. p.525–45.
3. Cyriax J. The knee. In: Textbook of orthopaedic medicine. 8th ed.London: Bailliere Tindall; 1982. p. 392–415.
4. Katz WA. Knees and legs. In: Katz WA, editor. Diagnosis andmanagement of rheumatic diseases. 2nd ed. Philadelphia: J.B.Lippincott Company; 1988. p. 134–55.
5. Hoppenfeld S. Physical examination of the knee joint by com-plaint. Orthopedic Clinics of North America 1979;10:3–20.
6. Bookman A. The knee. In: Little H, editor. The rheumatologicalphysical examination. Orlando: Grune & Stratton; 1986. p. 111–9.
7. Post WR. Clinical evaluation of participants with patellofemoraldisorders. Arthroscopy 1999;15:841–51.
8. Taunton JE, Wilkinson M. Rheumatology: 14. Diagnosis andmanagement of anterior knee pain. Can Med Assoc J 2001;164:1595–601.
9. Ike RW, O’Rourke KS. Compartment-directed physical examina-tion of the knee can predict articular cartilage abnormalitiesdisclosed by needle arthroscopy. Arthritis Rheum 1995;38:917–25.
10. Altman RD, Asch E, Bloch D, Bole G, Borenstein D, Brandt K, et
RELIABILITY OF EXAMINATION OF KNEE OA 467
al, for the Diagnostic and Therapeutic Criteria Committee of theAmerican Rheumatism Association. Development of criteria forthe classification and reporting of osteoarthritis: classification ofosteoarthritis of the knee. Arthritis Rheum 1986;29:1039–49.
11. Cushnaghan J, Cooper C, Dieppe P, Kirwan J, McAlindon T,McCrae F. Clinical assessment of osteoarthritis of the knee. AnnRheum Dis 1990;49:768–70.
12. Hart DJ, Spector TD, Brown P, Wilson P, Doyle DV, SilmanAJ. Clinical signs of early osteoarthritis: reproducibility and rela-tion to x-ray changes in 541 women in the general population. AnnRheum Dis 1991;50:467–70.
13. Jones A, Hopkinson N, Pattrick M, Berman P, Doherty M.Evaluation of a method for clinically assessing osteoarthritis of theknee. Ann Rheum Dis 1992;51:243–5.
14. Hauzeur JP, Mathy L, De Maertelaer V. Comparison betweenclinical evaluation and ultrasonography in detecting hydrarthrosisof the knee. J Rheumatol 1999;26:2681–3.
15. Bellamy N, Carette S, Ford PM, Kean WF, LeRiche NGH, LussierA, et al. Osteoarthritis antirheumatic drug trials. I. Effects ofstandardization procedures on observer dependent outcome mea-sures. J Rheumatol 1992;19:436–43.
16. Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J ClinEpidemiol 1993;46:423–9.
17. Kellgren JH, Lawrence JS. Radiological assessment of osteoar-throsis. Ann Rheum Dis 1957;16:494–502.
18. Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW.Validation study of WOMAC: a health status instrument formeasuring clinically important participant relevant outcomes toantirheumatic drug therapy in participants with osteoarthritis ofthe hip or knee. J Rheumatol 1988;15:1833–40.
19. Box GEP, Hunter WG, Hunter JS. Designs with more than oneblocking variable. In: Statistics for experimenters: an introductionto design, data analysis, and model building. New York: JohnWiley & Sons; 1978. p. 245–80.
20. S-Plus 2000 Professional 1988-2000. Seattle, Washington: Math-Soft Inc., Insightful Corporation. Available at: http://www.insightful.com.
21. Landis JR, Koch GG. The measurement of observer agreementfor categorical data. Biometrics 1977;33:159–74.
22. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing raterreliability. Psychol Bulletin 1979;86:420–8.
468 CIBERE ET AL
top related