Top Banner
ARTHRITIS & RHEUMATISM Vol. 50, No. 2, February 2004, pp 458–468 DOI 10.1002/art.20025 © 2004, American College of Rheumatology Reliability of the Knee Examination in Osteoarthritis Effect of Standardization Jolanda Cibere, 1 Nicholas Bellamy, 2 Anona Thorne, 3 John M. Esdaile, 4 Kelly J. McGorm, 5 Andrew Chalmers, 3 Simon Huang, 3 Paul Peloso, 6 Kam Shojania, 4 Joel Singer, 7 Hubert Wong, 3 and Jacek Kopec 4 Objective. To assess the reliability of physical examination of the osteoarthritic (OA) knee by rheuma- tologists, and to evaluate the benefits of standardiza- tion. Methods. Forty-two physical signs and techniques were evaluated using a 6 6 Latin square design. Patients with mild to severe knee OA, based on physical and radiographic signs, were examined in random order prior to and following standardization of techniques. For those signs with dichotomous scales, agreement among the rheumatologists was calculated as the prevalence-adjusted bias-adjusted kappa (PABAK), while for the signs with continuous and ordinal scales, a reliability coefficient (R c ) was calculated using analysis of variance. A PABAK of >0.60 and an R c of >0.80 were considered to indicate adequate reliability. Results. Adequate poststandardization reliability was achieved for 30 of 42 physical signs/techniques (71%). The most highly reliable signs identified by physical examination of the OA knee included align- ment by goniometer (R c 0.99), bony swelling (R c 0.97), general passive crepitus (R c 0.96), gait by inspection (PABAK 0.78), effusion bulge sign (R c 0.97), quadriceps atrophy (R c 0.97), medial tibiofemo- ral tenderness (R c 0.94), lateral tibiofemoral tender- ness (R c 0.85), patellofemoral tenderness by grind test (R c 0.94), and flexion contracture (R c 0.95). The standardization process resulted in substantial improvements in reliability for evaluation of a number of physical signs, although for some signs, minimal or no effect of standardization was noted. After standard- ization, warmth (PABAK 0.14), medial instability at 30° flexion (PABAK 0.02), and lateral instability at 30° flexion (PABAK 0.34) were the only 3 signs that were highly unreliable. Conclusion. With the exception of physical exam- inations for instability, a comprehensive knee examina- tion can be performed with adequate reliability. Stan- dardization further improves the reliability for some physical signs and techniques. The application of these findings to future OA studies will contribute to im- proved outcome assessments in OA. In clinical practice and in research, the knee examination is a key component in the assessment of patients with osteoarthritis (OA). Recommendations for the physical examination of the OA knee include many different signs and techniques (1–9). In the American College of Rheumatology (ACR) clinical diagnostic Supported by grants from the Canadian Institutes of Health Research and the Canadian Arthritis Network. Dr. Cibere’s work was supported by a Canadian Institutes of Health Research Clinician Scientist Award and a Michael Smith Foundation for Health Research Postdoctoral Fellowship Award. 1 Jolanda Cibere, MD: Arthritis Research Centre of Canada, Vancouver, British Columbia, Canada; 2 Nicholas Bellamy, MD, MSc, MBA: Centre of National Research on Disability and Rehabilitation Medicine, and University of Queensland, Brisbane, Australia; 3 Anona Thorne, MSc, Andrew Chalmers, MD, Simon Huang, MD, Hubert Wong, PhD: University of British Columbia, Vancouver, British Columbia, Canada; 4 John M. Esdaile, MD, MPH, Kam Shojania, MD, Jacek Kopec, MD, MSc, PhD: University of British Columbia, and Arthritis Research Centre of Canada, Vancouver, British Columbia, Canada; 5 Kelly J. McGorm, BN, MPH: Arthritis Research Centre of Canada, Vancouver, British Columbia, Canada (current address: Uni- versity of Edinburgh, Edinburgh, UK); 6 Paul Peloso, MD, MSc: University of Iowa, Iowa City; 7 Joel Singer, PhD: University of British Columbia and Centre for Health Evaluation and Outcome Sciences, Vancouver, British Columbia, Canada. Address correspondence and reprint requests to Jolanda Cibere, MD, Arthritis Research Centre of Canada, 895 West 10th Avenue, Vancouver, British Columbia V5Z 1L7, Canada. E-mail: [email protected]. Submitted for publication December 31, 2002; accepted in revised form October 1, 2003. 458
11

Reliability of the knee examination in osteoarthritis: Effect of standardization

May 13, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reliability of the knee examination in osteoarthritis: Effect of standardization

ARTHRITIS & RHEUMATISMVol. 50, No. 2, February 2004, pp 458–468DOI 10.1002/art.20025© 2004, American College of Rheumatology

Reliability of the Knee Examination in Osteoarthritis

Effect of Standardization

Jolanda Cibere,1 Nicholas Bellamy,2 Anona Thorne,3 John M. Esdaile,4 Kelly J. McGorm,5

Andrew Chalmers,3 Simon Huang,3 Paul Peloso,6 Kam Shojania,4 Joel Singer,7

Hubert Wong,3 and Jacek Kopec4

Objective. To assess the reliability of physicalexamination of the osteoarthritic (OA) knee by rheuma-tologists, and to evaluate the benefits of standardiza-tion.

Methods. Forty-two physical signs and techniqueswere evaluated using a 6 � 6 Latin square design.Patients with mild to severe knee OA, based on physicaland radiographic signs, were examined in random orderprior to and following standardization of techniques.For those signs with dichotomous scales, agreementamong the rheumatologists was calculated as theprevalence-adjusted bias-adjusted kappa (PABAK),while for the signs with continuous and ordinal scales, areliability coefficient (Rc) was calculated using analysis

of variance. A PABAK of >0.60 and an Rc of >0.80 wereconsidered to indicate adequate reliability.

Results. Adequate poststandardization reliabilitywas achieved for 30 of 42 physical signs/techniques(71%). The most highly reliable signs identified byphysical examination of the OA knee included align-ment by goniometer (Rc � 0.99), bony swelling (Rc �0.97), general passive crepitus (Rc � 0.96), gait byinspection (PABAK � 0.78), effusion bulge sign (Rc �0.97), quadriceps atrophy (Rc � 0.97), medial tibiofemo-ral tenderness (Rc � 0.94), lateral tibiofemoral tender-ness (Rc � 0.85), patellofemoral tenderness by grindtest (Rc � 0.94), and flexion contracture (Rc � 0.95).The standardization process resulted in substantialimprovements in reliability for evaluation of a numberof physical signs, although for some signs, minimal orno effect of standardization was noted. After standard-ization, warmth (PABAK � 0.14), medial instability at30° flexion (PABAK � 0.02), and lateral instability at30° flexion (PABAK � 0.34) were the only 3 signs thatwere highly unreliable.

Conclusion. With the exception of physical exam-inations for instability, a comprehensive knee examina-tion can be performed with adequate reliability. Stan-dardization further improves the reliability for somephysical signs and techniques. The application of thesefindings to future OA studies will contribute to im-proved outcome assessments in OA.

In clinical practice and in research, the kneeexamination is a key component in the assessment ofpatients with osteoarthritis (OA). Recommendations forthe physical examination of the OA knee include manydifferent signs and techniques (1–9). In the AmericanCollege of Rheumatology (ACR) clinical diagnostic

Supported by grants from the Canadian Institutes of HealthResearch and the Canadian Arthritis Network. Dr. Cibere’s work wassupported by a Canadian Institutes of Health Research ClinicianScientist Award and a Michael Smith Foundation for Health ResearchPostdoctoral Fellowship Award.

1Jolanda Cibere, MD: Arthritis Research Centre of Canada,Vancouver, British Columbia, Canada; 2Nicholas Bellamy, MD, MSc,MBA: Centre of National Research on Disability and RehabilitationMedicine, and University of Queensland, Brisbane, Australia; 3AnonaThorne, MSc, Andrew Chalmers, MD, Simon Huang, MD, HubertWong, PhD: University of British Columbia, Vancouver, BritishColumbia, Canada; 4John M. Esdaile, MD, MPH, Kam Shojania, MD,Jacek Kopec, MD, MSc, PhD: University of British Columbia, andArthritis Research Centre of Canada, Vancouver, British Columbia,Canada; 5Kelly J. McGorm, BN, MPH: Arthritis Research Centre ofCanada, Vancouver, British Columbia, Canada (current address: Uni-versity of Edinburgh, Edinburgh, UK); 6Paul Peloso, MD, MSc:University of Iowa, Iowa City; 7Joel Singer, PhD: University of BritishColumbia and Centre for Health Evaluation and Outcome Sciences,Vancouver, British Columbia, Canada.

Address correspondence and reprint requests to JolandaCibere, MD, Arthritis Research Centre of Canada, 895 West 10thAvenue, Vancouver, British Columbia V5Z 1L7, Canada. E-mail:[email protected].

Submitted for publication December 31, 2002; accepted inrevised form October 1, 2003.

458

Page 2: Reliability of the knee examination in osteoarthritis: Effect of standardization

criteria for knee OA, assessments for crepitus, bonyswelling, bony tenderness, and warmth are required (10).However, the reliability of these and other signs byphysical examination has not been evaluated compre-hensively. A search of the literature revealed very fewstudies on the intra- and interrater reliability of the kneeexamination in OA (11–15) (Table 1). Although the datain Table 1 suggest apparently large between-study dif-ferences in reliability, a comparison of these values is notappropriate because they are not all based on the samemeasure of reliability and because the kappa statistic isvery sensitive to differences in bias and prevalence. As aresult, it is not clear from these studies which, if any,physical signs can be examined reliably in knee OA.Additional limitations of prior studies are that some OAknee examination techniques have not been evaluated,and the effect of standardization has only been assessedin 1 study (15).

Because kappa is known to be a measure that issensitive to prevalence and bias, a prevalence-adjustedbias-adjusted kappa (PABAK), described by Byrt et al(16), was used in this study. PABAK measures agree-ment beyond chance, while taking into account both theprevalence of a positive finding and the bias of each

observer for reporting a positive finding. PABAK isthought to be a better estimate for agreement than thestandard kappa (16). In addition, PABAK has theadvantage that the results can be directly comparedbetween different variables and even between studies,when study populations are similar.

In this study, we assessed a wide range of kneephysical examination signs and techniques in patientswith mild to severe radiographic knee OA. The purposeof this study was 2-fold, in that we sought to determine1) which signs can be assessed reliably by rheumatolo-gists, and 2) whether standardization can reduce theinterobserver variability.

PATIENTS AND METHODS

Patients. The study was approved by the local institu-tional review board and all participating subjects providedtheir written informed consent. Six subjects were selected froma database of patients with knee OA. Subjects were eligible ifthey met predefined criteria. Inclusion criteria were 1) age40–79 years, 2) knee pain on most days of the month at anytime in the past, 3) any knee pain during the previous 12months, and 4) osteophytes on plain radiography. Exclusioncriteria were 1) prior total knee arthroplasty, 2) knee surgery

Table 1. Summary of published findings on the interobserver reliability of osteoarthritis knee examination signs

Knee examination signCushnaghan et al

(11)*Hart et al

(12)*Jones et al

(13)*Hauzeur et al

(14)†

Bellamy et al (15)‡

Prestandardization Poststandardization

Alignment – – – – – –Bony swelling 0.55 0.10 – – 0.65 0.69Crepitus

General crepitus – 0.14 0.23 – – –Tibiofemoral crepitus 0.64 – 0.09 – – –Patellofemoral crepitus 0.24 – 0.10 – – –

Gait – – – – – –Inflammation

Nonbony swelling 0.28 0.25 0.13 – – –Synovial fluid – – 0.22 0.45 – –Popliteal cyst – – 0.21 – – –Warmth – – 0.23 – – –

InstabilityMediolateral instability 0.23 – – – – –Anteroposterior instability 0.00 – – – – –

Muscle strength – – – – – –Tenderness/pain

General tibiofemoral – 0.74 – – – –Medial tibiofemoral 0.40 – 0.48 – – –Lateral tibiofemoral 0.43 – 0.44 – – –Patellofemoral 0.35 – 0.33 – – –Periarticular – – 0.35 – – –Pain on movement – 0.85 – – – –

Range of motion – – – – 0.58–0.94 0.89–0.95

* Reliability assessed by kappa.† Reliability assessed by weighted kappa.‡ Reliability assessed by reliability coefficient.

RELIABILITY OF EXAMINATION OF KNEE OA 459

Page 3: Reliability of the knee examination in osteoarthritis: Effect of standardization

within the previous 4 months, 3) fibromyalgia or inflammatoryarthritis, 4) knee pain derived from the hips or back, and 5)history of acute injury to the knee within the previous 6months.

Patients were selected by an independent rheumatol-ogist who was not involved in the standardization process (KS).The primary criterion for patient selection was the presence ofa full range of physical examination signs, including a range ofseverity of different physical signs. A secondary criterion wasthe selection of patients on the basis of radiographic severity ofOA as assessed by the Kellgren/Lawrence scale (17). Patientscompleted the Western Ontario and McMaster UniversitiesOA (WOMAC) Index, version VA3.1 (18), but they were notselected on the basis of symptom severity as reported on theWOMAC. Rheumatologists who participated in the standard-ization process were blinded to these patient selection criteria.

Design. The study involved 6 rheumatologists whowere experienced in the conduct of knee OA research studies,and a biostatistician. The selection of physical examinationsigns was based on the frequency with which they are recordedin rheumatology clinical practice or research and their poten-tial to be useful in the evaluation of knee pain (1–6,9,10). Ifseveral techniques were available for the evaluation of a sign,these were included and specified. The framework used toevaluate all physical signs was based on 9 domains, comprisingalignment, bony swelling, crepitus, gait, inflammation, instabil-ity, muscle strength, tenderness/pain, and range of motion. Thefinal selection of physical signs/techniques was determined bygroup consensus (Figure 1). A pilot knee examination wasconducted by 1 of the investigators (JC) to estimate the timerequired for the evaluation and to remedy any difficulties withthe data collection forms or their scoring.

The standardization study was completed over 2 days.On the first day, a prestudy briefing was conducted. Therheumatologists were familiarized with the instructions, equip-ment, and scoring sheets for the 2-day program. The studymanual contained a list of physical examinations with a briefdescription of patient position, examination method, and scor-ing. Basic misconceptions were clarified, but no detaileddiscussion of technique was entertained and no attempt tostandardize was undertaken. Rheumatologists were encour-aged to perform the examinations in accordance with theirusual practice. Equipment for examinations included a 30.5-cmgoniometer and a 150-cm tape measure, each identical inbrand and specification.

Physical examinations. Prestandardization examina-tions. Following the briefing session, the rheumatologists ex-amined each patient according to a 6 � 6 Latin square design(19). Examinations were conducted independently, in separaterooms, at a single site. The patient order and the order ofphysical examinations were randomized for each rheumatolo-gist. For physician convenience and patient comfort, physicalexaminations were randomized by position of test (standing,sitting, lying) and then type of test. Patients wore shorts, suchthat there was adequate bilateral exposure of knees and thighs.Following each examination, all data forms were checked forcompleteness by the study coordinator, and the results wereentered immediately for analysis by the biostatistician. Theschedule included a 5-minute break for patients and rheuma-tologists between each examination and a 20-minute breakafter the first 3 examinations. All participants (patients and

rheumatologists) were instructed not to discuss any examina-tions or techniques during the breaks. On completion of theprestandardization examinations, patients were asked to com-plete a 1-page questionnaire to document any perceiveddifferences in examination technique between rheumatolo-gists. Although analgesic medications were allowed in thestudy, none of the patients had taken any on the day of theexaminations.

Standardization. Following the prestandardization ex-aminations, rheumatologists convened for the standardizationmeeting, chaired by one of us (NB). The standardizationprocess was based on 4 elements: 1) patient responses to thequestionnaires on perceived differences in examinations; 2)graphic display of data and identification of outliers by thebiostatistician; 3) discussion of examination techniques toelucidate reasons for variability; and 4) demonstration ofphysical signs on a healthy volunteer. Although the patientquestionnaires provided some insight into differences in rheu-matologists’ examinations, these were ultimately not helpful toidentify outliers or explain variability. For each physical exam-ination item, a graphic chart of the variability of findings wasdisplayed by the biostatistician, followed by a discussion of theexamination technique to elucidate the reasons for the vari-ability. Areas of disagreement were resolved by consensus. Themain discussion points were the physical examination tech-niques, measurement landmarks, and the scoring scale. Thestandardized technique was recorded in the study manual byeach rheumatologist. Immediately following the standardiza-tion meeting, the poststandardization data collection formswere updated to incorporate all changes.

Poststandardization examinations. The following day,patients and rheumatologists returned for the poststandardiza-tion examinations. These were performed using a 6 � 6 Latinsquare design (19) with a different randomization schedulefrom that at prestandardization. The procedures for breaks,data checking, and patient feedback at the end of the exami-nations were the same.

Statistical analysis. The statistical analysis was con-ducted by the biostatistician using the S-Plus statistical pro-gram (20). Interobserver agreement with regard to signs withdichotomous scales was calculated as the prevalence-adjustedbias-adjusted kappa (PABAK), which is calculated as 2po � 1,where po is the observed proportion of agreement (16); thisobserved proportion of agreement (po) was obtained by calcu-lating the mean of observed proportions of agreement for allpossible pairs of rheumatologists. Although the PABAK isadjusted for prevalence and bias, it still needs to be examinedin conjunction with prevalence and bias. A high level of biaswould need to be investigated to determine its cause. In asituation of low prevalence, insufficient information is avail-able for a precise assessment of agreement, and PABAKvalues may be relatively uninformative. For the interpretationof PABAKs, we adopted the standard kappa descriptive scaleby Landis and Koch (21), which, although somewhat arbitrary,is widely used: �0.00 � poor agreement, 0.00–0.20 � slightagreement, 0.21–0.40 � fair agreement, 0.41–0.60 � moderateagreement, 0.61–0.80 � substantial agreement, and 0.81–1.00 � almost perfect agreement. A PABAK of �0.60 waschosen a priori to indicate adequate reliability. In view of thefact that the PABAK represents an improvement on Cohen’skappa, the consensus among the study rheumatologists and the

460 CIBERE ET AL

Page 4: Reliability of the knee examination in osteoarthritis: Effect of standardization

biostatistician was that it was reasonable to use the samedescriptive scale. Similarly, the choice of the cutoff value wasbased on a decision by consensus of all study rheumatologistsand the biostatistician.

Interobserver agreement with regard to signs withcontinuous and ordinal scales was assessed using analysis ofvariance (ANOVA). A number of different forms of theintraclass correlation coefficient may be used, depending onthe study question and structure (22). Since the purpose ofthis study was to examine only the variation due to thedifferences between doctors, and because this study attemptedto reduce this variation, it was thought that the most appro-priate reliability coefficient (Rc) would be calculated as 1 �variancedoctor, where variancedoctor is the proportion of total

variance attributed to doctors (15). By consensus, a reliabilitycoefficient of �0.80 was chosen a priori to indicate adequatereliability.

For the 7 signs in which it was necessary to changefrom an ordinal to a dichotomous scale after standardization,reliability coefficients were calculated for both the pre- andpoststandardization data. Although the change in scale meansthat differences between pre- and poststandardization valuesmust be interpreted with caution, it was believed that it wouldbe more appropriate to compare 2 reliability coefficients thana reliability coefficient and a PABAK.

For all physical signs/techniques, an A–D gradingsystem was applied as follows: grade A � reliability adequateon both pre- and poststandardization examinations (i.e.,

Figure 1. Assessment of poststandardization reliability (�) for 42 physical examination signsevaluated in patients with knee osteoarthritis. Thick vertical bars indicate the cutoff values for anacceptable prevalence-adjusted bias-adjusted kappa (PABAK) (cutoff �0.60) and reliabilitycoefficient (Rc) (cutoff �0.80). For a description of the grading system of reliability (grades A–D),see Patients and Methods. � � the effusion–balloon test was evaluated poststandardization only.

RELIABILITY OF EXAMINATION OF KNEE OA 461

Page 5: Reliability of the knee examination in osteoarthritis: Effect of standardization

PABAK �0.60 or Rc �0.80); grade B � reliability adequateonly on poststandardization examination; grade C � reliabilityadequate on pre- but not poststandardization examination;and grade D � reliability inadequate on both pre- andpoststandardization examinations. Since the poststandardiza-tion reliability is of primary interest, a higher grade (A or B)indicates greater reliability and usefulness of the physical sign,with grade A being the most desirable since adequate reliabil-ity is present without standardization.

RESULTS

Patient characteristics. Three male and 3 femalepatients participated in the study. Their median age was62 years (range 44–74 years), median duration of kneepain was 8 years (range 3–20 years), median WOMACscore for pain on walking was 4 mm (range 0–25 mm),and median body mass index was 23.9 kg/m2 (range22.4–26.7). Three of the patients underwent examina-tion of the right knee, and 3 underwent examination oftheir left knee. The radiographic severity of OA was aKellgren/Lawrence grade 2, grade 3, and grade 4 in 2patients each.

Pre- and poststandardization agreement. Thestandardization meeting resulted in changes to the scor-ing in some of the examination domain items. All 6 itemsevaluating pain or tenderness were changed from a4-point scale (none, mild, winces, withdraws) to a2-point scale (absent, present), mainly because the 2extreme descriptions of “winces” and “withdraws” wererarely elicited. Similarly, for the effusion bulge sign, the4-point scale (none, mild, moderate, severe) was re-placed by a 2-point scale (absent, present) in the post-standardization examinations. Scoring of bony swellingwas changed from a 3-point scale (none, mild, severe) toa 4-point scale (none, mild, moderate, severe). The3-point scale for the 11 crepitus signs was not changed,although the scale description was modified from “none,mild, severe” to “none, fine, coarse,” which was consid-ered to more accurately reflect the nature of the abnor-malities being detected. The scoring and description ofall other signs/techniques remained the same. In addi-tion, it was decided to score any marginal findings asnormal or absent.

Figure 2. Effect of standardization of examination for 41 physical signs/techniques. Improvementand worsening of reliability as a result of standardization are indicated by symbols positioned eitherabove or below the diagonal line, respectively. The degree of improvement or worsening isreflected by the vertical distance from the diagonal. Those signs/techniques assessed for reliabilityby prevalence-adjusted bias-adjusted kappas (PABAK) are denoted by lower case letters; thePABAK was considered acceptable if �0.60. Those signs/techniques assessed with reliabilitycoefficients are denoted by upper case letters and numbers; the reliability coefficient wasconsidered acceptable if �0.80. � � the effusion–balloon test was evaluated poststandardizationonly. A � active; P � passive; TF � tibiofemoral; PS � passive with stress; PF � patellofemoral.

462 CIBERE ET AL

Page 6: Reliability of the knee examination in osteoarthritis: Effect of standardization

The results for all physical examinations/signswithin each domain are summarized in Figure 1, whichshows the poststandardization reliability, and Figure 2,which shows the effect of standardization. Overall, ade-quate reliability was achieved for 30 (71%) of 42 physicalsigns/techniques, with a grade A or B following stan-dardization. Although most of the other signs/techniques were close to the cutoff value for reliability,a few, including warmth, medial instability at 30° flexion,and lateral instability at 30° flexion, were well below thecutoff and thus were highly unreliable even after stan-dardization (Figure 1). The process of standardizationresulted in substantial improvements in reliability formany of the physical examination signs, although forsome, standardization had minimal or no effect (Figure2).

Detailed results on the 12 signs with dichotomousscales are shown in Table 2, which lists the PABAKvalues, prevalence, and bias for both pre- and poststan-dardization as well as the letter grade (A–D). Onedichotomous item (effusion by balloon test) was addedduring the standardization meeting, and therefore onlypoststandardization data are reported for this item.Table 3 shows detailed results on the 29 signs withcontinuous or ordinal scales, listing the percentage vari-ance due to patient, doctor, order, and error, and thereliability coefficients for both pre- and poststandardiza-tion as well as the letter grade (A–D).

Results by domain. Alignment. With regard toassessment of alignment, 3 of the 4 physical signs/techniques (inspection, goniometer, and intercondylardistance) were of adequate reliability (all grade A),whereas intermalleolar distance was not reliable evenfollowing standardization (grade D) (Table 3 and Fig-ures 1 and 2). Alignment measured by goniometer wasmost reliable (Rc � 0.99). However, simple inspectionfor varus, valgus, or normal alignment also achieved areliability coefficient of 0.94 poststandardization; be-cause of its simplicity, this technique could also be usefulin future research.

Bony swelling. The assessment for bony swelling,which is a key component of the ACR clinical diagnosticcriteria for OA (10), was found to be highly reliable(grade A) in contrast to that observed in previous studies(11,12,15). Even with a change from a 3-point to a4-point scale, reliability further improved after standard-ization, achieving a reliability coefficient of 0.97 (Table 3and Figures 1 and 2).

Crepitus. Crepitus was assessed as general andcompartment-specific crepitus, and was assessed usingactive, passive, and passive with stress movement. Thelatter evaluation has been suggested to correlate withearly arthroscopic findings of cartilage damage (9). Thefindings for reliability were inconsistent. Although gen-eral crepitus was reliably assessed using passive move-ment (Rc � 0.96), assessment of general crepitus with

Table 2. Pre- and poststandardization prevalence-adjusted bias-adjusted kappa (PABAK) for dichotomous physical examination signs

Physical sign/technique Scale

Prestandardization Poststandardization

Grade*PABAK Prevalence Bias PABAK Prevalence Bias

GaitInspection Normal/abnormal 0.52 0.33 0.24 0.78 0.50 0.11 B

InflammationEffusion–balloon test† Present/absent – – – 0.88 0.19 0.06 A/BEffusion–patellar tap Present/absent 0.88 0.19 0.06 0.78 0.17 0.11 APopliteal cyst Present/absent 0.78 0.06 0.09 0.66 0.08 0.14 AWarmth Present/absent 0.24 0.28 0.24 0.14 0.31 0.30 D

InstabilityLateral

0° flexion Normal/abnormal 0.56 0.11 0.20 0.88 0.03 0.06 B30° flexion Normal/abnormal 0.08 0.42 0.21 0.34 0.28 0.27 D

Medial0° flexion Normal/abnormal 0.48 0.14 0.14 0.66 0.08 0.14 B30° flexion Normal/abnormal 0.02 0.39 0.31 0.02 0.50 0.18 D

AnteriorDrawer test Normal/abnormal 0.60 0.28 0.16 0.54 0.19 0.17 D

PosteriorDrawer test Normal/abnormal 0.82 0.11 0.09 0.82 0.11 0.09 ASag Normal/abnormal 0.82 0.06 0.09 0.78 0.06 0.11 A

Range of motionHyperextension Normal/abnormal �0.02 0.33 0.44 0.88 0.03 0.06 B

* For a description of the grading system of reliability (grades A–D), see Patients and Methods.† Variable only evaluated poststandardization.

RELIABILITY OF EXAMINATION OF KNEE OA 463

Page 7: Reliability of the knee examination in osteoarthritis: Effect of standardization

Tab

le3.

Pre-

and

post

stan

dard

izat

ion

com

pone

nts

ofva

rian

cean

dre

liabi

lity

coef

ficie

nts

for

cont

inuo

usan

dor

dina

lphy

sica

lexa

min

atio

nsi

gns

Phys

ical

sign

/tech

niqu

eSc

ale

(pos

tsta

ndar

diza

tion)

%of

vari

ance

,pre

stan

dard

izat

ion

%of

vari

ance

,po

stst

anda

rdiz

atio

nR

elia

bilit

yco

effic

ient

Gra

de*

Patie

ntD

octo

rO

rder

Err

orPa

tient

Doc

tor

Ord

erE

rror

Pres

tand

ardi

zatio

nPo

stst

anda

rdiz

atio

n

Alig

nmen

tIn

spec

tion

Nor

mal

,var

us,v

algu

s0.

790.

040.

020.

150.

520.

060.

040.

380.

960.

94A

Gon

iom

eter

Deg

rees

0.21

0.16

0.06

0.57

0.42

0.01

0.08

0.49

0.84

0.99

AIn

terc

ondy

lar

dist

ance

Cen

timet

ers

0.86

0.06

0.01

0.06

0.87

0.06

0.02

0.05

0.94

0.94

AIn

term

alle

olar

dist

ance

Cen

timet

ers

0.45

0.22

0.03

0.30

0.59

0.25

0.03

0.13

0.78

0.75

DB

ony

swel

ling

Palp

atio

nN

one,

mild

,mod

erat

e,se

vere

†0.

540.

090.

100.

270.

610.

030.

090.

270.

910.

97A

Cre

pitu

sG

ener

alA

ctiv

eN

one,

fine,

coar

se0.

540.

220.

050.

180.

370.

330.

100.

190.

780.

67D

Pass

ive

Non

e,fin

e,co

arse

0.56

0.14

0.04

0.26

0.38

0.04

0.15

0.43

0.86

0.96

AL

ater

altib

iofe

mor

alA

ctiv

eN

one,

fine,

coar

se0.

150.

240.

150.

450.

370.

240.

080.

310.

760.

76D

Pass

ive

Non

e,fin

e,co

arse

0.18

0.18

0.08

0.57

0.28

0.09

0.14

0.49

0.82

0.91

APa

ssiv

ew

ithst

ress

Non

e,fin

e,co

arse

0.15

0.24

0.19

0.42

0.28

0.33

0.04

0.35

0.76

0.67

DM

edia

ltib

iofe

mor

alA

ctiv

eN

one,

fine,

coar

se0.

410.

170.

070.

340.

490.

220.

060.

220.

830.

78C

Pass

ive

Non

e,fin

e,co

arse

0.35

0.25

0.07

0.33

0.34

0.22

0.06

0.39

0.75

0.78

DPa

ssiv

ew

ithst

ress

Non

e,fin

e,co

arse

0.20

0.24

0.13

0.42

0.32

0.06

0.15

0.47

0.76

0.94

BPa

tello

fem

oral

Act

ive

Non

e,fin

e,co

arse

0.57

0.23

0.01

0.19

0.18

0.27

0.21

0.35

0.77

0.73

DPa

ssiv

eN

one,

fine,

coar

se0.

690.

080.

080.

140.

180.

230.

130.

460.

920.

77C

Pass

ive

with

stre

ssN

one,

fine,

coar

se0.

180.

100.

070.

650.

210.

130.

130.

540.

900.

87A

Infla

mm

atio

nE

ffus

ion–

bulg

esi

gnPr

esen

t,ab

sent

‡0.

320.

180.

030.

470.

260.

030.

030.

670.

820.

97A

Mus

cle

stre

ngth

Ham

stri

ngst

reng

thPo

or,m

oder

ate,

full

0.12

0.12

0.12

0.65

0.14

0.14

0.14

0.57

0.88

0.86

AQ

uadr

icep

sst

reng

thPo

or,m

oder

ate,

full

0.14

0.14

0.14

0.57

0.14

0.14

0.14

0.57

0.86

0.86

AE

xten

sion

lag

Non

e,m

ild,m

oder

ate,

seve

re0.

530.

240.

050.

180.

120.

120.

120.

630.

760.

88B

Qua

dric

eps

atro

phy

Non

e,m

ild,s

ever

e0.

620.

140.

020.

210.

680.

030.

030.

270.

860.

97A

Ten

dern

ess/

pain

Lat

eral

tibio

fem

oral

tend

erne

ssPr

esen

t,ab

sent

‡0.

500.

150.

040.

310.

500.

150.

020.

330.

850.

85A

Med

ialt

ibio

fem

oral

tend

erne

ssPr

esen

t,ab

sent

‡0.

330.

170.

060.

440.

440.

060.

140.

360.

830.

94A

Pate

llofe

mor

alte

nder

ness

bygr

ind

test

Pres

ent,

abse

nt‡

0.70

0.07

0.07

0.17

0.67

0.06

0.02

0.25

0.93

0.94

A

Ans

erin

ebu

rsa

tend

erne

ssPr

esen

t,ab

sent

‡0.

310.

260.

100.

330.

450.

100.

030.

420.

740.

90B

Pate

llar

tend

onte

nder

ness

Pres

ent,

abse

nt‡

0.34

0.16

0.16

0.34

0.67

0.07

0.07

0.20

0.84

0.93

AE

nd-o

f-ra

nge

stre

sspa

inPr

esen

t,ab

sent

‡0.

340.

110.

060.

480.

400.

130.

070.

400.

890.

87A

Ran

geof

mot

ion

Fle

xion

rang

eof

mot

ion

Deg

rees

0.55

0.04

0.13

0.28

0.63

0.15

0.01

0.21

0.96

0.85

AF

lexi

onco

ntra

ctur

eD

egre

es0.

580.

190.

070.

160.

820.

050.

050.

080.

810.

95A

*F

ora

desc

ript

ion

ofth

egr

adin

gsy

stem

ofre

liabi

lity

(gra

des

A–D

),se

ePa

tient

san

dM

etho

ds.

†Pr

esta

ndar

diza

tion

scal

e�

none

,mild

,sev

ere.

‡Pr

esta

ndar

diza

tion

scal

e�

none

,mild

,mod

erat

e,se

vere

.

464 CIBERE ET AL

Page 8: Reliability of the knee examination in osteoarthritis: Effect of standardization

active movement did not achieve adequate reliability(Rc � 0.67). For compartment-specific crepitus, ade-quate reliability was present for the lateral compartmentonly on passive movement (Rc � 0.91), and for themedial and patellofemoral compartments, adequate re-liability was achieved only on passive with stress move-ment (Rc � 0.94 and Rc � 0.87, respectively). All othercrepitus evaluations achieved a grade C or D, although itshould be noted that most of the poststandardizationreliability coefficients were close to the cutoff value of0.80 (Table 3 and Figures 1 and 2).

Gait. Gait was assessed by simple inspection andwas found to be reliable, with a poststandardizationPABAK of 0.78. However, standardization was requiredto achieve adequate reliability (grade B) (Table 2 andFigures 1 and 2).

Inflammation. Signs of inflammation includedjoint effusion, popliteal cyst, and warmth. Effusion wasassessed by bulge sign, balloon test, and patellar tap. Ofthese, the bulge sign was most reliable (Rc � 0.97)(Table 3). However, assessment of effusion by balloontest also achieved a poststandardization PABAK of 0.88,although it is uncertain whether standardization wasrequired for such an achievement, since this sign wasonly assessed poststandardization (Table 2). The exam-ination for popliteal cyst was also reliable (grade A).However, the PABAK for assessment of popliteal cystunexpectedly decreased from 0.78 on prestandardizationto 0.66 on poststandardization (Table 2). Agreement onassessment of warmth was low and remained low despitestandardization, with a poststandardization PABAK of0.14 (grade D) (Table 2 and Figures 1 and 2).

Instability. Lateral and medial instability wereassessed reliably at 0° of flexion after standardization(both grade B), although the prevalence of a positivefinding for these 2 items was very low, and thereforethese results must be interpreted with caution. In con-trast, at 30° of flexion, agreement on assessment of bothlateral and medial instability was poor (both grade D),particularly for medial instability, which achieved apoststandardization PABAK of only 0.02. It is also ofinterest that the bias for both of these signs was high onboth pre- and poststandardization examinations, sug-gesting that, overall, the rheumatologists’ bias for find-ing instability was not altered by the standardizationprocess (Table 2). Assessment of posterior instability byposterior drawer test and posterior sag was reliable(grade A), but also had a low prevalence, whereasanterior instability assessment by anterior drawer testwas found to be unreliable (grade D) (Table 2).

Muscle strength. Good agreement was achievedfor all assessments of muscle strength, with reliabilitycoefficients of 0.86, 0.86, 0.88, and 0.97 for quadriceps

strength, hamstring strength, extension lag, and quadri-ceps atrophy, respectively. All signs achieved grade A,except for extension lag, which achieved grade B (Table3 and Figure 1).

Tenderness/pain. Good reliability was achievedfor assessment of all signs of articular and periarticulartenderness/pain (grade A or B). Although the change inscoring from a 4-point to a 2-point scale may haveimproved the agreement among rheumatologists, goodreliability was already present prior to standardizationfor most tenderness/pain signs (Table 3 and Figures 1and 2).

Range of motion. Assessment of range of motionof the knee was subdivided into flexion, flexion contrac-ture, and hyperextension. All 3 assessments were ofadequate reliability (grade A, A, and B, respectively)(Tables 2 and 3). The low PABAK and high bias for theexamination of hyperextension before standardizationwas related to a difficulty in the interpretation of thescale, which was coded as absent (0) or present (1).Some rheumatologists interpreted the absence of hyper-extension as abnormal, and therefore coded their findingas 1 instead of 0. Poststandardization, this coding errorwas eliminated by changing the scale to normal/abnormal. This resulted in an improved PABAK of 0.88and a much lower bias of 0.06. However, it should alsobe noted that the poststandardization prevalence wasvery low, such that this PABAK needs to be interpretedwith caution (Table 2 and Figures 1 and 2).

DISCUSSION

The principal elements of the knee examinationinclude evaluations for alignment, bony swelling, crepi-tus, gait, inflammation, instability, muscle strength,tenderness/pain, and range of motion. The availability ofreliable physical examination signs from within each ofthese domains is crucial for the ability to assess the kneejoint comprehensively in future outcome studies of OA.

With regard to examinations for crepitus, ourstudy findings were variable. Compartment-specificcrepitus was not assessed with consistent reliability byusing active or passive movement. Passive movementwith stress was reliable only for the medial tibiofemoraland the patellofemoral compartments and may be moredifficult to implement in clinical research, since it is notusual practice, is not easily performed, and is more timeconsuming. Therefore, given that passive crepitus isgenerally assessed in clinical practice, and since thereliability coefficients were either acceptable or close tothe cutoff of 0.80, this technique would seem mostfeasible for use in future studies, if compartment-specificcrepitus is of importance. For general (non–

RELIABILITY OF EXAMINATION OF KNEE OA 465

Page 9: Reliability of the knee examination in osteoarthritis: Effect of standardization

compartment-specific) crepitus, the assessment wasmost reliable using passive movement.

With regard to alignment, inflammation, andmuscle strength, several interchangeable signs were eval-uated in this study, with at least 1, and frequently morethan 1, physical sign achieving good reliability. Thisallows for selection of appropriate physical examinationsigns in future studies on the basis of not only reliability,but also suitability and preference. On the other hand,for some domains, such as instability, tenderness/pain,and range of motion, the individual physical signs eval-uated represent different dimensions of these domainsand are therefore not interchangeable. With regard toinstability, we were only able to reliably assess posteriorinstability. The reliability of medial and lateral instabil-ity, which depends on adequate assessments at both 0°and 30° of flexion, was not established. As a result, theassessment of instability as a whole was found to beunreliable and may need to be investigated in furtherstudies. In contrast, the assessments for tenderness/painand range of motion were found to be reliable, since allphysical signs within those groups were deemed to bereliable.

Overall, this study showed that the majority ofphysical signs can be assessed reliably. Furthermore, theeffect of standardization was to improve reliability formost of the signs/techniques. However, for some physi-cal examinations, there was a decrease in reliabilityfollowing standardization. For most of these, a decreaseof less than 0.05 in the reliability coefficient or PABAKwas seen. Such small changes, in either direction, maynot be clinically important and are likely due to randomerror resulting from the dynamic interactions whichoccur within and between subjects as well as within andbetween assessors. For a few signs, a greater decrease inthe reliability coefficient or PABAK was observed. Thisis likely due to the fact that not all physical measures areequally responsive to a simple standardization proce-dure, but require more intense or repeated training. Inparticular, a conflict between what assessors normally docompared with what they are obligated to do on the basisof imposed study requirements can likely influence thereliability. In this study, the standardization meetingincluded demonstration of physical technique for thepurpose of reaching an agreement on standardization,but no extensive assessor training was undertaken, andtherefore this may have adversely affected the reliabilityof some physical examination findings. This possibilityneeds to be considered in future studies.

Only 3 signs were clearly unreliable for thephysical examination of knee OA and were not reme-died by standardization. These were warmth, lateral

instability at 30° flexion, and medial instability at 30°flexion. It is not surprising that the latter 2 were unreli-able, since some mediolateral movement is invariablypresent at 30° of flexion and the decision of what degreeof movement constitutes instability is subjective anddifficult to standardize. It is interesting to note thatstandardization did achieve substantial improvement inagreement for lateral instability at 30° flexion, but notfor medial instability at 30° flexion. This discrepancy isdifficult to explain. It is possible that mediolateralinstability represents a single inseparable sign, whichmay achieve adequate reliability, if examined as such.This will need to be explored in future studies.

Similar to instability, we found poor reliability forthe assessment of warmth and were unable to standard-ize for it. Because a finding of warmth could be affectedby repeated joint examinations, the order of examina-tions and whether later examinations were associatedwith findings of warmth was evaluated. However, noeffect due to order was found, and therefore it is likelythat the poor reliability of warmth was due to the highlysubjective nature of its assessment.

The interpretation of our results also requires anunderstanding of the inherent limitation of dichotomiz-ing a continuum. The cutoff values of PABAK andreliability coefficients, although sensible, were arbitrary.The adequacy or inadequacy of reliability is not clearlydiscriminated by values that fall immediately above orbelow the cutoff points. Thus, our findings of adequatereliability may need to be interpreted more or lessstrictly, depending on the application of these results.Particularly with regard to the PABAK, which has beenutilized in few studies, the appropriateness of a cutoffvalue of 0.60 is uncertain. However, given that a con-

Table 4. Summary of poststandardization values for the most reli-able physical examination techniques in each domain

Domain Physical examination sign Reliability

Alignment Alignment by goniometer 0.99*Bony swelling Palpation 0.97*Crepitus General passive crepitus 0.96*Gait Inspection 0.78†Inflammation Effusion bulge sign 0.97*Instability – UnreliableMuscle strength Quadriceps atrophy 0.97*Tenderness/pain Medial tibiofemoral

tenderness0.94*

Tenderness/pain Lateral tibiofemoraltenderness

0.85*

Tenderness/pain Patellofemoral tendernessby grind test

0.94*

Range of motion Flexion contracture 0.95*

* By reliability coefficient.† By prevalence-adjusted bias-adjusted kappa.

466 CIBERE ET AL

Page 10: Reliability of the knee examination in osteoarthritis: Effect of standardization

ventional kappa greater than 0.60 is interpreted as atleast substantial agreement (21), and given that thePABAK provides more reliable values for an index ofagreement than does the conventional kappa, we thinkthat such a cutoff is indeed appropriate for the purposeof our study, which attempted to identify physical signsthat are highly reliable for use in future OA studies.

In addition, the magnitude of these statisticalvalues depends very much on the statistical methodsbeing applied. The ANOVA-generated coefficients arecharacterized by an interplay between the error due todoctors and the error due to patients and residuals, suchthat these latter 2 sources of variation can have aprofound influence on the magnitude of the error due todoctors and thus the reliability coefficient.

The small sample of both patients and assessorsmay also be seen as a limitation. However, because ofpotential patient and assessor fatigue with repeatedexaminations, this type of work, by necessity, involvessmall samples. Furthermore, the selection of patientswas carried out in such a way that they were represen-tative of patients with mild to severe radiographic OAwith a range of physical examination findings, and thusthey were the kind of patients typically seen in clinicalpractice and in research. More importantly, the preva-lence of a positive finding was adequate for the majorityof physical signs, thereby allowing for an appropriateassessment of reliability. For those signs in which theprevalence was low, increasing the number of subjectsmay improve the assessment of reliability.

Assessors were selected based on their expertisein OA and in clinical assessments, and therefore thestudy results may not be generalizable to other rheuma-tologists. However, the application of the standardizedtechniques developed in this study will likely proveuseful to further evaluate the reliability of the kneeexamination in other OA studies.

Finally, the intraobserver reliability was not as-sessed in this study, because doing so would haverequired more repetitions. Since our primary aim was toevaluate interobserver reliability, we designed the studyto minimize the repetitions of knee examinations inorder to avoid patient and examiner fatigue and, inparticular, to avoid reinforcement of memory, whichcould potentially bias the findings for interobserverreliability. In addition, the long-term stability of inter-observer agreement was not assessed. As a result, it isuncertain whether the improvement in reliabilityachieved during the standardization study is maintainedover time. This is an important consideration for clinicaltrials and other OA studies, since long-term followup isoften required. Further studies are necessary to evaluate

long-term reliability of the physical examination and thefrequency at which assessor training needs to be carriedout in order to reliably perform the knee examination inOA.

Despite these potential limitations, the followingkey findings can be summarized. The majority of phys-ical examinations can be performed reliably even with-out standardization (grade A signs). Even with highlyreliable signs/techniques, standardization can furtherimprove the reliability. Some physical examinations re-quire assessor training and should not be used otherwise.The examination techniques with the highest reliabilitycoefficients or PABAKs will likely be of most value inclinical research and possibly in the evaluation of earlyOA, in which more subtle findings are expected.

The key most reliable physical examinations ofknee OA are listed in Table 4 and include alignment bygoniometer, bony swelling, general passive crepitus, gait,effusion bulge sign, quadriceps atrophy, medial andlateral tibiofemoral tenderness, patellofemoral tender-ness by grind test, and flexion contracture. If kneeexamination techniques are to be included in futurestudies of OA, the inclusion of these key signs will beimportant and will allow for reliable and thereforeimproved outcome assessments.

ACKNOWLEDGMENT

We would like to thank our patient volunteers for theirparticipation in this research study.

REFERENCES

1. Graham GP, Fairclough JA. The knee. In: Klippel JH, Dieppe PA,editors. Rheumatology. 2nd ed. London: Mosby International;1998. p. 4.11.1–14.

2. Dillingham MF, Barry NN, Lannin JV. Hip and knee pain. In:Ruddy S, Harris ED Jr, Sledge CB, editors. Kelley’s textbook ofrheumatology. 6th ed. Philadelphia: W.B. Saunders; 2001. p.525–45.

3. Cyriax J. The knee. In: Textbook of orthopaedic medicine. 8th ed.London: Bailliere Tindall; 1982. p. 392–415.

4. Katz WA. Knees and legs. In: Katz WA, editor. Diagnosis andmanagement of rheumatic diseases. 2nd ed. Philadelphia: J.B.Lippincott Company; 1988. p. 134–55.

5. Hoppenfeld S. Physical examination of the knee joint by com-plaint. Orthopedic Clinics of North America 1979;10:3–20.

6. Bookman A. The knee. In: Little H, editor. The rheumatologicalphysical examination. Orlando: Grune & Stratton; 1986. p. 111–9.

7. Post WR. Clinical evaluation of participants with patellofemoraldisorders. Arthroscopy 1999;15:841–51.

8. Taunton JE, Wilkinson M. Rheumatology: 14. Diagnosis andmanagement of anterior knee pain. Can Med Assoc J 2001;164:1595–601.

9. Ike RW, O’Rourke KS. Compartment-directed physical examina-tion of the knee can predict articular cartilage abnormalitiesdisclosed by needle arthroscopy. Arthritis Rheum 1995;38:917–25.

10. Altman RD, Asch E, Bloch D, Bole G, Borenstein D, Brandt K, et

RELIABILITY OF EXAMINATION OF KNEE OA 467

Page 11: Reliability of the knee examination in osteoarthritis: Effect of standardization

al, for the Diagnostic and Therapeutic Criteria Committee of theAmerican Rheumatism Association. Development of criteria forthe classification and reporting of osteoarthritis: classification ofosteoarthritis of the knee. Arthritis Rheum 1986;29:1039–49.

11. Cushnaghan J, Cooper C, Dieppe P, Kirwan J, McAlindon T,McCrae F. Clinical assessment of osteoarthritis of the knee. AnnRheum Dis 1990;49:768–70.

12. Hart DJ, Spector TD, Brown P, Wilson P, Doyle DV, SilmanAJ. Clinical signs of early osteoarthritis: reproducibility and rela-tion to x-ray changes in 541 women in the general population. AnnRheum Dis 1991;50:467–70.

13. Jones A, Hopkinson N, Pattrick M, Berman P, Doherty M.Evaluation of a method for clinically assessing osteoarthritis of theknee. Ann Rheum Dis 1992;51:243–5.

14. Hauzeur JP, Mathy L, De Maertelaer V. Comparison betweenclinical evaluation and ultrasonography in detecting hydrarthrosisof the knee. J Rheumatol 1999;26:2681–3.

15. Bellamy N, Carette S, Ford PM, Kean WF, LeRiche NGH, LussierA, et al. Osteoarthritis antirheumatic drug trials. I. Effects ofstandardization procedures on observer dependent outcome mea-sures. J Rheumatol 1992;19:436–43.

16. Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J ClinEpidemiol 1993;46:423–9.

17. Kellgren JH, Lawrence JS. Radiological assessment of osteoar-throsis. Ann Rheum Dis 1957;16:494–502.

18. Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW.Validation study of WOMAC: a health status instrument formeasuring clinically important participant relevant outcomes toantirheumatic drug therapy in participants with osteoarthritis ofthe hip or knee. J Rheumatol 1988;15:1833–40.

19. Box GEP, Hunter WG, Hunter JS. Designs with more than oneblocking variable. In: Statistics for experimenters: an introductionto design, data analysis, and model building. New York: JohnWiley & Sons; 1978. p. 245–80.

20. S-Plus 2000 Professional 1988-2000. Seattle, Washington: Math-Soft Inc., Insightful Corporation. Available at: http://www.insightful.com.

21. Landis JR, Koch GG. The measurement of observer agreementfor categorical data. Biometrics 1977;33:159–74.

22. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing raterreliability. Psychol Bulletin 1979;86:420–8.

468 CIBERE ET AL