A Coefficient of Agreement as a Measure of Thematic ......Mead, 1983), conditional agreement coefficients are computed for each category between photointerpreter #1 and reference data.

George H. Rosenfield, and Katherine Fitzpatrick-LinsU.S. Geological Survey, Reston, VA 22092

ABSTRACT: The classification error matrix typically contains tabulated results of accuracy evaluation for a thematic classification, such as a land-use and land-cover map. Diagonal elementsof the matrix represent counts correct. The usual designation of classification accuracy hasbeen total percent correct. Nondiagonal elements of the matrix have usually been neglected.

A coefficient of agreement is determined for the interpreted map as a whole, and individually for each interpreted category. These coefficients utilize all cell values in the matrix. Aconditional coefficient of agreement for individual categories is compared to other methodsfor expressing category accuracy which have already been published in remote sensing literature.

A Coefficient of Agreement as aMeasure of Thematic ClassificationAccuracy

PURPOSE AND SCOPE

0099-1112/86/5202-223$02.25/0©1986 American Society for Photogrammetry

and Remote Sensing

verification. One measure of agreement that recentlyreceived attention in remote sensing applications(Congalton, 1980; Congalton et aI., 1981; andCongalton and Mead, 1983) is Cohen's Kappacoefficient of agreement (Cohen, 1960). This statistichas been given attention by Bishop et al. (1975, p.395-400). Landis and Koch (1977) define this samesituation as "the measure of observer agreement forcategorical data."

This paper introduces (1) background developmentof the Kappa coefficient of agreement as a measureof total map accuracy and (2) the coefficient ofconditional agreement (conditional Kappa) as ameasure of individual category accuracy.

Development of the Kappa coefficient, its variance,and tests for significant differences are brieflyreviewed. Kappa coefficient values and variancesare computed for two photointerpreter error matrices(Congalton and Mead, 1983), compared with percentcorrect, and tested for significant difference. Valuesof conditional Kappa, its variance, and percent correctare computed for individual categories of the firstphotointerpreter matrix of Table 1. Conditional Kappaand percent correct values are compared.

Accuracy indices (coefficients) for individualcategories published in remote sensing literature arecomputed and compared with each other and withconditional Kappa. Values for percent correct arealso computed for each category. Equations for theseindices are modified into terms similar to conditionalKappa so they can be compared. The matrix of Turk(1979) is transposed from the original (Table 2) to

PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING,Vol. 52, No.2, February 1986, pp. 223-227.

T HE CLASSIFICATION ERROR MATRIX typically contains tabulated results of accuracy evaluation for

a thematic classification, such as a land-use and landcover map. Table 1 shows two classification errormatrices for two photointerpreters (Congalton andMead, 1983). In the matrix, for each sample point(or count), interpretation (classification) is given asrows and verification (ground truth) is given as columns.

INTRODUCTION

BACKGROUND

Investigators in remote sensing have beensearching for a single value (coefficient) to adequatelyrepresent the accuracy of thematic classification. Theyhave also been searching for an accuracy value foreach individual category within the map. Because aclassification error matrix represents results ofaccuracy evaluation, the usual procedure has beento use total percent correct-the ratio of the sum ofdiagonal values to total number of cell counts in thematrix-as map accuracy. Proportions of diagonalvalues to row sums are considered as categoryaccuracy relative to errors by commission, andproportions of diagonal values to column sums ascategory accuracy relative to errors of omission.

Development of percent correct, both for the entiremap and for individual categories, has been lost toantiquity. Several researchers: (Turk, 1979; Hellden,1980; Short, 1982) have attempted to find an accuracyvalue for individual categories which considers errorsby commission and errors of omission. Another wayof considering overall accuracy and individualcategory accuracy is to address the problem as ameasure of agreement between classification and

category Short Turk Hellden k;; x 100 % correctother 91.36% 92.80% 95.48% 96.08% 98.01%slight & mild 66.67 70.54 80.00 84.43 87.72moderate 46.99 15.05 63.93 48.76 57.35severe 54.35 71.01 70.42 61.88 65.79very severe 60.00 74.41 75.00 74.36 75.00

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING, 1986224

TABLE 1. CLASSIFICATION ERROR MATRICES FOR

PHOTOINTERPRETERS #1 AND #2

Reference Data2 3 4 Total Cate 0

1 35 14 11 61 pine2 4 11 3 18 cedar3 12 9 38 4 63 oak4 2 5 12 2 21 cotton-

woodTotal 53 39 64 7 163

Reference Data2 3 4 Total Cate 0

1 32 15 5 3 55 pine2 7 8 5 20 cedar3 7 8 38 2 55 oak4 6 7 15 1 29 cotton-

woodTotal 52 38 63 6 159

Source: Congalton and Mead, 1983, p. 72.

TABLE 2. CONDITIONAL AGREEMENT FOR METHOD OF TURK

Ground Truth(matrix transposed from Turk (1979))*

1 2 3 4 5 Total cate 0

1 148 1 151 other2 1 50 6 57 slight and

mild3 8 15 39 6 68 moderate4 2 3 7 25 1 38 severe5 1 1 6 8 very

severeTotal 159 68 54 33 8 322

*Data from com leaf blight experiment (Bauer et al., 1971).

be compatible with conditional Kappa and used forthis comparison. Table 3 summarizes the results.

A COEFFICIENT OF AGREEMENT AS AMEASURE OF ACCURACY

Cohen (1960) developed a coefficient of agreement (called Kappa) for nominal scales which measures the relationship of beyond chance agreement

to expected disagreement. This measure of agreement uses all cells in the matrix, not just diagonalelements. The estimate of Kappa (K) is the proportion of agreement after chance agreement is removed from consideration: that is,

f< = (Po - pJ/(l - pJin which

Po = proportion of units which agree,Pc = proportion of units for expected chance

agreement, andPo = LPiO Pc = L(Pi+P+i), Pij = X;/N

+ represents summation over the index.where N = total number of counts in matrix andXij = number of cO,unts in ijth cell.

Cohen indicates K = 0 when obtained agreementequals chance agreement. Positive values of Kappaoccur from greater than chance agreement; negativevalues of Kappa are from less than chance agreement. The upper limit of Kappa (+ 1.00) occurs onlywhen there is perfect agreement. The lower limit ofKappa depE'nds on marginal distributions and is likelyto have no practical interest.

Cohen (1960, p. 43) shows how Kappa is similarto product-moment correlation for the dichotomouscase. Krippendorf (1970) shows that Kappa belongsto a family of bivariate agreement coefficients, in theform

observed disagreementAgreement = 1 - -----:-7.,----""'------

expected dIsagreement

Thus, the coefficient is zero for chance agreement,unity for perfect agreement, and negative for lessthan chance agreement. Krippendorf shows thatseveral coefficients, including Cohen's Kappa andPearson's intraclass correlation coefficient (Snedecor and Cochran, 1967, p. 294-296), are related. Furthermore, Fleiss (1975) presents and criticizes anumber of proposed indexes of agreement, andstresses the importance of incorporating the valueof the index expected by chance alone. He also states(p. 658-659) that Cohen's Kappa coefficient is oneof two chance-corrected measures of agreement defensible as intraclass correlation coefficients. Theother is Maxwell and Pilliner's coefficient (Maxwelland Pilliner, 1968).

Computation for Cohen's Kappa coefficient treatsall disagreements equally. Accordingly, Cohen (1968)developed weighted Kappa to accommodate the

THEMATIC CLASSIFICATiON ACCURACY

CONDITIONAL AGREEMENT FOR THEiTH CATEGORY

Coefficients of agreement, such as Kappa andweighted Kappa, have concerned the entire matrix.Conditional agreement between two observers hasbeen developed "for only those items which an observer has placed into the ith specific category,"

225

NXii - Xi+X+ iNXi+ - Xi+X+ i

Category Ki x 100 V(K i ) % correct

pine 36.84% 0.005821 57.37%cedar 48.88 0.020743 61.11oak 34.66 0.006791 60.32cotton- 05.46 0.003634 9.52

wood

(Light, 1971 p. 367, and Bishop et al. 1975 p. 397).Equations for conqitional coefficient of agreement(K,) and estimate (K,) are Bishop et al. (1975) p. 397,11.4-8, 11.4-9)

Pii - Pi+P+i -Ki = or KiPi+ - Pi.P+i

OTHER ATIEMPTS TO DESCRIBE CATEGORYACCURACY OF REMOTELY SENSED DATA

METHOD OF TURK (1979)

The ground truth (GT) index developed by Turk(1979) is "the proportion of agreement corrected forchance agreement," and is given in modified formas

8i = (Xii - RJ/(l - Rii)

Bishop et al. (1975, p. 397) indicate that the numerator and denominator of Kappa are obtained bysU!l!ming respective numerators and denominatorsof K; separately over all J categories. Accordingly,conditional agreement, K" serves as a measure ofaccuracy for a given category.

A PRACTICAL ApPLICATION

From the first matrix of Table 1 (Congalton andMead, 1983), conditional agreement coefficients arecomputed for each category between photointerpreter#1 and reference data. Values of conditional Kappa,variance, and percent correct are

Again note in each case that values for conditionalKappa are Significantly less than values for percentcorrect.

where Xii = actual correct classification and R,ilucky guesses (chance agreement). R,i is a derivedvalue.

It is evident that the GT index is also a memberof the family of agreement coefficients described byKrippendorf (1970), members of which are obtainedby variation in developing observed disagreementand expected frequencies. Turk, as does Cohen(1960), uses the actual correct classification value,Xii' Turk obtains his value for chance agreement,Rii , from a variation of Goodman (1968) based on"quasi-independence" of nondiagonals. Data are fitto the probability model using the method of iterative proportional fitting. Joint occurrence probability is computed as the product of a row parameterand a column parameter, both of which are normalized to unity. Cohen (1960) computes chance

Total %Correct

52.8%49.7

V(K)

0.0028810.002628

KxlOO

31.991%29.420

PI #1PI #2

Z = 0.3465

Note that Kappa coefficient values give significantlyless accuracy than do total percent correct values.The difference between these two independentKappa values is not Significant at the 95 percentprobability level because the Z statistic does notexceed 1. 96. This means these twophotointerpretations are not significantly different.

sense of the investigator that some disagreements(non-diagonal cells) are more serious than others.Thus, weighted Kappa describes the proportion ofweighted agreement corrected for chance. Equivalence of weighted Kappa with the intraclass correlation coefficient under general conditions isdemonstrated by Fleiss and Cohen (1973).

Correct formulation for estimated large samplevariances of Kappa and weighted Kappa is given byHeiss et al. (1969). Earlier versions of variances givenby Cohen (1960), Spitzer et al. (1967), Cohen (1968),and Everitt (1968) are incorrect. The approximatelarge sample variance of Kappa is given by Bishopet al. (1975, p. 396).

Cohen (1960) describes tests of significance between two independent Kappas by

Z = (K, - Kz)/[V(K,) + V(Kz)]"z

where Z is the standard normal deviate. If Z exceeds1.96, then the difference is significant at the 95 percent probability level.

A PRACTICAL ApPLICATION

Congalton and Mead (1983) utilize Cohen's Kappain a practical application to compare results of severalphotoin terpreter classifica tions. They showclassification error matrices for photointerpreters (PI)#1 and #2 (Table 1). Their accuracy analysis wasbased on "an appropriate sampling scheme" and"an adequate number of samples" (p. 70). Resultsof their analysis indicate values for Kappa andvariance for each photointerpreter, and for the Zvalue which measures significance between Kappavalues. Values for total percent correct are given forcomparison.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING, 1986226

agreement directly as the product of marginal probabilities.

Table 2 gives the classification error matrix usedby Turk (1979, p. 70, Table 2). Turk's matrix hasbeen transposed to have the remote sensing result"on the rows." Turk based this study on the cornleaf blight experiment of Bauer et al. (1971). Notethat "very severe" category has only eight samplepoints, too few to have been considered as a separate category.

METHOD OF HELLDEN (1980)

The mean accuracy (MA) index developed byHellden (1980, p. 18) "denotes the probability thata randomly chosen point of a specific class on themap has a correspondence of the same class in thesame position in the field and that a randomly chosenpoint in the field of the same class has acorrespondence of the same class in the same positionon the map," and is given in modified form as

MA = 2Xj(X+ i + Xi+)

where

Xii = number of correctly mapped sample unitsfor specific class,

X+ i = number of sample field units for specificclass, and

X i + = number of sampled map units for specificclass.

The MA index is a logical (heuristic) developmentof Hellden and cannot be derived on either aprobability basis or a mathematical basis(correspondence between Rosenfield and Hellden).

METHOD OF SHORT (1982)

The mapping accuracy (MA) index for any class ideveloped by Short (1982, p. 259, Table 6-2) is givenin modified form as

MA = Xj(Xi+ + X+ i - Xii)

This equation of Short is similar to the Jaccardcoefficient which Piper (1983) credits to Jaccard (1908).Piper shows that the Jaccard coefficient has thefollowing properties: (1) equals zero if no positivematches, (2) equals one if perfect agreement, and(3) not affected by sample size. In addition, Pipershows that the Jaccard coefficient has thehypergeometric distribution, of which the binomialdistribution is a good approximation for large samplesizes.

COMPARISON OF METHODS

To bring the various methods into perspective,Hellden's and Short's methods, along with percentcorrect and conditional Kappa, were applied to Turk'sdata matrix. Results of this comparison are given inTable 3.

Note in Table 3 that the order of percentages for

each category in general has an upward trend. Thisobservation indicates that percent correct tends tooverestimate classification accuracy, while coefficients given by Short, Turk, and Hellden tend tounderestimate classification accuracy when compared to conditional Kappa. This observation doesnot always hold, because coefficients depend uponrelative value and location of cell frequency in theclassification error matrix.

SUMMARY AND CONCLUSIONS

Kappa' coefficient of agreement is shown as ameasure of accuracy for thematic classification as awhole, and coefficient of conditional Kappa is presented as an accuracy measure for the individualcategory. A family of such coefficients exists whichcorrect for chance agreement, but the Kappa coefficient is one of few which also are defensible asintraclass correlation coefficients. These coefficientsuse information in the classification error matrix resulting from errors by commission and of omission.In the past, the usual designation for accuracy hasbeen percent correct which uses only diagonal elements of the matrix and which appears to give inflated accuracy. Recently, remote sensing literaturehas given several coefficients as accuracy indices.Only one of these (Turk, 1979) corrects for chanceagreement, and none have the statistical basis ofbeing an intraclass correlation coefficient.

It is, therefore, recommended that coefficients ofKappa and conditional Kappa be adopted by theremote sensing community as a measure of accuracy for thematic classification as a whole, and forthe individual categories.

REFERENCES

Bauer, M. E., P. H. Swain, R. P. Mroczynski, P. E. Anuta,and R. B. MacDonald, 1971. Detection of southerncorn leaf blight by remote sensing techniques: Proceedings, 7th International Symposium on Remote Sensingof the Environment, Ann Arbor, Mi., May 17-21, 1971,v. 4, p. 693-704.

Bishop, Y. M. M., S. E. Feinberg, and P. W. Holland, 1975.Discrete Multivariate Analysis-Theory and Practice:Cambridge, Mass., The MIT Press.

Cohen, J., 1960. A coefficient of agreement of nominalscales: Educational and Psychological Measurement, v. 20,no. 1, pp. 37--46.

--,1968. Weighted Kappa: Nominal scale agreementwith provision for scaled disagreement or partial credit:Psychological Bulletin, v. 70, no. 4, pp. 213-220.

Congalton, R. G., 1980. Statistical techniques for analysisof Landsat classification accuracy: Paper presented atmeeting of the American Society of Photogrammetry,St. Louis, Mo., March 11.

Congalton, R. G., R. A. Mead, R. G. Oderwald, and J.Heinen, 1981. Analysis of forest classification accuracy:Blacksburg, Virginia Polytechnic Institute and StateUniversity [NTIS PB82-1645000].

Congalton, R. c., and R. A. Mead, 1983. A quantitative

Reno, Nevada29 September-2 October 1986

Fifth Thematic Conference onRemote Sensing for Exploration Geology

The Fifth Thematic Conference, to be staged as a part of the continuing series of ERIM Remote SensingSymposia, will once again address "Remote Sensing for Exploration Geology." As with previous meetingsin Ft. Worth (1982), Colorado Springs (1984), and San Francisco (1985), the current 1986-Reno conferencewill focus upon operational development, emphasizing "Mineral and Energy Exploration: Technology fora Competitive World."

An industry-oriented program, including both conventional plenary sessions and poster presentations,will be formulated to address

227

stedt (eds.), Social Methodology: San Francisco, JosseyBass.

Landis, J. R., and G. G. Koch, 1977. The measurement ofobserver agreement for categorical data: Biometrics, v.33, pp. 159-174.

Light, R. J., 1971. Measures of response agreement forqualitative data: Some generalizations and alternatives: Psychological Bulletin, v. 76, no. 5, pp. 365-377.

Maxwell, A. E., and A. E. G. Pilliner, 1968. Deriving coefficients of reliability and agreement for ratings: BritishJournal of Mathematical and Statistical Psychology, v. 21,pp. 105-116.

Piper, S. E., 1983. The evaluation of the spatial accuracyof computer classification. Proceedings, MachineProcessing of Remotely Sensed Data, Purdue University,Laboratory for Applications of Remote Sensing, WestLafayette, In., pp. 303-309.

Short, N. M., 1982. The Landsat Tutorical Workbook - Basicsof Satellite Remote Sensing: Greenbelt, Md., GoddardSpace Flight Center, NASA Reference Publication 1078.

Snedecor, G. W., and W. G. Cochran, 1967. StatisticalMethods: Ames, Iowa State University Press.

Spitzer, R. L., J. Cohen, J. L. Fleiss, and J. Endicott, 1967.Quantification of agreement in psychiatric diagnosis:Archives of General Psychiatry, v. 17, July, pp. 83-87.

Turk, G., 1979. GT Index: A measure of the success ofprediction: Remote Sensing of Environment, v. 8, pp. 6575.

(Received 26 May 1984, revised and accepted 26 September 1985)

THEMATIC CLASSIFICATION ACCURACY

• Advanced Sensors and Sensor Systems • Geobotanical and Environmental Studies• Airborne Remote Sensing Applications • Global Data Availability/Current and Future• Applications for Hydrocarbon Exploration • Multiple Data Sources and Applications• Applications for Mineral Exploration • Photogeology and Image Interpretation

All persons interested in contributing a paper for consideration for poster presentation should submit30 copies of a comprehensive summary of 200 to 500 words of their proposed presentation and a plan ofthe poster display, describing the graphics, text, and other materials to be used, no later than 15 March1986, to

Dr. Jerald J. CookRemote Sensing CenterEnvironmental Research Institute of MichiganP.O. Box 8618Ann Arbor, MI 48107-8618Tele. (313) 994-1200

method to test for consistency and correctness in photointerpretation: Photogrammetric Engineering and Remote Sensing, v. 49, no. 1, pp. 69-74.

Everitt, B. S., 1968. Moments of the statistics Kappa andweighted Kappa: British Journal of Mathematical and Statistical Psychology, v. 21, pp. 97-103.

Reiss, J. L., 1975. Measuring agreement between two judgeson the presence or absence of a trait: Biometrics, v. 31,pp. 651~59.

Fleiss, J. L., J. Cohen, and B. S. Everitt, 1969. Large samplestandard errors of Kappa and weighted Kappa: Psychological Bulletin, v. 72, no. 5, pp. 322-327.

Fleiss, J. L., and J. Cohen, 1973. The equivalence ofweighted Kappa and the intraclass correlation coefficient as measures of reliability: Educational and Psychological Measurement, v. 33, pp. 61~19.

Goodman, L. A., 1968. The analysis of cross-classified data:Independence, quasi-independence, and interactionin contingency tables with and without missing entries: Journal of American Statistical Association, v. 63,pp. 1091-1131.

Hellden, U., 1980. A test of Landsat-2 Imagery and DigitalData for Thematic Mapping lllustrated by an Environmental Study in Northern Kenya: Sweden, Lund UniversityNatural Geography Institute Report No. 47.

Jaccard, P., 1908, Nouvelle Researches sur la DistributionFlorale. Bull. Soc. Vaud. Sci., Nat., v. 44, pp. 223-270.

Krippendorf, K., 1970. Bivariate agreement coefficients forreliability of data, in E. F. Borgatta and G. W. Bohrn-

A Coefficient of Agreement as a Measure of Thematic ......Mead, 1983), conditional agreement coefficients are computed for each category between photointerpreter #1 and reference data.

Documents