Measurement of Interobserver Disagreement: Correction of Cohen’s Kappa for Negative Values
As measures of interobserver agreement for both nominal and ordinal categories, Cohen’s kappa coefficients appear to be the most widely used with simple and meaningful interpretations. However, for negative coefficient values when (the probability of) observed disagreement exceeds chance-expected disagreement, no fixed lower bounds exist for the kappa coefficients and their interpretations are no longer meaningful and may be entirely misleading. In this paper, alternative measures of disagreement (or negative agreement) are proposed as simple corrections or modifications of Cohen’s kappa coefficients. The new coefficients have a fixed lower bound of −1 that can be attained irrespective of the marginal distributions. A coefficient is formulated for the case when the classification categories are nominal and a weighted coefficient is proposed for ordinal categories. Besides coefficients for the overall disagreement across categories, disagreement coefficients for individual categories are presented. Statistical inference procedures are developed and numerical examples are provided.
When two (or more) observers are independently classifying observations or items (objects) into the same set of mutually exclusive and exhaustive categories, it may be of interest to have a summary description of the extent to which the observers agreed in their classifications. The total probability (proportion) of agreement is one such obvious summary measure. However, since some agreement is to be expected purely by chance, Cohen  introduced the kappa coefficient of agreement as one that corrects for the chance-expected agreement. Cohen’s kappa has since become widely used in a variety of situations and discussed extensively in various textbooks (e.g., [2–5]) and a wide variety of journal publications (e.g., [6–10]).
In order to define the kappa coefficient in terms of probabilities (proportions), let be the probability that a random observation is assigned to category by Observer 1 and to category by Observer 2 for and . Furthermore, let denote the probability that a randomly chosen observation is assigned to category by Observer 1 and the probability that a randomly chosen observation is assigned to category by Observer 2 (). If these probabilities are represented in terms of a two-way contingency table with rows and columns , then becomes the probability in cell () and becomes the marginal row distribution and becomes the marginal column distribution. With the row categories and the column categories being the same, is the total probability of agreement between the two observers. Cohen  used the overall statistical independence as the condition for chance agreement and defined as with and being the observed agreement probability and the chance-expected agreement probability, respectively. In terms of the observed and chance-expected disagreement probabilities and , can alternatively be expressed as It is clear from (1)-(2) that if the interobserver agreement is perfect, that is, if , if , and if . The case of negative -values will be discussed further in the next section.
In addition to measuring the overall agreement between two observers, it may be of interest to assess their level of agreement for specific categories. Spitzer et al.  first proposed such a measure by collapsing the original table into a 2 × 2 table, one such 2 × 2 table for each category , and then computing in (1)-(2) for each such 2 × 2 table (see also [2, Chapter 18]). As a simpler procedure yielding the same numerical results, Kvålseth  proposed the following form of kappa for the specific category ():where denotes the summation over all disagreement cells for category . With, say, , consists of cells , , , and . For complete agreement with respect to category , when , for the independence , and when observed disagreement exceeds chance disagreement.
To account for the potential fact that some disagreements may be more serious than others, as when the categories have a natural order, Cohen  and Cicchetti and Allison  independently introduced the weighted kappa , which can be expressed as where each weight , with and for all and and with the following logical weight choices (e.g., [2, page 609]):For a specific category , Kvålseth  proposed the following measure as an extension of (4):with denoting the set of all disagreement cells for category . The values of these weighted measures equal 1 for perfect agreement and 0 if observed agreement equals chance agreement, with negative values if observed agreement is less than chance agreement.
The kappa coefficients in (1)–(8) may be appropriate measures of agreement when their values are nonnegative, but not when their values are negative as discussed in the next section. From a theoretical point of view at least, it is certainly troublesome that their negative values lack appropriate meaning and validity. This paper presents simple corrections or modifications of the kappa coefficients in (1)–(8) such that the negative values of the corrected coefficients provide appropriate representation of the extent to which the observers disagree. Statistical inference procedures for the new coefficients or measures are developed. Numerical examples are also given.
2. Comments on Kappa
One of the most appealing properties of kappa, and undoubtedly a reason for its popularity, is its simplicity and transparency. All the kappa coefficients in (1)–(8) have intuitively appealing and meaningful interpretations. In the case of in (1)-(2), for example, it seems most meaningful to interpret any -value in terms of (2) as the proportional difference between and , that is, the relative extent to which the observed disagreement probability is less than the disagreement probability attributable to chance. By comparison, the norming used in (1) is not unique, with any number of different potential denominators such that .
Complete statistical independence, that is, for , is a sufficient, but not a necessary, condition for the kappa coefficients in (1)–(8) to take on value 0. In fact, for in (1) and , it is not necessary that for . It is possible that even if for all and when . As a simple example, considerwhere all marginal probabilities equal 1/3. And for all and , but . In this case, from (3), , , and, from (6) and (7), for .
Note that the two expressions for in (1)-(2) are weighted arithmetic means of the expressions for in (3)-(4). Thus, from (1) and (3), for instance, it is seen thatSimilarly, for the weighted measures in (6) and (8),
In order to show that the interobserver agreement for a specific category can be determined directly from (3)-(4), without the need to collapse the original table as suggested by Spitzer et al. , consider that the original table with probability components , , and for category is collapsed into the following 2 × 2 table:When (12) is substituted into (1), in (3) results immediately. However, no such corresponding procedure applies to in (6) and in (8). Note that, for , and .
In spite of its wide appeal, kappa is not without some criticism or controversy, especially related to its dependence on the marginal distributions and (see, e.g., [4, pages 168–173]). The chance agreement (disagreement) for all the kappa coefficients in (1)–(8) is based on the marginal distributions. If those distributions are highly uneven (nonuniform) and nearly symmetric, the values of the kappa coefficients may become unreasonably small due to the relatively large chance agreements.
A clear limitation of the kappa coefficients relates to situations when the values of those coefficients become negative and lack meaningful interpretations. This limitation has generally been ignored in published studies, partly perhaps because such studies using kappa have typically involved positive kappa values. Negative kappa values could, however, lead to incorrect interpretations, results, and conclusions. Also, if, for instance, in (1)-(2), it is possible that some in (3)-(4).
For the overall kappa in (1)-(2), when so that , has no reasonable meaning in terms of (1), but does in terms of (2); that is, is the relative extent to which exceeds . The same argument applies to in (3)-(4). However, two serious limitations of all the kappa coefficients are that, for negative values, (a) the coefficients have no fixed lower bounds, making it impossible to appropriately assess the size or magnitude of coefficient values, and (b) the coefficients take on negative values that do not appear reasonable as discussed below.
The minimum values of in (1) and of in (3) depend exclusively on the marginal distributions and . Values such as or are uninformative since they cannot be related to any fixed lower bounds on or such as , irrespective of the marginal distributions. There is no basis for making any interpretation or statement such as indicating a “moderate,” “low,” or “high” level of disagreement between the two observers.
There is also some confusion in the literature about the minimum value of , with some stating that the minimum value is or [5, page 4] and others stating that it is when for all and [17, page 113]. Such statements are clearly incorrect. In fact, the minimum value equals if, and only if, . Similarly, the minimum value of in (3) equals only when the harmonic mean of and equals 0.5.
What is needed are chance-corrected measures of disagreement, both weighted and unweighted, which have fixed lower bounds of and which are attainable irrespective of the marginal distributions. This requirement has also been clearly emphasized by others . Such measures will be introduced in the next section as simple corrections or modifications of the existing kappa coefficients.
3. Proposed Kappa Coefficients of Disagreement
3.1. Overall Coefficients
When and hence , it seems most logical and intuitive to define negative overall kappa aswhere , and are the probabilities defined in (1)-(2). Consequently,where, of course, for . Except for the minus sign, in (13) follows from in (1)-(2) by simply substituting disagreement probabilities for the corresponding agreement probabilities.
The properties of can be summarized as follows:(P1) is well defined if at least two cells of the contingency table contain nonzero probabilities.(P2), with , if observed agreement (disagreement) equals chance agreement (disagreement) (i.e., or ) and if, and only if, .(P3) can take on value for any marginal distributions and .(P4) has a meaningful interpretation as the relative extent to which the observed agreement probability is less than that expected by chance alone.(P5) takes on values that appear reasonable throughout its 0 to range.
While Properties (P1)–(P4) are immediately apparent from the definition in (13), Property (P5) needs an explanation. This can most simply be done for the category case and without undue loss of generality since, for any data set with , there exists an equivalent 2 × 2 table with the same -value. Therefore, one may consider a 2 × 2 table such as the one in Table 1 with the marginal probabilities and (). The first two entries in each cell correspond to the cases when and 0, respectively, while the third entry equals the weighted arithmetic mean of the other two entries with weights and .
In order for the values of to be considered reasonable throughout the -interval, the only logical condition would clearly seem to be that the value of for the weighted mean cell probabilities should equal the weighted mean value of for the other cell probabilities with the same weights and ; that is, By substituting the expressions for the mean cell probabilities from Table 1 into in (13), it is seen that does meet the condition in (15), irrespective of the marginal probabilities and . This assumes, of course, as with Cohen’s , that chance agreement (disagreement) based on the marginal probabilities is reasonable.
By contrast, substituting the mean probabilities from Table 1 into in (1)-(2) givesshowing the strong dependence of on the marginal probabilities. The parenthetical term in (16) equals 1 if and approaches 0 as the marginal distributions become highly uneven or nonuniform (i.e., as approaches 0 or 1).
When in (5)-(6) and hence and and with the sets of weights and as defined in (7), the following weighted negative kappa is proposed:and henceExcept for the minus sign, in (17) follows from (5)-(6) by simply substituting for in (5) and for in (6).
is well defined if at least two cells of the table contain nonzero probabilities. It is also apparent from (17) that takes on values between 0 and , inclusive, with if for all and (as a sufficient but not necessary condition). Also if, and only if, = 0 for all and except for and and and , that is, when the only nonzero probabilities occur in the corner cells () and () and the weights are of the type of form as in (7). These properties of all appear to be reasonable.
By contrast, if , , and all other = 0, in (5)-(6) becomes , which equals only if . Otherwise, the value of increases as and become increasingly different, approaching 0 as approaches 1. Such behavior of makes any reasonable interpretation of negative -values impossible and meaningless.
3.2. Specific Category Coefficients
Just as the and coefficients are inappropriate for negative values, so is the category-specific coefficient in (3)-(4) as pointed out in Section 2. Therefore, for , another coefficient is needed that satisfies the reasonable requirements that its value equals 0 when and equals when . The following proposition seems most reasonable: is well defined unless either or (and hence ). if, and only if, (and and ). For the weighted mean cell probabilities in Table 1, . Also, analogous to (10),with defined in (13).
In terms of weights , with the types of as in (7), the proposed specific-category weighted kappa coefficient may be defined as where, as always, the first subscript refers to the table row and the second subscript to the table column. Note that the component for cell appears twice in . Note also that, analogous to (20), in (17) is the weighted arithmetic mean of in (21) with weights based on the denominator in (21) for . It is apparent from (21) that, for the weights in (7) and with , if for all , but also if and are the only nonzero probabilities in the table.
4. Statistical Inferences
Consider now that the coefficients (measures) discussed above are all sample estimates (and estimators) based on the sample probabilities () with frequencies (counts) and sample size . ’s are maximum likelihood estimates (and estimators) of the unknown population probabilities on which the corresponding population coefficients are based such as the population coefficient corresponding to the sample coefficient in (6). It may then be of interest to make statistical inferences about the population coefficients corresponding to the sample coefficients discussed above.
Such statistical inferences would probably be most meaningful in terms of the construction of confidence intervals for the overall kappa coefficients in (14) and (18). The inference procedure needs necessarily to be approximated for reasonably large sample size and be based on the delta method (e.g., [19, Chapter 14]) or resampling methods such as the bootstrap and the jackknife (e.g., [20, 21]). The delta method is chosen in this paper. By developing the procedure based on the expression in (6), the procedures for in (17), in (13), and in (1) follow as special cases by the appropriate selection of the set of weights . Fleiss et al.  gave the estimated large sample variance of based on the expression in (5) without presenting any intermediate steps. Instead, the expression in (6) will be used here as being more convenient and some of the important intermediate steps will be presented.
Then, letting in (6) denote both the sample estimate and estimator of the corresponding population coefficient (based on population probabilities , and for and ), it follows from the delta method that, under multinomial sampling (when the categories and the sample size are a priori fixed), the estimator is approximately normally distributed with mean and estimated variance if is reasonably large.
In order to derive the estimated variance of , express as and let , and denote the partial derivatives of these quantities with respect to , with then being replaced with the estimated probabilities for all and . Then, wherefor all and . It is found thatso that, from (23)-(24),from which one getsWhen (25) and (26) are substituted into (22), one obtainswhere and and are defined in (24). This variance formula, which gives the same numerical results as the formula given by Fleiss et al. , can then be used for interval estimation.
By comparing in (17) with the expression for in (6), it follows from (27) that the estimated variance of is given by where is the denominator in (17) and
By setting for all and for all , in (6) reduces to in (2) and, furthermore, and in (24) so that, from (27), it is found that where and are defined in (2). Similarly, by setting for all and for all , in (17) reduces to in (13) and, furthermore, and in (29) so that, from (28), the following result is obtained:where and are defined in (1). The expression in (30) is somewhat different from that given by Fleiss et al. , but they are found to give exactly the same numerical results.
If it should be of interest to test the null hypothesis that the population equivalent to one of the new coefficients is equal to zero, then the same procedure as proposed by Fleiss et al.  for the case of Cohen’s and would involve replacing with for all and in the variance expressions in (28) or (30). However, a simpler method would be to use the chi-square goodness-of-fit statistics or to test for independence (noting again that independence is a sufficient but not necessary condition for the coefficients to equal zero).
5. Numerical Examples
5.1. Example 1:
Instead of one pair of observers assigning each of items (observations) to one of categories, consider the statistically equivalent situation in which each of pairs of observers assigns one item to one of categories. For example, among randomly selected couples, each spouse answers a question with choice categories , and . The (fictitious) data are given in Table 2.
With and in Table 2, it follows from (13) that , indicating a substantial disagreement between husbands and wives. By contrast, the corresponding value of Cohen’s in (1) is found to be , which could have been interpreted as indicating a much lower level of disagreement. However, since has no fixed lower bound as discussed above, any interpretation or conclusion based on would be invalid and misleading.
The next question may then be, how do the disagreements on the individual categories contribute to the overall disagreement of ? The answer from Table 2 and (19) is found to be , , and . Therefore, the substantial overall disagreement is attributable to the high disagreement on each of the categories and , whereas category involves a very low level of agreement. By comparison, the negative values from (3) or (4) would have been substantially different, with and .
In order to construct a confidence interval for the population equivalent of based on the data in Table 2, it is found from (31) that . Then, since the estimator , with the sample estimate of , is approximately normally distributed with mean and estimated variance of 0.0115, an approximate 95% confidence interval for becomes or .
5.2. Example 2:
Consider now that the categories in Table 2 are ordinal so that the weighted kappa coefficients would be appropriate. Then, with the weights in (7) and for all and , it is found from Table 2 that and so that, from (17)-(18), . This weighted disagreement value differs considerably from the above value when the three categories are considered to be nominal.
In terms of the disagreements for the individual categories, it is found from (21) and Table 2 that , , and . Again, these results differ considerably from those in the nominal case considered in Example . Note that the arithmetic mean of ’s does not differ greatly from the overall .
An interval estimate for the population measure can be derived from (28) by first computing and for each of and from (29) and Table 2, giving , and for , and 3, respectively, and , and for , and 3, respectively. Then, from (28), with and , it is found that . Consequently, a 95% confidence interval for is given by or .
5.3. Logistic Transformation
Instead of making statistical inferences about the kappa coefficients directly, as done above, it is likely advantageous to do so indirectly via the logistic transformation. Therefore, in the case of in (13), consider the following logistic transformation of and its inverse:Since the derivative , the estimated variance of becomeswhere is given in (31). An approximate confidence interval for the population equivalent of can then be constructed based on (33), with the corresponding confidence interval for resulting from the inverse transform in (32).
In the case of in (17), in (32)-(33) is simply replaced with . For and in (1) and (5)-(6), the transformation becomes and . With such transformations, the lower end of a confidence interval for or cannot be less than and the upper end of a confidence interval or cannot exceed 1. Most importantly, the normal distribution approximation is likely to be improved with the above logistic transforms. Unless the sample size is very large, the distributions of the kappa coefficients are likely to be skewed, especially when a coefficient is near or . For instance, when, say, the population coefficient , the estimator cannot be much smaller than , but it could be much larger with nonnegligible probability. The logistic transformation to the -interval tends to correct for such skewness and provide for a more rapid convergence to normality.
In Example , with and , it follows from (32)-(33) that and . An approximate 95% confidence interval for the population equivalent of is then given by or . Then, from the inverse transform in (32), it follows that an approximate 95% confidence interval for has the limits and , that is, the interval . This confidence interval is slightly shorter than the interval determined above. Similarly, with substituted for in (32)-(33) and with and from Example , it is found that an approximate 95% confidence interval for the population coefficient is . This interval differs little from the interval determined above when applying the inference procedure directly to .
If Cohen’s kappa is accepted as an appropriate measure of interobserver agreement, as many do judging by its widespread use, then the corrections proposed here for negative kappa values should be equally acceptable. Of course, since the chance-expected disagreement (or agreement) terms in the new coefficients also depend exclusively on the marginal distributions, the criticism by some that Cohen’s coefficients depend too much on the marginal distributions would similarly apply to the new coefficients. Such concern is particularly important in cases of highly uneven (nonuniform or “skewed”) marginal distributions. If, however, those distributions are fairly even (uniform), Cohen’s kappa and hence the measures proposed in this paper for interobserver disagreement (negative agreement) would seem to be reasonably acceptable agreement-disagreement measures.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
The author gratefully acknowledges the helpful recommendations from a reviewer.
J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.View at: Publisher Site | Google Scholar
J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates and Proportions, Wiley, Hoboken, NJ, USA, 3rd edition, 2003.
K. L. Gwet, Handbook of Inter-Rater Reliability, Advanced Analytics, Gaithersburg, Md, USA, 4th edition, 2014.
M. M. Shoukri, Measures of Interobserver Agreement and Reliability, CRC Press, Boca Raton, Fla, USA, 2nd edition, 2011.
A. Von Eye and E. Y. Mun, Analyzing Rater Agreement: Manifest Variable Methods, Lawrence Erlbaum Associates, Mahwah, NJ, USA, 2005.
M. J. Warrens, “New interpretations of Cohen's kappa,” Journal of Mathematics, vol. 2014, Article ID 203907, 9 pages, 2014.View at: Publisher Site | Google Scholar | MathSciNet
M. J. Warrens, “Five ways to look at Cohen's kappa,” Journal of Psychology & Psychotherapy, vol. 5, no. 4, pp. 1–4, 2015.View at: Publisher Site | Google Scholar
Z. Yang and M. Zhou, “Kappa statistic for clustered matched-pair data,” Statistics in Medicine, vol. 33, no. 15, pp. 2612–2633, 2014.View at: Publisher Site | Google Scholar | MathSciNet
M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochemia Medica, vol. 22, no. 3, pp. 276–282, 2012.View at: Google Scholar
H. L. Kundel and M. Polansky, “Measurement of observer agreement,” Radiology, vol. 228, no. 2, pp. 303–308, 2003.View at: Publisher Site | Google Scholar
R. L. Spitzer, J. Cohen, J. L. Fleiss, and J. Endicott, “Quantification of agreement in psychiatric diagnosis,” Archives of General Psychiatry, vol. 17, no. 1, pp. 83–87, 1967.View at: Publisher Site | Google Scholar
T. O. Kvålseth, “Note on Cohen's kappa,” Psychological Reports, vol. 65, no. 1, pp. 223–226, 1989.View at: Publisher Site | Google Scholar
J. Cohen, “Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, 1968.View at: Publisher Site | Google Scholar
D. V. Cicchetti and T. Allison, “A new procedure for assessing reliability of scoring EEG sleep recordings,” American Journal of EEG Technology, vol. 11, no. 3, pp. 101–109, 1971.View at: Publisher Site | Google Scholar
T. O. Kvålseth, “Weighted specific-category Kappa measure of interobserver agreement,” Psychological Reports, vol. 93, no. 3, pp. 1283–1290, 2003.View at: Publisher Site | Google Scholar
T. O. Kvålseth, “Kappa coefficients of agreement,” in International Encyclopedia of Statistical Science, M. Lovric, Ed., part 11, pp. 710–713, Springer, Berlin, Germany, 2011.View at: Google Scholar
H. E. A. Tinsley and D. J. Weiss, “Interrater reliability and agreement,” in Handbook of Applied Multivariate Statistics and Mathematical Modeling, H. E. A. Tinsley and S. D. Brown, Eds., pp. 95–124, Academic Press, San Diego, Calif, USA, 2000.View at: Google Scholar
K. F. Hirji and M. H. Rosove, “A note on interrater agreement,” Statistics in Medicine, vol. 9, no. 7, pp. 835–839, 1990.View at: Publisher Site | Google Scholar
A. Agresti, Categorical Data Analysis, John Wiley & Sons, Hoboken, NJ, USA, 2nd edition, 2002.View at: Publisher Site | MathSciNet
A. C. Davison and D. V. Hinkley, Bootstrap Methods and Their Applications, Cambridge University Press, Cambridge, UK, 1997.View at: Publisher Site | MathSciNet
W. C. Parr and H. D. Tolley, “Jackknifing in categorical data analysis,” The Australian Journal of Statistics, vol. 24, no. 1, pp. 67–79, 1982.View at: Publisher Site | Google Scholar | MathSciNet
J. L. Fleiss, J. Cohen, and B. S. Everitt, “Large sample standard errors of kappa and weighted kappa,” Psychological Bulletin, vol. 72, no. 5, pp. 323–327, 1969.View at: Publisher Site | Google Scholar