Journal of Probability and Statistics

Volume 2015, Article ID 751803, 8 pages

http://dx.doi.org/10.1155/2015/751803

## Measurement of Interobserver Disagreement: Correction of Cohen’s Kappa for Negative Values

Departments of Mechanical Engineering and Industrial & Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA

Received 24 June 2015; Accepted 6 September 2015

Academic Editor: Z. D. Bai

Copyright © 2015 Tarald O. Kvålseth. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

As measures of interobserver agreement for both nominal and ordinal categories, Cohen’s kappa coefficients appear to be the most widely used with simple and meaningful interpretations. However, for negative coefficient values when (the probability of) observed disagreement exceeds chance-expected disagreement, no fixed lower bounds exist for the kappa coefficients and their interpretations are no longer meaningful and may be entirely misleading. In this paper, alternative measures of disagreement (or negative agreement) are proposed as simple corrections or modifications of Cohen’s kappa coefficients. The new coefficients have a fixed lower bound of −1 that can be attained irrespective of the marginal distributions. A coefficient is formulated for the case when the classification categories are nominal and a weighted coefficient is proposed for ordinal categories. Besides coefficients for the overall disagreement across categories, disagreement coefficients for individual categories are presented. Statistical inference procedures are developed and numerical examples are provided.

#### 1. Introduction

When two (or more) observers are independently classifying observations or items (objects) into the same set of mutually exclusive and exhaustive categories, it may be of interest to have a summary description of the extent to which the observers agreed in their classifications. The total probability (proportion) of agreement is one such obvious summary measure. However, since some agreement is to be expected purely by chance, Cohen [1] introduced the* kappa coefficient of agreement* as one that corrects for the chance-expected agreement. Cohen’s kappa has since become widely used in a variety of situations and discussed extensively in various textbooks (e.g., [2–5]) and a wide variety of journal publications (e.g., [6–10]).

In order to define the kappa coefficient in terms of probabilities (proportions), let be the probability that a random observation is assigned to category by Observer 1 and to category by Observer 2 for and . Furthermore, let denote the probability that a randomly chosen observation is assigned to category by Observer 1 and the probability that a randomly chosen observation is assigned to category by Observer 2 (). If these probabilities are represented in terms of a two-way contingency table with rows and columns , then becomes the probability in cell () and becomes the marginal row distribution and becomes the marginal column distribution. With the row categories and the column categories being the same, is the total probability of agreement between the two observers. Cohen [1] used the overall statistical independence as the condition for chance agreement and defined as with and being the observed agreement probability and the chance-expected agreement probability, respectively. In terms of the observed and chance-expected disagreement probabilities and , can alternatively be expressed as It is clear from (1)-(2) that if the interobserver agreement is perfect, that is, if , if , and if . The case of negative -values will be discussed further in the next section.

In addition to measuring the overall agreement between two observers, it may be of interest to assess their level of agreement for specific categories. Spitzer et al. [11] first proposed such a measure by collapsing the original table into a 2 × 2 table, one such 2 × 2 table for each category , and then computing in (1)-(2) for each such 2 × 2 table (see also [2, Chapter 18]). As a simpler procedure yielding the same numerical results, Kvålseth [12] proposed the following form of kappa for the specific category ():where denotes the summation over all disagreement cells for category . With, say, , consists of cells , , , and . For complete agreement with respect to category , when , for the independence , and when observed disagreement exceeds chance disagreement.

To account for the potential fact that some disagreements may be more serious than others, as when the categories have a natural order, Cohen [13] and Cicchetti and Allison [14] independently introduced the* weighted kappa *, which can be expressed as where each weight , with and for all and and with the following logical weight choices (e.g., [2, page 609]):For a specific category , Kvålseth [15] proposed the following measure as an extension of (4):with denoting the set of all disagreement cells for category . The values of these weighted measures equal 1 for perfect agreement and 0 if observed agreement equals chance agreement, with negative values if observed agreement is less than chance agreement.

The kappa coefficients in (1)–(8) may be appropriate measures of agreement when their values are nonnegative, but not when their values are negative as discussed in the next section. From a theoretical point of view at least, it is certainly troublesome that their negative values lack appropriate meaning and validity. This paper presents simple corrections or modifications of the kappa coefficients in (1)–(8) such that the negative values of the corrected coefficients provide appropriate representation of the extent to which the observers disagree. Statistical inference procedures for the new coefficients or measures are developed. Numerical examples are also given.

#### 2. Comments on Kappa

One of the most appealing properties of kappa, and undoubtedly a reason for its popularity, is its simplicity and transparency. All the kappa coefficients in (1)–(8) have intuitively appealing and meaningful interpretations. In the case of in (1)-(2), for example, it seems most meaningful to interpret any -value in terms of (2) as the proportional difference between and , that is, the relative extent to which the observed disagreement probability is less than the disagreement probability attributable to chance. By comparison, the norming used in (1) is not unique, with any number of different potential denominators such that [16].

Complete statistical independence, that is, for , is a sufficient, but not a necessary, condition for the kappa coefficients in (1)–(8) to take on value 0. In fact, for in (1) and , it is not necessary that for . It is possible that even if for all and when . As a simple example, considerwhere all marginal probabilities equal 1/3. And for all and , but . In this case, from (3), , , and, from (6) and (7), for .

Note that the two expressions for in (1)-(2) are weighted arithmetic means of the expressions for in (3)-(4). Thus, from (1) and (3), for instance, it is seen thatSimilarly, for the weighted measures in (6) and (8),

In order to show that the interobserver agreement for a specific category can be determined directly from (3)-(4), without the need to collapse the original table as suggested by Spitzer et al. [11], consider that the original table with probability components , , and for category is collapsed into the following 2 × 2 table:When (12) is substituted into (1), in (3) results immediately. However, no such corresponding procedure applies to in (6) and in (8). Note that, for , and .

In spite of its wide appeal, kappa is not without some criticism or controversy, especially related to its dependence on the marginal distributions and (see, e.g., [4, pages 168–173]). The chance agreement (disagreement) for all the kappa coefficients in (1)–(8) is based on the marginal distributions. If those distributions are highly uneven (nonuniform) and nearly symmetric, the values of the kappa coefficients may become unreasonably small due to the relatively large chance agreements.

A clear limitation of the kappa coefficients relates to situations when the values of those coefficients become negative and lack meaningful interpretations. This limitation has generally been ignored in published studies, partly perhaps because such studies using kappa have typically involved positive kappa values. Negative kappa values could, however, lead to incorrect interpretations, results, and conclusions. Also, if, for instance, in (1)-(2), it is possible that some in (3)-(4).

For the overall kappa in (1)-(2), when so that , has no reasonable meaning in terms of (1), but does in terms of (2); that is, is the relative extent to which exceeds . The same argument applies to in (3)-(4). However, two serious limitations of all the kappa coefficients are that, for negative values, (a) the coefficients have no fixed lower bounds, making it impossible to appropriately assess the size or magnitude of coefficient values, and (b) the coefficients take on negative values that do not appear reasonable as discussed below.

The minimum values of in (1) and of in (3) depend exclusively on the marginal distributions and . Values such as or are uninformative since they cannot be related to any fixed lower bounds on or such as , irrespective of the marginal distributions. There is no basis for making any interpretation or statement such as indicating a “moderate,” “low,” or “high” level of disagreement between the two observers.

There is also some confusion in the literature about the minimum value of , with some stating that the minimum value is or [5, page 4] and others stating that it is when for all and [17, page 113]. Such statements are clearly incorrect. In fact, the minimum value equals if, and only if, . Similarly, the minimum value of in (3) equals only when the harmonic mean of and equals 0.5.

What is needed are chance-corrected measures of disagreement, both weighted and unweighted, which have fixed lower bounds of and which are attainable irrespective of the marginal distributions. This requirement has also been clearly emphasized by others [18]. Such measures will be introduced in the next section as simple corrections or modifications of the existing kappa coefficients.

#### 3. Proposed Kappa Coefficients of Disagreement

##### 3.1. Overall Coefficients

When and hence , it seems most logical and intuitive to define negative overall kappa aswhere , and are the probabilities defined in (1)-(2). Consequently,where, of course, for . Except for the minus sign, in (13) follows from in (1)-(2) by simply substituting disagreement probabilities for the corresponding agreement probabilities.

The properties of can be summarized as follows:(P1) is well defined if at least two cells of the contingency table contain nonzero probabilities.(P2), with , if observed agreement (disagreement) equals chance agreement (disagreement) (i.e., or ) and if, and only if, .(P3) can take on value for any marginal distributions and .(P4) has a meaningful interpretation as the relative extent to which the observed agreement probability is less than that expected by chance alone.(P5) takes on values that appear reasonable throughout its 0 to range.

While Properties (P1)–(P4) are immediately apparent from the definition in (13), Property (P5) needs an explanation. This can most simply be done for the category case and without undue loss of generality since, for any data set with , there exists an equivalent 2 × 2 table with the same -value. Therefore, one may consider a 2 × 2 table such as the one in Table 1 with the marginal probabilities and (). The first two entries in each cell correspond to the cases when and 0, respectively, while the third entry equals the weighted arithmetic mean of the other two entries with weights and .