New Interpretations of Cohen’s Kappa
Cohen’s kappa is a widely used association coefficient for summarizing interrater agreement on a nominal scale. Kappa reduces the ratings of the two observers to a single number. With three or more categories it is more informative to summarize the ratings by category coefficients that describe the information for each category separately. Examples of category coefficients are the sensitivity or specificity of a category or the Bloch-Kraemer weighted kappa. However, in many research studies one is often only interested in a single overall number that roughly summarizes the agreement. It is shown that both the overall observed agreement and Cohen’s kappa are weighted averages of various category coefficients and thus can be used to summarize these category coefficients.
In various fields of science it is frequently required that an observer classifies a set of subjects into three or more nominal categories that are defined in advance. The observer may be a clinician who classifies children on the severity of a disease, a pathologist that rates the severity of lesions from scans, or a coder that transcribes interviews. If the observer did not fully understand what he or she was asked to interpret, or if the definition of the categories is ambiguous, the reliability of the rating system is at stake. To assess the reliability of the system researchers typically ask two or more observers to rate the same set of subjects independently. An analysis of the agreement between the observers can then be used as an indicator of the quality of the category definitions and the raters’ ability to apply them. High agreement between the ratings would indicate consensus in the diagnosis and interchangeability of the ratings.
There are several association coefficients that can be used for summarizing agreement between two observers [1–3]. In biomedical and behavioral science research the most widely used coefficient for summarizing agreement on a scale with two or more nominal categories is Cohen’s kappa [4–8]. The coefficient has been applied in thousand of research studies and is also frequently used for summarizing agreement if we have observers of one type paired with observers of a second type, and each of the observers assigns a subject to one of categories. A closely related coefficient is Scott’s pi . The latter coefficient is commonly used in the field of content analysis [2, 10]. The two coefficients have similar formulas and differ in how agreement under chance is defined [3, 11].
Cohen’s kappa reduces the ratings of the two observers to a single real number. To provide a proper interpretation of the coefficient one must first understand its meaning. There are two descriptions of kappa in the literature. The observed or raw agreement is the proportion of subjects that is classified into the same nominal categories by both observers. Several authors have argued that the overall observed agreement is artificially high and should be corrected for agreement due to chance [4, 6, 12]. Kappa can be described as a chance-corrected version of the observed agreement. The second interpretation of kappa involves the tables that are obtained by combining all the categories of the agreement table other than the one of current interest into a single category. If we have categories, there are associated tables, one for each category. For each table we may calculate the kappa value. The value of a category kappa is a measure of the agreement between the observers on the particular category [13, 14]. The overall kappa is a weighted average of the category kappas [15–17].
The interpretation of the overall kappa as an average of the category kappas has two consequences. On the one hand, if the category kappas are quite different, for example, high agreement on one category but low agreement on another category, the overall kappa cannot fully reflect the complexity of the agreement between the observers . If a researcher is interested in understanding the patterns of agreement and disagreement, it would be good practice to report (various) category coefficients for the individual categories, since this provides substantially more information than reporting only a single number. Alternatively, one can use log-linear or latent class models for modeling agreement . On the other hand, since the overall kappa is a weighted average, its value lies somewhere between the minimum and maximum of the category kappas. The overall kappa thus in a sense summarizes the agreement on the categories. If one is interested in a single number that roughly summarizes the agreement between the observers, which appears to be the case in many applications of Cohen’s kappa, then kappa can be used.
In this paper we present several new interpretations of the overall observed agreement, Cohen’s kappa, and Scott’s pi. The results presented here can be seen as support for the use of these coefficients as summary coefficients of the information on the categories. The paper is organized as follows. In Section 2 we present definitions of various category coefficients and three overall coefficients. The new interpretations are based on the correction for chance function and weighted averaging function of category coefficients. The domains and codomains of these functions are coefficient spaces. These spaces are also defined in Section 2. In Section 3 we define the correction for chance function, study some of its properties, and present an application. In Section 4 we define the weighted averaging function and study some of its properties. As an application of this function it is shown that Cohen’s kappa is an average of Bloch-Kraemer weighted kappas. A numerical illustration of this result is presented in Section 6. Finally, in Section 5 the composition of the correction for chance function and the averaging function is studied. It is shown that the functions commute under composition. It then follows that Cohen’s kappa and Scott’s pi are both averages of chance-corrected category coefficients, as well as chance-corrected versions of a weighted average of the category coefficients. The category coefficients include the sensitivity, specificity, and the positive and negative predictive values of the categories. Section 7 contains a conclusion.
2. Association Coefficients
2.1. Coefficient Spaces
For a population of subjects, let denote the proportion classified into category by the first observer and into category by the second observer, where , . The categories are nominal. Define The quantities and are the marginal totals of the table . They satisfy For a fixed number of categories , association coefficients are here defined as functions from the set of all tables with proportions into the real numbers. The domain of the functions is defined as An association coefficient is then a function that assigns a real number to a contingency table. For many association coefficients the codomain is either the closed interval or the interval . For notational convenience we will assume in this paper that all association coefficients have maximum value unity ().
The set of all association coefficients is given by . For most theoretical studies this set is too big. It turns out that the association coefficients that are used in data-analytic applications in real life belong to specific subsets of . For example, some association coefficients only describe the information for a particular category . For category all information is summarized in the element and the totals and . The diagonal element denotes the proportions of subjects classified into category by both raters. It indicates how often the raters agreed on category . The marginal totals and indicate how often category was used by the raters. Let and be functions of the marginal totals and . For category we define the set Given fixed marginal totals and , the coefficient space consists of all linear transformations of . In the context of a validity study, examples of coefficients in are the sensitivity , the positive predictive value , and the specificity and the negative predictive value of category . Additional examples of elements in are presented in the next section.
2.2. Examples of Category Coefficients
Since we are only interested in the quantities and and associated with category , we can collapse the contingency table into a table by combining all categories except category . Table 1 presents the collapsed table for category . A table can be the result of a reliability study involving two observers but also of a validity study. In the latter case a new test is usually compared to a “more-or-less gold standard.” For example, in a medical test evaluation one has a gold standard evaluation of the presence/absence or type of a disease against which a new test can be assessed. In this paper the rows of the contingency tables are associated with the gold standard, while the columns are associated with the new test.
Example 1. Let be a weight and consider for the functions Coefficient is the sensitivity of category , while is the positive predictive value. The coefficient is the coefficient proposed in Dice , a widely used coefficient in ecological biology.
Lemma 2 shows that for all the function belongs to , the coefficient space associated with category .
Lemma 2. One has for all .
Proof. We first show that for all . We have , since the value of cannot exceed the marginal totals and . Furthermore, note that for fixed and the set is convex. It consists of all values between and . Since and are nonnegative, all elements in the convex set are larger than or equal to . Hence, for all and it follows that for all .
Next, we can write as , where Hence, for all .
Example 3. Let be weights and consider the function This two-parameter family was first studied in Warrens . Note that ; that is, if we obtain the functions from Example 1. Since for all (Lemma 2), we also have for all . Furthermore, we can write as , where Hence, for all . Several additional coefficients from the literature are special cases of . Coefficient is the observed agreement of the collapsed table associated with category , while coefficients and are, respectively, the specificity and negative predictive value of category .
Example 4. For measuring validity in a study, Bloch and Kraemer  proposed the weighted kappa coefficient. The coefficient is based on an acknowledgment that the clinical consequences of a false negative may be quite different from the clinical consequences of a false positive. A false negative may delay treatment of a patient, while a false positive may result in unnecessary treatment. The Bloch-Kraemer weighted kappa is unique in that it requires that a real number must be specified a priori indicating the relative importance of the false negatives to the false positives. For category the weighted kappa is defined as [26, page 273]:
For all , coefficient can be used in the context of the utility of association . Coefficient (10) is a asymmetric special case of the weighted kappa proposed in Cohen . The latter weighted kappa is widely used with agreement tables with three or more ordinal categories [28–30].
Coefficient is the ordinary Cohen’s kappa for the table associated with category . It is a standard tool in a reliability study. It is sometimes called the reliability of category [13, 14]. Coefficient is the coefficient of conditional agreement proposed in Coleman  (see [32, page 367], and [33, page 397]). This coefficient can be used if one is interested in the agreement between the observers for those subjects which the first observer assigned to category .
Since we can write (10) as We can write (12) as , where Hence, for all .
Example 5. For the table associated with category , the intraclass kappa [26, page 276] can be defined as The letter was originally used by Scott . Bloch and Kraemer  showed that this coefficient can be used in the context of agreement. The intraclass kappa satisfies the classical definition of reliability [15, 18]. We can write (14) as , where Hence, .
2.3. Examples of Overall Coefficients
Coefficients in the sets for only describe the information of one category at a time. Other association coefficients summarize the information in all categories at once. Let be functions of the marginal totals and define the set Given fixed marginal totals the coefficient space consists of all linear transformations of the overall observed agreement . Clearly, is an element of . Other examples are Cohen’s kappa and Scott’s pi. The population value of Cohen’s kappa is defined as  The numerator of kappa is the difference between the actual probability of agreement and the probability of agreement in the case of statistical independence of the ratings. The denominator of kappa is the maximum possible value of the numerator. Kappa has value 1 when there is perfect agreement between the observers, 0 when agreement is equal to that expected by chance, and a negative value when agreement is less than that expected by chance. We can write kappa as , where The population value of Scott’s pi is defined as [2, 9, 11] The differences in the definitions of agreement under chance are discussed in Examples 9 and 10 in the next section. We always have the inequality .
3. Correction for Chance
In this section we define the correction for chance function. The expectation of a coefficient is conditionally upon fixed marginal totals. The correction for chance function is denoted by . For it is defined as For an association coefficient the correction for chance function is defined as The short formula is in both cases given by [3, 22, 35] We assume in (23) that to avoid indeterminacy. Lemma 6 presents an alternative expression for if .
Lemma 6. Let with . One has
Lemma 7. Let with . One has
The function is a map from to if is closed under . Lemma 8 shows that this is the case.
Lemma 8. The spaces and are closed under .
Proof. We present the proof for only. The proof for follows from using similar arguments.
Let with . The formula for is presented in (24). Since is a function of the marginal totals and we can write as , where Hence, , and the result follows.
Formula (24) shows that elements of coincide after correction for chance if they have the same difference , regardless of the choice of . This suggests the following definition. Two coefficients are said to be equivalent with respect to (24), denoted by , if they have the same difference . It can be shown that is an equivalence relation on . The equivalence relation divides the elements of into equivalence classes, one class for each value of the difference .
Example 9. The expected value of under statistical independence is given by In this case we assume that the data are a product of chance concerning two different frequency distributions.
Example 10. Alternatively, we may assume that the data are a product of chance concerning a single frequency distribution [9, 11]. The common parameter is usually estimated by the arithmetic mean of the marginals totals and . Hence, in this case we have
Lemma 11 presents an application of the correction for chance function. In Lemma 11 the function is combined with Example 9. The result shows how the functions in Examples 1 and 3 are related to the function in Example 4.
Lemma 11. Assume (28) holds. Then for all and .
4. Averaging over Categories
In this section we define a function that connects the association coefficients in the coefficient spaces to the coefficients in the space . For let with . For these coefficients we define the function or Thus, is the weighted average of the using the denominators of the as weights. This weighted average is similar to the arithmetic mean of the category coefficients. In the calculation of the arithmetic mean each category coefficient contributes equally to the final average. In the calculation of some category coefficients contribute more than others. We check whether function (32) is well-defined.
Lemma 12. Function (32) is well-defined.
Proof. It must be shown that is an element of . Since and each are functions of the marginal totals and , the sums and are also functions of the marginal totals. Hence, we can write as , where from which the result follows.
In the remainder of this section we consider some results associated with the weighted average function in (32). If we fix , then (5) provides association coefficients for describing the agreement between the observers, one for each category. Lemma 13 shows that a weighted average of these coefficients is equivalent to the overall observed agreement , regardless of the value of .
Lemma 13. Let be fixed. One has
If we fix , then (12) provides us with Bloch-Kraemer weighted kappas for describing the agreement between the observers, one for each category. Lemma 14 shows that a weighted average of these coefficients is equivalent to Cohen’s kappa in (18), regardless of our choice of .
Lemma 14. Let be fixed. One has
Lemma 15. One has
5. Composition of Functions
In Sections 3 and 4 we studied the correction for chance function and the weighted average function separately. In this section we study the composition of the two functions. Lemma 16 shows that the two functions commute. Hence, changing the order of the functions does not change the result.
Lemma 16. For let with . One has
Proof. We will show that both compositions are equivalent to
The formula for the is presented in (24). Adding the numerators of (24) we obtain the numerator of (45) and adding the denominators of (24) we obtain the denominator of (45). Hence, is equivalent to (45).
The formula for is presented in (32). The coefficient can be written as , where and are presented in (34a) and (34b). Using this and in (26) we also obtain (45).
Lemma 16 shows that we can either take the average of the chance-corrected versions of coefficients or take a weighted average of coefficients and then correct the overall coefficient for agreement due to chance. The result will be the same. Coefficient (45) contains two quantities that must be specified, namely, the expectation and the sum of the differences . Using, for fixed , and in (6a) and (6b), (8) and (9), (13a) and (13b), or (15a) and (15b) we obtain Identity (46) shows that all coefficients discussed in Section 2 belong to a specific family of linear transformations. An example of a coefficient that does not belong to this family is the phi coefficient in (50). For other examples, see .
Using identity (46) in (45) we obtain the overall coefficient If we use in (28) in (47) we obtain Cohen’s kappa, whereas if we use in (29) in (47) we obtain Scott’s pi. The overall kappa is not a weighted average of phi coefficients.
6. A Numerical Illustration
In this section we present a numerical illustration of Lemma 14, which shows that for fixed Cohen’s kappa is a weighted average of the Bloch-Kraemer weighted kappas associated with each category. Let denote the observed number of subjects that are classified into category by the first observer and into category by the second observer. Assuming a multinominal sampling model with the total numbers of subjects fixed, the maximum likelihood estimate of the cell probability is given by . We obtain the maximum likelihood estimates and by replacing the cell probabilities by the in the Bloch-Kraemer weighted kappas in (12) and Cohen’s kappa in (18) [33, page 396]. Let The approximate large sample variance of [33, 34, 36] is given by The product-moment correlation coefficient or phi coefficient for the table associated with category is given by The asymptotic variance [26, page 279] of is given by where
To illustrate Lemma 14 we consider the data in Table 2 taken from Fennig et al. . These authors investigated the accuracy of clinical diagnosis in psychotic patients. As a gold standard they used the ratings of two project psychiatrists, called the research diagnosis. Table 2 presents the cross-classification of the research and clinical diagnoses. The estimate of the overall kappa for these data is with 95% confidence interval (0.341–0.522), indicating a moderate overall level of agreement. Table 3 presents the estimates of the Bloch-Kraemer weighted kappas for the four categories, labeled S, B, D, and O, for five distinct values of . The table also presents the associated 95% confidence intervals between parentheses.
The statistics for category Schizophrenia in Table 3 are equivalent for all values of because . We have with 95% confidence interval (0.330–0.585), indicating a moderate level of agreement on Schizophrenia. The level of agreement on the other categories depends on the value of . The agreement on categories Bipolar disorder and Depression is higher than that of Schizophrenia for all values of , while the agreement on category Other is lowest for all values of . Finally, recall that, for fixed , the overall kappa is a weighted average of the Bloch-Kraemer weighted kappas. For example, for we have and for we have
The data in Tables 2 and 3 show that if we use the same category coefficients for all categories, then the coefficients in general produce different values. This observation holds for almost all real life data. Table 4 presents a hypothetical data set with three nominal categories. Table 5 presents the corresponding estimates of the Bloch-Kraemer weighted kappas for the three categories, labeled A, B, and C, for five distinct values of and the associated 95% confidence intervals. The statistics for category B in Table 5 are equivalent for all values of because . The estimate of the overall kappa for these data is with 95% confidence interval (0.229–0.482). Furthermore, all the estimates of the category kappas have the same value 0.356. Thus, in this hypothetical case the overall kappa is a perfect summary coefficient of the three category kappas. Due to Lemma 14, we know that the overall kappa also roughly summarizes the other Bloch-Kraemer weighted kappas. However, these weighted kappas have quite distinct values. These data illustrate that while the overall kappa is always a summary coefficient of all types of Bloch-Kraemer category kappas, it can be a perfect summary coefficient for a particular type of weighted kappas. On the contrary, while the overall kappa may summarize one type of category coefficients perfectly, it can still be a poor summary coefficient for other types of category coefficients.
Cohen’s kappa is a commonly used association measure for summarizing agreement between two observers on a nominal scale. The coefficient reduces the ratings of the two observers to a single real number. In general, this leads to a substantial loss of information. A more complete picture of the interobserver agreement is obtained by assessing the degree of agreement on the individual categories . There are various association coefficients that can be used to describe the information for each category separately. Examples are the sensitivity and specificity of a category, the positive predictive value, negative predictive value, and the Bloch-Kraemer weighted kappa. Once we have selected a category coefficient we have multiple coefficients describing the agreement between the observers, one for each category. If one is interested in a single number that roughly summarizes the agreement between the observers, what overall coefficient should be used? The results derived in this paper show that the overall observed agreement, Cohen’s kappa, and Scott’s pi are proper overall coefficients. Each coefficient is a weighted average of certain category coefficients and therefore its value lies somewhere between the minimum and maximum of the category coefficients. We enumerate some of the new interpretations that were found.(1)Suppose each category coefficient is the same special case of the function in (5). Examples are the sensitivity, positive predictive value, and the Dice coefficient. The observed agreement is a weighted average of the category coefficients (Lemma 13).(2)Suppose that each category coefficient is the same Bloch-Kraemer weighted kappa in (12). Then Cohen’s kappa is a weighted average of the weighted kappas (Lemma 14).(3)Suppose that each category coefficient is the intraclass kappa in (14). Then Scott’s pi is a weighted average of the intraclass kappas (Lemma 15).(4)Suppose that the value of a coefficient under chance is the value under statistical independence. Furthermore, suppose that each category coefficient is the same special case of the general function in (7). Examples are the sensitivity, specificity, positive predictive value, negative predictive value, the observed agreement, and the Dice coefficient. Then Cohen’s kappa is both a weighted average of the chance-corrected category coefficients and a chance-corrected version of a weighted average of the category coefficients (Lemma 16).An illustration of Lemma 14 was presented in Section 6. The lemmas presented in this paper show that there is an abundance of category coefficients of which the observed agreement and Cohen’s kappa are summary coefficients. The results provide a basis for using these overall coefficients if one is only interested in a single number that roughly summarizes the agreement between the observers. If, on the other hand, one is interested in understanding the patterns of agreement and disagreement, one can report various category coefficients for the individual categories or consider log-linear or latent class models that can be used to model the agreement .
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
This research is part of Veni Project 451-11-026 funded by The Netherlands Organisation for Scientific Research.
L. M. Hsu and R. Field, “Interrater agreement measures: comments on Kappan, Cohen's Kappa, Scott's , and Aickin's ,” Understanding Statistics, vol. 2, pp. 205–219, 2003.View at: Google Scholar
M. Maclure and W. C. Willett, “Misinterpretation and misuse of the Kappa statistic,” American Journal of Epidemiology, vol. 126, no. 2, pp. 161–169, 1987.View at: Google Scholar
K. Krippendorff, Content Analysis: An Introduction to Its Methodology, Sage, Thousands Oaks, Calif, USA, 2nd edition, 2004.
J. L. Fleiss, Statistical Methods for Rates and Proportions, Wiley, New York, NY, USA, 1981.View at: MathSciNet
S. Fennig, T. J. Craig, M. Tanenberg-Karant, and E. J. Bromet, “Comparison of facility and research diagnoses in first-admission psychotic patients,” The American Journal of Psychiatry, vol. 151, no. 10, pp. 1423–1429, 1994.View at: Google Scholar
J. S. Coleman, “Measures of concordance or consensus between members of social groups,” Johns Hopkins University, 1966.View at: Google Scholar
Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multivariate Analysis: Theory and Practice, MIT Press, Cambridge, UK, 1975.