Abstract

Weighted kappa is a widely used statistic for summarizing inter-rater agreement on a categorical scale. For rating scales with three categories, there are seven versions of weighted kappa. It is shown analytically how these weighted kappas are related. Several conditional equalities and inequalities between the weighted kappas are derived. The analytical analysis indicates that the weighted kappas are measuring the same thing but to a different extent. One cannot, therefore, use the same magnitude guidelines for all weighted kappas.

1. Introduction

In biomedical, behavioral, and engineering research, it is frequently required that a group of objects is rated on a categorical scale by two observers. Examples are the following: clinicians that classify the extent of disease in patients; pathologists that rate the severity of lesions from scans; and experts that classify production faults. Analysis of the agreement between the two observers can be used to assess the reliability of the rating system. High agreement would indicate consensus in the diagnosis and interchangeability of the observers. Various authors have proposed statistical methodology for analyzing agreement. For example, for modeling patterns of agreement, the loglinear models proposed in Tanner and Young [1] and Agresti [2, 3] can be used. However, in practice researchers are frequently only interested in a single number that quantifies the degree of agreement between the raters [4, 5]. Various statistics have been proposed in the literature [6, 7], but the most popular statistic for summarizing rater agreement is the weighted kappa introduced by Cohen [8].

Weighted kappa allows the use of weighting schemes to describe the closeness of agreement between categories. Each weighting scheme defines a different version or special case of weighted kappa. Different weighting schemes have been proposed for the various scale types. In this paper, we only consider scales of three categories. This is the smallest number of categories for which we can distinguish three types of categorical scales, namely, nominal scales, continuous-ordinal scales, and dichotomous-ordinal scales [9]. A dichotomous-ordinal scale contains a point of “absence” and two points of “presence”, for example, no disability, moderate disability, or severe disability. A continuous-ordinal scale does not have a point of “absence”. The scale can be described by three categories of “presence”, for example, low, moderate, or high. Identity weights are used when the categories are nominal [10]. In this case, weighted kappa becomes the unweighted kappa introduced by Cohen [11], also known as Cohen’s kappa. Linear weights [12, 13] or quadratic weights [14, 15] can be used when the categories are continuous ordinal. The modified linear weights introduced in Cicchetti [9] are suitable if the categories are dichotomous ordinal.

Although weighted kappa has been used in thousands of research applications [16], it has also been criticized by various authors [1719]. Most of the criticism has focused on a particular version of weighted kappa, namely, Cohen’s kappa for nominal categories. Weighted kappa and unweighted kappa correct for rater agreement due to chance alone using the marginal distributions. For example, in the context of latent class models, de Mast [18] and de Mast and van Wieringen [6] argued that the premise that chance measurements have the distribution defined by the marginal distributions cannot be defended. It is, therefore, difficult to interpret the value of Cohen’s kappa, and it makes the question of how large or how small the value should be arbitrary. Using signal detection theory, Uebersax [19] showed that different agreement studies with different marginal distributions can produce the same value of Cohen’s kappa. Again, this makes the value difficult to interpret. Alternative statistics for summarizing inter-rater agreement are discussed in, for example, de Mast [18] and Perreault and Leigh [20].

Although the choice for a specific version of weighted kappa usually depends on the type of categorical scale at hand, it frequently occurs that weighted kappas corresponding to different weighting schemes are applied to the same data. For example, Cohen’s kappa for nominal scales [11] is also frequently applied when the categories are continuous ordinal. When different weighted kappas are applied to the same data, they usually produce different values [5, 21]. For understanding the behavior of weighted kappa and its dependence on the weighting scheme, it is useful to compare the different versions of weighted kappa analytically [21]. For example, if the agreement table is tridiagonal, then the value of the quadratically weighted kappa exceeds the value of the linearly weighted kappa, which, in turn, is higher than the value of unweighted kappa [22, 23]. An agreement table is tridiagonal if it has nonzero elements only on the main diagonal and on the two diagonals directly adjacent to the main diagonal. These analytic results explain orderings of the weighted kappas that are observed in practice.

In this paper, we consider scales that consist of three categories and compare the values of seven special cases of weighted kappa. There are several reasons why the case of three categories is an interesting topic of investigation. First of all, various scales that are used in practice consist of three categories only. Examples can be found in Anderson et al. [24] and Martin et al. [25]. Furthermore, the case of three categories is the smallest case where symmetrically weighted kappas in general have different values, since all weighted kappas with symmetric weighting schemes coincide with two categories. Finally, as it turns out, with three categories we may derive several strong analytic results, which do not generalize to the case of four or more categories. The seven weighted kappas belong to two parameter families. For each parameter family, it is shown that there are only two possible orderings of its members. Hence, despite the fact that the paper is limited to weighted kappas for three categories, we present various interesting and useful results that deepen our understanding of the application of weighted kappa.

The paper is organized as follows. In Section 2 we introduce notation and define four versions of weighted kappa. In Section 3, we introduce the three category reliabilities of a agreement table as special cases of weighted kappa. The two parameter families are defined in Section 4. In Section 5, we present several results on inequalities between the seven weighted kappas. In Section 6, we consider the case that all special cases of weighted kappa coincide. Section 7 contains a discussion.

2. Weighted Kappas

Suppose that two raters, each, independently classify the same set of objects (individuals, observations) into the same set of three categories that are defined in advance. For a population of objects, let for denote the proportion classified into category by the first observer and into category by the second observer. Table 1 presents an abstract version of a population agreement table of proportions. The marginal totals and indicate how often raters 1 and 2 used the categories 1, 2, and 3. Four examples of agreement tables from the literature with frequencies are presented in Table 2. The marginal totals of the tables are in bold. For each table, the last column of Table 2 contains the corresponding estimates of seven weighted kappas. Between brackets behind each point estimate is the associated 95% confidence interval. Definitions of the weighted kappas are presented below.

Recall that weighted kappa allows the use of weighting schemes to describe the closeness of agreement between categories. For each cell probability , we may specify a weight. A weighting scheme is called symmetric if for all cell probabilities and are assigned the same weight. The weighting schemes can be formulated from either a similarity or a dissimilarity perspective. Definitions of weighted kappa in terms of similarity scaling can be found in Warrens [13, 22]. For notational convenience, we will define the weights in terms of dissimilarity scaling here. For the elements on the agreement diagonal, there is no disagreement. The diagonal elements are, therefore, assigned zero weight [8, page 215]. The other six weights are non-negative real numbers for . The inequality indicates that there is some disagreement between the assignments by the raters. Categories that are more similar are assigned smaller weights. For example, ordinal scale categories that are one unit apart in the natural ordering are assigned smaller weights than categories that are more units apart.

Table 3 presents one general and seven specific weighting schemes from the literature. The identity weighting scheme for nominal categories was introduced in Cohen [11]. The top table in Table 2 is an example of a nominal scale. The quadratic weighting scheme for continuous-ordinal categories was introduced in Cohen [8]. The quadratically weighted kappa is the most popular version of weighted kappa [4, 5, 15]. The linear weighting scheme for continuous-ordinal categories was introduced in Cicchetti and Allison [29] and Cicchetti [30]. The second table in Table 2 is an example of a continuous-ordinal scale. The dichotomous-ordinal weighting scheme was introduced in Cicchetti [9]. The two bottom tables in Table 2 are examples of dichotomous-ordinal scales. All weighting schemes in Table 3, except the general symmetric and the quadratic, are special cases of the weighting scheme with additive weights introduced in Warrens [31].

In this paper, we only consider weighted kappas with symmetric weighting schemes. For notational convenience, we define the following six coefficients: To avoid pathological cases, we assume that . The coefficients , , and reflect raw disagreement between the raters, whereas , , and reflect chance-expected disagreement. The general formula of weighted kappa for tables with symmetric weights will be denoted by . In terms of the coefficients and , this weighted kappa is defined as The value of lies between and . The numerator of the fraction in (2) reflects raw weighted disagreement. It is a weighted sum of the cell probabilities that are not on the main diagonal of the table, and it quantifies the disagreement between the raters. The denominator of the fraction in (2) reflects weighted disagreement under chance. It is a weighted sum of the products for . High values of correspond to high disagreement. If , then we have , and there is perfect agreement between the observers. Furthermore, we have if the raw weighted disagreement is equal to the weighted disagreement under chance.

Special cases of are obtained by using the specific weighting schemes in Table 3 in the general formula (2). Unweighted kappa, linearly weighted kappa, quadratically weighted kappa, and Cicchetti’s weighted kappa are, respectively, defined as Assuming a multinominal sampling model with the total numbers of objects fixed, the maximum likelihood estimate of the cell probability for is given by , where is the observed frequency. Note that the and are functions of the cell probabilities . The maximum likelihood estimate of in (2) is obtained by replacing the cell probabilities by [32]. The last column of Table 2 contains the estimates of the weighted kappas for each of the four tables. For example, for the top table of Table 2, we have , , , and . Between brackets behind the kappa estimates are the 95% confidence intervals. These were obtained using the asymptotic variance of weighted kappa derived in Fleiss et al. [33].

3. Category Reliabilities

With a categorical scale, it is sometimes desirable to combine some of the categories [34], for example, when two categories are easily confused, and then calculate weighted kappa for the collapsed table. If we combine two of the three categories, the table collapses into a table. For a table, all weighted kappas with symmetric weighting schemes coincide. Since we have three categories, there are three possible ways to combine two categories. The three -values of the collapsed tables are given by These three kappas are obtained by using the three bottom weighting schemes in Table 3 in the general formula (2). The last column of Table 2 contains the estimates of these weighted kappas for each of the four tables.

Weighted kappa for corresponds to the table that is obtained by combining the two categories other than category . The table reflects how often the two raters agreed on the category and on the category “all others”. Weighted kappa for , hence, summarizes the agreement or reliability between the raters on the single category , and it is, therefore, also called the category reliability of [10]. It quantifies how good category can be distinguished from the other two categories. For example, for the second table of Table 2, we have , , and . The substantially lower value of indicates that the third category is not well distinguished from the other two categories.

Unweighted kappa and linearly weighted kappa are weighted averages of the category reliabilities. Unweighted kappa is a weighted average of , , and , where the weights are the denominators of the category reliabilities [10]: Since is a weighted average of the category reliabilities, the -value always lies between the values of , , and . This property can be verified for all four tables of Table 2. Therefore, when combining two categories, the -value can go either up or down, depending on which two categories are combined [34]. The value of is a good summary statistic of the category reliabilities if the values of , , and are (approximately) identical. Table 2 shows that this is not the case in general. With an ordinal scale, it only makes sense to combine categories that are adjacent in the ordering. We should, therefore, ignore with ordered categories, since this statistic corresponds to the table that is obtained by merging the two categories that are furthest apart. Furthermore, note that for the two bottom tables of Table 2 the first category is the “absence” category. If the scale is dichotomous ordinal and category 1 is the “absence” category, then is the -value of the table that corresponds to “absence” versus “presence” of the characteristic.

The statistic is a weighted average of and , where the weights are the denominators of the category reliabilities [13, 35]: Since is a weighted average of the category reliabilities and , the -value always lies between the values of and . This property can be verified for all four tables of Table 2. Unlike , statistic can be considered an extension of to ordinal scales that preserves the “weighted average” property [13, 35]. The value of is a good summary statistic of and if the two weighted kappas are (approximately) identical. This is the case for the two bottom tables of Table 2.

The statistic is also a weighted average of and , where the weights are and : A proof can be found in Warrens [31].

4. Families of Weighted Kappas

In this section, we show that the seven weighted kappas introduced in Sections 2 and 3 are special cases of two families. Let be a real number. Inspection of the formulas , , , and shows that they only differ on how the coefficients and are weighted. The first family is, therefore, given by For , we have, respectively, the special cases , , , and .

Recall that and are weighted averages of the category reliabilities and . This motivates the following definition. Let . Then the second family is defined as The family consists of the weighted averages of and where the weights are multiples of and . For , we have, respectively, the special cases , , , and . Note that belongs to both and .

The following proposition presents a formula for the family in (9) that will be used in Theorem 6 below.

Proposition 1. The family in (9) is equivalent to

Proof. Since and are equal to, respectively, we can write (9) as which is identical to the expression in (10).

5. Inequalities

In this section, we present inequalities between the seven weighted kappas. We will use the following lemma repeatedly.

Lemma 2. Let and . Then one has the following:

Proof. Since and are positive numbers, we have , or . Adding to both sides, we obtain , or .

Theorem 3 classifies the orderings of the special cases of the family in (8).

Theorem 3. For one has the following:

Proof. The inequality is equivalent to Since , it follows from Lemma 2 that inequality (15) is equivalent to Applying Lemma 2 for a second time, we find that inequality (16) is equivalent to This completes the proof.

Theorem 3 shows that, in practice, we only observe one of two orderings of , , , and . In most cases, we have . For example, in Table 2 all tables exhibit this ordering. For all these tables, it holds that . Furthermore, if the table would be tridiagonal [22, 23], we would have , and the inequality would also hold. The other possibility is that we have . The only example from the literature where we found this ordering is the table presented in Cohen [11]. The table in Cohen satisfies the condition in (iii) of Theorem 3. We conclude that, with ordinal scales, we almost always have the ordering . The equality condition in Theorem 3 is discussed in Section 6.

Theorem 4 classifies the orderings of the special cases of the family in (9).

Theorem 4. For , one has the following:

Proof. The special cases of are weighted averages of and . For , we have if and only if ; that is, a statistic that gives more weight to will be higher if the -value exceeds the -value. Furthermore, we have This completes the proof.

Theorem 4 shows that, in practice, we only observe one of two orderings of , , , and . We either have the ordering , which is the case in the first, second, and fourth tables of Table 2, or we have , which is the case in the third table in Table 2.

Proposition 5 follows from Theorems 3 and 4 and the fact that is a weighted average of , , and [10].

Proposition 5. Consider the following:

Proposition 5 shows that we have an almost complete picture of how the seven weighted kappas are ordered just by comparing the values of , , and . The double inequality holds for the fourth table of  Table 2, whereas the inequality holds for the third table of Table 2. Both tables have a dichotomous-ordinal scale. Recall that corresponds to a weighting scheme specifically formulated for dichotomous-ordinal scales. It turns out that the -value can be both lower and higher than the -value with dichotomous-ordinal scales. Which statistic is higher depends on the data. Furthermore, tends to be smaller than and . The condition can be interpreted as an increase in the -value if we combine the middle category of the 3-category scale with one of the outer categories. This way of merging categories makes sense if the categories are ordered.

6. Equalities

Apart from the equality conditions in (ii) of Theorems 3 and 4, we only considered inequalities between the weighted kappas in the previous section. Unless there is perfect agreement, the values of the weighted kappas are usually different. Table 4 contains three hypothetical agreement tables that we have constructed to illustrate that the three equality conditions in Theorems 3, 4, and 6 (below) are not identical. For the top table in Table 4, we have , which is equivalent to the equality (Theorem 3). Although all weighted kappas of the family coincide, the kappas not belonging to this family produce different values. For the middle table in Table 4 we have , which is equivalent to the equality (Theorem 4). Although all weighted kappas of the family coincide, the kappas that do not belong to this family produce different values.

For the bottom table in Table 4, we have the stronger condition . Theorem 6 (below) shows that this condition is equivalent to the case that all weighted kappas, that is, all special cases of (2), coincide.

Theorem 6. The following conditions are equivalent:

Proof. In words, (ii) means that all special cases of (2) are identical. Therefore, . We first show that . It then suffices to show that .
If (i) holds, we have for certain . Hence, Thus, all special cases of weighted kappa in (2) coincide if (i) is valid.
Next, we show that . Consider condition (iii) first. If two special cases of are identical, it follows from Theorem 3 that all of them are identical. Hence, we have for a certain with . Using formula (10), we have Combining (24) with (Theorem 3), we obtain Applying Lemma 2 to the outer ratios of (25), we obtain First, suppose that . Applying Lemma 2 to the right-hand side equality of (26), we obtain or . Applying Lemma 2 to the second and fourth term of the triple equality (27), we obtain . Thus, we have , which completes the proof for . Next, suppose that . Applying Lemma 2 to the right-hand side equality of (26), we obtain or . Applying Lemma 2 to the second and fourth terms of the triple equality (28), we obtain . Thus, we also have for , which completes the proof for condition (iii).
Next, consider condition (iv). If two special cases of are identical, it follows from Theorem 4 that all of them are identical. Hence, we have for a certain and . We have First, suppose that . Applying Lemma 2 to the outer ratios of (29), we obtain Applying Lemma 2 to the right-hand side equality of (30) gives Applying Lemma 2 to the outer ratios of (31), we obtain , while applying Lemma 2 to the second and fourth terms of the triple equality (31), we obtain . Thus, we have .
Finally, if , then consider the equality Since , applying Lemma 2 to the right-hand side equality of (32) gives However, is equivalent to . Since is a weighted average of , , and , we must have . But then condition (iii) holds, and we have already shown that . This completes the proof for condition (iv).

Theorem 6 shows that all weighted kappas for tables are identical if we have the double inequality . If this condition holds, the equalities and also hold. Theorem 6 also shows that if any two special cases of the family are equal to a member of the family other than , then all weighted kappas coincide. Furthermore, if any two special cases of the family are identical to a member of the family other than , then all weighted kappas must be identical.

7. Discussion

Since it frequently happens that different versions of the weighted kappa are applied to the same contingency data, regardless of the scale type of the categories, it is useful to compare the various versions analytically. For rating scales with three categories, we may define seven special cases of weighted kappa. The seven weighted kappas belong to two different parameter families. Only the weighted kappa with linear weights belongs to both families. For both families, it was shown that there are only two possible orderings of its members (Theorems 3 and 4). We conclude that with ordinal scales consisting of three categories, quadratically weighted kappa usually produces higher values than linearly weighted kappa, which in turn has higher values than unweighted kappa.

Since there are only a few possible orderings of the weighted kappas, it appears that the kappas are measuring the same thing, but to a different extent. Various authors have presented magnitude values for evaluating the values of kappa statistics [3638]. For example, an estimated value of 0.80 generally indicates good or excellent agreement. There is general consensus in the literature that uncritical use of these guidelines leads to questionable decisions in practice. If the weighted kappas are measuring the same thing, but some kappas produce substantially higher values than others, then the same guidelines cannot be applied to all weighted kappas. However, using the same guidelines for different kappas appears to be common practice. If one wants to work with magnitude guidelines, then it seems reasonable to use stricter criteria for the quadratically weighted kappa than for unweighted kappa, since the former statistic generally produces higher values.

The quadratically and linearly weighted kappas were formulated for continuous-ordinal scale data. However, in practice, many scales are dichotomous ordinal (see, e.g., Anderson et al. [24] and Martin et al. [25]). In this case, the application of the weighted kappa proposed by Cicchetti [9] or the additively weighted kappa introduced in Warrens [31] is perhaps more appropriate. Unfortunately, Cicchetti’s weighted kappa has been largely ignored in the application of kappa statistics. In most applications, the quadratically weighted kappa is used [4, 5]. The observation that the quadratically weighted kappa tends to produce the highest value for many data may partly explain this popularity. As pointed out by one of the reviewers, to determine whether Cicchetti’s weighted kappa has real advantages, the various weighted kappas need to be compared on the quality and efficiency of prediction. This is a possible topic for future work.

Acknowledgments

The author thanks four anonymous reviewers for their helpful comments and valuable suggestions on an earlier version of this paper. This research is part of Project 451-11-026 funded by the Netherlands Organisation for Scientific Research.