#### Abstract

Cohen’s kappa is a standard tool for the analysis of agreement in a 2 × 2 reliability study. Researchers are frequently only interested in the kappa-value of a sample. Various authors have observed that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. Here we present exact formulations of some of these properties. The results provide a better understanding of the 2 × 2 kappa for situations where it is used as a sample statistic.

#### 1. Introduction

Results from experimental studies and research studies can often be summarized in a table [1]. An example is a reliability study in which two observers rate the same sample of subjects on the presence/absence of a trait or an ability [2, 3]. In this example the four cells of the table are the proportion of times the observers agreed on the presence of the trait, the proportion of times a trait was present according to the first observer but absent according to the second observer, the proportion of times a trait was absent according to the first observer but present according to the second observer, and the proportion of times the observers agreed on the absence of the trait.

To assess the quality of the ratings, the agreement between the ratings is taken as an indicator of the quality of the category definitions and the observers’ ability to apply them. A standard tool for estimating agreement in a reliability study is Cohen’s kappa [4–8]. Its value is 1 when there is perfect agreement, 0 when agreement is equal to that expected under independence, and negative when agreement is less than expected by chance.

Several authors have presented population models for Cohen’s kappa [2, 7]. Under these models kappa can be interpreted as an association coefficient. However, kappa is also frequently used as a sample statistic [4, 8–11], for example, when calculating kappa for a sample of subjects is one step in a series of research steps. In this case, researchers are merely interested in the agreement in the sample not that of a population.

As a sample statistic, kappa is known to be marginal or prevalence dependent since it takes the marginal totals with which raters use the rating categories into account [12–14]. The value of kappa depends on the prevalence of the condition being diagnosed. Values of kappa can be quite low if a condition is quite common or very rare. Various authors have shown that if two pairs of observers have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions [4, 9, 15, 16]. Since observers with similar marginal distributions usually have a higher amount of agreement expected to occur by chance, a fixed amount of observed agreement will lead to a lower kappa-value due to the definition of the statistic [14].

The marginal dependence of Cohen’s kappa has almost exclusively been demonstrated and described by means of examples of tables [4, 8–11, 16]. However, for better understanding of the behavior of the kappa as a sample statistic, it is desirable to have exact formulations of its marginal dependence. Such formulations are presented in this paper. The paper is organized as follows. Section 2 is used to introduce notation and define Cohen’s kappa. In Section 3 several concepts for a table are presented. The main results are presented in Section 4. The results show that the kappa may exhibit several different forms of marginal dependence. The results do not necessarily suggest that Cohen’s kappa should be discarded as an agreement measure. Instead, the exact formulations provide a better understanding of the statistic. Section 5 contains several conclusions.

#### 2. Notation and Kappa

In this section we introduce notation and define the kappa coefficient. Suppose two fixed observers independently rate the same set of subjects using the same two categories 1 and 0. For example, 1 = presence and 0 = absence of a trait. For a sample of subjects, let , , , and denote, respectively, the proportion classified in category 1 by both observers, the proportion classified by the first observer in category 1 and by the second observer in category 0, the proportion classified by the first observer in category 0 and by the second observer in category 1, and the proportion classified in category 0 by both observers.

A general table with observed proportions, denoted by , is presented in Table 1. The row and column totals are the marginal totals that result from summing the relative frequencies. We denote these by and for rater and by and for rater . They reflect how often the observers used the categories.

Using the notation presented in Table 1, the proportion of observed agreement is given by , while the proportion of expected agreement is given by . Furthermore, Cohen’s kappa can be defined as Cohen’s kappa is a chance-corrected coefficient [17]. The value of kappa is 1 when perfect agreement between the two observers occurs, 0 when agreement is equal to that expected under independence (), and negative when agreement is less than expected by chance.

#### 3. Concepts for a Table

In this section we introduce several concepts for a table . The 2-tuples and contain the marginal distributions.

*Definition 1. *Two tuples and are said to be(i)*similarly arranged* if both increase (i.e., and ) or both decrease (i.e., and );(ii)*oppositely arranged* if one increases and the other decreases.Furthermore, a tuple is said to be* balanced* if .

In the following definition we use the concepts from Definition 1 to define terminology that will be used to formalize the marginal dependencies of kappa.

*Definition 2. *A table is said to be(i)*strongly marginal symmetric* if (and hence );(ii)*weakly marginal symmetric* if and are similarly arranged;(iii)*balanced* if or is balanced;(iv)*marginal asymmetric* if and are oppositely arranged.

Note that strong marginal symmetry implies weak marginal symmetry. Furthermore, strong marginal symmetry coincides with the usual definition of a symmetric matrix.

The following result relates some of the concepts in Definition 2 to the value of the chance-expected agreement . Lemma 3 is used in the proof of Theorem 5.

Lemma 3. *For a table with marginal distributions and the following equivalences hold.*(1)* is weakly marginal symmetric ;*(2)* is balanced ;*(3)* is marginal symmetric .*

*Proof. *We prove equivalence 1. The other equivalences follow from using similar arguments.

() If is weakly marginal symmetric, and are similarly arranged and we have , or equivalently
Adding to both sides of inequality (2) and dividing the result by 2, we obtain

() If then must hold using the same steps as in () backwards. Hence, and are similarly arranged, and it follows that is weakly marginal symmetric.

#### 4. Main Results

In this section we present several marginal dependencies of Cohen’s kappa (Theorems 5, 7, and 10). The following lemma will be used repeatedly.

Lemma 4. *Coefficient (1) is strictly decreasing in .*

*Proof. *The first order partial derivative
is negative for .

The following result is a slightly stronger version of a theorem in Warrens [16] for a rating scale with two categories. Theorem 5 shows that, for a fixed value of the proportion of observed agreement , tables that possess weak marginal symmetry produce lower values of kappa than tables that are marginal asymmetric.

Theorem 5. *Let , , and be tables with the same value of that are, respectively, weakly marginal symmetric, balanced, and marginal asymmetric. Furthermore, let , , and denote the associated values of kappa. Then .*

*Proof. *Lemma 4 shows that is strictly decreasing in . The result then follows from application of Lemma 3.

Example 6 illustrates Theorem 5.

*Example 6. *Consider the three hypothetical tables in Table 2.

Each table has the same proportion of observed agreement . Table 2(a) is strongly marginal symmetric, Table 2(b) is balanced, and Table 2(c) is marginal asymmetric. We have the double inequality , which illustrates Theorem 5.

Theorem 5 also considers tables that have asymmetric marginals. However, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.

Theorem 7 shows that tables that are symmetric may have lower kappa-values than tables with unequal marginal distributions.

Theorem 7. *Let be a weakly marginal symmetric table with marginals and and value and strongly marginal symmetric with marginals and value . Furthermore, suppose that and have the same proportion of observed agreement . If
**
then .*

*Proof. *Due to the symmetries of the proportion of expected agreement , we may assume, without loss of generality, that , , and . It then follows from inequality (5) that . Furthermore, since , we have , or , and since , we also have .

Since , multiplying both sides of inequality by yields
Adding to both sides of inequality (6) we obtain
or equivalently,
Next, since , multiplying both sides of the inequality by we obtain
Adding to both sides of inequality (9) we obtain
or equivalently,
Combining (8) and (11) we obtain the inequality
The inequality then follows from inequality (12) and application of Lemma 4.

Example 8 illustrates Theorem 7.

*Example 8. *Consider the two hypothetical tables in Table 3.

Both tables have the same proportion of observed agreement . Table 3(a) is weakly marginal symmetric, whereas Table 3(b) is strongly marginal symmetric. Since
the inequalities in (5) hold. Furthermore, we have , which illustrates Theorem 7.

Using similar arguments as in the proof of Theorem 7 we may obtain the following result.

Theorem 9. *Let be a weakly marginal symmetric table with marginals and and -value and strongly marginal symmetric with marginals and -value . Furthermore, suppose that and have the same proportion of observed agreement . If
**
then .*

If we have , , and then Theorems 7 and 9 cover the cases that and . The cases and turn out to be more complicated.

Another marginal dependence of kappa is presented in Theorem 10. The theorem shows that, for a constant value of the proportion of observed agreement , tables that exhibit weak marginal symmetry may produce higher kappa-values than tables with strong marginal symmetry. Theorem 10 is similar to Theorem 7, but conditions (5) and (15) are different, unrelated requirements.

Theorem 10. *Let be a weakly marginal symmetric table with marginals and and value and strongly marginal symmetric with marginals and value . Furthermore, suppose that and have the same proportion of observed agreement . If
**
then .*

*Proof. *Due to the symmetries of the proportion of expected agreement , we may assume, without loss of generality, that and . It then follows from inequality (15) that . Furthermore, since is weakly marginal symmetric we must have and . It follows that , and thus .

For the function is concave with a maximum at . Since , we have
or equivalently,
Next, since a square is nonnegative we have , or
Adding to both sides of inequality (18) we obtain
Combining inequalities (17) and (19) we obtain
Adding 1 to both sides of inequality (20) we obtain
or equivalently,
The inequality then follows from inequality (22) and application of Lemma 4.

Example 11 illustrates Theorem 10 and a corollary of Theorem 10.

*Example 11. *Consider the four hypothetical tables in Table 4.

Each table has the same proportion of observed agreement . Tables 4(a) and 4(d) are strongly marginal symmetric, whereas Tables 4(b) and 4(c) are weakly marginal symmetric. For the largest marginals of Tables 4(a) and 4(b), we have and , which illustrates Theorem 10.

For many tables from the literature the converse of Theorem 10 also holds. However, Tables 4(a) and 4(c) provide us with a counterexample that illustrates that the converse does not hold in general. We have , but not .

Finally, Tables 4(a) and 4(d) illustrate a special application of Theorem 10. If two tables are strongly marginal symmetric and have the same proportion of observed agreement , then the table with the most skewed (unbalanced) marginals (Table 4(a)) has the lowest value of kappa. This is illustrated by the fact that .

#### 5. Conclusions

Cohen’s kappa is presently a standard tool for the analysis of agreement in a reliability study. The statistic is frequently used as a sample statistic. Various authors have observed in this context that if two pairs of raters have the same amount of observed agreement, the pair whose marginal distributions are more similar to each other may have a lower kappa-value than the pair with more divergent marginal distributions. These properties of Cohen’s kappa have almost exclusively been demonstrated and described by means of examples of tables [4, 8–11, 16]. In this paper we presented exact formulations and proved several marginal dependencies of this type (Theorems 5, 7, and 10). In general, they show that, for tables with the same value of observed agreement, tables with marginal distributions that are more similar have lower associated kappa-values than tables with marginal distributions that are less similar. Each result was illustrated by an example with hypothetical tables. The results provide a better understanding of the kappa when it is used as a sample statistic.

Theorem 5 considers tables that have asymmetric marginals. Although several authors have provided examples with asymmetric marginals, asymmetric tables may not be relevant in practice. If the classifications are hard to make researchers will often make use of expert observers. Furthermore, novice observers usually receive some training before the actual classifications have to be made. Asymmetric tables are therefore rarely encountered in practice.

Vach [14] emphasizes that kappa should not simply be interpreted as a measure of agreement but that Cohen’s kappa expresses the degree to which observed agreement exceeds the agreement that was expected by chance. The marginal dependencies are a direct consequence of the definition of kappa and its aim to adjust the observed agreement with respect to the expected amount of agreement under chance conditions [14, p. 659]. It is not a reason for discarding Cohen’s kappa.

#### Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The author thanks two anonymous reviewers for their helpful comments and valuable suggestions on an earlier version of this paper. This research is part of Veni project 451-11-026 funded by the Netherlands Organisation for Scientific Research.