Abstract

A weighted version of Bennett, Alpert, and Goldstein’s S, denoted by , is studied. It is shown that the special cases of are often ordered in the same way. It is also shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. It is argued that the application of as an agreement coefficient is not without difficulties.

1. Introduction

In behavioral and biomedical science it is frequently required to measure the intensity of a behavior or a disease. Examples are the degree of arousal of a speech-anxious participant while giving a presentation, the severity of lesions from scans, or the severity of sedation during opioid administration for pain management. The intensity of these phenomena is usually classified by a single observer using a rating scale with ordered categories, for example, mild, moderate, or severe. To avoid that the observer did not fully understand what he or she was asked to interpret, the categories must be clearly defined. To measure the reliability of the rating scale researchers typically ask two observers to rate independently the same set of subjects. Analysis of the agreement between the observers can then be used to asses the reliability of the scale. High agreement between the ratings of the observers usually indicates consensus in the diagnosis and interchangeability of the classifications of the observers.

For assessing agreement on an ordinal scale various statistical methodologies have been developed. For example, the loglinear models presented in Tanner and Young [1] and Agresti [2, 3] can be used for analyzing the patterns of agreement and potential sources of disagreement. Applications of these models can be found in Becker [4] and Graham and Jackson [5]. However, it turns out that researchers are usually only interested in a coefficient that (roughly) summarizes the agreement in a single number. The most commonly used coefficient for summarizing agreement on an ordinal scale is weighted kappa proposed in Cohen [6] ([5, 7]). Cohen [8] proposed coefficient kappa as an index of agreement when the rating scale has nominal (unordered) categories [9]. The coefficient corrects for agreement due to chance. Weighted kappa extends Cohen’s original kappa to rating scales with ordered categories. In the latter case there is usually more disagreement between the observers on adjacent categories than on categories that are further apart. With weighted kappa it is possible to describe the closeness between categories using weights. Both kappa and weighted kappa are standard tools in assessing agreement and have been used in thousands of applications [10, 11]. The most commonly used version of weighted kappa is quadratic kappa [5, 7].

Various authors have identified difficulties with the interpretation of kappa for nominal categories [7, 1217]. Cohen’s kappa is a function of the marginal totals, the base rates of the categories, which indicate how often the categories were used by the observers [1820]. Cohen’s kappa tends to produce much lower values for skewed marginal distributions. Furthermore, kappas from samples with different base rates are not comparable [13, 16]. de Mast and van Wieringen [16] and de Mast [17] studied kappa and kappa-type coefficients in the context of a latent class model. These authors argued that the problematic behavior of kappa is explained from the fact that it is a coefficient of predictive association, instead of a pure coefficient of agreement. Other authors have identified difficulties with the interpretation of quadratic kappa for ordered categories as well. Quadratic kappa behaves as a measure of association, instead of an agreement coefficient [5]. The value of quadratic kappa also tends to increase as the number of categories increases [21]. Furthermore, quadratic kappa cannot discriminate between tables with very different levels of exact agreement [22].

A commonly proposed alternative for Cohen’s kappa for nominal categories is coefficient , originally proposed in Bennett et al. [23] ([2426]). Since coefficient is a linear transformation of the raw agreement and not a function of the marginal totals, it does not exhibit the interpretation difficulties of the kappa coefficients [10, 27]. Furthermore, under the latent class model discussed in de Mast and van Wieringen [16] and de Mast [17], coefficient is the only agreement coefficient that can be given some justification. Coefficient is equivalent to coefficient in Janson and Vegelius [28], coefficient RE in Janes [29], and in Brennan and Prediger [12]. In the case of two categories coefficient is equivalent to coefficients discussed in, among others, Holley and Guilford [30], Maxwell [31], and Krippendorff [32].

Recently, Gwet [33] proposed a weighted version of coefficient for rating scales with ordinal categories. In this paper this coefficient will be denoted by . The generalization proposed in [33] is analogous to the generalization of kappa [8] to weighted kappa [6]. The weighting schemes that can be used with are identical to the weighting schemes of weighted kappa. The most commonly used weighting schemes for weighted kappa are the linear weights [3436] and the quadratic weights [22, 37, 38]. In this paper we study how behaves as an agreement coefficient for rating scales with ordinal categories. More precisely, we study a special case of which will be denoted by . Special cases of are coefficient and the coefficients that are obtained if we use the linear and quadratic weighting schemes. We present several properties of that indicate that the application of as an agreement coefficient is not without problems.

The paper is organized as follows. In Section 2 we introduce notation and define coefficients and . In Section 3 it is shown that there is a simple ordering of the special cases of if a certain mild condition holds. Since this requirement is often met in real life, the special cases of are usually ordered in the same way. In Section 4 we present properties of for tridiagonal agreement tables. It is shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. Section 5 contains a discussion.

2. Weighted Coefficients

In this section we introduce notation and define the coefficients and . Gwet [33, page 56] defines in terms of similarity scaling. However, for notational convenience, we will define in terms of dissimilarity scaling here. If the weights are dissimilarities, pairs of categories that are further apart are usually assigned higher weights.

Suppose two fixed observers independently rate the same set of subjects using the same set of ordered categories that are defined in advance. For a population of subjects, let denote the proportion classified in category by the first observer and in category by the second observer for . Furthermore, let denote the contingency table of observed frequencies. The probabilities can be estimated from the . Assuming a multinominal sampling model with the total number of subjects fixed, the maximum likelihood estimate of is given by [39, 40].

Since the labels of the rows and columns of are identical, the contingency table is usually called an agreement table. Table 1 is an example of an agreement table. The data of Table 1 come from Holmquist et al. [41]. Seven pathologists, labeled A to G, classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using ordered categories, (1) negative, (2) atypical squamous hyperplasia, (3) carcinoma in situ, (4) squamous carcinoma with early stromal invasion, and (5) invasive carcinoma. The data can also be found in Landis and Koch [42]. Table 1 is the cross classification of the ratings of pathologists A and D.

Let for be nonnegative real numbers with . The numbers are used as weights, one for each cell of the table . If we formulate Gwet’s approach in terms of dissimilarity scaling, then Gwet [33] presented the coefficient Coefficient is well defined if we require that at least one is nonzero. With fixed, the maximum likelihood estimate of (1) under a multinomial sampling model is given by In this paper we are interested in a particular weighting scheme. Let be a nonnegative real number and consider the weight function Using weight function (3) in (1) we obtain the weighted coefficient Various well-known weighting schemes are special cases of weighting scheme (3). For we have the identity weights For categories weighting scheme (5) is given by If we use in (4) we obtain Coefficient (7) is Bennett et al.’s [23] , an agreement coefficient proposed for rating scales with nominal categories [12, 28, 29]. Coefficient is thus a special case of coefficient (4). The value of is 1 if there is perfect agreement between the observers and 0 when . For Table 1 we have the estimate .

If we use in (3) we obtain the linear weights [3436]. For categories the linear weighting scheme is given by The linear coefficient is denoted by . For Table 1 we have the estimate .

If we use in (3) we obtain the quadratic weights [22, 37, 38]. For categories the quadratic weighting scheme is given by The quadratic coefficient is denoted by . For Table 1 we have the estimate .

Finally, if we use in (3) we obtain the radical weights [33, page 63, 64]. For categories the radical weighting scheme is given by The radical coefficient is denoted by . For Table 1 we have the estimate . Hence, for Table 1 we have the ordering .

Finally, for categories coefficient becomes Since all special cases of coincide for categories, there are no examples of tables in this paper.

3. Conditional Inequalities

If we apply coefficients , , , and to the same rating data we consistently find the triple inequality . For example, consider the data entries in Table 2. Table 2 presents various statistics of 20 agreement tables from the literature. The first column of Table 2 specifies the source of the agreement table and the second column shows whether the table has size , , or . Columns 3 to 6 of Table 2 contain the values of the estimates , , , and . For all entries except the first we have the triple inequality .

As a second example we consider the data on diagnosis of carcinoma from Holmquist et al. [41]. Seven pathologists labeled A to G classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using five ordered categories. Table 1 is the cross classification of the ratings of pathologists A and D. Table 3 presents various statistics of the 21 pairwise agreement tables for the seven pathologists. Columns 2 to 5 of Table 3 contain the values of the estimates , , , and . For all 21 tables we have the triple inequality . The quantities in the last four columns of Tables 2 and 3 are defined and discussed.

Tables 2 and 3 illustrate that the ordering is often found with real life data. This suggests that is usually increasing in . The triple inequality does not hold in general, but it holds if a certain condition is valid. This sufficient condition is defined below. Recall that is the agreement table with proportions. Define the quantities For fixed , the quantity in (12) is the sum of all elements of that are steps removed from the main diagonal, divided by . Since there are precisely elements that are steps removed from the main diagonal, is the average disagreement of the elements that are steps removed from the main diagonal. Since the elements of that are steps removed from the main diagonal correspond to pairs of categories that are steps apart, is the average disagreement between the observers on adjacent categories, is the average disagreement on all categories that are two steps apart, and so on.

With ordered categories it is natural to assume that Condition (13) states that the average disagreement between the observers on categories that are closer in the ordering is higher than on categories that are further apart in the ordering. Since condition (13) depends on the unobserved probabilities it cannot be verified directly. If we replace the probabilities by the we obtain the estimates To check whether inequality (13) is reasonable for real life data we may check if the inequality holds. It turns out that condition (15) holds for many real life agreement tables with ordered categories. This is to be expected if the rating scale has been thoughtfully constructed, since in this case one expects that the disagreement between the observers on categories that are closer in the ordering is higher than on categories that are further apart in the ordering. For example, consider the data in Table 1. We have or . Hence, condition (15) holds for Table 1. Furthermore, the last four columns of Tables 2 and 3 contain the estimates , , , and for the various agreement tables. For all entries of Table 2 condition (15) holds. Moreover, for all entries except the first, the inequalities are strict. For the first entry of Table 2 [8] we have If the are all equal, all special cases of coincide. It should be noted that the data in Cohen [8] are artificial. Condition (15) also holds for most entries of Table 3. The three exceptions are the entries corresponding to the pairs (A,C), (B,C), and (C,E).

Theorem 2 below shows that is increasing in if condition (13) holds. Thus, if (13) holds there is a simple relationship between the special cases of coefficient (4). In particular, if (13) holds we have the triple inequality Lemma 1 is used in the proof of Theorem 2.

Lemma 1. Let for be nonnegative real numbers and let and for be positive real numbers. If then Furthermore, inequality (21) is strict if two are distinct.

Proof. We start with the first part of the assertion. From (20) it follows that for . Since for it follows from (19) that Summing (22) over all and with we obtain Adding to both sides of (23) we obtain Since and are positive, inequality (24) is equivalent to (21). Finally, note that if two are distinct, then (22) and hence (24) are strict.

Theorem 2. Let be real numbers with . If condition (13) holds, then , with equality, if and only if all in (13) are equal.

Proof. Using (4) we have if and only if Since for we have the identities Using (26) and (27), inequality (25) can be written as Let and for . Because , the quantity is strictly decreasing in . Since (19) and (20) hold, validity of inequality (28) and thus validity of the inequality follow from application of Lemma 1.

4. Tridiagonal Agreement Tables

In practice it frequently occurs that an agreement table with ordered categories is (approximately) tridiagonal. A tridiagonal table is a square matrix that has nonzero elements only on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal. If the agreement table is tridiagonal there is only disagreement between the observers on adjacent categories. In this section we present results that hold if the agreement table is tridiagonal. In this case we have and it follows that condition (13) holds. It may be that the results also hold if condition (13) is valid. Note that Theorem 2 is always valid for tridiagonal agreement tables.

The tridiagonal tables in Table 2 have if , if , and if . For 6 of the 20 entries in Table 2 the agreement table is tridiagonal. In Table 3 the tridiagonal tables have . The agreement tables of the pairs (B,E) and (D,G) are tridiagonal. Many other agreement tables corresponding to the entries of Tables 2 and 3 are approximately tridiagonal: only a few disagreements are not on the diagonals directly below and above the main diagonal.

4.1. Upper Bound Unity

Theorem 2 and Tables 2 and 3 show that is usually increasing in . This implies that Bennett et al.’s is usually a lower bound of the other special cases of . Furthermore, it suggests that goes to unity as increases, regardless of the data at hand. Theorem 3 formalizes this observation for tridiagonal agreement tables.

Theorem 3. If is fixed and is tridiagonal, then as .

Proof. If is tridiagonal, (4) becomes Since the elements of sum to unity, we have the inequality The right-hand side of (30) does not depend on the data. Since the denominator is increasing in , we can, for fixed , make the right-hand side of (30) arbitrary small. Hence, as .

4.2. An Inequality for Differences

Since the parameter in (4) is a nonnegative real number there are uncountably infinite many special cases of . Theorems 2 and 3 together with Tables 2 and 3 show that all these special cases usually lie between and 1. Tables 2 and 3 also show that the positive differences and are quite substantial. This suggests that most elements of the sequence will lie close to 1 and that consecutive differences become smaller and smaller.

In this section we present a particular result for the positive differences and . Theorem 5 below shows that never exceeds . We first derive explicit formulas of for in Lemma 4.

Lemma 4. It holds that

Proof. We only present proofs of identities (32) and (33). We will use the following identities of sums of powers of integers (see, e.g., [43]): Using identities (27), (34), and (35) we have Furthermore, using identities (27), (35), and (36) we have

Theorem 5. If is tridiagonal, then , with equality, if and only if .

Proof. The formula of for the case that is tridiagonal is given in (29). The inequality or is equivalent to Using identities (31), (32), and (33), inequality (39) becomes or, equivalently, .

If the agreement table is tridiagonal, Theorem 5 shows that if and only if or . Recall that for categories all special cases of coefficient coincide. Thus, for , we have . Furthermore, an illustration for categories is the second entry of Table 2. For this entry we have , , , and .

4.3. Dependence on Number of Categories

A criticism against the use of Bennett et al.’s is that the coefficient tends to produce higher values for agreement tables with more categories [26]. More precisely, if the raw agreement is constant in (7) we have as . Thus, if the rating scale has many categories, we have , and is not a chance-corrected coefficient.

While has the raw agreement as an upper bound, it appears that the values of many other special cases of tend to go to unity as the number of categories of the rating scale increases. For example, suppose that is tridiagonal and that . Using (29) and (35) the formula of is given by We have as . Since is usually a lower bound for all special cases of with (Theorem 2), it follows that all coefficients with go to unity as the number of categories increases. Dependence on the number of categories is considered an undesirable property of .

5. Discussion

Bennett et al.’s [23] is an agreement coefficient for ratings scales with nominal categories that has been discovered and rediscovered by many authors [12, 28, 29]. Recently, a weighted version of was proposed by Gwet [33] for rating scales with ordinal categories. In this paper we presented various properties of a special case of this weighted version, denoted by , where is a nonnegative real number. Bennett et al.’s [23] corresponds to , while and are the versions of that are obtained by using, respectively, the linear and quadratic weighting schemes.

It was first studied how the different versions of are related. Theorem 2 shows that is increasing in if the average disagreement between the observers on adjacent categories is greater than the average disagreement on categories that are 2 steps apart and if the latter is greater than the average disagreement on categories that are 3 steps apart and so on. Hence, in this case, there is a simple relationship between the values of the special cases of . It turns out that Theorem 2 is quite a strong result. First of all, the result involves all special cases of , and there are uncountably infinite many versions of . Secondly, the sufficient condition holds for many data tables reported in this paper (see Tables 2 and 3). Since is usually increasing in , its special cases are essentially measuring the same thing.

For the application of Cohen’s kappa and weighted kappa, various authors have presented target values for evaluating the values of the kappa coefficients [4447]. There is general consensus in the literature that uncritical application of such magnitude guidelines leads to practically questionable decisions. Warrens [48] argued that, since quadratic kappa produces values that are substantially higher than the values produced by Cohen’s kappa, the same guidelines cannot be used for both coefficients. A similar argument applies here. Tables 2 and 3 show that coefficients , , and produce quite different values. Thus, despite the fact that the coefficients measure the same thing, they do this to a different extent. If one accepts the use of magnitude guidelines, different criteria need to be developed for the different special cases of .

Finally, a number of results were presented that illustrate that many special cases of tend to produce values close to unity, regardless of the data at hand. This is especially the case when is high () and the number of categories of the rating scale is large (5 or more). These results were only proved for agreement tables that are tridiagonal, but the high estimates in Tables 2 and 3 suggest that the results hold under more general conditions. The dependence of on the number of categories implies that different criteria need to be formulated depending on the number of categories. Developing different criteria for different coefficients and different number of categories seems an impossible task. Hence, coefficient is useless as a general agreement coefficient. It is advised to limit the application of weighted versions of Bennett et al. to one or two coefficients, for example, for rating scales with nominal categories and for rating scales with ordinal categories.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was done while the author was funded by the Netherlands Organisation for Scientific Research, Veni Project 451-11-026.