Table of Contents Author Guidelines Submit a Manuscript
Journal of Mathematics
Volume 2014 (2014), Article ID 231909, 9 pages
http://dx.doi.org/10.1155/2014/231909
Research Article

Power Weighted Versions of Bennett, Alpert, and Goldstein’s

Institute of Psychology, Unit Methodology and Statistics, Leiden University, P.O. Box 9555, 2300 RB Leiden, The Netherlands

Received 30 May 2014; Accepted 7 September 2014; Published 3 December 2014

Academic Editor: Yonghui Sun

Copyright © 2014 Matthijs J. Warrens. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

A weighted version of Bennett, Alpert, and Goldstein’s S, denoted by , is studied. It is shown that the special cases of are often ordered in the same way. It is also shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. It is argued that the application of as an agreement coefficient is not without difficulties.

1. Introduction

In behavioral and biomedical science it is frequently required to measure the intensity of a behavior or a disease. Examples are the degree of arousal of a speech-anxious participant while giving a presentation, the severity of lesions from scans, or the severity of sedation during opioid administration for pain management. The intensity of these phenomena is usually classified by a single observer using a rating scale with ordered categories, for example, mild, moderate, or severe. To avoid that the observer did not fully understand what he or she was asked to interpret, the categories must be clearly defined. To measure the reliability of the rating scale researchers typically ask two observers to rate independently the same set of subjects. Analysis of the agreement between the observers can then be used to asses the reliability of the scale. High agreement between the ratings of the observers usually indicates consensus in the diagnosis and interchangeability of the classifications of the observers.

For assessing agreement on an ordinal scale various statistical methodologies have been developed. For example, the loglinear models presented in Tanner and Young [1] and Agresti [2, 3] can be used for analyzing the patterns of agreement and potential sources of disagreement. Applications of these models can be found in Becker [4] and Graham and Jackson [5]. However, it turns out that researchers are usually only interested in a coefficient that (roughly) summarizes the agreement in a single number. The most commonly used coefficient for summarizing agreement on an ordinal scale is weighted kappa proposed in Cohen [6] ([5, 7]). Cohen [8] proposed coefficient kappa as an index of agreement when the rating scale has nominal (unordered) categories [9]. The coefficient corrects for agreement due to chance. Weighted kappa extends Cohen’s original kappa to rating scales with ordered categories. In the latter case there is usually more disagreement between the observers on adjacent categories than on categories that are further apart. With weighted kappa it is possible to describe the closeness between categories using weights. Both kappa and weighted kappa are standard tools in assessing agreement and have been used in thousands of applications [10, 11]. The most commonly used version of weighted kappa is quadratic kappa [5, 7].

Various authors have identified difficulties with the interpretation of kappa for nominal categories [7, 1217]. Cohen’s kappa is a function of the marginal totals, the base rates of the categories, which indicate how often the categories were used by the observers [1820]. Cohen’s kappa tends to produce much lower values for skewed marginal distributions. Furthermore, kappas from samples with different base rates are not comparable [13, 16]. de Mast and van Wieringen [16] and de Mast [17] studied kappa and kappa-type coefficients in the context of a latent class model. These authors argued that the problematic behavior of kappa is explained from the fact that it is a coefficient of predictive association, instead of a pure coefficient of agreement. Other authors have identified difficulties with the interpretation of quadratic kappa for ordered categories as well. Quadratic kappa behaves as a measure of association, instead of an agreement coefficient [5]. The value of quadratic kappa also tends to increase as the number of categories increases [21]. Furthermore, quadratic kappa cannot discriminate between tables with very different levels of exact agreement [22].

A commonly proposed alternative for Cohen’s kappa for nominal categories is coefficient , originally proposed in Bennett et al. [23] ([2426]). Since coefficient is a linear transformation of the raw agreement and not a function of the marginal totals, it does not exhibit the interpretation difficulties of the kappa coefficients [10, 27]. Furthermore, under the latent class model discussed in de Mast and van Wieringen [16] and de Mast [17], coefficient is the only agreement coefficient that can be given some justification. Coefficient is equivalent to coefficient in Janson and Vegelius [28], coefficient RE in Janes [29], and in Brennan and Prediger [12]. In the case of two categories coefficient is equivalent to coefficients discussed in, among others, Holley and Guilford [30], Maxwell [31], and Krippendorff [32].

Recently, Gwet [33] proposed a weighted version of coefficient for rating scales with ordinal categories. In this paper this coefficient will be denoted by . The generalization proposed in [33] is analogous to the generalization of kappa [8] to weighted kappa [6]. The weighting schemes that can be used with are identical to the weighting schemes of weighted kappa. The most commonly used weighting schemes for weighted kappa are the linear weights [3436] and the quadratic weights [22, 37, 38]. In this paper we study how behaves as an agreement coefficient for rating scales with ordinal categories. More precisely, we study a special case of which will be denoted by . Special cases of are coefficient and the coefficients that are obtained if we use the linear and quadratic weighting schemes. We present several properties of that indicate that the application of as an agreement coefficient is not without problems.

The paper is organized as follows. In Section 2 we introduce notation and define coefficients and . In Section 3 it is shown that there is a simple ordering of the special cases of if a certain mild condition holds. Since this requirement is often met in real life, the special cases of are usually ordered in the same way. In Section 4 we present properties of for tridiagonal agreement tables. It is shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. Section 5 contains a discussion.

2. Weighted Coefficients

In this section we introduce notation and define the coefficients and . Gwet [33, page 56] defines in terms of similarity scaling. However, for notational convenience, we will define in terms of dissimilarity scaling here. If the weights are dissimilarities, pairs of categories that are further apart are usually assigned higher weights.

Suppose two fixed observers independently rate the same set of subjects using the same set of ordered categories that are defined in advance. For a population of subjects, let denote the proportion classified in category by the first observer and in category by the second observer for . Furthermore, let denote the contingency table of observed frequencies. The probabilities can be estimated from the . Assuming a multinominal sampling model with the total number of subjects fixed, the maximum likelihood estimate of is given by [39, 40].

Since the labels of the rows and columns of are identical, the contingency table is usually called an agreement table. Table 1 is an example of an agreement table. The data of Table 1 come from Holmquist et al. [41]. Seven pathologists, labeled A to G, classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using ordered categories, (1) negative, (2) atypical squamous hyperplasia, (3) carcinoma in situ, (4) squamous carcinoma with early stromal invasion, and (5) invasive carcinoma. The data can also be found in Landis and Koch [42]. Table 1 is the cross classification of the ratings of pathologists A and D.

tab1
Table 1: Frequencies of pairwise classifications of carcinoma by pathologists A and D [41].

Let for be nonnegative real numbers with . The numbers are used as weights, one for each cell of the table . If we formulate Gwet’s approach in terms of dissimilarity scaling, then Gwet [33] presented the coefficient Coefficient is well defined if we require that at least one is nonzero. With fixed, the maximum likelihood estimate of (1) under a multinomial sampling model is given by In this paper we are interested in a particular weighting scheme. Let be a nonnegative real number and consider the weight function Using weight function (3) in (1) we obtain the weighted coefficient Various well-known weighting schemes are special cases of weighting scheme (3). For we have the identity weights For categories weighting scheme (5) is given by If we use in (4) we obtain Coefficient (7) is Bennett et al.’s [23] , an agreement coefficient proposed for rating scales with nominal categories [12, 28, 29]. Coefficient is thus a special case of coefficient (4). The value of is 1 if there is perfect agreement between the observers and 0 when . For Table 1 we have the estimate .

If we use in (3) we obtain the linear weights [3436]. For categories the linear weighting scheme is given by The linear coefficient is denoted by . For Table 1 we have the estimate .

If we use in (3) we obtain the quadratic weights [22, 37, 38]. For categories the quadratic weighting scheme is given by The quadratic coefficient is denoted by . For Table 1 we have the estimate .

Finally, if we use in (3) we obtain the radical weights [33, page 63, 64]. For categories the radical weighting scheme is given by The radical coefficient is denoted by . For Table 1 we have the estimate . Hence, for Table 1 we have the ordering .

Finally, for categories coefficient becomes Since all special cases of coincide for categories, there are no examples of tables in this paper.

3. Conditional Inequalities

If we apply coefficients , , , and to the same rating data we consistently find the triple inequality . For example, consider the data entries in Table 2. Table 2 presents various statistics of 20 agreement tables from the literature. The first column of Table 2 specifies the source of the agreement table and the second column shows whether the table has size , , or . Columns 3 to 6 of Table 2 contain the values of the estimates , , , and . For all entries except the first we have the triple inequality .

tab2
Table 2: Various statistics for 20 agreement tables from the literature.

As a second example we consider the data on diagnosis of carcinoma from Holmquist et al. [41]. Seven pathologists labeled A to G classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using five ordered categories. Table 1 is the cross classification of the ratings of pathologists A and D. Table 3 presents various statistics of the 21 pairwise agreement tables for the seven pathologists. Columns 2 to 5 of Table 3 contain the values of the estimates , , , and . For all 21 tables we have the triple inequality . The quantities in the last four columns of Tables 2 and 3 are defined and discussed.

tab3
Table 3: Various statistics for the 21 pairwise agreement tables between seven pathologists [41].

Tables 2 and 3 illustrate that the ordering is often found with real life data. This suggests that is usually increasing in . The triple inequality does not hold in general, but it holds if a certain condition is valid. This sufficient condition is defined below. Recall that is the agreement table with proportions. Define the quantities For fixed , the quantity in (12) is the sum of all elements of that are steps removed from the main diagonal, divided by . Since there are precisely elements that are steps removed from the main diagonal, is the average disagreement of the elements that are steps removed from the main diagonal. Since the elements of that are steps removed from the main diagonal correspond to pairs of categories that are steps apart, is the average disagreement between the observers on adjacent categories, is the average disagreement on all categories that are two steps apart, and so on.

With ordered categories it is natural to assume that Condition (13) states that the average disagreement between the observers on categories that are closer in the ordering is higher than on categories that are further apart in the ordering. Since condition (13) depends on the unobserved probabilities it cannot be verified directly. If we replace the probabilities by the we obtain the estimates To check whether inequality (13) is reasonable for real life data we may check if the inequality holds. It turns out that condition (15) holds for many real life agreement tables with ordered categories. This is to be expected if the rating scale has been thoughtfully constructed, since in this case one expects that the disagreement between the observers on categories that are closer in the ordering is higher than on categories that are further apart in the ordering. For example, consider the data in Table 1. We have or . Hence, condition (15) holds for Table 1. Furthermore, the last four columns of Tables 2 and 3 contain the estimates , , , and for the various agreement tables. For all entries of Table 2 condition (15) holds. Moreover, for all entries except the first, the inequalities are strict. For the first entry of Table 2 [8] we have If the are all equal, all special cases of coincide. It should be noted that the data in Cohen [8] are artificial. Condition (15) also holds for most entries of Table 3. The three exceptions are the entries corresponding to the pairs (A,C), (B,C), and (C,E).

Theorem 2 below shows that is increasing in if condition (13) holds. Thus, if (13) holds there is a simple relationship between the special cases of coefficient (4). In particular, if (13) holds we have the triple inequality Lemma 1 is used in the proof of Theorem 2.

Lemma 1. Let for be nonnegative real numbers and let and for be positive real numbers. If then Furthermore, inequality (21) is strict if two are distinct.

Proof. We start with the first part of the assertion. From (20) it follows that for . Since for it follows from (19) that Summing (22) over all and with we obtain Adding to both sides of (23) we obtain Since and are positive, inequality (24) is equivalent to (21). Finally, note that if two are distinct, then (22) and hence (24) are strict.

Theorem 2. Let be real numbers with . If condition (13) holds, then , with equality, if and only if all in (13) are equal.

Proof. Using (4) we have if and only if Since for we have the identities Using (26) and (27), inequality (25) can be written as Let and for . Because , the quantity is strictly decreasing in . Since (19) and (20) hold, validity of inequality (28) and thus validity of the inequality follow from application of Lemma 1.

4. Tridiagonal Agreement Tables

In practice it frequently occurs that an agreement table with ordered categories is (approximately) tridiagonal. A tridiagonal table is a square matrix that has nonzero elements only on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal. If the agreement table is tridiagonal there is only disagreement between the observers on adjacent categories. In this section we present results that hold if the agreement table is tridiagonal. In this case we have and it follows that condition (13) holds. It may be that the results also hold if condition (13) is valid. Note that Theorem 2 is always valid for tridiagonal agreement tables.

The tridiagonal tables in Table 2 have if , if , and if . For 6 of the 20 entries in Table 2 the agreement table is tridiagonal. In Table 3 the tridiagonal tables have . The agreement tables of the pairs (B,E) and (D,G) are tridiagonal. Many other agreement tables corresponding to the entries of Tables 2 and 3 are approximately tridiagonal: only a few disagreements are not on the diagonals directly below and above the main diagonal.

4.1. Upper Bound Unity

Theorem 2 and Tables 2 and 3 show that is usually increasing in . This implies that Bennett et al.’s is usually a lower bound of the other special cases of . Furthermore, it suggests that goes to unity as increases, regardless of the data at hand. Theorem 3 formalizes this observation for tridiagonal agreement tables.

Theorem 3. If is fixed and is tridiagonal, then as .

Proof. If is tridiagonal, (4) becomes Since the elements of sum to unity, we have the inequality The right-hand side of (30) does not depend on the data. Since the denominator is increasing in , we can, for fixed , make the right-hand side of (30) arbitrary small. Hence, as .

4.2. An Inequality for Differences

Since the parameter in (4) is a nonnegative real number there are uncountably infinite many special cases of . Theorems 2 and 3 together with Tables 2 and 3 show that all these special cases usually lie between and 1. Tables 2 and 3 also show that the positive differences and are quite substantial. This suggests that most elements of the sequence will lie close to 1 and that consecutive differences become smaller and smaller.

In this section we present a particular result for the positive differences and . Theorem 5 below shows that never exceeds . We first derive explicit formulas of for in Lemma 4.

Lemma 4. It holds that

Proof. We only present proofs of identities (32) and (33). We will use the following identities of sums of powers of integers (see, e.g., [43]): Using identities (27), (34), and (35) we have Furthermore, using identities (27), (35), and (36) we have

Theorem 5. If is tridiagonal, then , with equality, if and only if .

Proof. The formula of for the case that is tridiagonal is given in (29). The inequality or is equivalent to Using identities (31), (32), and (33), inequality (39) becomes or, equivalently, .

If the agreement table is tridiagonal, Theorem 5 shows that if and only if or . Recall that for categories all special cases of coefficient coincide. Thus, for , we have . Furthermore, an illustration for categories is the second entry of Table 2. For this entry we have , , , and .

4.3. Dependence on Number of Categories

A criticism against the use of Bennett et al.’s is that the coefficient tends to produce higher values for agreement tables with more categories [26]. More precisely, if the raw agreement is constant in (7) we have as . Thus, if the rating scale has many categories, we have , and is not a chance-corrected coefficient.

While has the raw agreement as an upper bound, it appears that the values of many other special cases of tend to go to unity as the number of categories of the rating scale increases. For example, suppose that is tridiagonal and that . Using (29) and (35) the formula of is given by We have as . Since is usually a lower bound for all special cases of with (Theorem 2), it follows that all coefficients with go to unity as the number of categories increases. Dependence on the number of categories is considered an undesirable property of .

5. Discussion

Bennett et al.’s [23] is an agreement coefficient for ratings scales with nominal categories that has been discovered and rediscovered by many authors [12, 28, 29]. Recently, a weighted version of was proposed by Gwet [33] for rating scales with ordinal categories. In this paper we presented various properties of a special case of this weighted version, denoted by , where is a nonnegative real number. Bennett et al.’s [23] corresponds to , while and are the versions of that are obtained by using, respectively, the linear and quadratic weighting schemes.

It was first studied how the different versions of are related. Theorem 2 shows that is increasing in if the average disagreement between the observers on adjacent categories is greater than the average disagreement on categories that are 2 steps apart and if the latter is greater than the average disagreement on categories that are 3 steps apart and so on. Hence, in this case, there is a simple relationship between the values of the special cases of . It turns out that Theorem 2 is quite a strong result. First of all, the result involves all special cases of , and there are uncountably infinite many versions of . Secondly, the sufficient condition holds for many data tables reported in this paper (see Tables 2 and 3). Since is usually increasing in , its special cases are essentially measuring the same thing.

For the application of Cohen’s kappa and weighted kappa, various authors have presented target values for evaluating the values of the kappa coefficients [4447]. There is general consensus in the literature that uncritical application of such magnitude guidelines leads to practically questionable decisions. Warrens [48] argued that, since quadratic kappa produces values that are substantially higher than the values produced by Cohen’s kappa, the same guidelines cannot be used for both coefficients. A similar argument applies here. Tables 2 and 3 show that coefficients , , and produce quite different values. Thus, despite the fact that the coefficients measure the same thing, they do this to a different extent. If one accepts the use of magnitude guidelines, different criteria need to be developed for the different special cases of .

Finally, a number of results were presented that illustrate that many special cases of tend to produce values close to unity, regardless of the data at hand. This is especially the case when is high () and the number of categories of the rating scale is large (5 or more). These results were only proved for agreement tables that are tridiagonal, but the high estimates in Tables 2 and 3 suggest that the results hold under more general conditions. The dependence of on the number of categories implies that different criteria need to be formulated depending on the number of categories. Developing different criteria for different coefficients and different number of categories seems an impossible task. Hence, coefficient is useless as a general agreement coefficient. It is advised to limit the application of weighted versions of Bennett et al. to one or two coefficients, for example, for rating scales with nominal categories and for rating scales with ordinal categories.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was done while the author was funded by the Netherlands Organisation for Scientific Research, Veni Project 451-11-026.

References

  1. M. A. Tanner and M. A. Young, “Modeling ordinal scale disagreement,” Psychological Bulletin, vol. 98, no. 2, pp. 408–415, 1985. View at Publisher · View at Google Scholar · View at Scopus
  2. A. Agresti, “A model for agreement between ratings on an ordinal scale,” Biometrics, vol. 44, no. 2, pp. 539–548, 1988. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  3. A. Agresti, Analysis of Ordinal Categorical Data, Wiley, Hoboken, NJ, USA, 2nd edition, 2010. View at MathSciNet
  4. M. P. Becker, “Using association models to analyse agreement data: two examples,” Statistics in Medicine, vol. 8, no. 10, pp. 1199–1207, 1989. View at Publisher · View at Google Scholar · View at Scopus
  5. P. Graham and R. Jackson, “The analysis of ordinal agreement data: beyond weighted kappa,” Journal of Clinical Epidemiology, vol. 46, no. 9, pp. 1055–1062, 1993. View at Publisher · View at Google Scholar · View at Scopus
  6. J. Cohen, “Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit,” Psychological Bulletin, vol. 70, no. 4, pp. 213–220, 1968. View at Publisher · View at Google Scholar · View at Scopus
  7. M. Maclure and W. C. Willett, “Misinterpretation and misuse of the Kappa statistic,” The American Journal of Epidemiology, vol. 126, no. 2, pp. 161–169, 1987. View at Publisher · View at Google Scholar · View at Scopus
  8. J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, pp. 37–46, 1960. View at Google Scholar
  9. M. J. Warrens, “Cohen's kappa can always be increased and decreased by combining categories,” Statistical Methodology, vol. 7, no. 6, pp. 673–677, 2010. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet · View at Scopus
  10. L. M. Hsu and R. Field, “Interrater agreement measures: comments on kappan, Cohen's kappa, Scott's π and Aickin's α,” Understanding Statistics, vol. 2, pp. 205–219, 2003. View at Google Scholar
  11. J. Sim and C. C. Wright, “The kappa statistic in reliability studies: use, interpretation, and sample size requirements,” Physical Therapy, vol. 85, no. 3, pp. 257–268, 2005. View at Google Scholar · View at Scopus
  12. R. L. Brennan and D. J. Prediger, “Coefficient kappa: some uses, misuses, and alternatives,” in Educational and Psychological Measurement, vol. 41, pp. 687–699, 1981. View at Google Scholar
  13. J. S. Uebersax, “Diversity of decision-making models and the measure ment of interrater agreement,” Psychological Bulletin, vol. 101, no. 1, pp. 140–146, 1987. View at Publisher · View at Google Scholar · View at Scopus
  14. A. R. Feinstein and D. V. Cicchetti, “High agreement but low kappa: I. the problems of two paradoxes,” Journal of Clinical Epidemiology, vol. 43, no. 6, pp. 543–549, 1990. View at Publisher · View at Google Scholar · View at Scopus
  15. C. A. Lantz and E. Nebenzahl, “Behavior and interpretation of the κ statistic: resolution of the two paradoxes,” Journal of Clinical Epidemiology, vol. 49, no. 4, pp. 431–434, 1996. View at Publisher · View at Google Scholar · View at Scopus
  16. J. de Mast and W. N. van Wieringen, “Measurement system analysis for categorical measurements: agreement and kappa-type indices,” Journal of Quality Technology, vol. 39, pp. 191–202, 2007. View at Google Scholar
  17. J. de Mast, “Agreement and kappa-type indices,” The American Statistician, vol. 61, no. 2, pp. 148–153, 2007. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  18. W. D. Thompson and S. D. Walter, “A reappraisal of the kappa coefficient,” Journal of Clinical Epidemiology, vol. 41, no. 10, pp. 949–958, 1988. View at Publisher · View at Google Scholar · View at Scopus
  19. W. Vach, “The dependence of Cohen's kappa on the prevalence does not matter,” Journal of Clinical Epidemiology, vol. 58, no. 7, pp. 655–661, 2005. View at Publisher · View at Google Scholar · View at Scopus
  20. A. von Eye and M. von Eye, “On the marginal dependency of Cohen's κ,” European Psychologist, vol. 13, no. 4, pp. 305–315, 2008. View at Publisher · View at Google Scholar · View at Scopus
  21. H. Brenner and U. Kliebsch, “Dependence of weighted kappa coefficients on the number of categories,” Epidemiology, vol. 7, no. 2, pp. 199–202, 1996. View at Publisher · View at Google Scholar · View at Scopus
  22. M. J. Warrens, “Some paradoxical results for the quadratically weighted kappa,” Psychometrika, vol. 77, no. 2, pp. 315–323, 2012. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet · View at Scopus
  23. E. M. Bennett, R. Alpert, and A. C. Goldstein, “Communications through limited-response questioning,” Public Opinion Quarterly, vol. 18, no. 3, pp. 303–308, 1954. View at Publisher · View at Google Scholar · View at Scopus
  24. U. N. Umesh, R. A. Peterson, and M. T. Sauber, “Interjudge agreement and the maximum value of kappa,” Educational and Psychological Measurement, vol. 49, pp. 835–850, 1989. View at Google Scholar
  25. G. J. Meyer, “Assessing reliability: critical corrections for a critical examination of the Rorschach comprehensive system,” Psychological Assessment, vol. 9, no. 4, pp. 480–489, 1997. View at Publisher · View at Google Scholar · View at Scopus
  26. M. J. Warrens, “The effect of combining categories on Bennett, Alpert and Goldstein's S,” Statistical Methodology, vol. 9, no. 3, pp. 341–352, 2012. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  27. J. J. Randolph, “Free-marginal multirater kappa (multirater κ free): an alternative to Fleiss' fixed-marginal multirater kappa,” in Proceedings of the Joensuu Learning and Instruction Symposium, Joensuu, Finland, 2005.
  28. S. Janson and J. Vegelius, “On generalizations of the G index and the Phi coefficient to nominal scales,” Multivariate Behavioral Research, vol. 14, no. 2, pp. 255–269, 1979. View at Publisher · View at Google Scholar
  29. C. L. Janes, “An extension of the random error coefficient of agreement to n×n tables,” The British Journal of Psychiatry, vol. 134, no. 6, pp. 617–619, 1979. View at Publisher · View at Google Scholar · View at Scopus
  30. J. W. Holley and J. P. Guilford, “A note on the G index of agreement,” Educational and Psychological Measurement, vol. 24, no. 4, pp. 749–753, 1964. View at Publisher · View at Google Scholar
  31. A. E. Maxwell, “Coefficients of agreement between observers and their interpretation,” British Journal of Psychiatry, vol. 116, pp. 651–655, 1977. View at Google Scholar
  32. K. Krippendorff, “Association, agreement, and equity,” Quality and Quantity, vol. 21, no. 2, pp. 109–123, 1987. View at Publisher · View at Google Scholar · View at Scopus
  33. K. L. Gwet, Handbook of Inter-Rater Reliability, Advanced Analytics, Gaithersburg, Md, USA, 2012.
  34. D. Cicchetti and T. Allison, “A new procedure for assessing reliability of scoring EEG sleep recordings,” The American Journal of EEG Technology, vol. 11, pp. 101–110, 1971. View at Google Scholar
  35. S. Vanbelle and A. Albert, “A note on the linearly weighted kappa coefficient for ordinal scales,” Statistical Methodology, vol. 6, no. 2, pp. 157–163, 2009. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at MathSciNet · View at Scopus
  36. M. J. Warrens, “Cohen's linearly weighted kappa is a weighted average,” Advances in Data Analysis and Classification, vol. 6, no. 1, pp. 67–79, 2012. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  37. J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973. View at Google Scholar
  38. C. Schuster, “A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales,” Educational and Psychological Measurement, vol. 64, no. 2, pp. 243–253, 2004. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  39. A. Agresti, Categorical Data Analysis, John Wiley & Sons, 1990. View at MathSciNet
  40. Y. M. M. Bishop, S. E. Fienberg, and P. W. Holland, Discrete Multi-Variate Analysis: Theory and Practice, MIT Press, Cambridge, Mass, USA, 1975. View at MathSciNet
  41. N. D. Holmquist, C. A. McMahan, and E. O. Williams, “Variability in classification of carcinoma in situ of the uterine cervix,” Obstetrical & Gynecological Survey, vol. 23, pp. 580–585, 1967. View at Google Scholar
  42. J. R. Landis and G. G. Koch, “An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers,” Biometrics, vol. 33, pp. 363–374, 1977. View at Google Scholar · View at MathSciNet
  43. A. F. Beardon, “Sums of powers of integers,” The American Mathematical Monthly, vol. 103, no. 3, pp. 201–213, 1996. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  44. J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, pp. 159–174, 1977. View at Google Scholar
  45. D. V. Cicchetti and S. S. Sparrow, “Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior,” American Journal of Mental Deficiency, vol. 86, no. 2, pp. 127–137, 1981. View at Google Scholar · View at Scopus
  46. P. E. Crewson, “Reader agreement studies,” American Journal of Roentgenology, vol. 184, no. 5, pp. 1391–1397, 2005. View at Publisher · View at Google Scholar · View at Scopus
  47. J. L. Fleiss, B. Levin, and M. C. Paik, Statistical Methods for Rates and Proportions, Wiley-Interscience, New York, NY, USA, 3rd edition, 2003.
  48. M. J. Warrens, “Conditional inequalities between Cohen's kappa and weighted kappas,” Statistical Methodology, vol. 10, pp. 14–22, 2013. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  49. C. S. Martin, N. K. Pollock, O. G. Bukstein, and K. G. Lynch, “Inter-rater reliability of the SCID alcohol and substance use disorders section among adolescents,” Drug and Alcohol Dependence, vol. 59, no. 2, pp. 173–176, 2000. View at Publisher · View at Google Scholar · View at Scopus
  50. J. S. Simonoff, Analyzing Categorical Data, Springer, New York, NY, USA, 2003.
  51. S. I. Anderson, A. M. Housley, P. A. Jones, J. Slattery, and J. D. Miller, “Glasgow outcome scale: an inter-rater reliability study,” Brain Injury, vol. 7, no. 4, pp. 309–317, 1993. View at Publisher · View at Google Scholar · View at Scopus
  52. D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, A Handbook of Small Data Sets, Chapman & Hall, London, UK, 1994.
  53. M. Némethy, L. Paroli, P. G. Williams-Russo, and T. J. J. Blanck, “Assessing sedation with regional anesthesia: inter-rater agreement on a modified Wilson sedation scale,” Anesthesia and Analgesia, vol. 94, no. 3, pp. 723–728, 2002. View at Publisher · View at Google Scholar · View at Scopus
  54. J. M. Seddon, C. R. Sahagian, R. J. Glynn et al., “Evaluation of an iris color classification system,” Investigative Ophthalmology and Visual Science, vol. 31, no. 8, pp. 1592–1598, 1990. View at Google Scholar · View at Scopus
  55. R. W. Bohannon and M. B. Smith, “Interrater reliability of a modified Ashworth scale of muscle spasticity,” Physical Therapy, vol. 67, no. 2, pp. 206–207, 1987. View at Google Scholar · View at Scopus
  56. V. A. J. Maria and R. M. M. Victorino, “Development and validation of a clinical scale for the diagnosis of drug-induced hepatitis,” Hepatology, vol. 26, no. 3, pp. 664–669, 1997. View at Publisher · View at Google Scholar · View at Scopus