Research Article | Open Access
Ordering Properties of the First Eigenvector of Certain Similarity Matrices
It is shown for coefficient matrices of Russell-Rao coefficients and two asymmetric Dice coefficients that ordinal information on a latent variable model can be obtained from the eigenvector corresponding to the largest eigenvalue.
An important role in statistics and data analysis is played by similarity coefficients. A similarity coefficient is a measure of resemblance or association of two data vectors, such as score patterns, variables, and items. For example, in ecological biology similarity coefficients are used for measuring the degree of coexistence between two species types over different locations. In many research studies the data consist of binary vectors: presence or absence of disease; presence or absence of species characteristics; yes or no answers in questionnaires; pass or fail in high-stakes testing. For expressing the degree of resemblance of two binary vectors in a number, a variety of similarity coefficients has been proposed [1–3]. Examples are the Jaccard coefficient , the Russell-Rao coefficient , the Dice coefficient , and the simple matching coefficient [7, 8]. In choosing a coefficient, a measure has to be considered in the context of the data-analytic study of which it is a part . Because there are so many similarity coefficients for binary data to choose from, it is important that the different coefficients and their properties are better understood.
Instead of studying properties of individual coefficients [10–13] one may also study properties of coefficient matrices . Coefficient matrices are used as input in various techniques of multivariate data analysis, including factor or component analysis [15, 16], hierarchical cluster analysis, and techniques in classification and dissimilarity analysis . Moreover, exploratory data-analytic methods such as principal coordinates analysis and (multiple) correspondence analysis can be defined as eigendecomposition of certain coefficient matrices [15, 16, 18]. It would be interesting to know what information, if any, is reflected in the eigenvectors of a coefficient matrix that is based on a similarity coefficient for binary vectors.
In this paper we show for several coefficient matrices that ordinal information on latent variable models can be obtained from the eigenvector corresponding to the largest eigenvalue. It is thus possible to uncover meaningful orderings of various models by using eigenvectors. The results are first of all of theoretical interest. They show that some coefficient matrices have more interesting eigenvectors than others. Coefficient matrices based on some coefficients may thus lead to more interesting data-analytic solutions than matrices corresponding to other coefficients. Furthermore, potentially, the results can enhance the interpretation of a data analysis that uses these coefficient matrices as input.
The paper is organized as follows. Notation and two latent variable models are introduced in the next section. In Section 3 several ordering properties of eigenvectors corresponding to a largest eigenvalue are presented. An illustration of the results is presented in Section 4. Section 5 contains a conclusion.
2. Latent Variable Models
Suppose the data consist of binary vectors of length . It may be assumed that the scores in the binary vectors are realizations of a latent variable model. In this section we introduce two models in the context of nonparametric item response theory [19, 20]. In item response theory the vectors are often viewed as items that, for instance, contain the responses (pass, fail) of a high-stakes test for subjects. The items will be indexed by and .
Let denote a one-dimensional latent variable and let be its probability density function. Let denote the response function corresponding to the response 1 on item . The unconditional probability of a response 1 on item is then given by Next, assume local independence; that is, conditionally on the responses of a subject on the items are stochastically independent. The joint probability of items and for a value of is then given by . The corresponding unconditional probability is Throughout the paper we assume that .
Next, we define the latent variable models. Both models have monotone response functions and are frequently applied in the context of measuring ability. The first model is characterized by requirements (3) and (4). The first requirement is that are monotonically increasing on ; that is, for . The second requirement is that the items can be ordered such that are nonintersecting; that is, for . The case that assumes (3) and (4), together with the assumptions of local independence and a single latent variable, is called the double monotonicity model in nonparametric item response theory [19, 20]. A well-known result is that if the double monotonicity model holds, then the items can be ordered such that we have for , and for and [19, 20]. The second model is characterized by requirements (3) and (7). The response functions may satisfy various orders of total positivity . If the functions are totally positive of order 2, the items can be ordered such that holds for and . Schriever  proved the following result for a set of response functions that are both monotonically increasing and satisfy total positivity of order 2. If the vectors are ordered such that (3) and (7) hold, then holds for and .
We conclude this section with a parametric example that satisfies requirements (3), (4), and (7). A well-known model from the field of item response theory is the Rasch  model. A response function of this one-parameter logistic model is given by where is a location parameter. In the context of item response theory the parameter is usually called a difficulty parameter [19, 20]. The functions form a location family.
3. Ordering Properties
In this section we present ordering properties for three coefficient matrices. The coefficient matrices of size are An element of the matrix is a Russell-Rao coefficient for two binary vectors and [5, 10]. Some data-analytic properties of the matrix are discussed in Warrens . The elements of the matrices and are conditional probabilities discussed and applied in Dice . The harmonic mean of the two conditional probabilities is equal to the Dice coefficient . Matrix is also called the conditional adjacency matrix in Post and Snijders .
A specific result that will be used in the proofs of Theorems 2, 3, and 4 below is the Perron-Frobenius theorem [25, 26]. More precisely, only the following weaker version of the Perron-Frobenius theorem will be used.
Lemma 1. If a square matrix has strictly positive elements, then the eigenvector corresponding to the largest eigenvalue of has strictly positive elements.
In the proof of Theorems 2, 3, and 4 we use certain special matrices. Let denote the upper triangular matrix of size () with unit elements on and above the diagonal and all other elements zero. Its inverse is the matrix with unit elements on the diagonal and with elements adjacent and above the diagonal. Examples of and of size are Furthermore, let be the identity matrix of size , and let denote the diagonal block matrix of size with diagonal elements and . Examples of and of size areWe first consider the matrix . Let be the eigenvector corresponding to the largest eigenvalue of the matrix . Theorem 2 shows that if the binary vectors can be ordered such that (3) and (4) hold, then this ordering is reflected in the corresponding elements of .
Theorem 2. Suppose that of the vectors, which without loss of generality can be taken as the first , can be ordered such that (3) and (4) hold. Then the elements of of corresponding to these vectors satisfy .
Proof. Since is nonsingular, is an eigenvector of corresponding to if and only if is an eigenvector of corresponding to . Under the conditions of the theorem, the elements of are positive and the elements of are strictly positive. Application of Lemma 1 then yields that the eigenvector of (or ) has strictly positive elements. The assertion then follows from the identity .
In the remainder of the proof we show that has positive elements and has strictly positive elements. The matrix has elements for and and for and . Under the conditions of the theorem properties (5) and (6) hold for the first items. By (6), we have , and the matrix has positive elements except for for . However by (5), we have and it follows that for . Hence, the matrix has positive elements. Moreover, because the elements in the last row and last column of are strictly positive, it follows that the elements of are strictly positive.
An analogous result holds for the matrix . Let be the eigenvector corresponding to the largest eigenvalue of the matrix . Theorem 3 shows that if the binary vectors can be ordered such that (3) and (4) hold, then this ordering is reflected in the corresponding elements of of .
Theorem 3. Suppose that of the vectors, which without loss of generality can be taken as the first , can be ordered such that (3) and (4) hold. Then the elements of of corresponding to these vectors satisfy .
Proof. The proof is similar to the proof of Theorem 2. The matrix has elements for and and for and . Under the conditions of the theorem properties (5) and (6) hold for the first items. By (6), we have , and the matrix has positive elements except for for . But by (5), we have , and it follows that for
Finally, Theorem 4 below presents an ordering property of the matrix . The ordering holds for a slightly stronger model than the one considered in Theorems 2 and 3. Theorem 4 shows that if the binary vectors can be ordered such that (3), (4), and (7) hold, then this ordering is reflected in the corresponding elements of of .
Theorem 4. Suppose that of the vectors, which without loss of generality can be taken as the first , can be ordered such that (3), (4), and (7) hold. Then the elements of of corresponding to these vectors satisfy .
Proof. The proof is similar to the proof of Theorems 2 and 3. Let denote the transpose of . The matrix has elements for and and for and . Under the conditions of the theorem properties (5) and (8) hold. By (8), we have , and the matrix has positive elements except for for . However, by (5), we have , and it follows that for .
4. An Illustration
In this section we consider an example from educational testing to illustrate some of the results from Section 3. The data consist of responses of 1000 individuals to five items of the LSAT (Law School Admission Test). The test was designed to measure a one-dimensional latent variable. The example is part of a data set given by Bock and Lieberman . The data set is distributed with the R package “ltm” written by Rizopoulos .
Requirements (3), (4), and (7) cannot be checked directly for real life data. However, it can be shown that the Rasch model in (9) fits these data quite well. Using subroutines from the “ltm” package we fitted the Rasch model and the so-called two-parameter logistic model [19, 20]. In the Rasch model the items are allowed to differ in location. In the more general two-parameter model the items are also allowed to differ in slope. For these data the two-parameter model has four additional parameters. The log likelihoods of the models are and , respectively, and the corresponding likelihood ratio test has a value of . Thus, the extra slope parameters are statistically not warranted.
Requirements (3), (4), and (7) can also be studied by verifying if conditions (5), (6), and (8) hold. The proportions of correct responses are , , , , and for items 1 to 5, respectively. For verifying conditions (6) and (8), we suppose that the items are ordered on the proportions of correct responses, from easiest to hardest item (1, 5, 4, 2, and 3). In other words, in the following we assume that the items are ordered such that condition (5) holds.
To study condition (6) we may inspect the matrix of Russell-Rao coefficients. For the LSAT data this matrix is given by The elements on the main diagonal are the proportions of correct responses. If we ignore the elements on the main diagonal it can be verified that the other four elements in each column of are strictly decreasing. Hence, condition (6) holds.
Since conditions (5) and (6) hold for all five LSAT items it follows from Theorem 3 that the ordering of the items is reflected in the eigenvector corresponding to the largest eigenvalue of . The largest eigenvalue is and the associated eigenvector is given by . The item ordering is thus reflected in the elements of the eigenvector.
To verify whether condition (8) holds we may inspect the matrix of Dice coefficients. For the LSAT data this matrix is given by If we ignore the elements on the main diagonal it can be verified that the remaining four elements in the first, third, and fourth columns of are strictly increasing. Furthermore, the elements in the second and fifth columns are roughly increasing. In both columns there is one anomaly. We may conclude that condition (8) holds approximately.
If the five LSAT items satisfy conditions (5) and (8) it follows from Theorem 4 that the ordering of the items is reflected in the eigenvector corresponding to the largest eigenvalue of . The largest eigenvalue is and the associated eigenvector is given by . The item ordering is thus reflected in the elements of the eigenvector.
Similarity coefficients for binary vectors are frequently used in statistics for analyzing the structure between objects. Examples that are commonly used are the Russell-Rao coefficient  and the Dice coefficient . Since the choice of a coefficient depends on the context of the data-analytic study, it is important that the different coefficients and their properties are well understood.
In this paper we showed that ordinal information on latent variable models is reflected in the eigenvector corresponding to the largest eigenvalue of the coefficient matrices with Russell-Rao coefficients (Theorem 3) and two asymmetric coefficients used in Dice  (Theorems 2 and 4). For other well-known coefficients like the Jaccard coefficient  and the simple matching coefficient similar ordering properties could not been found. The results may indicate that the Russell-Rao coefficient and Dice coefficients may lead to more clearly interpretable output if used as input in clustering methods or principal coordinates analysis. However, more research on this topic is needed.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
- A. N. Albatineh, M. Niewiadomska-Bugaj, and D. Mihalko, “On similarity indices and correction for chance agreement,” Journal of Classification, vol. 23, no. 2, pp. 301–313, 2006.
- F. B. Baulieu, “A classification of presence/absence based dissimilarity coefficients,” Journal of Classification, vol. 6, no. 2, pp. 233–246, 1989.
- M. J. Warrens, “On association coefficients for tables and properties that do not depend on the marginal distributions,” Psychometrika, vol. 73, no. 4, pp. 777–789, 2008.
- P. Jaccard, “The distribution of the ora in the Alpine zone,” The New Phytologist, vol. 11, no. 2, pp. 37–50, 1912.
- P. F. Russell and T. R. Rao, “On habitat and association of species of Anopheline larvae in South-Eastern Madras,” Journal of Malaria Institute India, vol. 3, pp. 153–178, 1940.
- L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.
- R. R. Sokal and C. D. Michener, “A statistical method for evaluating systematic relationships,” University of Kansas Science Bulletin, vol. 38, pp. 1409–1438, 1958.
- M. J. Warrens, “On similarity coefficients for 2 × 2 tables and correction for chance,” Psychometrika, vol. 73, no. 3, pp. 487–502, 2008.
- J. C. Gower and P. Legendre, “Metric and Euclidean properties of dissimilarity coefficients,” Journal of Classification, vol. 3, no. 1, pp. 5–48, 1986.
- M. J. Warrens, “Bounds of resemblance measures for binary (presence/absence) variables,” Journal of Classification, vol. 25, no. 2, pp. 195–208, 2008.
- M. J. Warrens, “On the indeterminacy of resemblance measures for binary (presence/absence) data,” Journal of Classification, vol. 25, no. 1, pp. 125–136, 2008.
- M. J. Warrens, “Corrected Zegers-ten Berge coefficients are special cases of Cohen's weighted kappa,” Journal of Classification, vol. 31, no. 2, pp. 179–193, 2014.
- M. J. Warrens, “Properties of the quantity disagreement and the allocation disagreement,” International Journal of Remote Sensing, vol. 36, pp. 1439–1446, 2015.
- M. J. Warrens, “On Robinsonian dissimilarities, the consecutive ones property and latent variable models,” Advances in Data Analysis and Classification, vol. 3, no. 2, pp. 169–184, 2009.
- J. C. Gower, “Some distance properties of latent root and vector methods used in multivariate analysis,” Biometrika, vol. 53, pp. 325–338, 1966.
- M. J. Greenacre, Theory and Applications of Correspondence Analysis, Academic Press, New York, NY, USA, 1984.
- B. Mirkin, Mathematical Classification and Clustering, Kluwer, Dordrecht, The Netherlands, 1984.
- A. Gifi, Nonlinear Multivariate Analysis, Wiley, Chichester, UK, 1990.
- W. J. Van der Linden and R. K. Hambleton, Handbook of Modern Item Response Theory, Springer, Berlin, Germany, 1997.
- K. Sijtsma and I. W. Molenaar, Introduction to Nonparametric Item Response Theory, SAGE Publications, Thousand Oaks, Calif, USA, 2002.
- S. Karlin, Total Positivity, Stanford Univeristy Press, Stanford, Calif, USA, 1968.
- B. F. Schriever, “Multiple correspondence analysis and ordered latent structure models,” Kwantitatieve Methoden, vol. 21, pp. 117–131, 1986.
- G. Rasch, Probabilistic Models for Some Intelligence and Attainment Tests, Studies in Mathematical Psychology, Danish Institute for Educational Research, Copenhagen, Denmark, 1984.
- W. J. Post and T. A. B. Snijders, “Nonparametric unfolding models for dichotomous data,” Methodika, vol. 7, pp. 130–156, 1993.
- F. R. Gantmacher, Matrix Theory, Chelsea, New York, NY, USA, 1977.
- C. R. Rao, Linear Statistical Inference and Its Applications, Wiley, New York, NY, USA, 1973.
- R. D. Bock and M. Lieberman, “Fitting a response model for n dichotomously scored items,” Psychometrika, vol. 35, no. 2, pp. 179–197, 1970.
- D. Rizopoulos, “ltm: an R package for latent variable modeling and item response theory analyses,” Journal of Statistical Software, vol. 17, no. 5, pp. 1–25, 2006.
Copyright © 2015 Matthijs J. Warrens and Alexandra de Raadt. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.