Abstract

Ordinal data are the most frequently encountered type of data in the social sciences. Many statistical methods can be used to process such data. One common method is to assign scores to the data, convert them into interval data, and further perform statistical analysis. There are several authors who have recently developed assigning score methods to assign scores to ordered categorical data. This paper proposes an approach that defines an assigning score system for an ordinal categorical variable based on underlying continuous latent distribution with interpretation by using three case study examples. The results show that the proposed score system is well for skewed ordinal categorical data.

1. Introduction

Ordinal data often occur during sampling survey and experimental design; therefore, it is difficult to get the interval data. The obtained data are usually “categorical data” or “ordinal categorical data,” which are collected based on a scale of “strongly agree,” “agree,” “have no opinion,” “disagree,” and “strongly disagree.” Because most data in traditional statistical methods are interval data, researchers often assign these ordinal categorical data a score first, convert them into interval data, and then conduct further statistical analyses, such as factor analysis, principal analysis, and discriminate analysis.

One method of assigning a score to these ordinal categorical data is to assign a score to ordinal categorical data subjectively (e.g., 5 for strongly agree, 4 for agree, 3 for no opinion, 2 for disagree, and 1 for strongly disagree). However, the original scale is an ordinal scale, without the concept of distance. After assigning a score from 5 to 1, the scale becomes an interval scale and thus has the concept of distance. The distance between strongly agree (5) and no opinion (3) is the same as that between agree (4) and disagree (2), which exaggerates the information provided by the data. Other score-assignment methods assign the data-generated scores objectively. These methods include the Ridit score relatively to an identified distribution [1], the Conditional Median under a given cumulative distribution function [2], Conditional Mean scoring functions based on the underlying distribution [3], and the normal scores [4]. In many applications, treating the latent variable models for ordinal categorical data requires the Bayesian model to calculate parameters [5]. Another two score-assignment methods can be referred to in testing for ordered tables. For processing this problem of the sensitivity of the linear rank test on the scores, Kimeldorf et al. suggested the min-max scoring [6] and Gautam et al. suggested the iso-chi-square approach for the ordered tables [7]. However, this approach may be detailed and involves complex computations of the prime assumption.

This paper aims to provide an alternative scoring system based on an underlying continuous latent variable to determine the scores of ordinal categorical data and explain the results by using three examples. The remainder of this paper is organized as follows: Section 2 introduces the scoring system and relevant theories; Section 3 describes how scores are assigned to ordinal categorical data, the main theorem, and the relevant corollary; Section 4 gives three examples to explain the effects of scoring results with the formula of Theorem 1; and lastly, Section 5 offers a conclusion and provides suggestions on score assignment for ordinal categorical data. Some property details are provided in the Appendix.

2. The Scoring System

For an ordinal categorical random variable with the probabilities , denotes the number of categories. A scoring system is a systematic method for assigning numerical values to ordinal categories [8].

The scores are computed from . Let be the scores assigned to the th category, and let denote the scoring system determined by the scoring functions .

For ordinal categorical data, Bross introduced a scoring system, which he called Ridit scores [1]. Let Bross defended the Ridit score for category by . Brockett defended a Conditional Median Score under [2], where denotes some given cumulative distribution functions selected either in accordance with some theoretical latent distribution of the categorical variable under study or in accordance with the desirable properties for the planned method of analysis. For example, if the categorical variable represents income levels, may represent a Pareto family distribution function. Let represent the scores assigned and let be the cumulative distribution function corresponding to this scoring system (i.e., ). Brockett found a scoring system , , , that satisfies the distance and minimizes , where is the Ridit score for the category (Figure 1).

Fielding suggested a scoring function based on the conditional mean of a category, assuming that the data are generated by an assumed distributional form [3]. Consider the following:

The next section will introduce a scoring system based on given cumulative distribution function satisfying some condition.

3. Scoring Procedure for Ordinal Categorical Data

For an ordinal categorical random variable with the probabilities , denotes the number of categories. Let an unobserved continuous variable underlie [9], and let denote the underlying latent variable. Suppose that are cut points of the continuous scale such that the observed response satisfies

In other words, falls in assigned score when the latent variable falls in the th interval of values (Figure 2). This section introduces a scoring system for based on the underlying latent variable of satisfying .

Theorem 1. Let be an ordinal categorical response variable with the probabilities , where denotes the number of categories. Assume that is a continuous underlying distribution of with the distribution function of and probability density function and assume that exists. Suppose that are cut points of . For each , let denote the score of and if and only if . If one takes , then one has

Corollary 2. If the underlying distribution is , then , where is the Ridit score.

Corollary 3. Let (Ridit score), then one has , where

Appendix shows the proofs of all the properties.

Remark 4. Assume that is a continuous underlying distribution of with the distribution function of is known; therefore, the cut point does not need to be given in advance.

Remark 5. The score defined in this study fulfills Brockett’s Postulate 2 (Branching Property) [2]: suppose there are more than two categories, and for statistical or computation reasons we wish to combine two adjacent categories. In this case, the scores of the unaffected categories remain unchanged. Symbolically, if the and st categories are combined, then

This postulate states that there is consistency within the scoring system as changes.

Remark 6. Agresti introduced a score , and let , where is a cumulative distribution function for standard normal distribution and is the Ridit score in category [4]. Then, by Corollary 3, when , we have .

4. Examples

Example 1. This example is a prospective study of maternal drinking and congenital malformations [10]. Table 1 presents a summary of the questionnaire results for alcohol consumption as completed by women who have passed their first trimester. Results show whether the newborns suffered from congenital malformations after birth. The average number of drinks per day was used to measure alcohol consumption, which was an explanatory variable of an ordinal categorical nature.

This study examines the correlation between the mothers’ level of alcohol consumption and congenital malformation in newborns. The traditional approach is to use a contingency table. However, this study assigns scores to the level of alcohol consumption and uses a statistical value to test the correlation, where is a coefficient of correlation. The square root of has an approximately standard normal distribution under the null hypothesis. The value is the right-tail probability above the observed value [11]. Different assigned scores are used to calculate the and the value. As Table 2 shows, the values of and value of the method by midpoints value of 0.0104 and the proposed method with exponential score have the significant value of 0.018572, indicating that they are close to each other, whereas the midpoints and midranks (Ridit score) have a large difference. And the proposed method with lognormal score value of 0.002318 has the smallest significant values that indicates it is well fit for this skewed data.

In this case, Graubard and Korn noted that the results of the trend test applied to this data set are sensitive to the choice of scores and the value for equally spaced scores is 0.1764. The Ridit score gave a value of 0.5533. Using the midpoints scores, we found the values corresponding to the exponential score value are close to each other [10]. Therefore, we suggest that using the proposed method with exponential scores or lognormal score could be well in this example.

Example 2. This example is from Agresti, who used several data sets from the General Social Survey (GSS) [4]. Table 3 shows the results of 2,387 responses from the GSS to a question on whether heaven exists where the data presents a skewed property.

Table 4 presents a comparison of the results to examine the proposed normal scores based on Ridits with the method of Remark 5 and the formula of Theorem 1. As in Table 4, the Agresti normal score   and the proposed normal score   are close. This table also shows the proposed score   including the exponential, logistic, and lognormal scores. The computation for scores is illustrated. Let be the cumulative relative frequency; that is, , , , and . Then, we apply function (b) in (3) of Theorem 1 to compute the score value with distribution to be standard normal, exponential, logistic, and lognormal, respectively. The result also indicates the relatively larger gap in lognormal score that has good fit for this skewed data.

Example 3. This example is from Snedecor and Cochran [12]. In this example, patients with leprosy were divided into those with little infiltration and those with much infiltration, based on a measure of a certain type of skin damage. Their health status was also classified into five levels after the 48-week treatment (Table 5). This study uses the formula of Theorem 1 and that proposed by Fielding to assign scores and to compare the results [3]. As Table 6 shows, the values are close to each other. In addition, Figures 3(a)3(d) show the results of scores under the different distribution with the formula of Theorem 1. The distribution pattern in these figures shows that the shapes of the scores computed from different underlying distribution are different.

5. Conclusion

In this paper, we provide alternative methods of assigning scores to ordinal categorical variables based on the underlying continuous distribution. These procedures are simpler and easier ways to assign scores. We cite three real case studies to explain the process and results of the calculations and propose that the score systems for ordinal variables are easy to perform, effective, and operationally useful, similar to the Ridit score or Agresti scores.

The Equal Space or Rank methods are generally used as scores (i.e., midranks or Ridit scores in Example 1) for processing ordinal categorical data. However, if the data are right-skewed or left-skewed or if some categories have many more observations than other categories, the result is obviously poor. This paper uses several underlying distributions as the alternatives of scores (e.g., the obtained from exponential score is closest to the midpoint in Example 1). We propose that if underlying distribution exists, these methods are also helpful for improving the development of traditional statistical techniques and software applications. By the three illustrations, this study suggests that the lognormal score can be applied well when the ordinal categorical data is skewed, and the normal score may be used when the data is relatively balanced among categories.

There are many methods for processing ordinal categorical data. However, not all of these methods require score-assignment methods (e.g., Cumulative Logit Models and Proportional Odds Models) to convert ordinal categorical data into interval data for analysis.

However, if independent variables are categorical ordinal variables, they are considered categorical data and processed as dummy variables in the traditional (general) statistical method. In addition, if many response variables exist in categorical ordinal variables, it is advisable to assign scores to variables to convert them into an interval scale for further statistical analysis. The benefits of this process to independent variables are as follows: (1) the degree of freedom is 1 (which is originally) and (2) the characteristics of ordinals can also be used, indicating that related computational analysis for variables may be less complicated.

Appendix

Proof of Proposition in Section 3

Proof of Theorem 1. Since if  and  only  if for each , thus, We have .
And Since, thus, we have , where .

Proof of Corollary 2. Since , thus we have and .
By thus, we have if .

Proof of Corollary 3. Let ; we have ; then .
By Theorem 1, since , where , thus, we have

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.