A Simulation Study to Assess the Effect of the Number of Response Categories on the Power of Ordinal Logistic Regression for Differential Item Functioning Analysis in Rating Scales
Objective. The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research. Methods. A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups. Results. When was distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively. Conclusions. The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected.
In studies related to quality of life, measurement equivalence is an essential assumption for meaningful comparison of health-related quality of life scores across different populations. Violation of this statistical property at the item level, also known as differential item functioning (DIF), is an important part of the process of validating health-related quality of life (HRQoL) instruments [1, 2]. DIF analysis originated in educational testing is now increasingly being used in psychometric studies to assess whether the probability of responding to a specific item within a HRQoL scale differs between the compared groups after controlling the construct being measured. There are different types of DIF detection methods for Likert-type items including multiple-group categorical confirmatory factor analysis (MGCFA), item response theory (IRT), and ordinal logistic regression model (OLR) [3–8]. These methods use different assumptions and procedures to test measurement equivalence; however, they share conceptual similarities such as having ability to examine both uniform and nonuniform DIF. In this perspective, some researchers have tried to compare various DIF detection methods by focusing on real data. They found that applying various methods for examining DIF may lead to different results . Beyond that, researchers designed simulation studies to get more insights into similarities and differences of DIF detection methods. Despite the existence of these stimulation studies, further attention is required to clarify some statistical properties of DIF detection methods under different conditions.
In the present study, we have just focused on OLR model as a well-known method in DIF analysis. Unlike IRT model, OLR is able to control additional covariates, both categorical and continuous, which may confound the results of DIF analysis. Moreover, it does not assume normality of the group ability and provides a number of criteria to quantify the magnitude of DIF which may not be practically important [7, 10].
Previous simulation studies have shown how various factors including sample size, sample size ratio, scale length, magnitude of DIF, and the ability distribution can affect the results of DIF analysis by OLR model [5, 10, 11]. Although the quality of an instrument changes according to the number of response options [12, 13], the same number of response categories was used in all of these simulation studies. A number of simulation studies have attempted to answer the question of what the optimal number of response options is for psychological instruments. In these researches, authors found that the optimum number of response options is between four and seven, and reliability and validity decrease with fewer than four categories [14, 15]. In addition, when the number of items is small or if the items are low discriminating, using more response categories can increase the precision of instruments. Moreover, in addition to the psychometric properties, the number of response categories may influence the level of response bias . Acquiescence and extreme response styles are two types of response bias that are highly dependent on the number of response categories in a measure. The first one is defined as the tendency to agree to propositions in general, while the second one is described as the tendency for people to consistently use the extreme ends of response scales .
Although these studies have investigated the effect of the number of response categories on psychometric properties of psychological instruments, to our best of knowledge, this issue has not been previously evaluated on the statistical properties of OLR for detecting DIF. To fill this gap, the present study uses simulated data to find what the optimal number of response categories is to achieve an acceptable level of power and type I error rates in OLR model for DIF analysis. Hence, this study aims to investigate if the effect of the number of response categories on the power of OLR for detecting DIF can be influenced by the skewed ability distributions, sample size, sample size ratio, and the magnitude of DIF across reference and focal groups.
2.1. Ordinal Logistic Regression Model for Detecting DIF
Testing for the presence of uniform and nonuniform DIF under ordinal logistic regression (OLR) is based on comparing three different models as follows: Model 1: ; Model 2: ; Model 3: .
In this model, the term is the ability or observed trait level of an individual usually denoted by total test score, and is the grouping variable with two levels including reference and focal groups. According to the above models, uniform and nonuniform DIF could be detected by comparing the log likelihood values for Models 1 and 2 and Models 2 and 3, respectively. For both uniform and nonuniform DIF twice the difference in log likelihoods is compared to a chi-square distribution with one degree of freedom.
For items with nonuniform DIF, the direction of DIF differs along , and consequently the effect of nonuniform DIF can be cancelled out at the scale level  which is the main reason for focusing on uniform DIF in this study.
2.2. Data Generation
In this study, the graded response model (GRM) was used to generate response data for a measure with 10 items. The mathematical form of the GRM iswhere is the probability of scoring in or above category of item , is the item discrimination parameter, is the threshold for category of item , and represents the latent trait. In this study, parameters were simulated from the standard normal distribution, and parameters were sampled from the uniform distribution over the interval . An item with categories ( to ) will be characterized by a vector of threshold parameters as follows: and parameters were fixed (for the entire simulation) within a particular simulated dataset.
In this simulation study, five factors were varied: response categories, sample size, sample size ratio, magnitude of uniform DIF, and ability distribution. The number of response categories of the items ranged from 3 to 5 (). Two sample sizes (, ) and three levels of sample size ratio () were investigated. Sample size ratio between the reference and focal groups was set to 1 : 1 for the equal sample size conditions and 2 : 1 and 3 : 1 for the unequal sample size conditions. More specifically, we created conditions with , 400/200, and 450/150 for the medium sample size () and , 667/333, 750/250 for the large sample size (). Moderate and severe uniform DIF were also simulated by adding 0.5 and 1 to parameters in the focal group, respectively. In this study, the length of the scale was held constant at 10 and just one item with uniform DIF was simulated.
2.3. Simulated Distributions of the Latent Trait
The GRM, used in the present study to generate item responses, assumes the normality assumption for the latent trait (). To assess whether the impact of the number of response categories on the power of OLR can be influenced by the ability distribution, we simulated nine different distribution conditions (Table 1). In the first condition, the distribution for the reference and focal groups was a standard normal. For the other eight conditions, beta distribution—Beta ()—was used to generate moderately and highly skewed ability distributions. The beta distribution is a member of continuous probability distributions defined on the interval with two positive parameters, including and .
As shown in Figure 1, the beta distribution has different shapes depending on the value of these parameters. If was set to 1 (skewed) or 0.5 (highly skewed) and was greater than 1, we obtained a L shaped distribution. Similarly, a J shaped distribution will be obtained when was greater than 1 and was set to 1 (skewed) or 0.5 (highly skewed). The L and J shaped distributions correspond to situations in which participants respond mostly negatively and positively, respectively. In total, we generated 324 () simulation scenarios; each simulated scenario corresponding to a combination of parameters was replicated 1000 times.
Table 2 shows the power of OLR model under different combinations of sample size ratio, number of response categories, distributions of ability, and magnitudes of DIF when total sample size was 600. Our findings show that the power of OLR model improved as the number of response categories increased. For instance, for the moderate magnitude of DIF (), in conditions 1, 2, 3 in which the same ability distribution assumed in both reference and focal groups, increasing the number of response categories from to increased the OLR power approximately 10%, 8%, and 6%, when , and 3, respectively. However, in conditions 4, 5, 6, 7, 8, and 9, that ability distribution differed in the reference and focal groups; this amount of increment in power was slightly lower and reached approximately 0%, 9%, and 5%, when was equal to 1, 2, and 3, respectively.
The second major finding was that, under various combinations of ability distributions, OLR model had different performances for , and 5. The power of OLR model for was more affected by different distributions of ability than the power of OLR model for and 5. For example, when , there were decreases of approximately 15%, 23%, and 22% in power for condition 8 as compared with conditions 1, 4, and 6. This amount of reduction was lower for and 5 which was approximately 9%, 18%, and 20% for and 12%, 20%, and 19% for . Another example illustrated that when , the power decreased approximately by 11%, 18%, and 18% when the ability distribution of and Beta for the reference and focal groups was compared to conditions 1, 4, and 6; the comparison of these conditions indicated that the rate of reduction was approximately 9%, 17%, and 16% for and approximately 2%, 11%, and 14% for .
Regardless of sample size ratio and ability distribution, general comparison of the OLR power among various numbers of response categories indicated that, for moderate magnitude of DIF, power for was approximately 5.9% and 5.7% higher than and 3, respectively. In addition, under equal sample size ratio, power was roughly 4% lower for compared with , while OLR model performed almost similarly for and under unequal sample size ratio 3 : 1.
Our findings also indicated that as skewness of ability distribution increased, the power of OLR model decreased. When ability distribution in the reference and focal groups was the same and moderately skewed (condition 2), power is the same or slightly different from the condition in which ability distribution was normal in both groups. In contrast, highly skewed distribution of ability for both reference and focal groups (i.e., conditions 3 and 9) led to inadequate levels of power (less than 0.61), irrespective of sample size ratio and number of response categories. Substantial reduction of approximately 37.7% and 57.7% in power occurred when ability distribution of conditions 3 and 9 was compared to condition 1.
Regarding the effect of unequal sample size ratios, the power of OLR model decreased as the ratio of the inequality increased in all combinations of different numbers of response categories and distributions of ability. Compared with , there were decreases of approximately 7.1% and 17.4% in power for and 3, respectively. Moreover, when the sample size ratio changed from 2 to 3, power reduced about 11.1%.
When the magnitude of DIF is large (i.e., DIF = 1), power was always close to 1, irrespective of the ratio of sample size, number of response categories, and distribution of ability; even under highly skew distribution of ability in both reference and focal groups, power was greater than or equal to 0.89.
Table 3 presents empirical type I error rate of OLR model at the nominal significance level of 0.05 under various combinations of sample size ratios, distribution of ability, number of response categories, and magnitude of DIF when total sample size was 600. For moderate magnitude of DIF (), type I error rate was below or close to the nominal level in all conditions. However, when , type I error rate exceeded the nominal level for some distributions of ability including conditions 5, 7, 8, and 9 which ranged from 0.06 to 0.10.
Table 4 displays the power of OLR model under various combinations of sample size ratio, numbers of response categories, different distributions of ability, and magnitudes of DIF, when total sample size was 1000. When , power exceeded 0.79 criterion in all conditions except for highly skewed distribution of ability, namely, conditions 3 and 9. In these cases, power ranged from 0.61 to 0.77 for condition 3 and from 0.39 to 0.61 for condition 9. We also found that unequal sample size ratio led to power reduction irrespective of the distribution of ability and number of response categories; compared with power decreased approximately by 4% and 10.5% for and 3, respectively. When the magnitude of DIF is large (), power was 1 or close to 1 in all cases.
Table 5 indicates type I error rate of OLR at the nominal significance level of 0.05 under different conditions when total sample size was 1000. When the magnitude of DIF is moderate, type I error rate was close to 0.05 in all conditions except for the conditions 7, 8 and 9 in which it ranged from 0.05 to 0.08. However, when the magnitude of DIF was large (), type I error rate was higher than the nominal level for almost all conditions and even exceeded 0.1 for conditions 5, 7, 8, and 9.
It should be noted that the results discussed here are based on adding 0.5 to the threshold parameters in focal group to produce uniform DIF. For a limited number of scenarios we subtracted 0.5 from threshold parameters to create uniform DIF and we found out that the power of OLR model substantially changed in some conditions. For instance, under equal sample size ratio, for condition 4 subtracting 0.5 led to reduction in power from 0.89, 0.87, and 0.90 to 0.75, 0.76, and 0.79 when , and 5, respectively. While in condition 9 the power increased from 0.35, 0.32, and 0.34 to 0.71, 0.68, and 0.66 by subtracting 0.5 instead of adding 0.5 to the focal group. Due to space limitation explaining about the results of all simulation scenarios regarding the effect of subtracting 0.5 from threshold parameters is beyond the scope of this research.
To the best of our knowledge, this is the first study that has evaluated the effect of the number of response categories on the power of OLR model for detecting DIF. Other factors evaluated in this study were the magnitude of DIF, the ability distribution of the reference and focal groups, the sample size, and the sample size ratios. Regardless of the number of response categories, sample size ratio, and ability distribution, this study showed that, for the large sample size () or the large magnitude of DIF (), the power of OLR is 1 or close to 1. Moreover, the effect of the number of response categories on the power of OLR for detecting DIF is slightly influenced by the ratio of sample size in the focal and reference groups.
One of the most obvious findings to emerge from this study was that gains in power from increasing the number of response categories could be affected by the difference between the ability distributions of the focal and the reference groups. Accordingly, when ability level was distributed identically (normal or skewed) in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in the power of OLR model for detecting DIF. Moreover, when the number of response options is kept constant, especially when , the power of OLR can be substantially affected by the ability distribution in the focal and reference groups. We found that, in the case of , the power of OLR model can be reduced by approximately 16% when the ability distribution changes from condition 4 to condition 5.
Another important feature which has been considered in this study is to evaluate whether high level of skewness in ability can affect the power of OLR for detecting DIF regardless of the number of response categories, sample size, and sample size ratio. In most simulation studies, item responses were generated using the GRM which generally assumed normality of the latent ability. However, this assumption may not be encountered in practice. For example, in HRQoL studies, violation of normality assumption can frequently occur when we intend to evaluate measurement equivalence of the instrument across the two diverse groups such as healthy people and people with chronic conditions [18, 19]. In the present research, the simulated distributions were partly extreme. In order to provide evidence as to why these distributions are realistic and relevant to study, readers can be referred to a number of applied and methodological articles in the field of HRQoL. For example, Guilleux et al. evaluated the impact of a deviation from the normality assumption on the performance of the IRT methods used in clinical trials to compare highly skewed latent HRQoL scores across two treatment groups . In addition, Hunger et al. examined the use of beta regression models for analyzing longitudinal HRQoL data in clinical trials and epidemiologic studies when HRQoL scores were highly skewed . Moreover, a large number of papers have dealt with the case that distribution of HRQoL scores was heavily skewed regarding patients with chronic conditions such as asthma, cerebral palsy, and cancer [22–25]. Thus, having considered the volume of research referenced above, it seems reasonable to assume that extreme distributions might happen in practice.
However, in previous simulation studies assessing DIF, limited conditions were considered with respect to ability distributions. In two studies, the ability distribution for the reference group was a standard normal, and the focal group had a distribution that was moderately or highly skewed to the left or right [26, 27]. In another study, a moderate negative skew in ability distribution for both the focal and reference groups was evaluated . All of these studies showed that moderate skewness had very little impact on the power of OLR for DIF detection. However, the most surprising aspect of our study was that compared to when ability level was distributed normally in both groups the power of OLR was reduced approximately by 60% when ability distributions in the reference and focal groups were highly skewed to the left and right, respectively. In this case, increasing the sample size from 300 to 500 per group could not compensate for the reduction in power. Even if ability distribution is highly skewed to the left or right in the focal group, when ability level is normally distributed in the reference group, for and , OLR model can detect moderate DIF () with a power close to 80%.
In this study, no clear differences in the nominal type I error rate were found under different conditions except when ability distributions were highly skewed in both groups, and also the sample size and the effect of DIF were large. Accordingly, if we intend to draw a general conclusion linking the findings of the current simulation study and the previous ones [8, 11, 26–28], it would be that for the moderate effect of DIF, in terms of type I error rate, OLR is robust to change the number of items in the measure and the number of response categories as well as to moderate skewness of ability distribution.
It should be noted that we have just presented simulations with positive changes in threshold parameters, but corresponding simulations were carried out for negative changes, and the results were totally different when ability distribution in one or both groups was skewed. These findings are different from those in a previous research, which reported that when the ability level in the focal and reference groups were normally distributed, adding or subtracting 0.5 to threshold parameters did not change the results principally .
Moreover, in the present simulation study, the number of response categories of the items varied from 3 to 5. However, there is no consensus on the number of response options that would optimize the psychometric properties of the scales. Although some authors argue that reliability is maximized with seven response categories [29, 30], others prefer five-point scale [31, 32]. In practice, the number of response alternatives most frequently used in Likert-type scales is five or even less than five, especially in the field of health and social sciences. For example, KINDL, PedsQL™ 4.0, and KIDSCREEN are the most frequently used questionnaires in pediatric HRQoL studies with five response options [9, 12, 13]. On the other hand, GHQ-12 and DASS are two measures among a pool of psychological instruments which include four response categories [33, 34]. Moreover, physical functioning subscale in the SF-36 instrument uses 3 response options . To explain why the number of response categories was set to 3, 4, and 5, we also simulated items with 6 and 7 response categories, not reported here in Results. The findings revealed that increasing the number of response categories from 5 to 6 or 7 resulted in an increase of less than 2% in the power of the OLR for DIF detection. With all things considered, it has been preferred to keep the number of response categories limited to 3, 4, and 5.
Similar to any Monte Carlo simulation study, this research had some limitations. One limitation was that just one item with uniform DIF was simulated to avoid the contamination of multiple DIF items in a test. Another limitation was that, we just simulated a hypothetical instrument with 10 items. The result would be different if we simulated more than one item with DIF and increased the number of items in a test.
The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR should be at least five. However, the impact of the number of response categories in detecting DIF was lower than might be expected. This research provides a guideline to applied researchers in choosing the number of response categories for rating scales in DIF analysis. Moreover, this study revealed that high skewness of ability distributions substantially reduced the power of OLR model to detect uniform DIF, and increasing the sample size could not compensate for the reduction in power. This finding is important because in HRQoL studies it is unrealistic to assume that the ability level is normally distributed across healthy people and people with chronic conditions. In future research, it would be useful to evaluate whether the effect of the number of response categories on OLR power can be influenced by increasing the number of items with DIF. Since increasing the number of items with DIF may contaminate the scale score [8, 36], future research in purification methods is strongly recommended.
This paper was extracted from Elahe Allahyari’s Ph.D. thesis.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank Dr. Zahra Shahsavar and Mr. Hossein Argasi in the Research Consulting Center (RCC) of Shiraz University of Medical Sciences for their valuable assistance in editing this paper. This work was supported by the Grant no. 93-7152 from Shiraz University of Medical Sciences Research Council.
P. Jafari, Z. Sharafi, Z. Bagheri, and S. Shalileh, “Measurement equivalence of the KINDL questionnaire across child self-reports and parent proxy-reports: a comparison between item response theory and ordinal logistic regression,” Child Psychiatry and Human Development, vol. 45, no. 3, pp. 369–376, 2014.View at: Publisher Site | Google Scholar
P. Jafari, Z. Bagheri, S. M. T. Ayatollahi, and Z. Soltani, “Using Rasch rating scale model to reassess the psychometric properties of the Persian version of the PedsQL TM 4.0 Generic Core Scales in school children,” Health and Quality of Life Outcomes, vol. 10, article 27, 2012.View at: Publisher Site | Google Scholar
A. Maydeu-Olivares, U. Kramp, C. García-Forero, D. Gallardo-Pujol, and D. Coffman, “The effect of varying the number of response alternatives in rating scales: experimental evidence from intra-individual effects,” Behavior Research Methods, vol. 41, no. 2, pp. 295–308, 2009.View at: Publisher Site | Google Scholar
C. A. Limbers, D. A. Newman, and J. W. Varni, “Factorial invariance of child self-report across healthy and chronic health condition groups: a confirmatory factor analysis utilizing the PedsQL™ 4.0 generic core scales,” Journal of Pediatric Psychology, vol. 33, no. 6, pp. 630–639, 2008.View at: Publisher Site | Google Scholar
C. J. Gaskin and T. Morris, “Physical activity, health-related quality of life, and psychosocial functioning of adults with cerebral palsy,” Journal of Physical Activity and Health, vol. 5, no. 1, pp. 146–157, 2008.View at: Google Scholar
Y. Kaya, W. L. Leite, and M. D. Miller, “A comparison of logistic regression models for DIF detection in polytomous items: the effect of small sample sizes and non-normality of ability distributions,” International Journal of Assessment Tools in Education, vol. 2, no. 1, pp. 22–39, 2015.View at: Google Scholar
A. Montazeri, A. M. Harirchi, M. Shariati, G. Garmaroudi, M. Ebadi, and A. Fateh, “The 12-item General Health Questionnaire (GHQ-12): translation and validation study of the Iranian version,” Health and Quality of Life Outcomes, vol. 1, no. 1, article 66, 2003.View at: Publisher Site | Google Scholar