Journal of Probability and Statistics

Volume 2014, Article ID 240263, 8 pages

http://dx.doi.org/10.1155/2014/240263

## New Indices for Refining Multiple Choice Questions

^{1}Department of Mathematics, Institute of Applied Mathematics in Science and Engineering, University of Castilla-La Mancha, 13071 Ciudad Real, Spain^{2}Department of Medical Sciences, University of Castilla-La Mancha, 13071 Ciudad Real, Spain^{3}Medical Education Unit, University of Castilla-La Mancha, 13071 Ciudad Real, Spain

Received 8 September 2014; Accepted 7 December 2014; Published 23 December 2014

Academic Editor: Chin-Shang Li

Copyright © 2014 Mariano Amo-Salas et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Multiple choice questions (MCQs) are one of the most popular tools to evaluate learning and knowledge in higher education. Nowadays, there are a few indices to measure reliability and validity of these questions, for instance, to check the difficulty of a particular question (item) or the ability to discriminate from less to more knowledge. In this work two new indices have been constructed: (i) the no answer index measures the relationship between the number of errors and the number of no answers; (ii) the homogeneity index measures homogeneity of the wrong responses (distractors). The indices are based on the lack-of-fit statistic, whose distribution is approximated by a chi-square distribution for a large number of errors. An algorithm combining several traditional and new indices has been developed to refine continuously a database of MCQs. The final objective of this work is the classification of MCQs from a large database of items in order to produce an automated-supervised system of generating tests with specific characteristics, such as more or less difficulty or capacity of discriminating knowledge of the topic.

#### 1. Introduction

Tests based on multiple choice questions (MCQs) are widely used for evaluation. These tests are basically designed to assess learning and knowledge. Nevertheless, tests may be built carefully to asses other capacities as clinical reasoning. There are some recommendations to take all this into account [1–3]. It is widely accepted that well-constructed MCQs are time consuming and difficult [4, 5], which justifies a careful review of each of the items. The main advantage of this methodology is to provide feedback to both the students and the professors.

MCQs are items with a* stem* (starting part of the item, e.g., a question or a statement to be completed) and a set of possible* responses*, generally ranging from 3 to 5. The only correct response is usually called the* key* and the incorrect responses are called* distractors*. The students have to select just one response or none. The mark is 1 if the answer is correct, 0 if none of the responses have been chosen and there is a penalty of for each failure. Thus, in this work we are considering a correction for guessing. This penalty is an unbiased estimate of what a student can get when answering randomly if there is no penalty. A negative number for the final mark is theoretically possible, but this rarely happens in practice with a sufficient number of items, which is crucial for this type of test.

We focus this work on this type of MCQs due to practical reasons. In Spain, after obtaining the B.M. degree (six years), all the graduates have to pass a national competitive examination based on MCQs to access a specialty in medicine. After passing the examination, all the graduates are ranked and they can choose specialty from different offers to spend a training period of 3–5 years in a medical center. The national competitive examination to access a specialty in medicine consists of 225 multiple choice questions with five options of which only one is correct and 10 questions in reserve in case formulation problems or errors are detected (235 in total). The mark is 1 if the answer is correct and 0 if none of the responses have been chosen and there is a penalty of 1/4 for each failure. As a matter of fact this kind of test is used in almost all the faculties of medicine in Spain to get the students used to it. Moreover, this is the type of MCQs generally used in higher education in Spain.

One of the main characteristics of these items is the existence of indices to analyze their reliability and validity, for instance, the difficulty or discrimination index. These indices allow the categorization of these items based on the obtained answers. Another utility of these indices is to detect mistakes in the items providing a tool to improve the item for future use. They can also be used to investigate why more failures than usual are observed in a particular item. The difficulty of a particular item may be caused by reasons intrinsic to the item (e.g., a complex concept) or because the key or the distractors lead to the failure of the student. Most of the poor designed items are characterized by the following: (i) the item not succeeding in assessing the main objective, (ii) existence of clues for the right answer, and (iii) the text of the stem or the responses being ambiguous. The aim of the distractors is to look like plausible solutions to the problem for those students who do not achieve the objective assessed in the item. At the same time the distractors have to be not plausible for the students reaching the objective evaluated by the item. For these students just the correct answer has to be plausible.

There are some indicators to identify weak and strong groups or to measure the difficulty and discrimination capacity of items and tests. As far as the authors know the literature on this topic does not consider any measure neither of the homogeneity of the responses nor of the rate of the “no answers” [6]. There exist a group of techniques based on fuzzy approach, based on more complicated ordering of results enabling the student to explicitly describe his/her degree of confidence in each possible answer [7].

The aim of this paper is to provide two new indices to measure the relationship between the number of errors and the number of no answers as well as the homogeneity of the responses of an item. As a matter of fact the justification of the penalty described above is strongly based on the homogeneity of the distractors and any violation of this hypothesis makes the use of the penalty inadequate. The indices provided here will help in checking this intrinsically in order to get a suitable test.

Finally, in this paper, a joint analysis of different indices is developed in order to obtain a procedure of classification of MCQs to detect the items that should be revised. In this sense the algorithm works as a security system.

#### 2. Materials and Methods

##### 2.1. Difficulty and Discrimination Indices

Difficulty and discrimination indices are classic in the analysis of MCQs and they have been widely treated in the literature [8, 9].

The* difficulty index * is defined as the proportion of correct answers among the students who did the test:
where is the number of students who performed the test and is the number of students who answered correctly the item. Thus, it is within the interval .

This index may be used to compare the difficulty of a particular item with the global difficulty of the test. Thus, this index may be used to check the homogeneity of the test in the sense of difficulty.

The* discrimination index * measures the capacity of an item to distinguish between different levels of knowledge of the students. In order to compute this index the students tests have to be sorted from lower to higher scores. Then a group with the lowest scores (lower group) and another group with the highest scores (upper group) are built. The size of these groups varies according to the literature, but it is usually around 30% of the total number of students. The most frequent size in the literature is 27%, for example, [10]. Other sizes may be found, for instance, in Tristrán [11]. The definition of the index is then
where is the proportion of the students in the upper group who answered correctly the item and is the proportion of the students in the lower group who answered correctly the item. The values of the index are in the interval , where 1 means maximum discrimination and 0 means minimum discrimination. Negative values of the index mean that the students of the upper group failed with this item more than the students of the lower group, which is contradictory with what is expected.

Although it is expected that difficult items will discriminate better than easy items, this is not always the case and the combination of both indices provides an interesting tool to check possible incoherencies. Both indices are based just on the correct answer, but the rest of the responses play an important role as well. The homogeneity index given in this paper considers all the responses.

##### 2.2. Homogeneity Index of the Distractors

A new index is defined to measure the homogeneity of the distractors in a MCQ. Thus, this index measures whether the number of wrong answers is equally distributed among all the responses, justifying the use of the traditional penalty. If there is some response with very low frequency, this means that for the students it is too obvious that this response is wrong and the students who chose this distractor are penalized in the same quantity compared to those who chose a more feasible distractor. On the contrary, if there is some distractor with very high frequency, this means that this response may be ambiguous and leads the students to a wrong interpretation.

The importance of this index comes from the penalty the student receives from a wrong answer. This penalty is based on the hypothesis that all the responses have the same difficulty and therefore the same chance to be chosen at random. Then a person choosing an answer randomly may have more probability to succeed than a person who studies the topic and is confused by an unclear interpretation of one of the responses. A higher frequency may be considered more unfair for the student than a lower one.

The index given here is based on the lack-of-fit test. Let be again the number of students and the number of responses for each item. Let , where is the number of people marking none of the responses, is the number of successes, and is the number of failures (errors). Moreover , where , , are the numbers of students choosing each of the distractors.

The later numbers follow a multinomial distribution of size : where is the proportion of subjects selecting response and .

To apply the traditional penalty, the optimal situation is that all the responses would have the same level of difficulty and therefore the frequencies should be similar. The following is a typical lack-of-fit hypotheses test:

The explicit formula for the index is the test statistic:

The probability distribution of this statistic is approximated by a chi-squared distribution with degrees of freedom. The values of this distribution can be found in any text book of basic statistics or in any statistical software, including Excel (=CHIIN (probability; degrees of freedom)). This approximation is good enough if most of the expected frequencies are greater than or equal to 5 and none of them is less than 1.5 [12]. Index may vanish in two very different cases. On the one hand, if there is perfect homogeneity, then for every . On the other hand, if there are no errors. In the latter case the index should not be applied while the first case means that there is not any clear objection against homogeneity. Table 1 gives critical values, for a significance level of 2.5%, for low numbers of errors computed with 200,000 simulations for each one. For example, if and , the critical value is 9.348 for a significance level of 2.5%. Notice that if the number of errors is too small, the index is still coherent. For instance, if and , then the observed index is 3 and the critical number is 3 and therefore there is no evidence of nonhomogeneity. If , they may be distributed in two distractors () or concentrated at the same distractor with , which is the critical value and therefore there is no evidence yet to assert lack of homogeneity. Later on, we will explain why we use here 2.5% as significance level instead of the traditional 5%.