Abstract

For test-sheet composition systems, it is important to adaptively compose test sheets with diverse conceptual scopes, discrimination and difficulty degrees to meet various assessment requirements during real learning situations. Computation time and item exposure rate also influence performance and item bank security. Therefore, this study proposes an Adaptive Test Sheet Generation (ATSG) mechanism, where a Candidate Item Selection Strategy adaptively determines candidate test items and conceptual granularities according to desired conceptual scopes, and an Aggregate Objective Function applies Genetic Algorithm (GA) to figure out the approximate solution of mixed integer programming problem for the test-sheet composition. Experimental results show that the ATSG mechanism can efficiently, precisely generate test sheets to meet the various assessment requirements than existing ones. Furthermore, according to experimental finding, Fractal Time Series approach can be applied to analyze the self-similarity characteristics of GA’s fitness scores for improving the quality of the test-sheet composition in the near future.

1. Introduction

With the rapid developments of information and assessment technology, the computerized testing is generally used to assess, predict, and diagnose learners’ learning statuses because it is able to effectively analyze examinees’ abilities and learning barriers. The test quality offered by a computerized testing system depends on not only the quality of test items but also the satisfied test sheets to meet the various requirements of assessment parameters, such as the difficulty degree, the discrimination degree, the associated concepts, and the expected testing time. Thus, how to efficiently assist teachers in composing and generating an appropriate test sheet to meet the diverse assessment requirements has become an important research issue.

Hwang [1] applied the dynamic programming technique to solve this issue, but the solution is inefficient for a large-item bank because of the exponential growth of time and space complexity. Su and Wang [2] developed an assistance system to provide teachers with statistic information for assisting teachers in manually composing the desired test sheets, but manually selecting appropriate test items in a large item bank is still inefficient and difficult to ensure the qualities of test sheets. Therefore, the pressing problem of automatic test item allocation is emerging and it can be regarded as a combinatorial optimization problem, which is proven an NP-hard problem [3]. Therefore, Hwang et al. [4] formulated this problem as a mixed integer programming model and proposed approximate solutions by using the Genetic Algorithm (GA) approach [5]. The experimental results show that their proposed approach can efficiently automatically compose a good enough test sheet for a large-scale test.

However, the aforementioned studies mainly aim to automatically generate a test sheet with a highest discrimination degree and to meet the constraints in terms of expected testing time and concept relevance. These mechanisms are suitable for the large-scale test only, but their natures are difficult to satisfy various purposes of assessments during the real learning situation. In order to efficiently understand the students’ learning problems, it is important to compose the test sheets with diverse conceptual scopes , discrimination and difficulty degrees, such as displacement and summative assessments (with normal distribution and ), and formative and diagnostic assessments (with various or specific and ) [69]. Moreover, the computation time of the test-sheet composition process and the Item Exposure Rate are our concerns as well. A long computation time will decrease the performance of test-sheet composition system and a high-item exposure rate will decrease the qualities of test items and Item Bank Security [10, 11]. Accordingly, to consider not only the various assessment requirements but also the computation time and item exposure rate, this study defines a new problem of automatic test item allocation, called an Adaptive Test Sheet Generation problem. To solve it, this research proposes Adaptive Test Sheet Generation (ATSG) mechanism, consisting of a Candidate Item Selection Strategy (CISS) and an Aggregate Objective Function (AOF). CISS process can adaptively determine candidate test items set and the conceptual granularities according to the desired concept scope, and AOF applies GA algorithms to solve the mixed integer programming problem. The evaluation results show that the proposed approach can generate test sheets to meet the various assessment requirements.

The original issue of the test sheet generation problem is identified for the large-scale tests, where these test items covering all required concepts and having the highest degree of discrimination are selected from a test item bank. Hwang [1] proposed an algorithm based on dynamic programming technique to find optimal test sheets, but the exponential time complexity causes the efficiency issue for a large number of candidate test items. Therefore, the researchers formulated this problem as a mixed integer programming model and applied a genetic algorithm [4] to figure out the approximate solution. In this paper, assume that a set of test items, which are related to concepts, should be selected from items in the item bank. Each test item is defined as

where is a test item in the item bank (IB) and has a set of parameters including the expected time needed for answering, the degree of discrimination , and the degree of association between and a concept .

The assessment requirement of a Test Sheet (TS) includes the lower bound and upper bound of the totally expected answering time, and the lower bound of the total relevance of each concept . To formulate the problem, a decision variable is defined as a Kronecker delta, that is,

The goal of this problem is to maximize .

Subject to the concept range for j = 1 to and the testing time limitation .

A Genetic Algorithm (GA) approach [5] is used to solve this problem, where a chromosome is represented as an n-bit binary string [,,…,] and the fitness rank is the summation of selected items’ discrimination degrees subtracted by the penalty scores. The penalty scores are the degrees about the violation of expected time and concept ranges constraints. The genetic algorithm iteratively generates new generation of chromosomes by the Crossover and Mutation processes, as Random Functions, and finds the best chromosomes according to their fitness ranks. In the Crossover, chromosomes of the next iteration are generated by combining halves of two chromosomes, which are randomly selected from the chromosomes in the current iteration. A chromosome can be more probably selected because it has a higher fitness rank. Mutation is the other operation of changing a chromosome, where the change of an arbitrary bit is randomly raised to a chromosome. This kind of evolutionary algorithm can iteratively approach to the optimal solution and use some random operations, such as the operations of Crossover and Mutation, to prevent falling into the local optimal solutions. According to the evaluation, the test sheet generation approach based on a GA can really provide good solutions among more than ten thousand test items in an acceptable response time. Furthermore, the greedy algorithm approach [12], the tabu search algorithm [13], and the discrete particle swarm optimization algorithm [14] were subsequently applied to enhance the computation efficiency of test sheet generation based on the aforementioned problem formulation.

Besides, the test sheet composition problem was extended to a parallel test sheets composition problem, where multiple test sheets are generated at one time. These sheets must have similar concept relevance, discrimination, and difficulty degrees but contain no common test items. The problem was solved by extending the existing tabu search algorithm [15] and the particle swarm optimization algorithm [16].

3. Adaptive Test Sheet Generation Problem

In order to efficiently understand the students’ learning problems, the parameters of a test sheet including conceptual scopes , discrimination , and difficulty degrees should be adaptively composed according to the various assessment purposes, such as displacement and summative assessments (with normal distribution and ), and formative and diagnostic assessments (with various or specific and ). As illustrated in Figure 1, for the formative assessment, like a small-scale test, a test sheet with the specific and detailed concepts, that is, low-level conceptual scope/fine-grained granularity, is required to evaluate the students’ specific conceptual capabilities during the learning; for the diagnostic assessment, like a specific-scale test, a test sheet with the diverse conceptual scopes and granularities is used to diagnose the students’ learning problems; for the displacement and summative assessments, like a large-scale test, a test sheet with the high-level conceptual granularities is required to evaluate the students’ learning performance before and after the learning, respectively.

However, as seen in Figure 2, the existing approaches did not take the adaptive requirements, that is, , , and into account, and only focus on the highest . Consequently, their composed test sheets may contain the miss- and error-included concept nodes and cannot meet the adaptive requirements. Moreover, they also need to spend much more computation time to select candidate test items in the item bank because they have no item selection strategy to filter the irrelevant ones in advance. Besides, item exposure rate, which denotes the number of a test item used in the test sheets, also needs to consider for enhancing the Item Bank Security.

Therefore, three issues are required to be solved for satisfying the adaptive requirements of a test sheet:(i)how to generate a test sheet to precisely meet the adaptive requirements in terms of conceptual granularities, discrimination, difficulty, and expected test time parameters;(ii)how to speed up the test sheet generation process for reducing the computation time;(iii)how to consider the item exposure rate issue to enhance the Item Bank Security.

An Adaptive Test Sheet Generation Problem Is Defined as Follows
Assume that a set of test items should be selected from items in the item bank . All items should be related to the concepts in a concept hierarchy , a tree of concepts as shown in Figure 1. The tree contains concepts as the tree nodes C, namely, . δ is a descendent function, where is a set of descendent nodes of , and is a descendent leaf function, where belongs to () if and only if is a leaf concept of and the descendent of .

Based on the definition in Section 2, the item exposure times and the degree of difficulty are taken in account in this study. Thus, each test item is defined as follows.

An example is provided in Figure 3, where the concept hierarchy is a tree of concept and the test item set is a set of test items . A weight denotes relevance degree between the concept test item , for example, the relevance of and is . The δ() denotes the subtree of the concept , for example, and belong to the δ().

Therefore, in this study, a test sheet (TS) can be defined as follows:

where TS includes the expected test time of the test sheet, target difficulty degree , target concepts , and the lower bound of average concept relevance . Based on the definitions of existing studies mentioned in Section 2, a decision variable X = [,,…,] is defined where is 1 if the test item is selected to the test sheet; 0, otherwise.

The goal of the adaptive test sheet generation problem is to generate a test sheet to(i)approach all the target parameters and ,(ii)have the highest average discrimination degree, (iii)have the balanced concept relevance weight sum of each required conceptual granularity and its descents among the required concept range and the average relevance to be higher than , (iv)have the lowest average item exposure rate.

This is a multiobjective optimization problem, and the objective functions are defined as follows.

The objective function of the discrimination degree is inversed to the average discrimination degree of the test sheet:

The objective function of the expected test time is the distance between the sum of expected test time and the target expected time:

The objective function of the difficulty degree is the distance between the average difficulty degree and the target difficulty degree:

Let be the average sum of relevance degree of each concept in the test sheet:

Let the generalized concept relevance denote the maximum concept relevance of a test item toward the concept or its descendent concepts:

The objective function of concept relevance is the distance between the sum of generalized concept relevance degrees and the average sum . This objective function shows the imbalance degree of the concept relevance:

The objective function of the item exposure rate is the average exposure times:

The multiobjective optimization problem is to find a test sheet to minimize all the values of objective functions and subject to the lower bound of average concept relevance , as shown in the following:

4. Methodology

To solve the Adaptive Test Sheet Generation Problem, an Adaptive Test Sheet Generation (ATSG) mechanism has been proposed. ASTG mechanism consists of a Candidate Item Selection Strategy (CISS) to adaptively determine candidate test items set and the conceptual granularities according to the desired concept scope, and an Aggregate Objective Function (AOF) to apply Genetic Algorithm (GA) to figure out the approximate solution of mixed integer programming problem for the test-sheet composition. CISS process is illustrated in Figure 4.

4.1. Candidate Item Selection Strategy (CISS)

CISS process includes two phases: (1) specifying Concept Granularity and (2) selecting Candidate Test Item Set.

Phase  1: Specifying Concept Granularity
Concepts associated with a test sheet might be in various granularities for specific educational situations, so the conceptual granularities should be determined before generating a test sheet. Because the required concepts might be in various granularities, the most specific required concepts should be selected as the target concept set to precisely express the requirements. Let denote the target concept set, where no concepts in the set are the other concepts’ ancestors, and the goal of the first phase is determining the concepts in :

Phase  2: Selecting Candidate Test Item Set
Let θ be the candidate test item set, where the inner test items should be related to the target concept set. In Phase  2, test items whose related concepts are out of are filtered:

Besides, the generalized concept relevance degrees of all test items toward all concepts in are calculated.

After this phase, the search space can be reduced from to θ.

An example of CISS process is provided in Figure 5, where assume the required concepts set . In Phase  1, , , and are selected into for expressing the most specific required concepts. In Phase  2, only the test items which are associated with the subtrees of concepts in can be selected to the candidate item set θ, so and are filtered before solving the optimization problem.

4.2. Aggregate Objective Function

An aggregate objective function is defined to solve the multiobjective optimization problem:

The aggregate objective function includes the discrimination score and the penalty scores of the expected time , the difficulty degree , the concept relevance , the concept relevance lower bound , and the exposure times . All score and penalty score are normalized to the range from 0 to 1.

The discrimination score is inversed to the objective function D(X):

The penalty score of the expected time is the percentage of the distance between the sum of expected test time and the target expected time over the target expected time. If the penalty score is greater than 1, 1 is assigned the penalty score:

The penalty score of the difficulty degree is the value generated by the objective function of the difficulty degree:

The penalty score of the concept relevance balance degree is the average distance between the sum of relevance degrees and the average sum of a concept:

The penalty score of the concept relevance lower bound is greater than 0 if the average concept relevance is lower than the concept relevance lower bound and the value the percentage of the distance over the concept relevance lower bound. If the penalty score is greater than 1, the penalty score will be set as 1:

The penalty score of the exposure times is the percentage of the average of exposure times over the exposure times parameter , which denotes the maximum exposure times to be considered. If the average of the exposure times is greater than , the penalty score will be set as 1:

Thus, a single aggregate objective function F(X) can be defined to integrate all the score and penalty scores to a single objective score as (5.1).

The genetic algorithm (GA) can be applied to solve the Adaptive Test Sheet Generation Problem by maximizing the aggregate objective function . The overall process of the GA algorithm is shown in Figure 4. The CISS process can adaptively determine the desired concept scopes and granularities, and the out-of-scope test items, that is, error-included concept nodes in Figure 2, can be adaptively filtered to reduce the problem space of the test sheet generation. The candidate test items can be encoded into chromosomes, which is an N-bit binary string [,,…,], where is the amount of candidate test items and denotes the test item selected into the test sheet. In the beginning, a set of chromosomes, each whose bit value is randomly set, are generated as the initial selection states. Then, each chromosome is evaluated by the aggregate objective function . The higher score the chromosome gets, the more probability the chromosome can be reserved to generate the next generation. In the Crossover step, the chromosomes with higher score of are selected to generate new chromosomes. Two chromosomes are both broken into two segments in the randomly selected segment lengths and the new chromosomes are generated by exchanging a segment with each other. Further, in the Mutation step, a random bit of a random chromosome in the new generation is inversed in order to prevent falling into the local optimal solutions. Then, return to the Crossover step to further generate next generation until the iteration limitation is achieved. Finally, the chromosome having the highest score of ) among the whole process is the approximate solution.

5. Experiment and Evaluation

In order to evaluate the effectiveness of the proposed methodology in support of various purposes of assessments during the real learning situation, three experiments have been conducted. Firstly, various sizes of item banks are used to evaluate the efficiency and fitness scores of the proposed ATSG mechanism. Secondly, various levels of target concepts are used to evaluate the performance and the satisfaction degree of concepts in ATSG mechanism. Thirdly, exposure times of selected test items are measured during the 50 times of use. The exposure times of test items are accumulated and the experiment can evaluate whether ATSG mechanism can prevent the generation of the test sheets with high exposure times. In the three experiments, a system of the control group has also been developed based on Hwang’s methodology [4], where the objective function shown in (5.1) was modified to meet the experimental requirements:

Some differences in the system of control group are listed as follows.(1)It does not run the CISS; all test items are considered in the GA algorithm. (2)It does not consider the exposure times of test items. (3)It does not calculate the generalized concept relevance, so the required concepts for control group are expended to all their descendent concepts.

The parameters of the GA algorithms used by the experimental and control systems were determined to balance the effectiveness and efficiency. In the three experiments, the GA algorithms were limited to 1,000 iterations and the mutation rate was 0.1. The population size was 30 and all initial bits of chromosomes were assigned to 0 because the amount of all test items was much larger than the amount of the selected test items.

5.1. Various Size of the Item Bank

The item banks having 1,000 to 20,000 test items are used to evaluate the systems’ efficiency and effectiveness. In each item bank, 10 test sheets with randomly chosen parameters are generated by the control and experimental systems. The effectiveness is measured by the fitness score of the aggregate objective function . The result of effectiveness is shown in Figure 6, where the experimental system has more stable and generally higher fitness scores than those of the control system.

The experimental result of efficiency is shown in Figure 7, where the response time of the GA algorithm becomes higher if the size of item bank grows gradually. The reason is that if there are more candidate test items, much longer chromosomes will be used and the computing time dealing with all bits in chromosomes becomes much longer as well. Among the two systems, experimental system, which applies CISS process to dramatically reduce the size of candidate test items, can have much more efficient response time.

5.2. Various Levels of Target Concepts

This experiment demonstrates the systems’ effectiveness of generating a test sheet for specific level of target concepts. Target concepts in the most coarse-grained level, level 1, to the most fine-grained level, level 6, are randomly chosen for the two systems. As shown in Figure 8, the concept relevance scores of the control system are much lower than those of the experimental system, especially when the concept level is fine grained. The reason is that without filtering out-of-scope test items, the GA algorithm of the control system is difficult to precisely choose the test items with accurate concepts. Figure 9 also shows that the test sheet generated by the control system contains many out-of-scope test items, which will seriously affect the test quality.

The result of response time in Figure 10 also reveals that the control system needs more computation times to generate a test sheet because many out-of-scope test items are also computed.

5.3. Exposure Times Measurement of Test Items

In the last experiment, 50 test sheets with similar target concept ranges are generated from the item bank containing 2,000 test items and the used test items are recorded to calculate the exposure times of each test item. Results of the average exposure times of test items are shown in Figure 11, where the control system and the experimental system have no noticeable difference. According to the analysis of each test sheet, although the experimental system can prevent the test items with high exposure times, the average exposure times are still accumulated due to the small range of target concepts. However, the out-of-scope test items are usually used in the test sheet generated by the control system, so the exposure times of a single test item are accumulated slowly. That makes the exposure times of the experimental system are not better than those of the control system.

6. Discussion

The proposed ATSG mechanism is able to solve Adaptive Test Sheet Generation Problem in terms of the following aspects.

6.1. The Control of the Concept Granularity of the Test Sheets and the Prevention of the Irrelevant Problem Space

To simplify the discussion of this problem, assume that the concept tree is an L-level balanced tree, and the amount of branches in each level is . Let an adaptive requirement of the test sheet contain target concepts in level . By applying the CISS mechanism, the problem space of the test sheet generation problem can be reduced to of the original problem space.

Proof. Assume that items are related to a concept. The amount of candidate test items in the previous research is mBL−1. By using the candidate item selection strategy, the amount of the candidate test items is mnBL−X. Thus, the percentage of the new problem space over the previous problem space is .

6.2. The Generation of a Test Sheet to Precisely Fit the Target Concept Range, Difficulty, and Expected Test Time

In the new objective functions, the distances toward the target thresholds are used instead of the lower bound and upper bound in the previous studies. Thus, the difficulty and expected test item can be precisely fitted. Moreover, the candidate item selection strategy and the penalty score of the concept relevance balance degree can ensure that the test sheet contains balanced target concepts. As shown in Section 5.2, the concept relevance scores of the test sheets generated by the experimental system are also much higher than those of the control system.

6.3. The Consideration of the Item Exposure Rate

The penalty score of the exposure times can prevent the high-exposure-rate items selected to the test sheet.

6.4. The Extensibility of the ATSG Mechanism

Most approaches mentioned in the related work section applied more efficient evolutionary algorithms, for example, the greedy algorithm approach [12], the tabu search algorithm [13], and the discrete particle swarm optimization algorithm [14] to enhance the computation efficiency of test sheet generation. However, these approaches did not yet take the conceptual granularity, exposure rates, and test item filtering into account. Therefore, these enhanced evolutionary approaches can thus be expected to replace the Hwang’s methodology [4] for improving the efficiency of the Selecting Candidate Test Item Set phase (Figure 4) in the CISS process of ATSG mechanism.

6.5. The Future Work of the ATSG Mechanism

According to our observation and finding of experimental results, the degree of fitness score changes with the item bank sizes and the computation time (see Figure 6). Because the fitness scores directly affect the quality of the generated test sheet, a new important issue will be how to analyze the characteristics and predict the trends of fitness scores over times and item bank sizes for improving the quality of test sheet composition. However, this kind of time series problem may not be modeled by the conventional distribution model because the quality of the GA selection strategy seems to have the characteristics of self-similarity. Therefore, according to the study of Li [17], Fractal Time Series, which has the features of Long-Range Dependence (LRD) and obeys the Power Law, are a suitable mathematical approach to model and analyze the features and phenomenon of self-similar series [18], for example, the data series in the cyber-physical networking systems [19], the time series of sea level [20] and molecular motion on the cell membrane [21], the DNA series [22], and the fractal lattice geometry using Iterated Function System (IFS) on simplexes [23]. Accordingly, in the near future, we are going to try to apply the fractal time series approach to analyze and model the series of fitness score for figuring out the characteristics of self-similarity.

7. Conclusion

In this paper, an Adaptive Test Sheet Generation (ATSG) mechanism is proposed, where the Candidate Item Selection Strategy (CISS) is come up to reduce the problem space of test sheet composition and an Aggregate Objective Function (AOF) based on the Genetic Algorithm (GA) is modeled to figure out the approximate solution. In this approach, the adaptive conceptual scope and granularity and item exposure rates have been considered to meet the various purposes of assessments during the real learning situation. Experimental results show that ATSG mechanism is able to more efficiently, precisely, adaptively generate the various test sheets than the existing approaches in terms of various conceptual scopes, computation time, and item exposure rates. Furthermore, in the near future, the fractal time series approach can be expected to be applied to analyze and model the series of GA’s fitness score for figuring out the characteristics of self-similarity and improving the quality of test sheet composition according to the experimental finding.

Acknowledgment

This paper was partially supported by the National Science Council of Republic of China under the number of NSC 101-2511-S-024-004-MY3 and NSC 100-2511-S-468-002.