Genetic Algorithm-Based Test Data Generation for Multiple Paths via Individual Sharing

Yao, Xiangjuan; Gong, Dunwei

doi:https://doi.org/10.1155/2014/591294

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Related Work Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 591294 | https://doi.org/10.1155/2014/591294

Genetic Algorithm-Based Test Data Generation for Multiple Paths via Individual Sharing

Xiangjuan Yao¹and Dunwei Gong²

Academic Editor: Jianwei Shuai

Received28 Apr 2014

Revised16 Sept 2014

Accepted19 Sept 2014

Published16 Oct 2014

Abstract

The application of genetic algorithms in automatically generating test data has aroused broad concerns and obtained delightful achievements in recent years. However, the efficiency of genetic algorithm-based test data generation for path testing needs to be further improved. In this paper, we establish a mathematical model of generating test data for multiple paths coverage. Then, a multipopulation genetic algorithm with individual sharing is presented to solve the established model. We not only analyzed the performance of the proposed method theoretically, but also applied it to various programs under test. The experimental results show that the proposed method can improve the efficiency of generating test data for many paths’ coverage significantly.

1. Introduction

One of the approaches to improve the quality of software is to do a large number of tests before delivery and usage in order to detect bugs or faults in software. Software testing is an expensive, tedious, and labor-intensive task and requires significant human effort [1]. If the process of testing can be automated, it will undoubtedly shorten the period of software development and improve the quality of software, so as to enhance the market competitiveness. One of the most important issues in automated software testing is the generation of effective test data satisfying the selected test adequacy criteria.

It has been proved that many software test problems can come down to those of generating test data for paths coverage [2, 3], which can be described as follows: for a given path of a program under test, search for a test datum in the input domain of the program, such that the traversed path of the test datum is just the desired one.

In recent years, it is becoming a promising direction to generate test data for complex software using the genetic algorithm (for short, GA) and has achieved many research results [4]. But most GA-based test data generation methods for path coverage intend to cover target paths one by one, which make the process of test data generation inefficient.

In this study, we established a mathematical model of generating test data for multiple paths coverage, which takes each optimization problem corresponding to one target path as a subproblem, and a number of subproblems form an overall optimization problem. This model is different from those existing multiobjective problems due to the specificity of generating test data.

On this basis, we proposed a multipopulation genetic algorithm to solve the proposed optimization problem. In our algorithm, each subpopulation optimizes one subproblem, so the fitness functions of different subpopulations differ from each other. All subpopulations evolve in parallel. A very key step of our algorithm is the individual sharing of different subpopulations; specifically, every time when the evolutionary operations of a generation finish, the algorithm not only determines whether an individual is an optimal solution of the subpopulation it belongs to, but also does that for the other subpopulations. By this way, the efficiency of finding optimal solutions for each subproblem improves with the complexity of the algorithm not increasing obviously.

We not only analyzed the performance of the proposed method theoretically, but also applied it to different programs under test for evaluation. The experimental results show that the proposed method can significantly improve the efficiency of generating test data for many paths’ coverage.

This paper is divided into nine sections, and the remainder is organized as follows: Section 2 briefly reviews the related works; Section 3 gives a model of generating test data for multiple path coverage; a multipopulation genetic algorithm is proposed to solve the model in Section 4; Section 5 analyzes the performance of the proposed algorithm theoretically; the experiments are presented in Section 6; Section 7 discusses possible threats to the validity of the proposed method; finally, conclusion is presented in Section 8.

This section provides a survey on GA-based software testing. First, some basic methods of automatic software testing are introduced. Then, we review the main works on GA-based test data generation. Finally, we talk about the challenges of path coverage testing.

2.1. Automatic Software Testing

Since the process of software testing is highly time and resource consuming, many automatic approaches have been developed to facilitate the process and decrease its cost, which can be divided into four categories, namely, random method, static method, dynamic method, and heuristics method.

Random method generates test data by randomly sampling the input space of a program under test [5]. This approach is simple but has certain blindness in generating test data. Some improved methods have been proposed to heighten the diversity of test data [6, 7].

Static method only needs static analysis and transformation, without involving actual execution of the program under test, such as symbolic execution [2, 8], and domain reduction [9]. But this method usually requires a significant amount of algebra and (or) interval arithmetic [10].

Dynamic structural method of generating test data was firstly proposed by Miller and Spooner [11], which needs real execution of the program under test, in order to obtain useful information [3].

Different from dynamic method, the process of generating test data by heuristics method is not completely determined. Heuristics method usually recurs to some sort of heuristic algorithms, such as the genetic algorithm, simulated annealing, tabu search, and scatter algorithm, of which the GA is the most widely used [12].

2.2. GA-Based Test Data Generation

As an efficient search-based optimization algorithm, the GA shows special advantage and efficiency in solving problems with high complexity, such as the problems of large space, multipeak, and nonlinear. Therefore it has become a research hotspot to automatically generate test data with GAs and produced encouraging results [13].

Gong and Yao [14] used a GA to generate test data for statement coverage based on testability transformation. Yao et al. [15] proposed an approach to reduce target statements according to their dominant relations and the test suite covering the reduced set of target statements was generated by a GA.

Miller et al. [16] used GAs to generate test data satisfying branch coverage criterion. The experimental results show that the test suite obtained by GAs can achieve or be very close to branch coverage. Baars et al. [17] presented an algorithm for constructing fitness functions that improve the efficiency of search-based testing when trying to generate branch adequate test data. Alshraideh et al. [18] proposed a multiple-population algorithm to improve the efficiency of branch coverage testing. The experimental results showed that the proposed method outperforms the single-population algorithm significantly.

Michael et al. [19] used a GA to generate test data satisfying condition coverage criterion. In their work, the problem of test data generation is reduced to a function minimization, and the function is minimized using one of two genetic algorithms in place of the local minimization techniques.

As for the works of GA-based software testing for path coverage criterion, we will introduce them individually in Section 2.3.

Besides traditional structural software testing, Bühler and Wegener [20] applied an evolutionary algorithm to functional testing. Watkins and Hufnagel [21] used two GAs to generate a couple of test data pieces and then trained a decision tree using them, in order to obtain an agent model which distinguishes the merit of test data. Ferrer et al. [22] presented a method of automatically generating test data by considering multiple objectives: maximizing the coverage and minimizing the oracle cost.

2.3. GA-Based Path Testing

Path coverage testing is the strongest sufficiency criterion in white box testing. Automatically generating data for paths coverage remains a challenging problem [23].

Bueno and Jino [24] and Watkins and Hufnagel [25] used a GA to obtain test data fulfilling path coverage, respectively. Mei and Wang [26] proposed a method that can automatically generate test cases for selected paths using a special genetic algorithm. In their algorithm, the best chromosome called queen crosses with the selected drones, which enhances the exploitation of global optimal solutions.

Hermadi and Ahmed [27] have observed that existing GA-based test data generators can generate only one test datum for one test goal at a time. When there are many target paths to be covered, the generator has to be run many times. In fact, the generated individuals when trying to find test data covering a path may be just test data covering other target paths. This, hence, makes those existing test data generators inefficient in trying to generate test data for multiple paths.

Wegener et al. [28] developed a fully automatic GA-based test data generator for structural software testing. In their approach, all generated individuals are evaluated with regard to all unachieved partial aims. Partial aims reached by chance are identified, and the individuals with good fitness values for one or more partial aims are noted and stored for seeding the subsequent testing of uncovered targets. But they only considered one partial aim for optimization at a time, which means that they solved the problems of generating test data one by one. Furthermore, they did not discuss whether multiple targets can be covered in one run. Besides, they reported that full coverage of some programs is achieved but not for all programs though.

Bueno and Jino [24] looked after methods to improve the performance of test data generation by using past input data to compose the initial population for the search. Although these methods can improve the performance of the initial population by reusing test data, they still cannot make full use of the test data generated in the evolutionary process.

Ahmed and Hermadi [29] proposed a GA-based test data generator for multiple paths. In their work, the problem of generating test data for multiple paths is regarded as a multiobjective optimization problem and solved by a multiobjective evolutionary algorithm. In fact, the problem of generating test data for multiple paths is strictly different from traditional multiobjective optimization problems. Therefore, it is necessary to establish an appropriate mathematical model for the problem of generating test data for multiple paths coverage according to its specificity and give a corresponding evolutionary solution.

Gong and Zhang [30] also proposed a test data generation method for multipath coverage. They represent a target path using Huffman encoding method and designed the fitness function according to the Huffman codes of target paths. Their method is simple and has better performance than Ahmed’s method, but the fitness function cannot distinguish individuals well.

In order to stop searching as soon as all feasible paths have been covered, Hermadi et al. [31] proposed method for determining when it is no longer worthwhile to continue searching for test data to cover uncovered target paths. Compared to searching for a standard number of generations, an average of 30–75% of total computation was avoided in test programs with infeasible paths, and no feasible paths were missed due to early termination. The extra computation in programs with no infeasible paths was negligible.

3. Mathematical Model of Test Data Generation for Multiple Paths

In order to illustrate conveniently, we first introduce several concepts. Then, an objective function is constructed in order to transform the problem of generating test data into an optimization one. On this basis, the optimization model of generating test data for multiple paths coverage is established.

3.1. Basic Concepts

Control Flow Graph (CFG) [1]. The CFG of a program is a directed graph , where is the set of nodes, is the set of edges, and and are unique entry and exit nodes of the graph, respectively. Each node is a statement in the program; each edge represents a transfer of control from node to node .

Path [1]. A path of a CFG is a sequence , such that there exists an edge from node to , .

For large-scale programs, the sequence of a path may be very long. We represent a path using a (0, 1)-string for simplicity. Suppose that there are conditional statements in path , denoted as . Define Thus we obtain a (0-1)-string of length . In program , the mapping between a path and such a (0, 1)-string is one to one. Without special illustration, a path is represented by such a (0, 1)-string in this study.

Let the input vector of program be , and let the domain of be ; then the input domain of is . When program adopts as an input, the traversed path is denoted by . We call the first dissimilar character of and their bifurcation.

3.2. Structure of Objective Function

The key problem of applying GAs to test data generation is the construction of a suitable objective function. The goodness of a candidate test datum is often expressed in terms of the closeness that the test datum fulfills the test goal. The approach to forming an objective function typically involves two parts: approach level () and branch distance () [3, 24, 25].

The approach level assesses how close an execution comes to reaching the predicate which controls the test object. If , we define the approach level of input to a target path as the number of characters between the bifurcation of and to the last character of , denoted by ; otherwise, we define . covers path if and only if .

For example, suppose that is a target path, , and ; then and .

The branch distance assesses how close the predicate comes to evaluating either true or false branch. For example, suppose that a conditional statement is “if ,” and the aim is to execute the true branch. Suppose that the value of is after the execution of this statement with input ; then the branch distance of for branch condition is defined as follows:

Branch distances of different kinds of simple branch conditions are listed in Table 1. For a complex branch condition, branch distance is the composite of those of all simple conditions included in it, which is listed in Table 2.

We define the general objective function of input to target path as follows: where refers to the branch distance of to the conditional statement corresponding to the bifurcation of and , and function Function maps the value of to interval .

A sufficient and necessary condition of is that the traversed path of is ; that is, ; furthermore, the smaller the value of , the nearer the to the data covering . So the problem of generating test data for path can be transformed into that of minimizing .

For example, see the program in Figure 1. Suppose that the target path is . There are three conditional statements in , that is, statements 1, 2, and 4, respectively. traverses the true branches of all these statements. So we also write . Suppose that , , and . We obtain , , and . Thus , , and .

In addition, deviates from the first conditional statement, so the branch distance of to is ; similarly, we get and . Thus

Although the traversed paths of and are the same, the branch distance of is smaller than that of . Thus obtains a better objective value than .

3.3. Mathematical Model of Generating Test Data for Multiple Paths Coverage

Let the set of target paths be ; then the problem of generating test data for can be described as follows: find a test suite , such that . Let the objective function for path using the method proposed in Section 3.2 be ; then the problem of generating test data for can be transformed into an optimization one described as follows:

Most existing GA-based test data generation methods take the above problem as self-governed optimization ones and solve them one by one. Specifically, for each optimization problem , run a GA in order to find an optimal solution of , which is just a test datum traversing target path . Repeat above process, until all optimization problems have been solved. If the number of target paths is , the GA has to be run times.

This approach, however, does not take advantage of the fact that some of the required test data can be readily available as by-products when trying to find other test data, because different target paths have similarities. Therefore the efficiency of these methods is low when is large.

Ahmed et al. gave an algorithm of generating test data for multiple paths coverage, but they regarded this problem as a multiobjective optimization one. Thus, their model should be

In fact, the problem of generating test data for multiple paths is strictly different from traditional multiobjective optimization ones. In traditional multiobjective optimization problems, the aim is to find one solution which satisfies all objectives well. In a multiobjective environment, we often encounter conflicting objectives with some trade-off among them. But for the problem of generating test data for paths , what we need is to obtain a test suite , where is an optimal solution of , .

In addition, the number of objective functions in traditional multiobjective optimization problems remains unchanged, while that in the proposed model gradually reduces. Therefore, there is much limitation to take the problem of generating test data as a multiobjective optimization one.

Different from existing methods, we consider the problem of generating test data for paths coverage as a uniform problem, in which each optimization problem corresponding to one target path is a subproblem. We solve all subproblems at the same time. Thus the problem corresponding to the test data generation for multiple paths coverage can be described as follows:

This model includes subproblems, each of which is a minimization problem, and all objective functions have the same domain. We will seek an algorithm to solve these problems simultaneously, rather than solve them independently. So problem (8) strictly differs from (6) and (7), and we should seek a suitable method to solve it.

4. Multipopulation GA for Test Data Generation of Multiple Paths

In this section we will give a multipopulation GA to solve problem (8), which is different from traditional multipopulation GAs. The main purpose of our strategy is to expand the search range of each population by individual sharing, so as to improve the efficiency of the algorithm.

4.1. Initialization of Populations

For the th optimization problem , randomly generating a subpopulation of size , that is, , , where refers to the th individual in the th population of the first generation. An individual corresponds to a string by proper encoding. Population size might have some influence on the performance of the algorithms, but this is not a focus of this study, so we just give an appropriate value for it.

4.2. Genetic Operations

As a typical GA, our method mainly includes three kinds of operations, that is, selection, crossover, and mutation.

Individuals are selected according to their fitness, so that good gens have more chances to be copied to the next generation. We adopt objective function as the fitness of individual . Because what we are solving are minimization problems, the smaller the fitness of an individual is, the better we consider it.

Crossover operation exchanges parts of two gene strings in a certain probability to produce two new chromosomes, while mutation operation modifies some of the genes in a string, resulting in a new chromosome. The crossover and mutation rates are denoted by and , respectively. Because parameter setting is not the focus of this work, we just give the value of the parameters based on experience.

Each subpopulation implements these operations independently. By this way, individuals of the th generation are evolved to the th generation, which can be shown as Figure 2.

4.3. Individual Sharing among Different Subpopulations

The biggest difference between traditional multipopulation GAs and the proposed one lies in the following: in traditional multipopulation GAs, subpopulations communicate by means of individual migration, while in our method, subpopulations communicate through individual sharing among subpopulations. Specifically, every time when the evolutionary operations of a generation finish, the algorithm not only determines whether an individual is an optimal solution of the subpopulation it belongs to, but also does that for the other subpopulations. In this way, the individuals of one subpopulation are shared by all other subpopulations, and the probability of finding optimal solutions significantly increases. The implementation of individual sharing is shown as Figure 3.

Because if and only if the traversed path of is just , we determine whether is a desired test datum covering according to the value of . Suppose that there are target paths . In our algorithm, we can obtain the values of in one run of the instrumented program with input . Thus the individual sharing can be realized with the computation complexity not increasing too much.

4.4. Steps of the Algorithm

Based on the above discussion, the main steps of the proposed algorithm are shown as follows.

Step 1. Set the values of the number of subpopulations , maximum termination generation , crossover probability , and mutation probability , where is equal to the number of target paths.

Step 2. Suppose that the set of target paths is . For , randomly generate a subpopulation , . The value of generation .

Step 3. For subpopulation in the th generation, calculate the values of for individual to all target paths and those of for individual to path , , .

Step 4. is used as the fitness of individual for subpopulation to guide the evolution.

Step 5. If there is a , which means that covers , then is an optimal solution of the th optimization subproblem. In this case, save , delete from the target path set, and terminate the evolution of the th subpopulation.

Step 6. If the number of subpopulations becomes 0, or the number of generations is larger than , then stop the evolution and output the test data; otherwise, go to Step 7.

Step 7. Perform genetic operations on to generate offspring population . let and go to Step 3.

5. Performance Analysis

We will illustrate the performance of the proposed algorithm by analyzing its efficiency and computational complexity.

5.1. Efficiency of Algorithm

Suppose that the set of target paths is and is the subpopulation used to optimize the th subproblem, which is related to the problem of generating test data for . Let be the number of generations in which the th subpopulation finds the test datum covering path ; thus is a random variable. From experiences, we can suppose that . Let , , be the number of generations in which the th subpopulation finds the test datum covering ; then is also a random variable. Suppose that the probability of finding the test datum that covers is ; then , . For convenience to illustration, we also denote by .

If we use traditional single-objective GAs to solve (3), in the circumstance of using the same population size, the probability of finding an optimal solution within generations is , where is the distribution function of standard normal distribution. Thus the probability of all subpopulations finding their optimal solutions within generations is

If we adopt the proposed method to solve (6), then the probability of finding the test datum covering path within generations is

Thus the probability of all subpopulations finding all optimal solutions within generations is Since , we obtain

That is to say, the probability of finding all optimal solutions using the proposed algorithm is larger than that of traditional single-objective GAs. In addition, the more the number of target paths is, the more obvious the advantage of the proposed method is, which can also be easily understood by the following example.

Suppose that the set of target paths is and , , ; then the probabilities of finding all optimal solutions within 500 and 600 generations using traditional single-objective GAs are respectively, whereas the probabilities of finding all optimal solutions within 500 and 600 generations using the proposed algorithm are respectively. If the number of target paths increases to 10, and , , , then the probabilities of finding all optimal solutions within 500 and 600 generations using traditional single-objective GAs are respectively, whereas the probabilities of finding all optimal solutions within 500 and 600 generations using the proposed algorithm are respectively. As can be seen from these results, in circumstance with 5 target paths, the probabilities of finding all optimal solutions within 500 and 600 generations using the proposed algorithm are 0.0719 and 0.4540, respectively, which are and times those of traditional single-objective GAs; in circumstance with 10 target paths, the probabilities of finding all optimal solutions within 500 and 600 generations using the proposed algorithm are 0.0215 and 0.3182, respectively, which are and times those of traditional single-objective GAs. The above results forcefully illuminate that the proposed algorithm is more efficient than traditional single-objective GAs; moreover, with the increase of the number of target paths, the advantages become more obvious.

5.2. Computational Complexity

We will compare the computational complexity of our multipopulation genetic algorithm and those of traditional ones. Suppose that the program under test has executable statements and there are target paths . The population size is . Because can be set manually, we consider as a constant. We take the number of executed statements for the calculation of individual fitness and individual sharing in a generation as a measure of the computational complexity of an algorithm.

If we use traditional multipopulation GAs to solve the problem, which means that there is no individual sharing among subpopulations, then the program under test will be run times, which is equal to the number of all individuals. Since each run of the program probably executes statements, the number of executed statements for the run of the program under test will be . Taking the computation of the fitness as one statement, then all these individuals need to execute statements. So the number of executed statements in a generation using traditional multipopulation GAs is .

If we use the proposed method to solve the problem, which means that subpopulations share all individuals, in addition to the run of the program under test and the computation of the fitness function, we consider the computation due to individual sharing among subpopulations. Taking the computation of the approach level as a statement, individual sharing needs to execute sentences. So the number of executed sentences in a generation using the proposed method is . Under normal circumstances, is much smaller than , so . Thus

On the other hand, each subpopulation has individuals as possible solutions for each generation in traditional multipopulation GAs. But in our method, the possible solutions become for each generation via individual sharing, which is times that of traditional methods. In other words, the population size is magnified to times via individual sharing with the computation quantity almost doubling.

6. Experiments

A group of experiments are conducted so as to investigate the performance of the proposed method. In the following section, subject programs are first introduced. Afterwards, experimental design is characterized. Finally, empirical results are presented and discussed.

6.1. Subject Programs

In order to evaluate the proposed method, we select eighteen programs for experiments. Table 3 shows some basic information of each program, including its name, size, and description. Table 3 is sorted by the sizes of the programs. These test subjects include not only laboratory programs, but also nontrivial industry ones. In addition, their lengths and functions are different from each other. These programs have been thoroughly used by other researches in the literature of software testing and analysis [19, 32–34]. The number of target paths for each program is also listed in Table 3.

For each program under test, we just randomly choose a part of feasible paths to cover. If there are too many paths to be covered, we can divide them into several groups, so that the scale of paths is reasonable. In addition, if we choose infeasible paths as target ones, the performances of different methods will not be distinguished, because it is impossible for any method to generate test data covering infeasible paths. The prediction of the infeasibility of a program path is an undecidable problem, and heuristic techniques that automatically select likely feasible paths can be employed [32].

6.2. Experimental Design

When designing the experiment, we specially have concern about two issues that can be described as follows.

Proposition 1. Can individual sharing improve the efficiency of the algorithm?

In order to verify the first proposition, we conduct two groups of experiments. In the first group of experiments, we use the proposed multipopulation GA with individual sharing to generate test data, while in the second one, different populations do not implement individual sharing but evolve independently.

Proposition 2. How is the overall performance of the proposed method?

In order to validate the overall performance of the proposed method in this study (for short, our method), we compare it with other three methods, namely, Gong’s method [30], Ahmed’s method [29], and random method. The reason why we adopt Gong’s and Ahmed methods to compare is that they are also about the problem of generating test data for multiple paths. In addition, random method is a basic test technique and has been widely used, so we also adopt it as a consult object.

All methods (except random one) apply the same values of parameters, which are listed in Table 4. There are two termination criteria: one is that the test data for all target paths have been found; the other is that the number of generations has reached the maximum.

6.3. Experimental Results

In each group of experiments, we performed 30 runs for each program under test and record the time consumption of each run and each method, where the time consumption refers to the time needed to generate test data covering all target paths.

6.3.1. Experimental Results for Testing the Performance of Individual Sharing

The experimental results to test the performance of individual sharing are listed in Table 5, in which Ave. and S.D. are the sample average and standard deviation of time consumption for each program and method, respectively. Sh.R. means the radio of the number of test data obtained by individual sharing and the number of all test data.

It can be seen from Table 5 that, for all subject programs, the average time consumption using the method of individual sharing is all less than that not implementing individual sharing. The least time consumption of the method applying individual sharing is 6.38 seconds (Bubble Sort), while that not implementing individual sharing for the same program is 11.03 seconds. The most time consumption of the method applying individual sharing is 183.53 seconds (barcode), while that of the second method is 265.72 seconds. The sharing rates of all programs exceed 30% except schedule (29.8%). The average sharing rate of the eight programs is 36.87%, which means that approximately one of each three test data pieces is obtained by individual sharing. By this way, we can make more full use of individuals generated in evolutionary process, therefore improving the efficiency of generating test data.

We use hypothesis testing to give a more scientific analysis for the above experimental results. Let and denote the time consumption using and not using individual sharing, respectively (without confusion, we will use the same symbol for all programs under test). It can be verified that and are random variables obeying normal distribution. Suppose that , . Because the sample standard deviation is an unbiased estimate of the standard deviation of the population, we take the values of sample standard deviations as those of standard deviations. Let the significance level . We will illustrate the performances of different methods by comparing and .

Step 1. Establishing hypothesis:

Step 2. Constructing statistics:

Step 3. Giving rejection region: where , .

Step 4. Calculating the value of statistics.
The values of statistics of different programs are listed in Table 6; .

Step 5. Drawing conclusions
From Table 6 we conclude that the values of are all less than . Then we reject null hypothesis for all object programs, which means that the time consumption using individual sharing is significantly less than that not using it.

6.3.2. Experimental Results for Testing the Proposed Method

The experimental results of comparing different methods are listed in Table 7. The meanings of all symbols are the same with Table 5. We also use hypothesis testing to give a scientific analysis for the above experimental results. The value of shows the hypothesis testing results by comparing our method and Gong’s method, that of shows the hypothesis testing results by comparing our method and Ahmed’s method, and that of shows the hypothesis testing results by comparing our method and the random method.

It can be seen from Table 7 that, for all subject programs, the average time consumption using our method is all less than that of Gong’s, Ahmed’s, and the random methods. The least time consumption of our method is 5.85 seconds (Bubble Sort), while that of Gong’s, Ahmed’s, and the random methods for the same program is 8.79, 9.85, and 12.32 seconds, respectively. The largest time consumption of our method is 192.82 seconds (Barcode), while that of Gong’s, Ahmed’s, and the random methods is 224.87, 289.67, and 316.42 seconds, respectively. Gong’s and Ahmed’s methods have better results than the random method but are all poorer than ours. The values of and are all less than . The values of are all less than except three programs, that is, Comn, Splinge, and Printtok. For these three programs, the time consumption of our method is still all less than that of Gong’s method. Then we conclude that the time consumption using our method is significantly less than that using Gong’s, Ahmed’s, and random methods.

7. Threats to Validity

The present study focuses on generating test data for multiple paths coverage. One possible threat to the validity of the proposed method may be related to parameter settings. The settings of parameters in GAs have an influence on the performance of generating test data. Appropriate choices of these values can improve the performance of an algorithm and therefore enhance its efficiency in generating test data. However, how to set proper parameters is not the emphasis of this study; thus we just give the values of the parameters based on our experience. The second threat to the validity may have relation with the use of software systems. Thus, possible bugs or errors, different program conversions, and test objectives may also have influence on the obtained results. Additionally, the selection of target paths may have also influenced the obtained results.

8. Conclusion

We establish a mathematical model which is a rational reflection of the problem of generating test data for multiple paths coverage. On this basis, a multipopulation GA is presented to solve the problem in the model. The main idea of this algorithm, very different from traditional multipopulation GAs, is to improve the search efficiency by means of individual sharing among different subpopulations. In addition, we not only prove the efficiency of our method theoretically, but also apply it in various programs under test. The experimental results show that our method has more significant advantages than Ahmed’s multiobjective method and random method. The proposed algorithm in this study enriches the theory and technique of GA-based test data generation and provides a new way to improve the efficiency of software testing.

Possible future researches are presented as follows: one is the method to generate test data when the number of target paths is very large; the other one is the establishment of test platform based on our method.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by Natural Science Foundation of China (nos. 61203304, 61375067), Natural Science Foundation of Jiangsu Province (no. BK2012566), and the Fundamental Research Funds for the Central Universities (no. 2012QNA41).

References

B. Beizer, Software Testing Techniques, International Thomson Computer Press, 1990.
L. A. Clarke, “A system to generate test data and symbolically execute programs,” IEEE Transactions on Software Engineering, vol. 2, no. 3, pp. 215–222, 1976.
View at: Google Scholar | MathSciNet
B. Korel, “Automated software test data generation,” IEEE Transactions on Software Engineering, vol. 16, no. 8, pp. 870–879, 1990.
View at: Publisher Site | Google Scholar
R. Malhotra and M. Khari, “Heuristic search-based approach for automated test data generation: a survey,” International Journal of Bio-Inspired Computation, vol. 5, no. 1, pp. 1–18, 2013.
View at: Publisher Site | Google Scholar
R. Hamlet, “Random testing,” in Encyclopedia of Software Engineering, pp. 970–978, Wiley, 1994.
View at: Google Scholar
P. M. S. Bueno, M. Jino, and W. E. Wong, “Diversity oriented test data generation using metaheuristic search techniques,” Information Sciences, vol. 259, pp. 490–509, 2014.
View at: Publisher Site | Google Scholar
A. Arcuri and L. Briand, “A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering,” Software Testing Verification and Reliability, vol. 24, no. 3, pp. 219–250, 2014.
View at: Publisher Site | Google Scholar
K. Ma, K. Y. Phang, J. S. Foster et al., Static Analysis, Springer, Berlin, Germany, 2011.
A. J. Offutt, Z. Jin, and J. Pan, “The dynamic domain reduction procedure for test data generation,” Software, Practice and Experience, vol. 29, no. 2, pp. 167–193, 1999.
View at: Google Scholar
D. Zhang, C. Nie, and B. Xu, “Optimal allocation of test case considering testing-resource in partition testing,” Journal of Nanjing University (Natural Sciences), vol. 41, no. 5, pp. 553–561, 2005.
View at: Google Scholar
W. Miller and D. L. Spooner, “Automatic generation of floating-point test data,” IEEE Transactions on Software Engineering, vol. 2, no. 3, pp. 223–226, 1976.
View at: Google Scholar | MathSciNet
C. Blum and A. Roli, “Metaheuristics in combinatorial optimization: overview and conceptual c omparison,” ACM Computing Surveys, vol. 35, no. 3, pp. 268–308, 2003.
View at: Publisher Site | Google Scholar
P. McMinn, “Search-based software test data generation: a survey,” Software Testing Verification and Reliability, vol. 14, no. 2, pp. 105–156, 2004.
View at: Publisher Site | Google Scholar
D. W. Gong and X. J. Yao, “Testability transformation based on equivalence of target statements,” Neural Computing and Applications, vol. 21, no. 8, pp. 1871–1882, 2012.
View at: Publisher Site | Google Scholar
X. J. Yao, D. W. Gong, Y. J. Luo, and M. Li, “Test data reduction based on dominance relations of ta rget statements,” in Proceedings of IEEE World Congress on Computational Intelligence, pp. 2191–2198, Springer, 2012.
View at: Google Scholar
J. Miller, M. Reformat, and H. Zhang, “Automatic test data generation using genetic algorithm and program dependence graphs,” Information and Software Technology, vol. 48, no. 7, pp. 586–605, 2006.
View at: Publisher Site | Google Scholar
A. Baars, M. Harman, Y. Hassoun et al., “Symbolic search-based testing,” in Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11), pp. 53–62, IEEE, Lawrence, Kan, USA, November 2011.
View at: Publisher Site | Google Scholar
M. Alshraideh, B. A. Mahafzah, and S. Al-Sharaeh, “A multiple-population genetic algorithm for branch coverage test data generation,” Software Quality Journal, vol. 19, no. 3, pp. 489–513, 2011.
View at: Publisher Site | Google Scholar
C. C. Michael, G. McGraw, and M. A. Schatz, “Generating software test data by evolution,” IEEE Transactions on Software Engineering, vol. 27, no. 12, pp. 1085–1110, 2001.
View at: Publisher Site | Google Scholar
O. Bühler and J. Wegener, “Evolutionary functional testing,” Computers and Operations Research, vol. 35, no. 10, pp. 3144–3160, 2008.
View at: Publisher Site | Google Scholar
A. Watkins, E. M. Hufnagel, D. Berndt, and L. Johnson, “Using genetic algorithms and decision tree induction to classify software failures,” International Journal of Software Engineering and Knowledge Engineering, vol. 16, no. 2, pp. 269–291, 2006.
View at: Publisher Site | Google Scholar
J. Ferrer, F. Chicano, and E. Alba, “Evolutionary algorithms for the multi-objective test data generation problem,” Software—Practice and Experience, vol. 42, no. 11, pp. 1331–1362, 2012.
View at: Publisher Site | Google Scholar
I. Hermadi, C. Lokan, and R. Sarker, “Genetic algorithm based path testing: challenges and key parameters,” in Proceedings of the 2nd WRI World Congress on Software Engineering (WCSE ’10), pp. 241–244, Wuhan, China, December 2010.
View at: Publisher Site | Google Scholar
P. M. S. Bueno and M. Jino, “Automatic test data generation for program paths using genetic algorithms,” International Journal of Software Engineering and Knowledge Engineering, vol. 12, no. 6, pp. 691–709, 2002.
View at: Publisher Site | Google Scholar
A. Watkins and E. M. Hufnagel, “Evolutionary test data generation: a comparison of fitness functions,” Software Practice and Experience, vol. 36, no. 1, pp. 95–116, 2006.
View at: Publisher Site | Google Scholar
J. Mei and S. Y. Wang, “An improved genetic algorithm for test cases generation oriented paths,” Chinese Journal of Electronics, vol. 23, no. 3, pp. 494–498, 2014.
View at: Google Scholar
I. Hermadi and M. A. Ahmed, “Genetic algorithm based test data generator,” in Proceedings of the Congress on Evolutionary Computation (CEC '03), pp. 85–91, Canberra, Australia, December 2003.
View at: Publisher Site | Google Scholar
J. Wegener, A. Baresel, and H. Sthamer, “Evolutionary test environment for automatic structural testing,” Information and Software Technology, vol. 43, no. 14, pp. 841–854, 2001.
View at: Publisher Site | Google Scholar
M. A. Ahmed and I. Hermadi, “GA-based multiple paths test data generator,” Computers and Operations Research, vol. 35, no. 10, pp. 3107–3124, 2008.
View at: Publisher Site | Google Scholar
D. W. Gong and Y. Zhang, “Novel evolutionary generation approach to test data for multiple paths coverage,” Acta Electronica Sinica, vol. 38, no. 6, pp. 1299–1304, 2010.
View at: Google Scholar
I. Hermadi, C. Lokan, and R. Sarker, “Dynamic stopping criteria for search-based test data generation for path testing,” Information and Software Technology, vol. 56, no. 4, pp. 395–407, 2014.
View at: Publisher Site | Google Scholar
P. M. S. Bueno and M. Jino, “Identification of potentially infeasible program paths by monitorin g the search for test data,” in Proceedings of IEEE International Conference on Automated Software Engineering (ASE '00), pp. 209–218, Grenoble, France, 2000.
View at: Google Scholar
R. L. Becerra, R. Sagarna, and X. Yao, “An evaluation of differential evolution in software test data generation,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC ’09), pp. 2850–2857, Trondheim, Norway, May 2009.
View at: Publisher Site | Google Scholar
A. Pachauri and G. Srivastava, “Automated test data generation for branch testing using genetic algorithm: An improved approach using branch ordering, memory and elitism,” Journal of Systems and Software, vol. 86, no. 5, pp. 1191–1208, 2013.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2014 Xiangjuan Yao and Dunwei Gong. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

4643

Downloads

1421

Citations

Computational Intelligence and Neuroscience

Genetic Algorithm-Based Test Data Generation for Multiple Paths via Individual Sharing

Abstract

1. Introduction

2. Related Work

2.1. Automatic Software Testing

2.2. GA-Based Test Data Generation

2.3. GA-Based Path Testing

3. Mathematical Model of Test Data Generation for Multiple Paths

3.1. Basic Concepts

3.2. Structure of Objective Function

3.3. Mathematical Model of Generating Test Data for Multiple Paths Coverage

4. Multipopulation GA for Test Data Generation of Multiple Paths

4.1. Initialization of Populations

4.2. Genetic Operations

4.3. Individual Sharing among Different Subpopulations

4.4. Steps of the Algorithm

5. Performance Analysis

5.1. Efficiency of Algorithm

5.2. Computational Complexity

6. Experiments

6.1. Subject Programs

6.2. Experimental Design

6.3. Experimental Results

6.3.1. Experimental Results for Testing the Performance of Individual Sharing

6.3.2. Experimental Results for Testing the Proposed Method

7. Threats to Validity

8. Conclusion

Conflict of Interests

Acknowledgments

References

Copyright