Research Article | Open Access
Adaptive Random Testing with Combinatorial Input Domain
Random testing (RT) is a fundamental testing technique to assess software reliability, by simply selecting test cases in a random manner from the whole input domain. As an enhancement of RT, adaptive random testing (ART) has better failure‐detection capability and has been widely applied in different scenarios, such as numerical programs, some object‐oriented programs, and mobile applications. However, not much work has been done on the effectiveness of ART for the programs with combinatorial input domain (i.e., the set of categorical data). To extend the ideas to the testing for combinatorial input domain, we have adopted different similarity measures that are widely used for categorical data in data mining and have proposed two similarity measures based on interaction coverage. Then, we propose a new version named ART‐CID as an extension of ART in combinatorial input domain, which selects an element from categorical data as the next test case such that it has the lowest similarity against already generated test cases. Experimental results show that ART‐CID generally performs better than RT, with respect to different evaluation metrics.
Software testing, a major software engineering activity, is widely considered to assure the quality of software under test . Many testing methods have been developed to effectively identify software failures by actively selecting inputs (namely, test cases). Random testing (RT), a basic software testing method, simply chooses test cases at random from the set of all possible program inputs (namely, the input domain) [2, 3]. There are many advantages of using RT in software testing. For example, in addition to simplicity and the efficiency of generating random test cases , RT allows statistical quantitative estimation of software’s reliability . Due to these advantages, RT has been widely used to detect software failures in different scenarios, such as the testing of UNIX utilities [5, 6], SQL database systems [7, 8], Java JIT compilers , and embedded software systems . In spite of the popularity, RT is still criticized by many researchers due to little or no information to guide its test case generation.
Given a faulty program, two basic features are determined by program inputs causing software to exhibit failure behaviors (namely, failure-causing inputs), that is, failure rate and failure pattern. Failure rate refers to the ratio between the number of failure-causing inputs and the number of all possible program inputs, while failure pattern refers to the geometry and distribution of failure regions (i.e., the region where failure-causing inputs reside). It has been observed, however, that failure-causing inputs tend to cluster together [11–13]. Given that failure regions are continuous, nonfailure regions should also be contiguous. More specifically, suppose a test case (tc) is not a failure-causing input, test cases that are close to tc (or tc’s neighbors) may fail to reveal a failure as well. Therefore, it is intuitively appealing that test cases that spread away from tc may have a higher chance to be failure-causing than tc’s neighbors.
Briefly speaking, it is very likely that a more even-spread of random test cases can improve the failure-detection effectiveness of RT. Based on this intuition, Chen et al.  have proposed a novel approach, namely, adaptive random testing (ART). Similar to RT, ART also randomly generates test case from the whole input domain. But ART uses additional criteria to guide the test case selection for the purpose of evenly spreading test cases over the input domain. Various ART algorithms have been developed based on different test case selection criteria, such as ART by distance , ART by exclusion , ART based on evolutionary search algorithms , and ART by perturbation . Essentially, ART achieves test case diversity with the subset of test cases executed at any one time .
As an alternative of RT, ART has been successfully applied to different programs, such as numerical programs [15–18], object-oriented programs [20, 21], and mobile application . However, not much work has been done on the effectiveness of ART for programs with combinatorial input domain (or categorical data, i.e., a Cartesian product of finite value domains for each of a finite set of parameter variables). With the popularity of category-partition method  and many guidelines to help construct categories and partitions [24–27], combinatorial input domain has been widely applied to different testing scenarios, such as configurable-aware system [28, 29], event-driven software , and GUI-based application . In this paper, we propose a new testing strategy called ART-CID as an extension of ART in combinatorial input domain. In order to successfully extend the ART principle into combinatorial input domain, we propose two similarity measures based on interaction coverage and also adopt different well-studied similarity measures that are popularly used for categorical data in data mining . To analyze the effectiveness of ART-CID (mainly FSCS-CID, one version of ART-CID), we compare the effectiveness of FSCS-CID with RT by designing some simulations and the empirical study. Experimental results show that, compared with RT, FSCS-CID can not only use smaller test cases in order to cover all possible combinations of parameter values at a given strength, but also require to generate fewer test cases to identify the first failure in the real-life program.
This paper is organized as follows. Section 2 introduces some preliminaries, including combinatorial input domain, ART, similarity measures used for combinatorial input domain, and the effectiveness measures adopted in our study. Section 3 proposes two similarity measures for combinatorial test cases based on interaction coverage. Section 4 proposes a new algorithm called ART-CID to select test cases from combinatorial input domain. Section 5 reports some experimental studies, which examine the rate of covering value combinations at a given strength and failure-detection effectiveness of our new method. Finally, Section 6 summarizes some discussions and conclusions.
In the following section, some preliminaries of combinatorial input domain, failure patterns, adaptive random testing, similarity and dissimilarity measures for combinatorial input domain, and effectiveness measure are described.
2.1. Combinatorial Input Domain
Suppose that a system under test (SUT) has a set of parameters (or categories) , which may represent user inputs, configuration parameters, internal events, and so forth. Let be the finite set of discrete valid values (or choices) for (), and let be the set of constraints on parameter value combinations. Without loss of generality, we assume that the order of parameters is fixed; that is, . In the remainder of this paper, we will refer to a combination of parameters as a parameter interaction, and a combination of parameter values or a parameter value combination as a value combination.
Definition 1. A test profile, denoted as , is about the information on a combinatorial input domain of the SUT, including parameters, () values for parameter , and constraints on value combinations.
In this paper, we assume that all the parameters are independent; that is, no constraint among value combinations is considered (), unless otherwise specified. Therefore, the test profile can be abbreviated as .
To clearly describe some notions and definitions, we present an example of the part of suboptions in an option “View” of the tool PDF shown in Table 1. In this system, there are four configuration parameters, each of which has three values. Therefore, its test profile can be written as .
Definition 2. Given a , a test case or a test configuration is a -tuple where ().
Intuitively speaking, a combinatorial input domain is a Cartesian product of for each of ; that is, . Therefore, the size of all possible test cases is . For example, a 4-tuple , , , is a test case for the SUT shown in Table 1.
Definition 3. Given a , a -wise value combination is a -tuple involving parameters with fixed values (named fixed parameters) and parameters with arbitrary allowable values (named free parameters), where and
Generally, -wise value combination is also called -value schema , and is called strength. When , a -wise value combination becomes a test case for the SUT as it takes on a specific value for each of its parameters. For ease of description, we define a term as the set of -wise value combinations covered by the test case (tc). Intuitively speaking, a test case (tc) with parameters contains -wise value combinations, that is, .
For example, considering a test case (tc) , , , , we can obtain that , , , , while , , , , , , , .
Definition 4. The number of parameters required to trigger a failure is referred to as the failure-triggering fault interaction (FTFI) number.
As we know, the faulty model in the combinatorial input domain assumes that failures are caused by parameter interactions. For instance, if the SUT shown in Table 1 fails when is set to “Single”, is set to “None,” and is not equal to “None,” this failure is caused by the parameter interaction (). Therefore, the FTFI number of this fault is 3.
2.2. Failure Patterns
Given a faulty program, two basic features can be obtained from it. One feature is failure rate, denoted by , which refers to the ratio of the number of failure-causing inputs to the number of all possible inputs. The other feature is failure pattern, which refers to the geometric shapes and the distributions of the failure-causing regions. Both features are fixed but unknown to testers before testing.
In , the patterns of failure-causing inputs have been classified into three categories: point pattern, stripe pattern, and block pattern. An illustrative example about three types of failure patterns in a two-dimensional input domain is shown in Figure 1. In this example, suppose the input domain is consisting of parameters and where , . Point pattern means the tested program will fail when and are assigned to particular integers, that is, some specific points in the input domain, while strip pattern may be of the form , , and block pattern may be of the form , .
(a) Point pattern
(b) Strip pattern
(c) Block pattern
In the combinatorial input domain, failure patterns of any failures belong to the point pattern as all test inputs are discrete. However, from the perspective of functionality and computation of each test input, three failure patterns shown in Figure 1 also exist in the combinatorial input domain. For example, if a failure in the SUT shown in the Table 1 is caused by “ or ” and “ or ”, we believe that the failure pattern of is a strip pattern and its failure rate is ; if a failure in the SUT is caused by “”, “”, “”, and “”, we believe that the failure pattern of is a block pattern and its failure rate is ; and if a failure is caused by a single test case , we believe that the failure region of is a point pattern and its failure rate is . According to Kuhn’s investigations [28, 34], however, the FTFI numbers are always very low (i.e., the FTFI numbers are smaller than the number of parameters), which means that the strip pattern is the most frequent failure pattern in the combinatorial input domain.
2.3. Adaptive Random Testing (ART)
The methodology of adaptive random testing (ART) [14, 15] has been proposed to enhance the failure-detection effectiveness of random testing (RT) by even-spreading test cases across the whole input domain. In ART, test cases are not only randomly generated, but also evenly spread. According to previous ART studies [15–22], ART was shown to reduce the number of test cases required to identify the first fault by as much as 50% over RT.
There are many implementations of ART by different notions. A simple algorithm is the fixed-size-candidate-set ART (FSCS-ART) . FSCS-ART implements the notion of distance as follows. FSCS-ART uses two sets of test cases, namely, the executed set and the candidate set . is a set of test cases that have been executed but without revealing any failure, while is a set of tests that are randomly selected from the input domain according to the uniform distribution. is initially empty and the first element is randomly chosen from the input domain and then incrementally updates with the selected elements from until a failure is exhibited. From , the element that is farthest away from all test cases in is chosen as the next test case; that is, the criterion is to choose the element from as the next test case such that where dist is defined as the Euclidean distance, that is, in a -dimensional input domain, for two test inputs, and , The process is repeated until the desired stopping criterion is satisfied.
Figure 2 gives the illustration of FSCS-ART in a two-dimensional input domain. In Figure 2(a), there are 3 previously executed test cases , , and , and 2 randomly generated candidates and . To choose among the candidates, the distance of each candidate against each previously executed test case is calculated. Figure 2(b) describes that the closest previously executed test case is determined for each candidate. In Figure 2(c), the candidate is selected as the next test case (i.e., ), as the distance of against its nearest previously executed test case is larger than that of the candidate .
In this paper, we emphasize the extension of FSCS-ART as that of ART in combinatorial input domain, unless otherwise specified.
2.4. Similarity and Dissimilarity Measures for Combinatorial Input Domain
Measuring similarity or dissimilarity (distance) between two test inputs is a core requirement for test case selection, evaluation, and generation. Generally speaking, in numerical input domains, Euclidean distance (see (3)) is a mostly used distance measure for continuous data. However, for a combinatorial input domain, since its parameters and corresponding values are finite and discrete, Euclidean distance may not be available and reasonable. Nevertheless, various distance measures (or dissimilarity measures) are popularly used in data mining for evaluating categorical data , such as clustering (-means), classification (KNN, SVM), and distance-based outlier detection. In this subsection, we simply describe the following measures that will be adopted in our paper later.
To illustrate our work clearly, let us define a few terms. Consider a categorical dataset containing objects, derived from a for parameters . We also use the following notation.(i) is the number of times parameter takes the value in . Note that if , .(ii) is the sample probability of parameter to take the value in . The sample probability is given by (iii) is another probability estimate of parameter to take the value in and is given by (iv) is a generalized similarity measure between two data instances denoted as and where , and (). Its definition is given as follows: where () is the per-parameter similarity between two values for parameter and denotes the weight assigned to the parameter . Therefore, we only require to present the definitions of and for each similarity measure, unless otherwise specified.
To directly refer to , the measures discussed henceforth will all be in the context of similarity, with dissimilarity or distance measures being converted using the following formula: where is the dissimilarity measure between and .
Table 2 presents nine similarity measures for categorical parameter values, which are widely used in data mining for categorical data. In Table 2, the last column “Range” represents the range of for mismatches or matches of parameter values in each measure.
|Note. For measure Goodall1, . |
For measure Goodall2, .
For measure Lin1, .
2.5. Effectiveness Measurement
In this paper, we adopt the -measure (i.e., the number of test cases required to detect the first failure) as the measurement of failure-detection effectiveness of testing methods, since previous studies  have demonstrated that the -measure is particularly suitable for adaptive testing strategies such as ART. Intuitively speaking, a smaller -measure of ART over RT means fewer test cases required by ART to detect the first failure and hence implies a better failure-detection effectiveness of ART than that of RT. For the purpose of clear description, we will use ART -ratio (i.e., the ratio of ART’s -measure () relative to RT’s -measure ()) to indicate the failure-detection effectiveness improvement of ART over RT.
However, it is extremely difficult to theoretically obtain ART’s -measure (). Similar to all other ART studies, is collected via simulations and empirical studies, whose procedure is described as follows. On the one hand, in simulation studies, failure pattern (including its size and sharp) and failure rate are predefined for simulating a faulty program. The failure regions are then randomly placed inside the whole input domain. If a point inside one of the failure regions is picked by a testing strategy, a failure is said to be detected. On the other hand, for empirical studies, some faults are seeded into a subject program. Once the subject program behaves differently from its fault-seeded version, it is said that a failure is identified. The number of test cases to find the first failure is regarded as the of that run. Such a process runs times repeatedly until a statistically reliable estimate of the ( accuracy rate and confidence level adopted in our paper) has been obtained. Refer to the value of ; it can be determined dynamically using the same method as shown in . With respect to RT’s -measure (), since test cases are chosen with replacement according to the uniform distribution, is equal to theoretically.
Apart from the -measure used as the measurement, another measurement is also used in our paper, that is, the number of test cases required to first cover all possible value combinations of a given strength (denoted -measure). This measurement is widely used in the combinatorial input domain. Unlike the -measure, the testing stop condition of -measure is not that the first failure is detected, but that all possible -wise value combinations are first covered. For the purpose of clear description, we use to represent this measurement for RT while for ART.
3. Two Similarity Measures Based on Interaction Coverage
Apart from various similarity measures described in Section 2.4, in this section, we propose another two similarity measures by using interaction coverage: incremental interaction coverage similarity (IICS) and multiple interaction coverage similarity (MICS), in order to apply the characteristics of combinatorial input domain to the selection of test cases. All similarity measures illustrated in Section 2.4 are used to evaluate how similar two test cases are; however, two similarity measures presented in this section are used to evaluate the resemblance of the combinatorial test case against the combinatorial test suite. We will discuss them next.
Before introducing them, we firstly describe a simple similarity measure of the test case against a test suite based on interaction coverage, named normalized covered -wise value combinations similarity (or ) , which is widely used in combinatorial input domain.
Definition 5. Given a combinatorial test suite on , a combinatorial test case (tc), and the strength , normalized covered -wise value combinations similarity () of tc against is defined as the ratio of the number of -wise value combinations covered by tc that have already been covered by to ; that is, where can be written as follows:
Obviously, the is a function that requires to set the strength value in advance, and its range is . Two properties of the are discussed as follows.
Theorem 6. If , , where .
Proof. When , it can be noted that covers all possible -wise value combinations covered by , that is, Since also covers all possible value combinations at strengths lower than that are covered by tc. As a consequence, where .
Theorem 7. If , where .
Proof. When , it can be noted that each -wise value combination covered by tc is not covered by , indicating that, for , : that is,
Therefore, the problem converts to demonstrating that , .
We suppose that such that and , that is, Due to , (13) is equivalent to the equation shown as follows: Obviously, (14) is contradictory to (12). Therefore, , , which means that where .
As we know, given a and the strength , the number of all possible -wise value combinations is fixed; that is, . In other words, there exists a test case generation method using as the criterion, which can generate a certain number of combinatorial test cases denoted as () to cover all possible -wise value combinations. However, if testing with fails to reveal any failures due to no failure-causing inputs in , the next test case generated by this method is, in fact, obtained in a random manner. The main reason is that the of each element in is equal to . Therefore, the is not particularly suitable for adaptive testing strategies such as ART. To solve this problem, we propose two similarity measures based on interaction coverage in the following subsections.
3.1. Incremental Interaction Coverage Similarity
As discussed in Theorem 6, if all possible -wise value combinations are covered by a combinatorial test suite , all possible value combinations at strengths lower than are also covered by . According to this fact, we present a new similarity measure based on interaction coverage, named incremental interaction coverage similarity (IICS).
Given a combinatorial test suite on and a combinatorial test case (tc), the incremental interaction coverage similarity of tc against is defined as follows: where satisfies the following properties: and , where (assume .
It can be noted that if , the IICS is equal to 1.0 as tc is the same as one of elements in ; if , the IICS of tc against is actually equal to the of tc against where is gradually incremented. More specifically, if covers all possible -wise value combinations and partial -wise value combinations occurred in tc, . Similar to , the range of IICS is also .
Here, we present an example to illustrate IICS. Suppose on , , and , as 1-wise value combinations are not completely covered by , and hence . Let , as covers all 1-wise value combinations and partial 2-wise value combinations occurred in , and hence .
3.2. Multiple Interaction Coverage Similarity
As shown in Section 3.1, the IICS measure begins at strength , and then update the value of by . In other words, it considers different strength values when evaluating the combinatorial test case against the combinatorial test suite. However, the IICS accounts for each strength value at each time rather than simultaneously considering all strength values. As a consequence, we present another similarity measure based on interaction coverage, named multiple interaction coverage similarity (MICS).
Given a combinatorial test suite on and a combinatorial test case (tc), the weighted interaction coverage similarity of tc against is defined as follows: where and .
Intuitively speaking, if , . Similar to IICS, the MICS ranges from 0 to 1.0.
Here, we present an example to explain the definition of MICS. Let on , , , and , , while .
3.3. Properties of Two New Similarity Measures
Some properties of the proposed two similarity measures are discussed in the following subsection.
Theorem 8. If , for , and remain unchanged.
Proof. On the one hand, if (i.e., covers all possible -wise value combinations), for ,
On the other hand, if (i.e., covers all possible -wise value combinations), , . According to Theorem 6, it can be concluded that , where ; that is, covers all possible -wise value combinations. In other words, (). Therefore, for , In summary, if , for , and .
According to Theorem 8, a test case generation method using IICS or MICS as the similarity measure becomes a random generation method, when its generated test suite covers all possible -wise value combinations. The main reason is that, for any candidates, no matter whether they are included in or not, the IICS (or MICS) values of all candidates are identical.
Theorem 9. If , .
Proof. If , where because of and ; that is, all possible -wise value combinations covered by tc are not covered by . According to (15), therefore, .
As discussed before, both IICS and MICS consider different interaction coverage when evaluating combinatorial test cases. However, they have some differences. Given a combinatorial test case (tc), its IICS measure is actually calculated by the at an appropriate value, which means that the IICS measure of tc only considers single interaction coverage, while its MICS measure considers different coverage at the same meanwhile. In other words, tc’s calculation time of the IICS measure is less than that of the MICS.
In summary, two new similarity measures based on interaction coverage (IICS and MICS) fundamentally differ from NCVCS due to the following reasons: they do not require setting the strength value in advance, and they are more suitable for adaptive strategies than NCVCS.
4. Adaptive Random Testing for Combinatorial Test Inputs
In this section, we propose a new family of methods adopting ART in combinatorial input domain, namely, ART-CID. Similar to previous ART studies, ART-CID can also be implemented according to different notions. In this paper, we present one version of ART-CID by similarity (denoted as FSCS-CID), which uses the strategy of FSCS-ART . Since the similarity measure is used in this paper, the procedure of FSCS-CID may differ from that of FSCS-ART. Detailed information will be given as follows.
4.1. Similarity-Based Test Case Selection in FSCS-CID
FSCS-CID uses two test sets, that is, the candidate set of fixed size and the executed set , each of which has the same definition as FSCS-ART. However, test cases in either or are obtained from the combinatorial input domain. For ease of description, let while . In order to select the next test case from , the criterion is described as follows: where is the similarity measure between two combinatorial test inputs. The detailed algorithm of implementing (19) is illustrated as follows (see Algorithm 1).
4.2. Algorithm of FSCS-CID
As discussed before, Algorithm 1 is used to guide the selection of the best test case. In FSCS-CID, the process of Algorithm 1 runs until the stop condition is satisfied. In this paper, we consider two stop conditions: the first software failure is detected (denoted StopCon1); and all possible value combinations at strength are covered (denoted StopCon2). Detailed algorithm of FSCS-CID is shown in Algorithm 2.
Since the frequencies of parameter values are used in some similarity measures such as Lin, OF, and Goodall2, there requires a fixed-size set of test cases in order to count the frequencies. However, the executed set is incrementally updated with the selected element from the candidate set until the StopCon1 (or StopCon2) is satisfied. In this paper, we take the following strategy to construct the fixed-size set of test cases when calculating the similarity between test inputs. During the process of choosing the th () test input from as the next test case (i.e., ), each candidate requires to be measured against all elements in according to the similarity measure, and the fixed-size set of test case for is constructed by .
In this section, some experimental results, including simulations and experiments against real programs, were presented to analyze the effectiveness of FSCS-CID. We mainly compared our method to RT in terms of failure-detection effectiveness (-measure) and the rate of value combinations coverage at a given strength (-measure). For ease of describing our work clearly, we used the terms Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS to, respectively, represent the similarity measure Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS adopted in the FSCS-CID. Additionally, we used the term RT to represent RT.
As shown in (16), a weight is required to be assigned for interaction coverage at each strength value. There are many techniques which conduct on assigning weights; however, in this paper we focus on two distribution styles: equal distribution where each interaction coverage has the same weight, that is, ; and FTFI percentage distribution where according to previous studies [28, 34], for example, in , Kuhn et al. investigated several software projects and concluded that the interaction faults are summarized to have 29% to 82% faults as 1-wise faults (i.e., the FTFI number is 1), 6% to 47% of faults as 2-wise faults, 2% to 19% as 3-wise faults, 1% to 7% of faults as 4-wise faults, and even fewer failures beyond 4-wise interactions. As a consequence, we arrange weights as follows: , , where . For example, if , and ; if , , , and . In this paper, therefore, we use the terms MICS1 and MICS2 to stand for the MICS techniques with the above two weight distribution styles, respectively.
In the following subsection, two simulations were presented to analyze the effectiveness of FSCS-CID according to the rate of covering -wise value combinations (i.e., -measure). We used two usual test profiles and that are commonly used in previous studies .
Since the was known before testing, in this simulation, we considered the FSCS-CID using  as the similarity measure (denoted NCVCS). Except the MICS, all other methods do not require to be set. As for the MICS, different strength values from 1 to are considered to calculate the MICS measure according to (16). However, due to the known , we mainly focused on the strength values from 1 to for calculating the MICS measure. As a consequence, (16) becomes as follows: where only are considered. Each method runs until the StopCon2 is satisfied. Additionally, we consider as the metric to evaluate each method in terms of the rate of covering value combinations at strength for each method, where .
Figure 3 summarizes the number of test cases required to cover all possible -wise value combinations (i.e., ) generated by each method for the above two designed test profiles. Based on the experimental data, we have the following observations.(1)For each test profile, the () metric values of all FSCS-CID methods using different similarity measures are smaller than those of RT. In other words, the FSCS-CID methods require the smaller number of test cases for covering all -wise value combinations than RT, which means that the FSCS-CID methods have the higher rates of covering value combinations than RT.(2)Among all the FSCS-CID methods, the NCVCS is the most effective technique. The results show that the values of the NCVCS are about 30%~50% of those of the RT. The IICS has the second best metric values, followed by the OF. For , the Goodall3 is least effective, while for , the Lin performs least.(3)From the perspective of the similarity category, the FSCS-CID methods using the interaction-coverage-based similarity measures (including IICS, MICS, and NCVCS) perform best, while the FSCS-CID methods using the information-theoretic similarity measures (including Lin and Lin1) perform worst.
Here, we briefly analyze the above observations. The observation (1) is explained as follows. The FSCS-CID methods using different similarity measures select the next test case that has the smallest similarity value against already generated test cases, while RT simply generates teat cases at random from combinatorial input domain. As a consequence, the FSCS-CID methods achieve test cases more diversely than RT over the combinatorial input domain.
As for the observations (1) and (2), they are easy to be explained. On the one hand, since the metric is related to -wise value combinations, the NCVCS performs best because it selects the next test case that covers of uncovered -wise value combinations as much as possible. In other words, it may have the fastest rate of covering all -wise value combinations. On the other hand, another two interaction-coverage-based methods, such as IICS and MICS, consider different strength values for generating test cases; however, both of them take the strength as an indispensable part. In detail, the IICS calculates the test candidate from the strength 1 to , while the MICS considers different strengths from 1 to at the same time. Hence, it is reasonable that, compared to other categories, the FSCS-CID methods using interaction-coverage-based similarity measures perform best according to the metric.
5.2. An Empirical Study
In this section, an empirical study was conducted to compare the performance between FSCS-CID and RT in practical situations, using the -measure as the effectiveness metric. To describe data clearly, we used ART -ratio, which is defined as the -measure ratio between FSCS-CID and RT, that is, . Intuitively speaking, the smaller ART -ratio value implies the higher improvement of FSCS-CID over RT, and is the -measure improvement of FSCS-CID over RT.
In this empirical study, we use a set of six fault-seeded C programs with 9 versions. The five subject programs, including count, series, tokens, ntree, and nametbl, are downloaded from Chris Lott’s website (http://www.maultech.com/chrislott/work/exp/), which have been widely used in the research of combinatorial space such as comparison of defect revealing mechanisms , evaluation of different combination strategies for test case selection , and fault diagnosis [40, 41]. The remainder subject programs are a series of flex programs (the model used in this paper is unconstrained, which has some limitations: “We note that in a real test environment an unconstrained TSL would most likely be prohibitive in size and would not be used” .), downloaded from Software Infrastructure Repository (SIR) , which are popularly used in combinatorial test suite construction  and combinatorial interaction regression testing .
Table 3 presents detailed information about these subject programs, from which the third column “LOC” represents the number of lines of executable code in these programs, and “#S.” is the number of seeded faults in each subject program, while “#D.” is the number of faults that can be detected by some test cases derived from the accompanying test profiles, which are not guaranteed to be able to detect all faults. However, in our study, we only use a portion of detectable faults, of which the size is shown as “#U.”. The main reason is due to the fact that faults in the set of detectable faults but not in the set of used faults have high failure rates that exceed 0.5. As we know, if the failure rate of a fault is larger than 0.5, the -measure of random testing is theoretically less than . As a consequence, the -measure of FSCS-CID depends on the first randomly selected test case. In other words, if the first test case cannot detect a failure, the is larger than or equal to 2. Therefore, the