Abstract
Random testing (RT) is a fundamental testing technique to assess software reliability, by simply selecting test cases in a random manner from the whole input domain. As an enhancement of RT, adaptive random testing (ART) has better failure‐detection capability and has been widely applied in different scenarios, such as numerical programs, some object‐oriented programs, and mobile applications. However, not much work has been done on the effectiveness of ART for the programs with combinatorial input domain (i.e., the set of categorical data). To extend the ideas to the testing for combinatorial input domain, we have adopted different similarity measures that are widely used for categorical data in data mining and have proposed two similarity measures based on interaction coverage. Then, we propose a new version named ART‐CID as an extension of ART in combinatorial input domain, which selects an element from categorical data as the next test case such that it has the lowest similarity against already generated test cases. Experimental results show that ART‐CID generally performs better than RT, with respect to different evaluation metrics.
1. Introduction
Software testing, a major software engineering activity, is widely considered to assure the quality of software under test [1]. Many testing methods have been developed to effectively identify software failures by actively selecting inputs (namely, test cases). Random testing (RT), a basic software testing method, simply chooses test cases at random from the set of all possible program inputs (namely, the input domain) [2, 3]. There are many advantages of using RT in software testing. For example, in addition to simplicity and the efficiency of generating random test cases [2], RT allows statistical quantitative estimation of software’s reliability [4]. Due to these advantages, RT has been widely used to detect software failures in different scenarios, such as the testing of UNIX utilities [5, 6], SQL database systems [7, 8], Java JIT compilers [9], and embedded software systems [10]. In spite of the popularity, RT is still criticized by many researchers due to little or no information to guide its test case generation.
Given a faulty program, two basic features are determined by program inputs causing software to exhibit failure behaviors (namely, failurecausing inputs), that is, failure rate and failure pattern. Failure rate refers to the ratio between the number of failurecausing inputs and the number of all possible program inputs, while failure pattern refers to the geometry and distribution of failure regions (i.e., the region where failurecausing inputs reside). It has been observed, however, that failurecausing inputs tend to cluster together [11–13]. Given that failure regions are continuous, nonfailure regions should also be contiguous. More specifically, suppose a test case (tc) is not a failurecausing input, test cases that are close to tc (or tc’s neighbors) may fail to reveal a failure as well. Therefore, it is intuitively appealing that test cases that spread away from tc may have a higher chance to be failurecausing than tc’s neighbors.
Briefly speaking, it is very likely that a more evenspread of random test cases can improve the failuredetection effectiveness of RT. Based on this intuition, Chen et al. [14] have proposed a novel approach, namely, adaptive random testing (ART). Similar to RT, ART also randomly generates test case from the whole input domain. But ART uses additional criteria to guide the test case selection for the purpose of evenly spreading test cases over the input domain. Various ART algorithms have been developed based on different test case selection criteria, such as ART by distance [15], ART by exclusion [16], ART based on evolutionary search algorithms [17], and ART by perturbation [18]. Essentially, ART achieves test case diversity with the subset of test cases executed at any one time [19].
As an alternative of RT, ART has been successfully applied to different programs, such as numerical programs [15–18], objectoriented programs [20, 21], and mobile application [22]. However, not much work has been done on the effectiveness of ART for programs with combinatorial input domain (or categorical data, i.e., a Cartesian product of finite value domains for each of a finite set of parameter variables). With the popularity of categorypartition method [23] and many guidelines to help construct categories and partitions [24–27], combinatorial input domain has been widely applied to different testing scenarios, such as configurableaware system [28, 29], eventdriven software [30], and GUIbased application [31]. In this paper, we propose a new testing strategy called ARTCID as an extension of ART in combinatorial input domain. In order to successfully extend the ART principle into combinatorial input domain, we propose two similarity measures based on interaction coverage and also adopt different wellstudied similarity measures that are popularly used for categorical data in data mining [32]. To analyze the effectiveness of ARTCID (mainly FSCSCID, one version of ARTCID), we compare the effectiveness of FSCSCID with RT by designing some simulations and the empirical study. Experimental results show that, compared with RT, FSCSCID can not only use smaller test cases in order to cover all possible combinations of parameter values at a given strength, but also require to generate fewer test cases to identify the first failure in the reallife program.
This paper is organized as follows. Section 2 introduces some preliminaries, including combinatorial input domain, ART, similarity measures used for combinatorial input domain, and the effectiveness measures adopted in our study. Section 3 proposes two similarity measures for combinatorial test cases based on interaction coverage. Section 4 proposes a new algorithm called ARTCID to select test cases from combinatorial input domain. Section 5 reports some experimental studies, which examine the rate of covering value combinations at a given strength and failuredetection effectiveness of our new method. Finally, Section 6 summarizes some discussions and conclusions.
2. Preliminaries
In the following section, some preliminaries of combinatorial input domain, failure patterns, adaptive random testing, similarity and dissimilarity measures for combinatorial input domain, and effectiveness measure are described.
2.1. Combinatorial Input Domain
Suppose that a system under test (SUT) has a set of parameters (or categories) , which may represent user inputs, configuration parameters, internal events, and so forth. Let be the finite set of discrete valid values (or choices) for (), and let be the set of constraints on parameter value combinations. Without loss of generality, we assume that the order of parameters is fixed; that is, . In the remainder of this paper, we will refer to a combination of parameters as a parameter interaction, and a combination of parameter values or a parameter value combination as a value combination.
Definition 1. A test profile, denoted as , is about the information on a combinatorial input domain of the SUT, including parameters, () values for parameter , and constraints on value combinations.
In this paper, we assume that all the parameters are independent; that is, no constraint among value combinations is considered (), unless otherwise specified. Therefore, the test profile can be abbreviated as .
To clearly describe some notions and definitions, we present an example of the part of suboptions in an option “View” of the tool PDF shown in Table 1. In this system, there are four configuration parameters, each of which has three values. Therefore, its test profile can be written as .
Definition 2. Given a , a test case or a test configuration is a tuple where ().
Intuitively speaking, a combinatorial input domain is a Cartesian product of for each of ; that is, . Therefore, the size of all possible test cases is . For example, a 4tuple , , , is a test case for the SUT shown in Table 1.
Definition 3. Given a , a wise value combination is a tuple involving parameters with fixed values (named fixed parameters) and parameters with arbitrary allowable values (named free parameters), where and
Generally, wise value combination is also called value schema [33], and is called strength. When , a wise value combination becomes a test case for the SUT as it takes on a specific value for each of its parameters. For ease of description, we define a term as the set of wise value combinations covered by the test case (tc). Intuitively speaking, a test case (tc) with parameters contains wise value combinations, that is, .
For example, considering a test case (tc) , , , , we can obtain that , , , , while , , , , , , , .
Definition 4. The number of parameters required to trigger a failure is referred to as the failuretriggering fault interaction (FTFI) number.
As we know, the faulty model in the combinatorial input domain assumes that failures are caused by parameter interactions. For instance, if the SUT shown in Table 1 fails when is set to “Single”, is set to “None,” and is not equal to “None,” this failure is caused by the parameter interaction (). Therefore, the FTFI number of this fault is 3.
In [28, 34], Kuhn et al. investigated interaction failures by analyzing the faults reports of several software projects and concluded that failures are always caused by low FTFI numbers.
2.2. Failure Patterns
Given a faulty program, two basic features can be obtained from it. One feature is failure rate, denoted by , which refers to the ratio of the number of failurecausing inputs to the number of all possible inputs. The other feature is failure pattern, which refers to the geometric shapes and the distributions of the failurecausing regions. Both features are fixed but unknown to testers before testing.
In [14], the patterns of failurecausing inputs have been classified into three categories: point pattern, stripe pattern, and block pattern. An illustrative example about three types of failure patterns in a twodimensional input domain is shown in Figure 1. In this example, suppose the input domain is consisting of parameters and where , . Point pattern means the tested program will fail when and are assigned to particular integers, that is, some specific points in the input domain, while strip pattern may be of the form , , and block pattern may be of the form , .
(a) Point pattern
(b) Strip pattern
(c) Block pattern
In the combinatorial input domain, failure patterns of any failures belong to the point pattern as all test inputs are discrete. However, from the perspective of functionality and computation of each test input, three failure patterns shown in Figure 1 also exist in the combinatorial input domain. For example, if a failure in the SUT shown in the Table 1 is caused by “ or ” and “ or ”, we believe that the failure pattern of is a strip pattern and its failure rate is ; if a failure in the SUT is caused by “”, “”, “”, and “”, we believe that the failure pattern of is a block pattern and its failure rate is ; and if a failure is caused by a single test case , we believe that the failure region of is a point pattern and its failure rate is . According to Kuhn’s investigations [28, 34], however, the FTFI numbers are always very low (i.e., the FTFI numbers are smaller than the number of parameters), which means that the strip pattern is the most frequent failure pattern in the combinatorial input domain.
2.3. Adaptive Random Testing (ART)
The methodology of adaptive random testing (ART) [14, 15] has been proposed to enhance the failuredetection effectiveness of random testing (RT) by evenspreading test cases across the whole input domain. In ART, test cases are not only randomly generated, but also evenly spread. According to previous ART studies [15–22], ART was shown to reduce the number of test cases required to identify the first fault by as much as 50% over RT.
There are many implementations of ART by different notions. A simple algorithm is the fixedsizecandidateset ART (FSCSART) [15]. FSCSART implements the notion of distance as follows. FSCSART uses two sets of test cases, namely, the executed set and the candidate set . is a set of test cases that have been executed but without revealing any failure, while is a set of tests that are randomly selected from the input domain according to the uniform distribution. is initially empty and the first element is randomly chosen from the input domain and then incrementally updates with the selected elements from until a failure is exhibited. From , the element that is farthest away from all test cases in is chosen as the next test case; that is, the criterion is to choose the element from as the next test case such that where dist is defined as the Euclidean distance, that is, in a dimensional input domain, for two test inputs, and , The process is repeated until the desired stopping criterion is satisfied.
Figure 2 gives the illustration of FSCSART in a twodimensional input domain. In Figure 2(a), there are 3 previously executed test cases , , and , and 2 randomly generated candidates and . To choose among the candidates, the distance of each candidate against each previously executed test case is calculated. Figure 2(b) describes that the closest previously executed test case is determined for each candidate. In Figure 2(c), the candidate is selected as the next test case (i.e., ), as the distance of against its nearest previously executed test case is larger than that of the candidate .
(a)
(b)
(c)
In this paper, we emphasize the extension of FSCSART as that of ART in combinatorial input domain, unless otherwise specified.
2.4. Similarity and Dissimilarity Measures for Combinatorial Input Domain
Measuring similarity or dissimilarity (distance) between two test inputs is a core requirement for test case selection, evaluation, and generation. Generally speaking, in numerical input domains, Euclidean distance (see (3)) is a mostly used distance measure for continuous data. However, for a combinatorial input domain, since its parameters and corresponding values are finite and discrete, Euclidean distance may not be available and reasonable. Nevertheless, various distance measures (or dissimilarity measures) are popularly used in data mining for evaluating categorical data [32], such as clustering (means), classification (KNN, SVM), and distancebased outlier detection. In this subsection, we simply describe the following measures that will be adopted in our paper later.
To illustrate our work clearly, let us define a few terms. Consider a categorical dataset containing objects, derived from a for parameters . We also use the following notation.(i) is the number of times parameter takes the value in . Note that if , .(ii) is the sample probability of parameter to take the value in . The sample probability is given by (iii) is another probability estimate of parameter to take the value in and is given by (iv) is a generalized similarity measure between two data instances denoted as and where , and (). Its definition is given as follows: where () is the perparameter similarity between two values for parameter and denotes the weight assigned to the parameter . Therefore, we only require to present the definitions of and for each similarity measure, unless otherwise specified.
To directly refer to [32], the measures discussed henceforth will all be in the context of similarity, with dissimilarity or distance measures being converted using the following formula: where is the dissimilarity measure between and .
Table 2 presents nine similarity measures for categorical parameter values, which are widely used in data mining for categorical data. In Table 2, the last column “Range” represents the range of for mismatches or matches of parameter values in each measure.
2.5. Effectiveness Measurement
In this paper, we adopt the measure (i.e., the number of test cases required to detect the first failure) as the measurement of failuredetection effectiveness of testing methods, since previous studies [35] have demonstrated that the measure is particularly suitable for adaptive testing strategies such as ART. Intuitively speaking, a smaller measure of ART over RT means fewer test cases required by ART to detect the first failure and hence implies a better failuredetection effectiveness of ART than that of RT. For the purpose of clear description, we will use ART ratio (i.e., the ratio of ART’s measure () relative to RT’s measure ()) to indicate the failuredetection effectiveness improvement of ART over RT.
However, it is extremely difficult to theoretically obtain ART’s measure (). Similar to all other ART studies, is collected via simulations and empirical studies, whose procedure is described as follows. On the one hand, in simulation studies, failure pattern (including its size and sharp) and failure rate are predefined for simulating a faulty program. The failure regions are then randomly placed inside the whole input domain. If a point inside one of the failure regions is picked by a testing strategy, a failure is said to be detected. On the other hand, for empirical studies, some faults are seeded into a subject program. Once the subject program behaves differently from its faultseeded version, it is said that a failure is identified. The number of test cases to find the first failure is regarded as the of that run. Such a process runs times repeatedly until a statistically reliable estimate of the ( accuracy rate and confidence level adopted in our paper) has been obtained. Refer to the value of ; it can be determined dynamically using the same method as shown in [15]. With respect to RT’s measure (), since test cases are chosen with replacement according to the uniform distribution, is equal to theoretically.
Apart from the measure used as the measurement, another measurement is also used in our paper, that is, the number of test cases required to first cover all possible value combinations of a given strength (denoted measure). This measurement is widely used in the combinatorial input domain. Unlike the measure, the testing stop condition of measure is not that the first failure is detected, but that all possible wise value combinations are first covered. For the purpose of clear description, we use to represent this measurement for RT while for ART.
3. Two Similarity Measures Based on Interaction Coverage
Apart from various similarity measures described in Section 2.4, in this section, we propose another two similarity measures by using interaction coverage: incremental interaction coverage similarity (IICS) and multiple interaction coverage similarity (MICS), in order to apply the characteristics of combinatorial input domain to the selection of test cases. All similarity measures illustrated in Section 2.4 are used to evaluate how similar two test cases are; however, two similarity measures presented in this section are used to evaluate the resemblance of the combinatorial test case against the combinatorial test suite. We will discuss them next.
Before introducing them, we firstly describe a simple similarity measure of the test case against a test suite based on interaction coverage, named normalized covered wise value combinations similarity (or ) [36], which is widely used in combinatorial input domain.
Definition 5. Given a combinatorial test suite on , a combinatorial test case (tc), and the strength , normalized covered wise value combinations similarity () of tc against is defined as the ratio of the number of wise value combinations covered by tc that have already been covered by to ; that is, where can be written as follows:
Obviously, the is a function that requires to set the strength value in advance, and its range is . Two properties of the are discussed as follows.
Theorem 6. If , , where .
Proof. When , it can be noted that covers all possible wise value combinations covered by , that is, Since also covers all possible value combinations at strengths lower than that are covered by tc. As a consequence, where .
Theorem 7. If , where .
Proof. When , it can be noted that each wise value combination covered by tc is not covered by , indicating that, for , : that is,
Therefore, the problem converts to demonstrating that , .
We suppose that such that and , that is,
Due to , (13) is equivalent to the equation shown as follows:
Obviously, (14) is contradictory to (12). Therefore, , , which means that where .
As we know, given a and the strength , the number of all possible wise value combinations is fixed; that is, . In other words, there exists a test case generation method using as the criterion, which can generate a certain number of combinatorial test cases denoted as () to cover all possible wise value combinations. However, if testing with fails to reveal any failures due to no failurecausing inputs in , the next test case generated by this method is, in fact, obtained in a random manner. The main reason is that the of each element in is equal to . Therefore, the is not particularly suitable for adaptive testing strategies such as ART. To solve this problem, we propose two similarity measures based on interaction coverage in the following subsections.
3.1. Incremental Interaction Coverage Similarity
As discussed in Theorem 6, if all possible wise value combinations are covered by a combinatorial test suite , all possible value combinations at strengths lower than are also covered by . According to this fact, we present a new similarity measure based on interaction coverage, named incremental interaction coverage similarity (IICS).
Given a combinatorial test suite on and a combinatorial test case (tc), the incremental interaction coverage similarity of tc against is defined as follows: where satisfies the following properties: and , where (assume .
It can be noted that if , the IICS is equal to 1.0 as tc is the same as one of elements in ; if , the IICS of tc against is actually equal to the of tc against where is gradually incremented. More specifically, if covers all possible wise value combinations and partial wise value combinations occurred in tc, . Similar to , the range of IICS is also .
Here, we present an example to illustrate IICS. Suppose on , , and , as 1wise value combinations are not completely covered by , and hence . Let , as covers all 1wise value combinations and partial 2wise value combinations occurred in , and hence .
3.2. Multiple Interaction Coverage Similarity
As shown in Section 3.1, the IICS measure begins at strength , and then update the value of by . In other words, it considers different strength values when evaluating the combinatorial test case against the combinatorial test suite. However, the IICS accounts for each strength value at each time rather than simultaneously considering all strength values. As a consequence, we present another similarity measure based on interaction coverage, named multiple interaction coverage similarity (MICS).
Given a combinatorial test suite on and a combinatorial test case (tc), the weighted interaction coverage similarity of tc against is defined as follows: where and .
Intuitively speaking, if , . Similar to IICS, the MICS ranges from 0 to 1.0.
Here, we present an example to explain the definition of MICS. Let on , , , and , , while .
3.3. Properties of Two New Similarity Measures
Some properties of the proposed two similarity measures are discussed in the following subsection.
Theorem 8. If , for , and remain unchanged.
Proof. On the one hand, if (i.e., covers all possible wise value combinations), for ,
On the other hand, if (i.e., covers all possible wise value combinations), , . According to Theorem 6, it can be concluded that , where ; that is, covers all possible wise value combinations. In other words, (). Therefore, for ,
In summary, if , for , and .
According to Theorem 8, a test case generation method using IICS or MICS as the similarity measure becomes a random generation method, when its generated test suite covers all possible wise value combinations. The main reason is that, for any candidates, no matter whether they are included in or not, the IICS (or MICS) values of all candidates are identical.
Theorem 9. If , .
Proof. If , where because of and ; that is, all possible wise value combinations covered by tc are not covered by . According to (15), therefore, .
As discussed before, both IICS and MICS consider different interaction coverage when evaluating combinatorial test cases. However, they have some differences. Given a combinatorial test case (tc), its IICS measure is actually calculated by the at an appropriate value, which means that the IICS measure of tc only considers single interaction coverage, while its MICS measure considers different coverage at the same meanwhile. In other words, tc’s calculation time of the IICS measure is less than that of the MICS.
In summary, two new similarity measures based on interaction coverage (IICS and MICS) fundamentally differ from NCVCS due to the following reasons: they do not require setting the strength value in advance, and they are more suitable for adaptive strategies than NCVCS.
4. Adaptive Random Testing for Combinatorial Test Inputs
In this section, we propose a new family of methods adopting ART in combinatorial input domain, namely, ARTCID. Similar to previous ART studies, ARTCID can also be implemented according to different notions. In this paper, we present one version of ARTCID by similarity (denoted as FSCSCID), which uses the strategy of FSCSART [15]. Since the similarity measure is used in this paper, the procedure of FSCSCID may differ from that of FSCSART. Detailed information will be given as follows.
4.1. SimilarityBased Test Case Selection in FSCSCID
FSCSCID uses two test sets, that is, the candidate set of fixed size and the executed set , each of which has the same definition as FSCSART. However, test cases in either or are obtained from the combinatorial input domain. For ease of description, let while . In order to select the next test case from , the criterion is described as follows: where is the similarity measure between two combinatorial test inputs. The detailed algorithm of implementing (19) is illustrated as follows (see Algorithm 1).

4.2. Algorithm of FSCSCID
As discussed before, Algorithm 1 is used to guide the selection of the best test case. In FSCSCID, the process of Algorithm 1 runs until the stop condition is satisfied. In this paper, we consider two stop conditions: the first software failure is detected (denoted StopCon1); and all possible value combinations at strength are covered (denoted StopCon2). Detailed algorithm of FSCSCID is shown in Algorithm 2.

Since the frequencies of parameter values are used in some similarity measures such as Lin, OF, and Goodall2, there requires a fixedsize set of test cases in order to count the frequencies. However, the executed set is incrementally updated with the selected element from the candidate set until the StopCon1 (or StopCon2) is satisfied. In this paper, we take the following strategy to construct the fixedsize set of test cases when calculating the similarity between test inputs. During the process of choosing the th () test input from as the next test case (i.e., ), each candidate requires to be measured against all elements in according to the similarity measure, and the fixedsize set of test case for is constructed by .
5. Experiment
In this section, some experimental results, including simulations and experiments against real programs, were presented to analyze the effectiveness of FSCSCID. We mainly compared our method to RT in terms of failuredetection effectiveness (measure) and the rate of value combinations coverage at a given strength (measure). For ease of describing our work clearly, we used the terms Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS to, respectively, represent the similarity measure Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS adopted in the FSCSCID. Additionally, we used the term RT to represent RT.
As shown in (16), a weight is required to be assigned for interaction coverage at each strength value. There are many techniques which conduct on assigning weights; however, in this paper we focus on two distribution styles: equal distribution where each interaction coverage has the same weight, that is, ; and FTFI percentage distribution where according to previous studies [28, 34], for example, in [28], Kuhn et al. investigated several software projects and concluded that the interaction faults are summarized to have 29% to 82% faults as 1wise faults (i.e., the FTFI number is 1), 6% to 47% of faults as 2wise faults, 2% to 19% as 3wise faults, 1% to 7% of faults as 4wise faults, and even fewer failures beyond 4wise interactions. As a consequence, we arrange weights as follows: , , where . For example, if , and ; if , , , and . In this paper, therefore, we use the terms MICS1 and MICS2 to stand for the MICS techniques with the above two weight distribution styles, respectively.
5.1. Simulation
In the following subsection, two simulations were presented to analyze the effectiveness of FSCSCID according to the rate of covering wise value combinations (i.e., measure). We used two usual test profiles and that are commonly used in previous studies [37].
5.1.1. Setup
Since the was known before testing, in this simulation, we considered the FSCSCID using [36] as the similarity measure (denoted NCVCS). Except the MICS, all other methods do not require to be set. As for the MICS, different strength values from 1 to are considered to calculate the MICS measure according to (16). However, due to the known , we mainly focused on the strength values from 1 to for calculating the MICS measure. As a consequence, (16) becomes as follows: where only are considered. Each method runs until the StopCon2 is satisfied. Additionally, we consider as the metric to evaluate each method in terms of the rate of covering value combinations at strength for each method, where .
5.1.2. Results
Figure 3 summarizes the number of test cases required to cover all possible wise value combinations (i.e., ) generated by each method for the above two designed test profiles. Based on the experimental data, we have the following observations.(1)For each test profile, the () metric values of all FSCSCID methods using different similarity measures are smaller than those of RT. In other words, the FSCSCID methods require the smaller number of test cases for covering all wise value combinations than RT, which means that the FSCSCID methods have the higher rates of covering value combinations than RT.(2)Among all the FSCSCID methods, the NCVCS is the most effective technique. The results show that the values of the NCVCS are about 30%~50% of those of the RT. The IICS has the second best metric values, followed by the OF. For , the Goodall3 is least effective, while for , the Lin performs least.(3)From the perspective of the similarity category, the FSCSCID methods using the interactioncoveragebased similarity measures (including IICS, MICS, and NCVCS) perform best, while the FSCSCID methods using the informationtheoretic similarity measures (including Lin and Lin1) perform worst.
(a) , 
(b) , 
(c) , 
(d) , 
(e) , 
(f) , 
5.1.3. Analysis
Here, we briefly analyze the above observations. The observation (1) is explained as follows. The FSCSCID methods using different similarity measures select the next test case that has the smallest similarity value against already generated test cases, while RT simply generates teat cases at random from combinatorial input domain. As a consequence, the FSCSCID methods achieve test cases more diversely than RT over the combinatorial input domain.
As for the observations (1) and (2), they are easy to be explained. On the one hand, since the metric is related to wise value combinations, the NCVCS performs best because it selects the next test case that covers of uncovered wise value combinations as much as possible. In other words, it may have the fastest rate of covering all wise value combinations. On the other hand, another two interactioncoveragebased methods, such as IICS and MICS, consider different strength values for generating test cases; however, both of them take the strength as an indispensable part. In detail, the IICS calculates the test candidate from the strength 1 to , while the MICS considers different strengths from 1 to at the same time. Hence, it is reasonable that, compared to other categories, the FSCSCID methods using interactioncoveragebased similarity measures perform best according to the metric.
5.2. An Empirical Study
In this section, an empirical study was conducted to compare the performance between FSCSCID and RT in practical situations, using the measure as the effectiveness metric. To describe data clearly, we used ART ratio, which is defined as the measure ratio between FSCSCID and RT, that is, . Intuitively speaking, the smaller ART ratio value implies the higher improvement of FSCSCID over RT, and is the measure improvement of FSCSCID over RT.
In this empirical study, we use a set of six faultseeded C programs with 9 versions. The five subject programs, including count, series, tokens, ntree, and nametbl, are downloaded from Chris Lott’s website (http://www.maultech.com/chrislott/work/exp/), which have been widely used in the research of combinatorial space such as comparison of defect revealing mechanisms [38], evaluation of different combination strategies for test case selection [39], and fault diagnosis [40, 41]. The remainder subject programs are a series of flex programs (the model used in this paper is unconstrained, which has some limitations: “We note that in a real test environment an unconstrained TSL would most likely be prohibitive in size and would not be used” [42].), downloaded from Software Infrastructure Repository (SIR) [43], which are popularly used in combinatorial test suite construction [44] and combinatorial interaction regression testing [42].
Table 3 presents detailed information about these subject programs, from which the third column “LOC” represents the number of lines of executable code in these programs, and “#S.” is the number of seeded faults in each subject program, while “#D.” is the number of faults that can be detected by some test cases derived from the accompanying test profiles, which are not guaranteed to be able to detect all faults. However, in our study, we only use a portion of detectable faults, of which the size is shown as “#U.”. The main reason is due to the fact that faults in the set of detectable faults but not in the set of used faults have high failure rates that exceed 0.5. As we know, if the failure rate of a fault is larger than 0.5, the measure of random testing is theoretically less than . As a consequence, the measure of FSCSCID depends on the first randomly selected test case. In other words, if the first test case cannot detect a failure, the is larger than or equal to 2. Therefore, the measure of FSCSCID is dependent on random testing.
For the purpose of clear description, we order used faults in each subject program in a descend order according to failure rate and abbreviate them as . The range of failure rates in each program, as shown in Table 3, is from to .
We used all twelve FSCSCID versions using different similarity measures to test these faultseeded programs. The results of the empirical study are given in Figure 4, where axis represents each seeded fault in the subject program, while axis represents the ART ratio. As shown in Figures 4(a)–4(i), each figure corresponds to a particular subject program, while Figure 4(j) represents the average ART ratio of all FSCSCID versions for each subject program.
(a) Count
(b) Series
(c) Tokens
(d) Ntree
(e) Nametbl
(f) Flexv1
(g) Flexv2
(h) Flexv3
(i) Flexv4
(j) Average ART ratio per program 
From Figures 4(a)–4(j), we can observe the following conclusions.(1)According to ART ratio, all twelve FSCSCID versions, including Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, MICS1, and MICS2, perform better than RT. In the best case, the improvement of FSCSCID over RT is about 40% (i.e., ART ratio is 60%).(2)With the increase of failure rate , the ART ratio of each FSCSCID version increases as well in most programs. In other words, when is larger, the improvement of each FSCSCID version over RT is smaller.(3)The failuredetection capability of FSCSCID depends on some factors, such as the program (or test profile) and failure type (including failure rate and failure pattern). For example, in program count faults and have the same failure rate; however, the ART ratio of each FSCSCID version when detecting is very different from that when detecting .(4)Figure 4(j) describes the average ART ratio of all FSCSCID versions when detecting each fault for each subject program. It can be clearly seen that the ART ratio of the FSCSCID algorithm generally fluctuates from 0.75 to 0.90 among all faults for each program, which means FSCSCID can improve about 10%~25% of measure over RT in the average.(5)Among all FSCSCID versions, no method performs best for all programs, and no method performs worst. In order to compare the failuredetection capabilities of different FSCSCID versions, Table 4 shows the average ART ratio of each FSCSCID version for each subject program. According to data shown in Table 4, it is obvious that in general one of FSCSCID version OF performs best, followed by IICS, while Lin and Lin1 generally perform worst. In addition, Eskin performs best for the program tokens and Goodall1 has the best performance for the program flexv4.
In summary, our simulation results (Section 5.1) have shown that our FSCSCID algorithm (irrespective of used similarity measure) has higher rates of covering value combinations at different strength values than those of random testing. Besides, the empirical study has shown that the FSCSCID algorithm performs better than RT in terms of the number of test cases required to detect the first failure (i.e., measure).
5.3. Threats to Validity
The experimental results suffer from some threats to validity; in this section, we outline the major threats. In the simulation study, two widely used, but limited, test profiles were employed. In the empirical study, many reallife programs were used, which have been popularly investigated by different researches. However, the faults seeded in each subject program have high failure rates. To address these potential threats, additional studies using a great number of test profiles and a great number of subject programs with low failure rates will be investigated in the future.
In addition, although two metrics (measure and measure) were employed in our experiment, we recognize that there may be other metrics which are more pertinent to the study.
6. Discussion and Conclusion
Adaptive random testing (ART) [15] has been proposed to enhance the failuredetection capability of random testing (RT) by evenly spreading test cases all over the input domain and has been widely applied in various applications such as numerical programs, Java programs, and objectoriented programs. In this paper, we broaden the principle of ART in a new type of input domain that has not yet been investigated, that is, combinatorial input domain. Due to special characteristics of combinatorial input domain, the test case similarity (or dissimilarity) measures previously used in ART may not be suitable for combinatorial input domain. By adopting some wellknown similarity measures used in data mining and proposing two new similarity measures based on interaction coverage, we proposed a new approach to apply original ART into combinatorial input domain, named ARTCID. We conducted some experiments including simulations and the empirical study to analyze the effectiveness of one version of ARTCID (FSCSCID, which is based on fixedsizecandidateset ART). Compared with RT, FSCSCID not only brings higher rates in covering all possible combinations at any given strengths, but also requires fewer combinatorial test cases to detect the first failure in the seeded program.
Combinatorial interaction testing (CIT) [33] is a blackbox testing method and has been widely used in combinatorial input domain. It aims at constructing an effective test suite to identify interaction faults caused by parameter interactions. Some greedy CIT algorithms, such as AETG [45], TCG [46], and DDA [47], may have similar mechanism as FSCSCID. Taking AETG for example, similar to AETG, FSCSCID also first constructs some candidates, and then from which the “best” element would be chosen as the next test case according to some criteria. However, there are some fundamental differences between AETG and FSCSCID, which are mainly summarized as follows.(1)Different construction strategies of candidates: FSCSCID constructs candidates in a random manner, while AETG first orders all parameters and then assigns a value to each parameter, such that all assigned parameter values can cover the largest number of value combinations at a given strength.(2)Different test case selection criteria: AETG selects an element from candidates as the next test case such that it covers the largest number of value combinations at a given strength, while FSCSCID chooses the next test case according to its used similarity measure.(3)Different goals achieved: AETG aims at covering all possible value combinations of a given strength with fewer test cases, which means that the unique stopping condition of AETG is that all value combinations of a given strength are covered by generated test cases, while FSCSCID is an adaptive strategy, which means that the stopping condition of FSCSCID is not limited to covering all value combinations of a give strength, for example, detecting a first failure in the SUT.
In this paper, constraints among value combinations have not been considered; however, they often exist in reallife programs. For example, as shown in Table 1, there may exist a constraint among “Full Size” of and “Single” of , that is, when , (i.e., “Full Size” and “Single” cannot occur in a combinatorial test cases). In this case, the method FSCSCID proposed in this paper can still be successfully executed only by judging that each selected test case violates constraints among value combinations or not. Generally speaking, this judgment process can be implemented in the following phases: when constructing the candidate set and when adding the latest test case into the executed set. However, how to deal with constraints among value combinations should be further studied.
In the future, we plan to further investigate how to improve the effectiveness of the approach by adopting other similarity measures that may be available in combinatorial input domain or by considering additional factors to guide test case generation. In addition, how to extend other original ART algorithms into combinatorial input domain is also expected.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors would like to thank C. M. Lott for sending them the failure reports of five subject programs (count, series, tokens, ntree, and nametbl) and the Softwareartifact Infrastructure Repository (SIR) [43] which provided the source code and fault data for the program flex. They also would like to thank T. Y. Chen for the many helpful discussions and comments. This work is in part supported by the National Natural Science Foundation of China (Grant no. 61202110) and Natural Science Foundation of Jiangsu Province (Grant no. BK2012284).