Abstract

Random testing (RT) is a fundamental testing technique to assess software reliability, by simply selecting test cases in a random manner from the whole input domain. As an enhancement of RT, adaptive random testing (ART) has better failuredetection capability and has been widely applied in different scenarios, such as numerical programs, some objectoriented programs, and mobile applications. However, not much work has been done on the effectiveness of ART for the programs with combinatorial input domain (i.e., the set of categorical data). To extend the ideas to the testing for combinatorial input domain, we have adopted different similarity measures that are widely used for categorical data in data mining and have proposed two similarity measures based on interaction coverage. Then, we propose a new version named ARTCID as an extension of ART in combinatorial input domain, which selects an element from categorical data as the next test case such that it has the lowest similarity against already generated test cases. Experimental results show that ARTCID generally performs better than RT, with respect to different evaluation metrics.

1. Introduction

Software testing, a major software engineering activity, is widely considered to assure the quality of software under test [1]. Many testing methods have been developed to effectively identify software failures by actively selecting inputs (namely, test cases). Random testing (RT), a basic software testing method, simply chooses test cases at random from the set of all possible program inputs (namely, the input domain) [2, 3]. There are many advantages of using RT in software testing. For example, in addition to simplicity and the efficiency of generating random test cases [2], RT allows statistical quantitative estimation of software’s reliability [4]. Due to these advantages, RT has been widely used to detect software failures in different scenarios, such as the testing of UNIX utilities [5, 6], SQL database systems [7, 8], Java JIT compilers [9], and embedded software systems [10]. In spite of the popularity, RT is still criticized by many researchers due to little or no information to guide its test case generation.

Given a faulty program, two basic features are determined by program inputs causing software to exhibit failure behaviors (namely, failure-causing inputs), that is, failure rate and failure pattern. Failure rate refers to the ratio between the number of failure-causing inputs and the number of all possible program inputs, while failure pattern refers to the geometry and distribution of failure regions (i.e., the region where failure-causing inputs reside). It has been observed, however, that failure-causing inputs tend to cluster together [1113]. Given that failure regions are continuous, nonfailure regions should also be contiguous. More specifically, suppose a test case (tc) is not a failure-causing input, test cases that are close to tc (or tc’s neighbors) may fail to reveal a failure as well. Therefore, it is intuitively appealing that test cases that spread away from tc may have a higher chance to be failure-causing than tc’s neighbors.

Briefly speaking, it is very likely that a more even-spread of random test cases can improve the failure-detection effectiveness of RT. Based on this intuition, Chen et al. [14] have proposed a novel approach, namely, adaptive random testing (ART). Similar to RT, ART also randomly generates test case from the whole input domain. But ART uses additional criteria to guide the test case selection for the purpose of evenly spreading test cases over the input domain. Various ART algorithms have been developed based on different test case selection criteria, such as ART by distance [15], ART by exclusion [16], ART based on evolutionary search algorithms [17], and ART by perturbation [18]. Essentially, ART achieves test case diversity with the subset of test cases executed at any one time [19].

As an alternative of RT, ART has been successfully applied to different programs, such as numerical programs [1518], object-oriented programs [20, 21], and mobile application [22]. However, not much work has been done on the effectiveness of ART for programs with combinatorial input domain (or categorical data, i.e., a Cartesian product of finite value domains for each of a finite set of parameter variables). With the popularity of category-partition method [23] and many guidelines to help construct categories and partitions [2427], combinatorial input domain has been widely applied to different testing scenarios, such as configurable-aware system [28, 29], event-driven software [30], and GUI-based application [31]. In this paper, we propose a new testing strategy called ART-CID as an extension of ART in combinatorial input domain. In order to successfully extend the ART principle into combinatorial input domain, we propose two similarity measures based on interaction coverage and also adopt different well-studied similarity measures that are popularly used for categorical data in data mining [32]. To analyze the effectiveness of ART-CID (mainly FSCS-CID, one version of ART-CID), we compare the effectiveness of FSCS-CID with RT by designing some simulations and the empirical study. Experimental results show that, compared with RT, FSCS-CID can not only use smaller test cases in order to cover all possible combinations of parameter values at a given strength, but also require to generate fewer test cases to identify the first failure in the real-life program.

This paper is organized as follows. Section 2 introduces some preliminaries, including combinatorial input domain, ART, similarity measures used for combinatorial input domain, and the effectiveness measures adopted in our study. Section 3 proposes two similarity measures for combinatorial test cases based on interaction coverage. Section 4 proposes a new algorithm called ART-CID to select test cases from combinatorial input domain. Section 5 reports some experimental studies, which examine the rate of covering value combinations at a given strength and failure-detection effectiveness of our new method. Finally, Section 6 summarizes some discussions and conclusions.

2. Preliminaries

In the following section, some preliminaries of combinatorial input domain, failure patterns, adaptive random testing, similarity and dissimilarity measures for combinatorial input domain, and effectiveness measure are described.

2.1. Combinatorial Input Domain

Suppose that a system under test (SUT) has a set of parameters (or categories) , which may represent user inputs, configuration parameters, internal events, and so forth. Let be the finite set of discrete valid values (or choices) for (), and let be the set of constraints on parameter value combinations. Without loss of generality, we assume that the order of parameters is fixed; that is, . In the remainder of this paper, we will refer to a combination of parameters as a parameter interaction, and a combination of parameter values or a parameter value combination as a value combination.

Definition 1. A test profile, denoted as , is about the information on a combinatorial input domain of the SUT, including parameters, () values for parameter , and constraints on value combinations.

In this paper, we assume that all the parameters are independent; that is, no constraint among value combinations is considered (), unless otherwise specified. Therefore, the test profile can be abbreviated as .

To clearly describe some notions and definitions, we present an example of the part of suboptions in an option “View” of the tool PDF shown in Table 1. In this system, there are four configuration parameters, each of which has three values. Therefore, its test profile can be written as .

Definition 2. Given a , a test case or a test configuration is a -tuple where ().

Intuitively speaking, a combinatorial input domain is a Cartesian product of for each of ; that is, . Therefore, the size of all possible test cases is . For example, a 4-tuple , , , is a test case for the SUT shown in Table 1.

Definition 3. Given a , a -wise value combination is a -tuple involving parameters with fixed values (named fixed parameters) and parameters with arbitrary allowable values (named free parameters), where and

Generally, -wise value combination is also called -value schema [33], and is called strength. When , a -wise value combination becomes a test case for the SUT as it takes on a specific value for each of its parameters. For ease of description, we define a term as the set of -wise value combinations covered by the test case (tc). Intuitively speaking, a test case (tc) with parameters contains -wise value combinations, that is, .

For example, considering a test case (tc) , , , , we can obtain that , , , , while , , , , , , , .

Definition 4. The number of parameters required to trigger a failure is referred to as the failure-triggering fault interaction (FTFI) number.

As we know, the faulty model in the combinatorial input domain assumes that failures are caused by parameter interactions. For instance, if the SUT shown in Table 1 fails when is set to “Single”, is set to “None,” and is not equal to “None,” this failure is caused by the parameter interaction (). Therefore, the FTFI number of this fault is 3.

In [28, 34], Kuhn et al. investigated interaction failures by analyzing the faults reports of several software projects and concluded that failures are always caused by low FTFI numbers.

2.2. Failure Patterns

Given a faulty program, two basic features can be obtained from it. One feature is failure rate, denoted by , which refers to the ratio of the number of failure-causing inputs to the number of all possible inputs. The other feature is failure pattern, which refers to the geometric shapes and the distributions of the failure-causing regions. Both features are fixed but unknown to testers before testing.

In [14], the patterns of failure-causing inputs have been classified into three categories: point pattern, stripe pattern, and block pattern. An illustrative example about three types of failure patterns in a two-dimensional input domain is shown in Figure 1. In this example, suppose the input domain is consisting of parameters and where , . Point pattern means the tested program will fail when and are assigned to particular integers, that is, some specific points in the input domain, while strip pattern may be of the form , , and block pattern may be of the form , .

In the combinatorial input domain, failure patterns of any failures belong to the point pattern as all test inputs are discrete. However, from the perspective of functionality and computation of each test input, three failure patterns shown in Figure 1 also exist in the combinatorial input domain. For example, if a failure in the SUT shown in the Table 1 is caused by “ or ” and “ or ”, we believe that the failure pattern of is a strip pattern and its failure rate is ; if a failure in the SUT is caused by “”, “”, “”, and “”, we believe that the failure pattern of is a block pattern and its failure rate is ; and if a failure is caused by a single test case , we believe that the failure region of is a point pattern and its failure rate is . According to Kuhn’s investigations [28, 34], however, the FTFI numbers are always very low (i.e., the FTFI numbers are smaller than the number of parameters), which means that the strip pattern is the most frequent failure pattern in the combinatorial input domain.

2.3. Adaptive Random Testing (ART)

The methodology of adaptive random testing (ART) [14, 15] has been proposed to enhance the failure-detection effectiveness of random testing (RT) by even-spreading test cases across the whole input domain. In ART, test cases are not only randomly generated, but also evenly spread. According to previous ART studies [1522], ART was shown to reduce the number of test cases required to identify the first fault by as much as 50% over RT.

There are many implementations of ART by different notions. A simple algorithm is the fixed-size-candidate-set ART (FSCS-ART) [15]. FSCS-ART implements the notion of distance as follows. FSCS-ART uses two sets of test cases, namely, the executed set and the candidate set . is a set of test cases that have been executed but without revealing any failure, while is a set of tests that are randomly selected from the input domain according to the uniform distribution. is initially empty and the first element is randomly chosen from the input domain and then incrementally updates with the selected elements from until a failure is exhibited. From , the element that is farthest away from all test cases in is chosen as the next test case; that is, the criterion is to choose the element from as the next test case such that where dist is defined as the Euclidean distance, that is, in a -dimensional input domain, for two test inputs, and , The process is repeated until the desired stopping criterion is satisfied.

Figure 2 gives the illustration of FSCS-ART in a two-dimensional input domain. In Figure 2(a), there are 3 previously executed test cases , , and , and 2 randomly generated candidates and . To choose among the candidates, the distance of each candidate against each previously executed test case is calculated. Figure 2(b) describes that the closest previously executed test case is determined for each candidate. In Figure 2(c), the candidate is selected as the next test case (i.e., ), as the distance of against its nearest previously executed test case is larger than that of the candidate .

In this paper, we emphasize the extension of FSCS-ART as that of ART in combinatorial input domain, unless otherwise specified.

2.4. Similarity and Dissimilarity Measures for Combinatorial Input Domain

Measuring similarity or dissimilarity (distance) between two test inputs is a core requirement for test case selection, evaluation, and generation. Generally speaking, in numerical input domains, Euclidean distance (see (3)) is a mostly used distance measure for continuous data. However, for a combinatorial input domain, since its parameters and corresponding values are finite and discrete, Euclidean distance may not be available and reasonable. Nevertheless, various distance measures (or dissimilarity measures) are popularly used in data mining for evaluating categorical data [32], such as clustering (-means), classification (KNN, SVM), and distance-based outlier detection. In this subsection, we simply describe the following measures that will be adopted in our paper later.

To illustrate our work clearly, let us define a few terms. Consider a categorical dataset containing objects, derived from a for parameters . We also use the following notation.(i) is the number of times parameter takes the value in . Note that if , .(ii) is the sample probability of parameter to take the value in . The sample probability is given by (iii) is another probability estimate of parameter to take the value in and is given by (iv) is a generalized similarity measure between two data instances denoted as and where , and (). Its definition is given as follows: where () is the per-parameter similarity between two values for parameter and denotes the weight assigned to the parameter . Therefore, we only require to present the definitions of and for each similarity measure, unless otherwise specified.

To directly refer to [32], the measures discussed henceforth will all be in the context of similarity, with dissimilarity or distance measures being converted using the following formula: where is the dissimilarity measure between and .

Table 2 presents nine similarity measures for categorical parameter values, which are widely used in data mining for categorical data. In Table 2, the last column “Range” represents the range of for mismatches or matches of parameter values in each measure.

2.5. Effectiveness Measurement

In this paper, we adopt the -measure (i.e., the number of test cases required to detect the first failure) as the measurement of failure-detection effectiveness of testing methods, since previous studies [35] have demonstrated that the -measure is particularly suitable for adaptive testing strategies such as ART. Intuitively speaking, a smaller -measure of ART over RT means fewer test cases required by ART to detect the first failure and hence implies a better failure-detection effectiveness of ART than that of RT. For the purpose of clear description, we will use ART -ratio (i.e., the ratio of ART’s -measure () relative to RT’s -measure ()) to indicate the failure-detection effectiveness improvement of ART over RT.

However, it is extremely difficult to theoretically obtain ART’s -measure (). Similar to all other ART studies, is collected via simulations and empirical studies, whose procedure is described as follows. On the one hand, in simulation studies, failure pattern (including its size and sharp) and failure rate are predefined for simulating a faulty program. The failure regions are then randomly placed inside the whole input domain. If a point inside one of the failure regions is picked by a testing strategy, a failure is said to be detected. On the other hand, for empirical studies, some faults are seeded into a subject program. Once the subject program behaves differently from its fault-seeded version, it is said that a failure is identified. The number of test cases to find the first failure is regarded as the of that run. Such a process runs times repeatedly until a statistically reliable estimate of the ( accuracy rate and confidence level adopted in our paper) has been obtained. Refer to the value of ; it can be determined dynamically using the same method as shown in [15]. With respect to RT’s -measure (), since test cases are chosen with replacement according to the uniform distribution, is equal to theoretically.

Apart from the -measure used as the measurement, another measurement is also used in our paper, that is, the number of test cases required to first cover all possible value combinations of a given strength (denoted -measure). This measurement is widely used in the combinatorial input domain. Unlike the -measure, the testing stop condition of -measure is not that the first failure is detected, but that all possible -wise value combinations are first covered. For the purpose of clear description, we use to represent this measurement for RT while for ART.

3. Two Similarity Measures Based on Interaction Coverage

Apart from various similarity measures described in Section 2.4, in this section, we propose another two similarity measures by using interaction coverage: incremental interaction coverage similarity (IICS) and multiple interaction coverage similarity (MICS), in order to apply the characteristics of combinatorial input domain to the selection of test cases. All similarity measures illustrated in Section 2.4 are used to evaluate how similar two test cases are; however, two similarity measures presented in this section are used to evaluate the resemblance of the combinatorial test case against the combinatorial test suite. We will discuss them next.

Before introducing them, we firstly describe a simple similarity measure of the test case against a test suite based on interaction coverage, named normalized covered -wise value combinations similarity (or ) [36], which is widely used in combinatorial input domain.

Definition 5. Given a combinatorial test suite on , a combinatorial test case (tc), and the strength , normalized covered -wise value combinations similarity () of tc against is defined as the ratio of the number of -wise value combinations covered by tc that have already been covered by to ; that is, where can be written as follows:

Obviously, the is a function that requires to set the strength value in advance, and its range is . Two properties of the are discussed as follows.

Theorem 6. If , , where .

Proof. When , it can be noted that covers all possible -wise value combinations covered by , that is, Since also covers all possible value combinations at strengths lower than that are covered by tc. As a consequence, where .

Theorem 7. If , where .

Proof. When , it can be noted that each -wise value combination covered by tc is not covered by , indicating that, for , : that is, Therefore, the problem converts to demonstrating that , .
We suppose that such that and , that is, Due to , (13) is equivalent to the equation shown as follows: Obviously, (14) is contradictory to (12). Therefore, , , which means that where .

As we know, given a and the strength , the number of all possible -wise value combinations is fixed; that is, . In other words, there exists a test case generation method using as the criterion, which can generate a certain number of combinatorial test cases denoted as () to cover all possible -wise value combinations. However, if testing with fails to reveal any failures due to no failure-causing inputs in , the next test case generated by this method is, in fact, obtained in a random manner. The main reason is that the of each element in is equal to . Therefore, the is not particularly suitable for adaptive testing strategies such as ART. To solve this problem, we propose two similarity measures based on interaction coverage in the following subsections.

3.1. Incremental Interaction Coverage Similarity

As discussed in Theorem 6, if all possible -wise value combinations are covered by a combinatorial test suite , all possible value combinations at strengths lower than are also covered by . According to this fact, we present a new similarity measure based on interaction coverage, named incremental interaction coverage similarity (IICS).

Given a combinatorial test suite on and a combinatorial test case (tc), the incremental interaction coverage similarity of tc against is defined as follows: where satisfies the following properties:    and   , where (assume .

It can be noted that if , the IICS is equal to 1.0 as tc is the same as one of elements in ; if , the IICS of tc against is actually equal to the of tc against where is gradually incremented. More specifically, if covers all possible -wise value combinations and partial -wise value combinations occurred in tc, . Similar to , the range of IICS is also .

Here, we present an example to illustrate IICS. Suppose on , , and , as 1-wise value combinations are not completely covered by , and hence . Let , as covers all 1-wise value combinations and partial 2-wise value combinations occurred in , and hence .

3.2. Multiple Interaction Coverage Similarity

As shown in Section 3.1, the IICS measure begins at strength , and then update the value of by . In other words, it considers different strength values when evaluating the combinatorial test case against the combinatorial test suite. However, the IICS accounts for each strength value at each time rather than simultaneously considering all strength values. As a consequence, we present another similarity measure based on interaction coverage, named multiple interaction coverage similarity (MICS).

Given a combinatorial test suite on and a combinatorial test case (tc), the weighted interaction coverage similarity of tc against is defined as follows: where and .

Intuitively speaking, if , . Similar to IICS, the MICS ranges from 0 to 1.0.

Here, we present an example to explain the definition of MICS. Let on , , , and , , while .

3.3. Properties of Two New Similarity Measures

Some properties of the proposed two similarity measures are discussed in the following subsection.

Theorem 8. If , for , and remain unchanged.

Proof. On the one hand, if (i.e., covers all possible -wise value combinations), for ,
On the other hand, if (i.e., covers all possible -wise value combinations), , . According to Theorem 6, it can be concluded that , where ; that is, covers all possible -wise value combinations. In other words, (). Therefore, for , In summary, if , for , and .

According to Theorem 8, a test case generation method using IICS or MICS as the similarity measure becomes a random generation method, when its generated test suite covers all possible -wise value combinations. The main reason is that, for any candidates, no matter whether they are included in or not, the IICS (or MICS) values of all candidates are identical.

Theorem 9. If , .

Proof. If , where because of and ; that is, all possible -wise value combinations covered by tc are not covered by . According to (15), therefore, .

As discussed before, both IICS and MICS consider different interaction coverage when evaluating combinatorial test cases. However, they have some differences. Given a combinatorial test case (tc), its IICS measure is actually calculated by the at an appropriate value, which means that the IICS measure of tc only considers single interaction coverage, while its MICS measure considers different coverage at the same meanwhile. In other words, tc’s calculation time of the IICS measure is less than that of the MICS.

In summary, two new similarity measures based on interaction coverage (IICS and MICS) fundamentally differ from NCVCS due to the following reasons: they do not require setting the strength value in advance, and they are more suitable for adaptive strategies than NCVCS.

4. Adaptive Random Testing for Combinatorial Test Inputs

In this section, we propose a new family of methods adopting ART in combinatorial input domain, namely, ART-CID. Similar to previous ART studies, ART-CID can also be implemented according to different notions. In this paper, we present one version of ART-CID by similarity (denoted as FSCS-CID), which uses the strategy of FSCS-ART [15]. Since the similarity measure is used in this paper, the procedure of FSCS-CID may differ from that of FSCS-ART. Detailed information will be given as follows.

4.1. Similarity-Based Test Case Selection in FSCS-CID

FSCS-CID uses two test sets, that is, the candidate set of fixed size and the executed set , each of which has the same definition as FSCS-ART. However, test cases in either or are obtained from the combinatorial input domain. For ease of description, let while . In order to select the next test case from , the criterion is described as follows: where is the similarity measure between two combinatorial test inputs. The detailed algorithm of implementing (19) is illustrated as follows (see Algorithm 1).

Input: The candidate set , and the executed set
.
Output: Best test element .
( )   Set ;
( )   for ( to )
( )Set ;
( )for ( to )
( )Calculate the similarity between and , that is, ;
( )if ( )
( )   ;
( )end  if
( )  end  for
( )  if ( )
( )   ;
( )   ;
( )  end  if
( )  end  for
( )  return   .

4.2. Algorithm of FSCS-CID

As discussed before, Algorithm 1 is used to guide the selection of the best test case. In FSCS-CID, the process of Algorithm 1 runs until the stop condition is satisfied. In this paper, we consider two stop conditions: the first software failure is detected (denoted  StopCon1); and all possible value combinations at strength are covered (denoted  StopCon2). Detailed algorithm of FSCS-CID is shown in Algorithm 2.

Input: A test profile, .
Output: The size of the executed set and the first failure-causing input when meeting the StopCon1 (or the size of when meeting the StopCon2).
( ) Set , and ;
( ) Randomly generate a test case , according to by the uniform distribution;
( ) Run the program with , and ;
( ) while (The StopCon1 (or StopCon2) is not satisfied)
( ) Randomly generate candidates from , according to uniform distribution to form the candidate set ;
( ) Set ;
( ) Run the program with , and ;
( ) end  while
( ) return   and for the StopCon1 (or for the StopCon2).

Since the frequencies of parameter values are used in some similarity measures such as Lin, OF, and Goodall2, there requires a fixed-size set of test cases in order to count the frequencies. However, the executed set is incrementally updated with the selected element from the candidate set until the  StopCon1 (or  StopCon2) is satisfied. In this paper, we take the following strategy to construct the fixed-size set of test cases when calculating the similarity between test inputs. During the process of choosing the th () test input from as the next test case (i.e., ), each candidate requires to be measured against all elements in according to the similarity measure, and the fixed-size set of test case for is constructed by .

5. Experiment

In this section, some experimental results, including simulations and experiments against real programs, were presented to analyze the effectiveness of FSCS-CID. We mainly compared our method to RT in terms of failure-detection effectiveness (-measure) and the rate of value combinations coverage at a given strength (-measure). For ease of describing our work clearly, we used the terms  Goodall1,  Goodall2,  Goodall3,  Goodall4,  Lin,   Lin1,   Overlap,  Eskin,  OF,  IICS,  and  MICS to, respectively, represent the similarity measure Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS adopted in the FSCS-CID. Additionally, we used the term RT to represent RT.

As shown in (16), a weight is required to be assigned for interaction coverage at each strength value. There are many techniques which conduct on assigning weights; however, in this paper we focus on two distribution styles: equal distribution where each interaction coverage has the same weight, that is, ; and FTFI percentage distribution where according to previous studies [28, 34], for example, in [28], Kuhn et al. investigated several software projects and concluded that the interaction faults are summarized to have 29% to 82% faults as 1-wise faults (i.e., the FTFI number is 1), 6% to 47% of faults as 2-wise faults, 2% to 19% as 3-wise faults, 1% to 7% of faults as 4-wise faults, and even fewer failures beyond 4-wise interactions. As a consequence, we arrange weights as follows: , , where . For example, if , and ; if , , , and . In this paper, therefore, we use the terms  MICS1 and  MICS2 to stand for the  MICS techniques with the above two weight distribution styles, respectively.

5.1. Simulation

In the following subsection, two simulations were presented to analyze the effectiveness of FSCS-CID according to the rate of covering -wise value combinations (i.e., -measure). We used two usual test profiles and that are commonly used in previous studies [37].

5.1.1. Setup

Since the was known before testing, in this simulation, we considered the FSCS-CID using [36] as the similarity measure (denoted  NCVCS). Except the  MICS, all other methods do not require to be set. As for the  MICS, different strength values from 1 to are considered to calculate the MICS measure according to (16). However, due to the known , we mainly focused on the strength values from 1 to for calculating the MICS measure. As a consequence, (16) becomes as follows: where only are considered. Each method runs until the  StopCon2 is satisfied. Additionally, we consider as the metric to evaluate each method in terms of the rate of covering value combinations at strength for each method, where .

5.1.2. Results

Figure 3 summarizes the number of test cases required to cover all possible -wise value combinations (i.e., ) generated by each method for the above two designed test profiles. Based on the experimental data, we have the following observations.(1)For each test profile, the () metric values of all FSCS-CID methods using different similarity measures are smaller than those of RT. In other words, the FSCS-CID methods require the smaller number of test cases for covering all -wise value combinations than RT, which means that the FSCS-CID methods have the higher rates of covering value combinations than RT.(2)Among all the FSCS-CID methods, the  NCVCS is the most effective technique. The results show that the values of the  NCVCS are about 30%~50% of those of the  RT. The  IICS has the second best metric values, followed by the  OF. For , the  Goodall3 is least effective, while for , the Lin performs least.(3)From the perspective of the similarity category, the FSCS-CID methods using the interaction-coverage-based similarity measures (including  IICS,  MICS, and  NCVCS) perform best, while the FSCS-CID methods using the information-theoretic similarity measures (including  Lin and  Lin1) perform worst.

5.1.3. Analysis

Here, we briefly analyze the above observations. The observation (1) is explained as follows. The FSCS-CID methods using different similarity measures select the next test case that has the smallest similarity value against already generated test cases, while RT simply generates teat cases at random from combinatorial input domain. As a consequence, the FSCS-CID methods achieve test cases more diversely than RT over the combinatorial input domain.

As for the observations (1) and (2), they are easy to be explained. On the one hand, since the metric is related to -wise value combinations, the  NCVCS performs best because it selects the next test case that covers of uncovered -wise value combinations as much as possible. In other words, it may have the fastest rate of covering all -wise value combinations. On the other hand, another two interaction-coverage-based methods, such as IICS and MICS, consider different strength values for generating test cases; however, both of them take the strength as an indispensable part. In detail, the  IICS calculates the test candidate from the strength 1 to , while the  MICS considers different strengths from 1 to at the same time. Hence, it is reasonable that, compared to other categories, the FSCS-CID methods using interaction-coverage-based similarity measures perform best according to the metric.

5.2. An Empirical Study

In this section, an empirical study was conducted to compare the performance between FSCS-CID and RT in practical situations, using the -measure as the effectiveness metric. To describe data clearly, we used ART -ratio, which is defined as the -measure ratio between FSCS-CID and RT, that is, . Intuitively speaking, the smaller ART -ratio value implies the higher improvement of FSCS-CID over RT, and is the -measure improvement of FSCS-CID over RT.

In this empirical study, we use a set of six fault-seeded C programs with 9 versions. The five subject programs, including  count,  series,  tokens,  ntree, and  nametbl, are downloaded from Chris Lott’s website (http://www.maultech.com/chrislott/work/exp/), which have been widely used in the research of combinatorial space such as comparison of defect revealing mechanisms [38], evaluation of different combination strategies for test case selection [39], and fault diagnosis [40, 41]. The remainder subject programs are a series of   flex programs (the model used in this paper is unconstrained, which has some limitations: “We note that in a real test environment an unconstrained TSL would most likely be prohibitive in size and would not be used” [42].), downloaded from Software Infrastructure Repository (SIR) [43], which are popularly used in combinatorial test suite construction [44] and combinatorial interaction regression testing [42].

Table 3 presents detailed information about these subject programs, from which the third column “LOC” represents the number of lines of executable code in these programs, and “#S.” is the number of seeded faults in each subject program, while “#D.” is the number of faults that can be detected by some test cases derived from the accompanying test profiles, which are not guaranteed to be able to detect all faults. However, in our study, we only use a portion of detectable faults, of which the size is shown as “#U.”. The main reason is due to the fact that faults in the set of detectable faults but not in the set of used faults have high failure rates that exceed 0.5. As we know, if the failure rate of a fault is larger than 0.5, the -measure of random testing is theoretically less than . As a consequence, the -measure of FSCS-CID depends on the first randomly selected test case. In other words, if the first test case cannot detect a failure, the is larger than or equal to 2. Therefore, the -measure of FSCS-CID is dependent on random testing.

For the purpose of clear description, we order used faults in each subject program in a descend order according to failure rate and abbreviate them as . The range of failure rates in each program, as shown in Table 3, is from to .

We used all twelve FSCS-CID versions using different similarity measures to test these fault-seeded programs. The results of the empirical study are given in Figure 4, where -axis represents each seeded fault in the subject program, while -axis represents the ART -ratio. As shown in Figures 4(a)4(i), each figure corresponds to a particular subject program, while Figure 4(j) represents the average ART -ratio of all FSCS-CID versions for each subject program.

From Figures 4(a)4(j), we can observe the following conclusions.(1)According to ART -ratio, all twelve FSCS-CID versions, including  Goodall1,  Goodall2,  Goodall3,  Goodall4,  Lin,  Lin1,  Overlap,  Eskin,  OF,  IICS,  MICS1, and  MICS2, perform better than  RT. In the best case, the improvement of FSCS-CID over RT is about 40% (i.e., ART -ratio is 60%).(2)With the increase of failure rate , the ART -ratio of each FSCS-CID version increases as well in most programs. In other words, when is larger, the improvement of each FSCS-CID version over RT is smaller.(3)The failure-detection capability of FSCS-CID depends on some factors, such as the program (or test profile) and failure type (including failure rate and failure pattern). For example, in program count faults and have the same failure rate; however, the ART -ratio of each FSCS-CID version when detecting is very different from that when detecting .(4)Figure 4(j) describes the average ART -ratio of all FSCS-CID versions when detecting each fault for each subject program. It can be clearly seen that the ART -ratio of the FSCS-CID algorithm generally fluctuates from 0.75 to 0.90 among all faults for each program, which means FSCS-CID can improve about 10%~25% of -measure over RT in the average.(5)Among all FSCS-CID versions, no method performs best for all programs, and no method performs worst. In order to compare the failure-detection capabilities of different FSCS-CID versions, Table 4 shows the average ART -ratio of each FSCS-CID version for each subject program. According to data shown in Table 4, it is obvious that in general one of FSCS-CID version  OF performs best, followed by  IICS, while  Lin and  Lin1 generally perform worst. In addition,   Eskin performs best for the program  tokens and  Goodall1 has the best performance for the program  flex-v4.

In summary, our simulation results (Section 5.1) have shown that our FSCS-CID algorithm (irrespective of used similarity measure) has higher rates of covering value combinations at different strength values than those of random testing. Besides, the empirical study has shown that the FSCS-CID algorithm performs better than RT in terms of the number of test cases required to detect the first failure (i.e., -measure).

5.3. Threats to Validity

The experimental results suffer from some threats to validity; in this section, we outline the major threats. In the simulation study, two widely used, but limited, test profiles were employed. In the empirical study, many real-life programs were used, which have been popularly investigated by different researches. However, the faults seeded in each subject program have high failure rates. To address these potential threats, additional studies using a great number of test profiles and a great number of subject programs with low failure rates will be investigated in the future.

In addition, although two metrics (-measure and -measure) were employed in our experiment, we recognize that there may be other metrics which are more pertinent to the study.

6. Discussion and Conclusion

Adaptive random testing (ART) [15] has been proposed to enhance the failure-detection capability of random testing (RT) by evenly spreading test cases all over the input domain and has been widely applied in various applications such as numerical programs, Java programs, and object-oriented programs. In this paper, we broaden the principle of ART in a new type of input domain that has not yet been investigated, that is, combinatorial input domain. Due to special characteristics of combinatorial input domain, the test case similarity (or dissimilarity) measures previously used in ART may not be suitable for combinatorial input domain. By adopting some well-known similarity measures used in data mining and proposing two new similarity measures based on interaction coverage, we proposed a new approach to apply original ART into combinatorial input domain, named ART-CID. We conducted some experiments including simulations and the empirical study to analyze the effectiveness of one version of ART-CID (FSCS-CID, which is based on fixed-size-candidate-set ART). Compared with RT, FSCS-CID not only brings higher rates in covering all possible combinations at any given strengths, but also requires fewer combinatorial test cases to detect the first failure in the seeded program.

Combinatorial interaction testing (CIT) [33] is a black-box testing method and has been widely used in combinatorial input domain. It aims at constructing an effective test suite to identify interaction faults caused by parameter interactions. Some greedy CIT algorithms, such as AETG [45], TCG [46], and DDA [47], may have similar mechanism as FSCS-CID. Taking AETG for example, similar to AETG, FSCS-CID also first constructs some candidates, and then from which the “best” element would be chosen as the next test case according to some criteria. However, there are some fundamental differences between AETG and FSCS-CID, which are mainly summarized as follows.(1)Different construction strategies of candidates: FSCS-CID constructs candidates in a random manner, while AETG first orders all parameters and then assigns a value to each parameter, such that all assigned parameter values can cover the largest number of value combinations at a given strength.(2)Different test case selection criteria: AETG selects an element from candidates as the next test case such that it covers the largest number of value combinations at a given strength, while FSCS-CID chooses the next test case according to its used similarity measure.(3)Different goals achieved: AETG aims at covering all possible value combinations of a given strength with fewer test cases, which means that the unique stopping condition of AETG is that all value combinations of a given strength are covered by generated test cases, while FSCS-CID is an adaptive strategy, which means that the stopping condition of FSCS-CID is not limited to covering all value combinations of a give strength, for example, detecting a first failure in the SUT.

In this paper, constraints among value combinations have not been considered; however, they often exist in real-life programs. For example, as shown in Table 1, there may exist a constraint among “Full Size” of and “Single” of , that is, when , (i.e., “Full Size” and “Single” cannot occur in a combinatorial test cases). In this case, the method FSCS-CID proposed in this paper can still be successfully executed only by judging that each selected test case violates constraints among value combinations or not. Generally speaking, this judgment process can be implemented in the following phases: when constructing the candidate set and when adding the latest test case into the executed set. However, how to deal with constraints among value combinations should be further studied.

In the future, we plan to further investigate how to improve the effectiveness of the approach by adopting other similarity measures that may be available in combinatorial input domain or by considering additional factors to guide test case generation. In addition, how to extend other original ART algorithms into combinatorial input domain is also expected.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank C. M. Lott for sending them the failure reports of five subject programs (count,  series,  tokens,  ntree,  and  nametbl) and the Software-artifact Infrastructure Repository (SIR) [43] which provided the source code and fault data for the program  flex. They also would like to thank T. Y. Chen for the many helpful discussions and comments. This work is in part supported by the National Natural Science Foundation of China (Grant no. 61202110) and Natural Science Foundation of Jiangsu Province (Grant no. BK2012284).