Abstract

Software reliability growth models (SRGMs) based on a nonhomogeneous Poisson process (NHPP) are widely used to describe the stochastic failure behavior and assess the reliability of software systems. For these models, the testing-effort effect and the fault interdependency play significant roles. Considering a power-law function of testing effort and the interdependency of multigeneration faults, we propose a modified SRGM to reconsider the reliability of open source software (OSS) systems and then to validate the model’s performance using several real-world data. Our empirical experiments show that the model well fits the failure data and presents a high-level prediction capability. We also formally examine the optimal policy of software release, considering both the testing cost and the reliability requirement. By conducting sensitivity analysis, we find that if the testing-effort effect or the fault interdependency was ignored, the best time to release software would be seriously delayed and more resources would be misplaced in testing the software.

1. Introduction

Software systems are thought to run successfully when they perform without any failure during a certain period of time. However, in practice, failures are frequently observed and a perfect operational process cannot always be guaranteed due to a variety of reasons. Software reliability, which is defined as the probability of failure-free operation of a software program for a specified time interval, is an essential consideration in the processes of software development and a critical factor in measurement of software quality. In fact, accurately modeling the software reliability growth process and predicting its possible trend are of high values for software developers. For these reasons, a bunch of software reliability growth models (SRGMs), most of which are based on a nonhomogeneous Poisson process (NHPP), have been constructed [15]. Unlike closed-source software, open source software’s (OSS) knowledge base is not bounded to a particular organization. As a result, reliability measurement is important and complicated for OSS, such as Linux operating systems, Mozilla browser, and Apache web server. Due to their unique properties, a parallel of SRGMs has been purposefully developed to quantify the stochastic failure behavior of OSS [6, 7].

One key factor of software reliability growth modeling is the testing-effort effect [814]. Huang and Kuo [15] incorporated a logistic testing-effort function into software reliability modeling and showed that it provides powerful prediction capability based on real failure data. Huang [16] considered a generalized logistic testing-effort function and change-point parameters in software reliability modeling and some practical data were employed to fit the new models. Ahmad et al. [17] incorporated the exponentiated Weibull testing-effort function into inflection S-shaped software reliability growth models. Peng et al. [18] established a flexible and general software reliability model considering both testing-effort function and imperfect debugging.

In these models, assuming independence among successive software runs may not be appropriate [1921]. For instance, Goševa-Popstojanova and Trivedi [22] introduced a software reliability model based on Markov renewal processes that covers interdependence among successive software runs. The Compound-Poisson software reliability model presented by Sahinoglu [23] considered multiple failures that occur simultaneously. Singh et al. [24] proposed new SRGMs that assume the presence of two types of faults in the software: leading and dependent. Leading faults can be removed when a failure is observed, but dependent faults are masked by leading faults and can be removed only after the corresponding leading faults are removed.

We also notice that a bundle of prior papers have examined the problems of optimal release time under different criteria. One stream studies this issue by minimizing the total cost subject to a constraint on software reliability. For example, Huang and Lin [25] provided a series of theoretical findings regarding when a software version should be released and when fault dependency and debugging time lag are considered. In addition, Ahmad et al. [26] analyzed the optimal release timing in an SRGM with Burr-type X testing-effort functions. Another stream adopts the multiattribute utility theory (MAUT) to decide the release time. For instance, Singh et al. [27] built an SRGM based on the Yamada two-stage model and focused on two different attributes to investigate the release time. Pachauri et al. [28] studied an SRGM incorporating a generalized modified Weibull testing-effort function in an imperfect debugging environment and the optimal release policy is examined using both genetic algorithm (GA) and MAUT.

In this work, we propose a new model that incorporates a power-law testing-effort function and three generations of interdependent errors. Specifically, the detection process for the first-generation errors is independent, and the detection rate is proportional to the instantaneous testing-effort expenditure and the mean of remaining first-generation errors in the system. By contrast, the detection process of the second-generation errors depends on the first generation and the detection rate is proportional to not only the instantaneous testing-effort expenditure and the mean of remaining second-generation errors but also the ratio of removed first-generation errors. Similarly, the detection process of the third-generation errors relies on the second-generation errors and the detection rate is proportional to the instantaneous testing-effort expenditure, the mean of remaining third-generation errors, and the ratio of removed second-generation errors.

Model parameters are estimated by the nonlinear least square estimation method, which is realized by minimizing the sum of squares of the deviations between estimated values and true observations. Numerical experiments are carried out based on three versions of Apache. The estimation results show that the proposed model well fits the real observations and performs better than the traditional Goel-Okumoto model and the Yamada delayed S-shaped model. Comprehensive comparisons in terms of prediction capabilities among various models are conducted to reveal that our model can predict the error occurrence process more accurately than the benchmarks. We also analyze the optimal release policy and the minimal cost for the software development managers, based on which insightful suggestions are provided. Besides, we conduct sensitivity analysis and find that resources would be misplaced if either the testing-effort effect or the error interdependency is overlooked by the software testing team. Unlike the benchmarks, our results suggest that the strategy of an earlier release time can substantially reduce the cost.

2. Software Reliability Growth Modeling

2.1. Basic Assumptions

To describe the stochastic failure behavior of software systems, we define as a counting process that represents the cumulative number of failures by time . We use the mean value function (MVF) to denote the expected cumulative number of detected failures in the time period . That is,where means the expectation operator and is called failure intensity function which indicates the instantaneous fault-detection rate at time . Specifically, for an NHPP-based SRGM, it is assumed that follows a Poisson distribution with a mean value function . That is, the model can be mathematically characterized aswhere is the Poisson probability mass function (pmf) with mean [29, 30]. Generally, one can obtain different NHPP models by taking different mean value functions.

In this paper, to characterize the stochastic behavior of the software failure process and obtain the expression for the mean value function, we make the following assumptions:(i)The software system is subject to failures at random times caused by errors remaining in the system.(ii)The error detection phenomenon in software testing is modeled by a nonhomogeneous Poisson process.(iii)Each time a failure occurs, the error causing it is immediately removed and no new errors are introduced.(iv)The faults existing in the software are of three generations and each generation is modeled by a different growth curve.(v)The mean number of errors detected in the time interval is proportional to the instantaneous testing-effort expenditure and the mean number of remaining errors in the system.(vi)The mean number of errors detected in the time interval is proportional to the instantaneous testing-effort expenditure, the mean number of remaining errors in the system, and the ratio of removed errors.(vii)The mean number of errors detected in the time interval is proportional to the instantaneous testing-effort expenditure, the mean number of remaining errors in the system, and the ratio of removed errors.(viii)Fault-detection/removal rate is a power-law function of testing time for all three generations of errors.

The first four assumptions in the above list are the building blocks in most prior software reliability growth models and the others are specific in our model. To facilitate our later analysis, we summarize the notations used throughout this paper in Notations section.

2.2. Proposed Software Reliability Growth Model

In general, an implemented software system is tested to detect and correct software errors in the development process. During the testing phase, software errors remaining in the system are discovered and corrected by test personnel. Based on the aforementioned assumptions, the detection rate must be proportional to the instantaneous testing-effort expenditure. Particularly, unlike traditional closed-source software, in open source software, instructions are executed by various concurrent users and each release of open source software can attract an increasing number of volunteers in the software development and/or code modification procedure. Hence, it is reasonable to assume the detection/correction rate as a power-law function of testing time [19, 24]. That is, the instantaneous testing-effort expenditure at time , , takes the following form:where and are parameters to be estimated by real-world data sets. As such, we can quickly obtain the cumulative amount of testing-effort expenditures by time ; that is,

As stated above, it is essential to consider fault interdependency when modeling software reliability growth processes. Specifically, in this work we propose three generations of interdependent errors. That is, the detection of the second-generation errors relies on the first generation, and the detection of the third-generation errors depends on the second generation. Moreover, the second-generation errors are detectable only if the first-generation errors have been removed, and the third-generation errors are detectable only if the second-generation errors have been removed. The flow chart is shown in Figure 1 to illustrate the error detecting process in a software system.

Next, we sequentially examine the mean value function for each generation of errors:

(i) For the first-generation errors in the software system, the number of detected errors is proportional to the number of remaining errors and the instantaneous testing-effort expenditure. Thus, the mean value function for errors satisfies the following differential equation:where is the initial content of errors in the system and represents the expected number of errors detected in time . Specifically, is assumed to be a bounded nondecreasing function of with the initial condition . Solving the above differential equation yields

(ii) The mean value function for errors satisfies the following differential equation:where is the initial content of errors in the system and stands for the expected number of errors detected in time . Specifically, is also assumed to be a bounded nondecreasing function of with the initial condition . Solving the above differential equation yields

(iii) Similarly, the mean value function for errors satisfies the following differential equation:where is the initial content of errors in the system and means the expected number of errors detected in time . Specifically, is assumed to be a bounded nondecreasing function of with the initial condition . Solving the above differential equation leads to

After this, we formulate the overall mean value function that consists of mean value functions for three generations of errors. Assume that is the expected total number of faults that will be eventually detected and , , and are corresponding ratios for each generation of errors at the beginning of testing. By noting that , , , and , we haveAs such, is a bounded nondecreasing function of , with and .

Then, based on the aforesaid mean value function, the failure intensity function for the proposed model, with testing-effort, can be explicitly given as

We further note that when , , and , the original Goel-Okumoto model can be realized [31]; that is,which is one of our benchmarks for the comparative analysis. In this case, errors are assumed to be independent and the testing-effort function keeps constant, that is, . Besides, the mean value function is a concave function with respect to .

Moreover, when , we have ; namely, the instantaneous testing-effort expenditure at time reduces to a constant. In this case, we haveand we call it No- model for brevity.

In addition, when and , the error interdependency is ignored and only one generation of errors exists in the software system. As a result, we haveWe call it Only- model for exposition.

As another benchmark, hereby we introduce an S-shaped model, which incorporates an increasing detection rate function and well explains “learning effect” of the testing team. Essentially, the mean value function is stated as follows:It is also called Yamada delayed S-shaped model [32, 33] in literature. For this model, the detection rate first increases and then decreases with .

3. Estimation and Analysis

3.1. Data Set

To compare the proposed model against traditional models for reliability assessment, numerical examples are provided based on real-world data. Specifically, the well-known Apache project, the most used web server software that is developed and maintained by an open community of developers, is selected in this paper for illustration. This project has large and well-organized communities: a great number of developers have the right to update and change files freely. In particular, testing data about three versions of Apache, 2.0.35/2.0.36/2.0.39, are used. The failure data is presented in Table 1 [8], including the time and the number of errors detected for each version. Specifically, testing for Apache 2.0.35 ranges from day 1 to day 43 and in total 74 errors are found; failure data for Apache 2.0.36 are collected from day 1 to day 103 and 50 errors are detected; Apache 2.0.39 is tested over a 164-day period and in the end 58 errors are discovered by the testing team.

3.2. Estimation Method

The nonlinear least square estimation (NLSE) method is a widely used estimation technique, especially for small/medium sized data sets [34]. We will use this technique to estimate parameters in models raised in this paper. More specifically, estimation is done by minimizing the sum of squared residuals, which are the squares of the differences between estimated values and true observations. The objective function for the minimization problem is stated as follows:where is the th observation, is the total number of observations, is the time in which the th error occurs, and is the corresponding estimated/theoretical value. Specifically, in this paper, can be from (11), (13), (14), (15), or (16). They correspond to our proposed model, the Goel-Okumoto model, the No- model, the Only- model, and the Yamada delayed S-shaped, respectively.

Differentiating with respect to each unknown parameter and setting the partial derivatives to zero, we can obtain the estimated values [34, 35]. Specifically, for our proposed model, we haveUsing numerical methods to solve the set of simultaneous equations above yields the estimation for unknown parameters in our proposed model.

3.3. Metrics for Model Comparison

With estimated parameters, we can conduct comparative analysis among models based on the following metrics:

() Goodness of Fit (): the value of determines how well the curve fits the real data [19, 36]. It is defined as follows:It measures the percentage of the total variance about the mean accounted for the fitted curve. It ranges from to . The larger is, the better the model fits data. In this work, the model is considered to provide a good fit if is above 0.95.

() Mean of Square Error (MSE): the value of MSE measures the average of squares of the differences between the predicted values and the actual values [9]. It is defined as follows:Generally, the closer to zero the MSE value is, the better the model fits the data.

() Theil Statistic (TS): the value of TS indicates the average percentage deviation with respect to the actual failure data during the selected time period [37]. It is defined as follows:A smaller TS value means a smaller deviation and a better prediction capability. In this work, the model is considered to provide a high-level prediction if the TS value is below 10%.

() Relative Error (RE): the value of RE measures the relative difference between the predicted value and the real observations [25, 38]. The definition can be stated as follows:Specifically, assuming that we have observed failures by the end of test at time , the failure data up to time () will be used to estimate parameters in the models. Substituting the estimated parameters in the mean value function yields the estimation of the number of failures by . The estimation is then compared with the actual number . Such procedure is repeated for various values of . Positive values of error indicate overestimation and negative values indicate underestimation.

3.4. Estimation Results and Model Comparison

The parameter estimation can help characterize properties of theoretical models, determine the number of errors that have already been detected/corrected in the testing process, and predict the number of errors that will be eventually encountered by users. Specifically, by minimizing the objective function in (17), we can obtain the estimated results for each model.

(i) For Apache 2.0.35, the estimated results are presented in Table 2, with , , , , and for our proposed model. Then, the ratio of errors to all types of errors is approximately 59%, the ratio of errors is around 28%, and the leftover are errors. Note that the errors consist of not only independent errors (), although and account much less than . Through incorporating such interdependency among generations of errors, we are able to strengthen the understanding towards composition of software system. Besides, we find that the estimated value of is obviously larger than 0. Hence, not considering the effect of testing effort in the analysis of the reliability growth process may lead to significantly biased results.

Illustrations for the theoretical values versus their observations are shown in Figure 2. We find that the theoretical values are highly consistent with the real-world observations. We can further examine the failure intensity function . Figure 3 shows that the failure intensity function is a single-peaked curve; that is, the error detection rate first increases and then decreases with time. It in turn implies that our proposed model falls into the S-shaped category for the SRGMs [34].

We observe that our proposed model well fits the real-world software failure data. This can be seen from the high value of , that is, 0.9954, which is very close to . By comparing with the Goel-Okumoto model, the Yamada delayed S-shaped model, the No- model, and the Only- model, values are the highest in our proposed model. This conclusion can also be drawn by comparing MSE values among the three models; our proposed model provides the lowest MSE values (2.0979). From this aspect, the model in our paper best fits the observed data about Apache 2.0.35.

Furthermore, the prediction capability in our model is very high. The low TS value for Apache 2.0.35, 0.0252, is much lower than 10% and well confirms a high prediction capability for our model. Compared with the benchmarks, our model can provide significantly lower TS values and higher predication capacity to the failure of Apache system. Lastly, the relative error in prediction using this data set is computed and results are plotted in Figure 4. We observe that the relative error approaches 0 as approaches and the relative error curve is usually within .

(ii) For Apache 2.0.36, the estimated results are presented in Table 3, with , , , , and for our proposed model. Three generations of errors account for 40%, 29%, and 31%, respectively. Obviously, and significantly contribute to the error composition, which counts for 60% in total. We also find that the estimated value of is much larger than implying that the testing-effort effect is necessary.

Then, we can obtain theoretical values in our proposed model. Illustrations for the theoretical values versus their observations are presented in Figure 5, which shows that the theoretical values are consistent with the real-world data. In Figure 6, we present the failure intensity function based on the estimated values. Similar to Apache 2.3.35, the hump-shaped curve shows that this function possesses an increasing-then-decreasing trend.

We find that our proposed model also well fits the real-world software failure data for Apache 2.0.36. This can be seen from the high value of , 0.9806. It is also much higher than those in the Goel-Okumoto model, the Yamada delayed S-shaped model, the No- model, and the Only- model. By comparing MSE values among the models, our proposed model provides the lowest MSE value (3.6163). From this aspect, the proposed model best fits the observed data about Apache 2.0.36.

Furthermore, the TS value for Apache 2.0.36 is as low as 0.0452, which is much lower than 10% and well confirms a high prediction capability of our model. Compared with the benchmarks, our model shows a lower value of TS and a higher level of predication capability. Lastly, the relative error in prediction for this data set is computed, as shown in Figure 7. We observe that the relative error approaches 0 as approaches and the curve is usually within .

(iii) For Apache 2.0.39, the estimated results are presented in Table 4, with , , , , and for our proposed model. The ratios of three generations of errors become about 53%, 38%, and 9%, respectively. Then, the ratio for errors is approximately 59%, the ratio for errors is around 28%, and the leftover are errors. We also find that the estimated value of is significantly greater than . Thus, ignoring the effect of testing effort in the analysis of the software reliability growth curve would lead to significant bias.

Comparison between the fitted curve and the observed failure data is illustrated in Figure 8, which indicates that the theoretical values and the real-world observations are highly consistent. In Figure 9, we present the failure intensity function based on the estimated values. Obviously, it is also a single-peaked curve: the error detection rate first increases and then decreases with time.

The high value of , 0.9937, suggests that our proposed model well fits the real-world software failure data for this version. By comparing with the Goel-Okumoto model, the Yamada delayed S-shaped model, the No- model, and the Only- model, values are the highest in our proposed model. This conclusion can also be drawn by comparing MSE values among the models; our proposed model provides the lowest MSE value (0.5314). Therefore, our proposed model best fits the observed data about Apache 2.0.39.

Furthermore, the TS value for Apache 2.0.39 is 0.0144. It is much lower than 10% and well confirms a high prediction capability for our model. Compared with the benchmarks, TS values in our model are significantly lower, and prediction capacity of the failure occurrence is also much greater. Lastly, the relative error in prediction for this data set is computed and results are plotted in Figure 10. We observe that the relative error approaches 0 as approaches and the error curve is usually within .

In summary, all of three tests using different versions of Apache failure data sets suggest that our proposed model well fits the real data and presents better performance than the benchmarks in error occurrence prediction. Besides, both the testing effort and the error interdependency effects play an important role in the parameter estimation.

4. Software Release Policy

The theoretical modeling can also determine whether the software is ready to release to users or how much more effort should be spent in further testing. In this section, we formally examine the software developer’s best release time and deliver sensitivity analysis regarding how the test effort effect and error interdependency affect the optimal time and the cost.

4.1. Optimal Release Time

When the failure process is modeled by a nonhomogeneous Poisson process with a mean value function, , a popular cost structure [15, 25, 27, 39] is applied aswhere is the release time of the software to be determined. In addition, is the expected cost of removing a fault during the testing phase, is the expected cost of removing a fault during the operation phase, and is the expected cost per unit time of testing. Specifically, the expected cost of removing a fault during the operation phase is higher than that during the testing phase; that is, . Overall, the above cost function is the summation of the testing cost, including the failure cost during testing and actual cost of testing, and the cost of failure in the field.

Proposition 1. If only the cost is considered, the optimal release time satisfies .

Proof of Proposition 1. Note that the objective function is as follows:The first-order condition yields which indicates .

It is also reasonable to quit testing as soon as the number of remaining faults in the system is less than a prescribed portion of total faults. Hence, a simple criterion for software release is given [8, 25]:which is the type-I reliability definition.

Alternatively, given that the testing or the operation has been conducted up to time , the probability that a software failure will not occur in the time interval , with , can be given by the following conditional probability [3, 40]:which is the type-II reliability definition.

In most situations, the project managers need to strike a balance between the cost and the reliability level. That is, on one hand, the reliability level should be at least as high as a desired level set beforehand; on the other hand, given the target level of software reliability , the cost minimization problem over the release time should be imposed. Considering this tradeoff yieldsfor the type-I reliability definition or for the type-II reliability definition.

As an illustrative example, Figure 11 presents how the software reliability and the total cost change with time. In this example, we use the data set with Apache 2.0.36 and take , $, $, and $. Suppose that the reliability must be higher than 0.95; that is, . Then, the reliability constraint can be satisfied only if for the type-I reliability definition (or for the type-II reliability definition). When only the total cost is considered, we find that the cost is minimized at . Furthermore, when both criteria (reliability and cost) are taken into account, we find that for the type-I reliability definition (or for the type-II reliability definition).

In addition, for Apache 2.0.35, the reliability level is above 0.95 only if if the type-I reliability definition is taken (or for the type-II reliability definition). When only the cost is considered, it is minimized at . Furthermore, when both criteria (reliability and cost) are taken into account, we find that for the type-I reliability definition (or for the type-II reliability definition).

Similarly, for Apache 2.0.39, the reliability level is higher than 0.95 only if if the type-I reliability definition is taken (or for the type-II reliability definition). When only the cost is considered, the minimization is attained at . When both reliability and cost are considered, we find that for the type-I reliability definition (or for the type-II reliability definition).

4.2. Sensitivity Analysis

As mentioned above, we have incorporated both the testing-effort and error interdependency effects into our proposed model. In this part, we intend to examine what if the testing effort, the error interdependency, or both of them are ignored in decision-making of software release. The first case corresponds to the No- model (14), the second case corresponds to the Only- model (15), and the third case is just the Goel-Okumoto model (13).

For each model, we shall examine the optimal release time and the corresponding minimal cost when a desired reliability level is set beforehand. The data set with Apache 2.0.36 is used to illustrate our findings. Under the type-I reliability definition for reliability measurement, the optimal release time and the minimal cost as a function of are presented in Figures 12 and 13, respectively. Similarly, under the type-II reliability definition, the optimal release time and the minimal cost as a function of are shown in Figures 14 and 15, respectively. Other parameters take the same values as analysis in Section 4.1; that is, , $, $, and $.

Under either type-I or type-II reliability definition, both the optimal release time and the total cost would be mistakenly estimated if either testing effort or fault interdependency is ignored. Unlike these two cases, our proposed model always suggests an earlier release time and a lower testing cost.

As an example, we compare the Goel-Okumoto model and our proposed model when the type-I reliability definition is adopted and is set as 95%. The Goel-Okumoto model indicates the best release time to be 76.56 while that for our proposed model is 54.3. Hence, the testing lasts much longer if the Goel-Okumoto model is used to decide the stopping time. Similarly, using the Goel-Okumoto model yields the minimal cost as 2529$ while that for our proposed model is just 1896$, implying that a decision based on our model can save the software developers as much as 25% in cost.

Alternatively, when the type-II reliability definition is adopted, the Goel-Okumoto model predicts the best release time as 132.2 while that for our proposed model is 66.75, which suggests that the testing duration in our model is doubled compared with traditional Goel-Okumoto model. Similarly, according to the Goel-Okumoto model, the minimal cost is 3083$ while that for our proposed model is merely 2000$. Thus, a decision based on our model can save the software developers as much as 35% in terms of the cost.

5. Concluding Remarks

In the development process, a variety of tests need to be conducted to ensure the reliability of the software system. In this work, we propose a framework for software reliability modeling that simultaneously captures effects of testing effort and interdependence between error generations. In order to evaluate our proposed model, we compare its performance with other existing models by doing experiments with actual software failure data, that is, three versions of Apache. The nonlinear least square estimation technique is applied for parameter estimation, and values of , mean of square error, Theil Statistic, and relative error are used for model comparison in terms of fitting quality and prediction capability. The numerical analysis from all three Apache versions demonstrates that the proposed model fits the real-world data very well and exhibits a high level of prediction capability.

Our theoretical results can help software project managers assess the most cost-effective time to stop software testing and to release the product. Specifically, we have formally examined the release timing and provided useful managerial insights based on the Apache data sets. Moreover, our proposed model recommends an earlier release time and a reduced cost compared with the Goel-Okumoto model and the Yamada delayed S-shaped model. Therefore, our modeling essentially enhances accuracy of the understanding towards the software reliability growth process and the best time to release a software version.

Notations

The th generation of fault
Testing time
Mean value function of the th generation faults
Mean value function for all three generations of faults
Theth observation
Failure intensity function
Cumulative number of software errors detected by testing time
Initial content of all three generations of faults
Proportion of generation fault
Initial content of generation fault
Coefficient for the testing-effort function
Power of the testing-effort function
Testing-effort expenditure at time
Cumulative testing-effort expenditure by time
Metric for goodness of fit
Mean of square error
Theil Statistic
Relative error
Total testing cost
Type-I reliability definition of software systems
Type-II reliability definition of software systems
Target level of reliability
  Testing cost coefficient.

Competing Interests

The authors declare that they have no competing interests.