Abstract

Latent Dirichlet Allocation (LDA) is a statistical topic model that has been widely used to abstract semantic information from software source code. Failure refers to an observable error in the program behavior. This work investigates whether semantic information and failures recorded in the history can be used to predict component failures. We use LDA to abstract topics from source code and a new metric (topic failure density) is proposed by mapping failures to these topics. Exploring the basic information of topics from neighboring versions of a system, we obtain a similarity matrix. Multiply the Topic Failure Density (TFD) by the similarity matrix to get the TFD of the next version. The prediction results achieve an average 77.8% agreement with the real failures by considering the top 3 and last 3 components descending ordered by the number of failures. We use the Spearman coefficient to measure the statistical correlation between the actual and estimated failure rate. The validation results range from 0.5342 to 0.8337 which beats the similar method. It suggests that our predictor based on similarity of topics does a fine job of component failure prediction.

1. Introduction

Components are subsections of a product, which is a simple encapsulation of data and methods. Component-based software development has emerged as an important new set of methods and technologies [1, 2]. How to find failure-prone components of a software system effectively is significant for component-based software quality. The main goal of this paper is to provide a novel method for predicting component failures.

Recently, prediction studies are mainly based on two aspects to build a predictor. One is past failure data [35], and another is the source code repository [68]. Nagappan et al. [9] found if an entity often fails in the past, then it is likely to do so in the future. They combined the two aspects to make a component failure predictor. However, Gill and Grover [10] analyzed the characteristics of component-based software systems and made a conclusion that some traditional metrics are inappropriate to analyze component-based software. They stated that semantic complexity should be considered when we characterize a component.

A metric based on semantic concerns [1113] has provided initial evidence that topics in software systems are related to the defect-proneness of source code. These studies approximate concerns using statistical topic models, such as the LDA model [14]. Nguyen et al. [12] were among the first researchers that concentrated on the technical concerns/functionality of a system to predict failures. They suggested that topic-based metrics have a high correlation to the number of bugs. Also, they found that topic-based defect prediction has better predictive performance than other existing approaches. Chen et al. [15] used defect topics to explain defect-proneness of source code and made a conclusion that prior defect-proneness of a topic can be used to explain the future behavior of topics and their associated entities. However, in Chen’s work, they mainly focused on single file processing and did not draw a clear conclusion whether topics can be used to describe the behavior of components.

In this paper, our research is motivated by the recent success of applying the statistical topic model to predict defect-proneness of source code entities. The predicting model in this work is approached based on the failures and semantic concerns as Figure 1 shows. A new metric (topic failure density) is defined by mapping failures back to the topics. We study the performance of our new predictor on component failure predicting by analyzing three open source projects. In summary, the main contributions of this research are as follows.(i)We utilize past failure data and semantic information in component failure prediction and propose a new metric based on semantic and failure data. As a result, it connects the semantic information with failures of a component.(ii)We explore the word-topic distribution and find the relationship between topics. The more similar the high frequent words are in the topics, the more similar the topics are.(iii)We investigate the capability of our proposed metric in component failure prediction. We compare its prediction performance against the actual data from Bugzilla on three open source projects.

The remainder of this paper is organized as follows. In Section 2, we present the related work of our research. In Section 3, we describe our research preparation, models, and techniques. In Sections 4 and 5, we show our experiments and validate our results. Then at last in Section 6, we make our conclusion.

2.1. Software Defect Prediction

Several techniques have been studied on detection and correction of defects in software. In general, software defect prediction is divided into two subcategories: the prediction of the number of expected defects and prediction of the defect-prone entities of a system [16].

El Emam et al. [6] constructed a prediction model based on object-oriented design metrics and used the model to predict the classes that contained failures in a future version. The model was then validated on a subsequent release of the same application. Thwin and Quah [17] presented the application of neural networks in software quality estimation. They built a neural network model to predict the number of defects per class and the number of lines changed per class and then used it to estimate the software quality. Gyimóthy et al. [18] employed statistical and machine learning methods to assess object-oriented metrics and then built a classification model to predict the number of bugs in each class. They made a conclusion that there was strong linear association between the bugs in different versions. Malhotra [19] performed a systematic review of the studies that used machine learning techniques for software fault prediction. He concluded that machine learning techniques have the ability for predicting software fault proneness and more studies should be carried out in order to obtain well formed and generalizable results.

Some other defect prediction researchers paid attention to finding defect-prone parts [16, 20, 21]. In the work by Ostrand et al. [7], they made a prediction based on the source code in the current release and fault and modification history of the file from previous releases. The predictions were quite accurate when the model was applied to two large industrial systems: one with 17 releases over 4 years and the other with 9 releases over 4 years. However, a long failure history may not exist for some projects. Turhan and Bener [22] proposed a prediction model by combining multivariate approaches combined with Bayesian method. This model was used to predict the number of failure modules. Their major contribution was to incorporate multivariate approaches rather than using a univariate one. K. O. Elish and M. O. Elish [23] investigated the capability of SVM in finding defect-prone modules. They used the SVM to classify modules as defective or not defective. Krishnan et al. [24] investigated the relationship between classification-based prediction of failure-prone files and the product line. Jing et al. [25] introduced the dictionary learning technique into the field of software defect prediction. They used a cost-sensitive discriminative dictionary learning (CDDL) approach to enhance the classification ability for software defect classification and prediction. Caglayan et al. [26] investigated the relationships between defects and test phases to build defect prediction models to predict defect-prone modules. Yang et al. [27] introduced a learning-to-rank approach to construct software defect prediction models by directly optimizing the ranking performance. Ullah [28] proposed a method to select the model which best predicts the residual defects of the OSS (applications or components). Nagappan et al. [9] found complexity metrics correlated with components failures, but a single set of metrics could not act as a universally defect predictor. They used principal component analysis on metrics and built a regression model. With the model, they predicted postrelease defects of components. Abaei et al. [29] proposed a software fault detection model using a semisupervised hybrid self-organizing map. Khoshgoftaar et al. [30] applied discriminant analysis to identify fault-prone modules in a sample from a very large telecommunications system. Ohlsson and Alberg [31] investigated the relationship between design metrics and the number of function test failure reports associated with software modules. Graves et al. [8] inclined to use process measures to predict faults. Several novel process measures, such as deltas, derived from the change history, were used in their work and they found process measure is more appropriate in failure prediction than product metrics. Graves’s work is the most similar one to ours. Both our work and Graves’s work are based on process measure. The difference is that we take the extra semantic concerns between versions into consideration. Neuhaus et al. [32] provided a tool for mining a vulnerability database and mapped the vulnerabilities to individual components and then made a predictor. They used this predictor to explore failure-prone components. Schröter et al. [33] used history data to find which design decisions correlated with failures and used combinations of usages between components to make a failure predictor.

2.2. LDA Model in Defect Research

LDA is an unsupervised machine learning technique, which has been widely used in latent topic information recognition from documents or corpuses. It is of great importance in latent semantic analysis, text sentiment analysis, and topic clustering data in the field. Software source code is a kind of text dataset; hence, researchers have applied LDA to diverse software activities such as software evolution [34, 35], defect prediction [12, 15], and defect orientation [36].

Nguyen et al. [12] stated that a software system is viewed as a collection of software artifacts. They use the topic model to measure the concerns in the source code and used these as the input for a machine learning-based defect prediction model. They validated their model on an open source system (Eclipse JDT). The results showed that the topic-based metrics have a high correlation to the number of bugs and the topic-based defect prediction has better predictive performance than existing state-of-the-art approaches. Liu et al. [11] proposed a new metric, called Maximal Weighted Entropy (MWE), for the cohesion of classes in object-oriented software systems. They compared the new metric with an extensive set of existing metrics and used them to construct models that predict software faults. Chen et al. [15] used a topic model to study the effect of conceptual concerns on code quality. They combined the traditional metrics (such as LOC) and the word-topic distributions (from topic model) to propose a new topic-based metric. They used the topic-based metrics to explain why some entities are more defect-prone. They also found that defect topics are associated with defect entities in the source code. Lukins et al. [36] used the LDA-based technique for automatic bug localization and to evaluate its effectiveness. They concluded that an effective static technique for automatic bug localization can be built around LDA and there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base. Our work is inspired by the recent success of topic modeling in mining source code.

3. Research Methodology

The proposed method is divided into three steps: the source code extracting and preprocessing step, the topic modeling step, and the prediction step.

3.1. Data Extracting and Preprocessing

Modern software usually has a sound bug tracking system, like Bugzilla and JIRA. The Eclipse project, for instance, maintains a bug database that contains status, versions, and components using Bugzilla. Our experiment data is gained by the three steps.(1)Identify postrelease failures. From the bug tracking system, we get failures that were observed after a release.(2)Collect source code from version management systems, such as SVN and Git.(3)Prune the source code of a component. From the source code, we find there are many kinds of files in a component, but not all are necessary, such as project execution files and XML.

After the extracting step, we perform the following preprocessing steps: separating the comments and identifiers, removing the JAVA keyword syntax structure, and stemming and removing extremely high or extremely low frequency words. The words with a rate of more than 90% and the emergence rate of less than 5% are removed [37].

3.2. Topic Modeling

There are two steps in topic modeling. First, we connect the component size and the failures by defining the failure density. Second, we map failures to topics and build a failure topic metric.

3.2.1. Defining Failure Density

During the design phase of new software, designers make a detail component list. Each component contains a lot of files. In Bugzilla, the bug reports are associated with component units. In Knab et al. [38], they stated that there is a relation between failures and component size. In a software system, researchers used defect density () to assess the defects of a file, which indicates how many defects per line of code [39]. In our study, we found that components with more files usually have more failures. The number of files in different components is different, so are the failures. Here, we define the failure density of a component () aswhere represents component , is the total number of failures in a component , and is the number of files within component . is used to depict how failure-prone a component is.

3.2.2. Mapping Failures to Topics

LDA [14] is a generative probabilistic model of a corpus. It makes a simplifying assumption that all documents have topics. The -dimension vector is the parameter of topic prior distribution, while , a matrix, is the word probabilities. The joint distribution of a topic mixture is a set of topics and a set of words that we express aswhere is simply for the unique such that .

Integrating over and summing over , we obtain the marginal distribution of a document:

A corpus with documents is designed as , So

From (4), we obtain the parameters and by training and get the maximum, so that we compute the posterior distribution of the hidden variables given a document:

By (2), the Failure Density (FD) is determined by the number of failures and the number of files within a component, defined as the ratio of the number of failures in the component to its size, which reflects the failure information in a component. Using this ratio as motivation, we define the failure density of a topic (TFD) asUsing (6), failures are mapped to topics. describes the failure-proneness of topic .

3.3. Prediction

In the work by Hindle et al. [40], they compared two topics by taking the top 10 words with the highest probability that if 8 of the words are the same, then the two topics are considered to be the same. In this paper, we increased the number of the highest probability words and defined the similarity as a ratio of the number of the same words in different topics to the total number of high probability words (see (7)):By (7), we calculate similarity of the topics in two neighboring versions and get a similarity matrix. Then we define TFD relation:where is version , is the total number of topics in , and is the similar degree of topics and which is from the similarity matrix. After getting TFD through (8), we get in using (6), and then we obtain the number of failures in each component.

4. Experiment

4.1. Dataset

The experimental data comes from three open source projects, that is, Platform (a subproject of Eclipse), Ant, and Mylyn. We select three versions of bug reports for each project and the source code of the corresponding versions (Platform3.2, Platform3.3, and Platform3.4; Ant1.6.0, Ant1.7.0, and Ant1.8.0; Mylyn3.5, Mylyn3.6, and Mylyn3.7). The basic information of the three projects is shown in Table 1.

4.2. Results and Analysis on Three Open Source Projects

For any given corpus, how to choose the topic number () does not have a unified standard, and different corpora with different topics have great differences [4143]. Setting to extremely small values causes the topics to contain multiple concepts (imagine only a single topic, which will contain all of the concepts in the corpus), while setting to extremely large values makes the topics too fine to be meaningful and only reveals the idiosyncrasies of the data [34]. Synthetically we considered the component number and scale within a project, as well as using our experience; we select from 10 to 100. The experiment results are shown in Figure 2.

We choose a different number of topics and compare the similarity of predicted data and actual data. From Figure 2, the result is better than the others when the number of topics is 20. This visualization allows us to quickly and compactly compare and contrast the trends exhibited by the various topics. We set to 20 for the three projects.

We run LDA on three projects and get the topic distribution of the components. The full listing of the topics distribution discovered 9 is given in Table 4. We compare the topic distribution of the components between neighboring versions of the three projects. It is seen that the topics in the current version relate to the topics in the previous version. For example, topic 8 in version 3.5 () and topic 1 in version 3.6 () have almost the same correlation with 11 components (Figure 3). We also find that and have a large difference.

Why do and have only a small variance in COM1 (Bugzilla) with the relation value of and being 0.8988 and 0.8545 (see Table 4), respectively? Also, what makes and have a difference within components? We compare the high probability word information of the three topics. Table 2 shows the high probability words (top 10 words) of these three topics.

In the experiments we found that not all TFD in the previous version had an impact on the later version. When the similarity between topics is below a threshold, the influence between TFD is very small or even has a negative influence on the results.

With the high probability words of and , just the ninth “message” is different. In addition, has nothing in common with in terms of high probability words. We conclude this is the most direct reason for why the two topics have a high similarly (or great difference) relation value between the components. Hence, this is why we use the similarity of high probability words to describe the similarity between two topics. At the same time, a similarity matrix is built to show the membership of topics in two neighboring versions. Table 3 is a similarity matrix of the three projects. In our study, some topics had one or more similar topics in the next version, but others had no similar topics. This is consistent with the topic evolution [34].

We use (6) to calculate the TFD for each topic on the three projects (Platform, Ant, and Mylyn). In order to better describe the relation between TFD and versions, we use a box plot. TFD of these three projects is shown in Figure 4.

From Table 1 and Figure 4, it is seen that the fewer the number of failures, the smaller the box length. For example, the length of the box plot corresponding TFD in Ant1.8 is almost 0; the number of failures in Ant1.8 is only 5 (Table 1); the number of failures in Ant1.7 is 104; the length of the box plot is much longer. According to (6), TFD is determined by the topic distribution matrix and the FD of the component. If the FD and the topic distribution matrix change, the value of the TFD will also change. However, in our research, we find that the number of files and topic mixture is constant in a version of a project. We conclude that the distribution of the TFDs in the same project is related to the number of failures. On the other hand, the distribution of TFD reflects the failure distribution in each version of a project.

In the above work, it is shown that using the similarity of high probability words describes the connection between topics in two neighboring versions (see Table 3). Furthermore, the TFD has a connection with failures of components (see Figure 4). Next we use TFD and the relation of topics to make a prediction for the failures of components.

Figure 5 shows the prediction results of the failures in each component and the actual number of failures in each component of three projects.

From Figure 5, it is seen that the number of failures from our predictor (we call it the prediction data) has some relation with the numbers of failures collected from Bugzilla (we call it the actual data). When the actual data of each component is larger, our prediction data is usually larger, for example, the numbers of failures of component SWT and component Debug in Figure 5(c).

What is the significance of our prediction? As in [9], we sort components by the number of failures. We find that ranking of many predicted components is consistent with the order of the real components ranking (Figure 6). We compare the first three and last three components with the actual ranking, and the average correct rate is 77.8%. In other words, our proposed prediction method quickly finds which components have the most failures and which have the least in the next version. It gives an idea about the testing priorities and allows software organizations to better focus on testing activities and improve cost estimation.

5. Validity

5.1. Validation and Comparison

To evaluate the correlation between our prediction data and the actual data, we use the Spearman correlation coefficient [44], which is a measure of the two-variable dependence on each other [45]. If there are no duplicate values in the data and when the two variables have a completely monotonic correlation, the Spearman coefficient is or . represents a complete positive correlation, represents a perfect negative correlation, and 0 means no relationship between two variables. Correlation is as follows:

In this paper, is the actual data, and is the prediction data. The better our predictor would be, the stronger the correlations would be; a correlation of 1.0 means that the sensitivity of the predictor is high. The results of Spearman correlation coefficients are shown in Figure 7.

From Figure 7, we find out the predicted failures are positively correlated with actual value with our approach. For instance, in project Mylyn3.7, the higher the number of failures in a component, the larger the number of postrelease failures (correlation 0.6838). To conduct the comparison, we implemented the lightweight method provided by Graves et al. [8]. The number of changes to the code in the past and a weighted average of the dates of the changes to the component are used to predict failures in Graves et al. work. As they described, we collected change management data deltas and average age of components for three projects from Github. The general linear model was used to build the prediction model. Equation (10) shows their most successful prediction model for the log of the expected number of faults:where deltas describe the number of changes to the code in the past and age is calculated by taking a weighted average of the dates of the changes to the module and it means the average age of the code.

In the evaluation, we use Graves’s model to obtain component failures and get the Spearman correlation with the actual failure data (Figure 7). It is seen that our approach gets a higher correlation with actual failures.

5.2. Threats to Validity

Threats to validity of the study are as follows.

Data Extracting. In Bugzilla, each failure is assigned by a tester to a component. Failures in a component are easy to collect. However, it is difficult to extract source code for each component. When we get source code from the version management systems, we should classify the source code by ourselves. It may bring some unnecessary mistakes. For example, a file that belonged to component 1 in version may be moved into component 2 in version .

Parameter Values. Our evaluation of the components similarity is based on the topics. Since LDA is a probability model, mining different versions of the source code may also lead to different topics. Besides, our work involves choosing several parameters for LDA computation; perhaps, the most important is the number of topics. Also required for LDA is the number of sampling iterations, as well as prior distributions for the topic and document smoothing parameters, and . There is currently no theoretically guaranteed method for choosing optimal values for these parameters, even though the resulting topics are obviously affected by these choices.

6. Conclusion

This paper studies whether and how to use historical semantic and failure information to facilitate component failure prediction. In our work, the LDA topic model is used for software source code topic mining. We map information of source code failures to topics and get TFD. Our result is that the TFD is quite useful in describing the distribution of failures in components. After exploring the base information of word-topic and high frequent words, we find the similar regularity from topics. The experiment shows that the similarity of topics is determined by the similarity of their high frequent words. These two results motivated us to make a prediction model. The TFD is used as the basic information, and the similarity matrix is used as a bridge to connect topics from neighboring versions. Our prediction results show that our predictor has a high precision on predicting component failures. To go a step further and validate the results of our prediction, a rank correlation called Spearman is used. The Spearman correlation ranges from 0.5342 to 0.8337 which beats the similar method. It suggests that our prediction model is well applicable to predict component failures.

Appendix

See Table 4.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

The work described in this paper was partially supported by the National Natural Science Key Foundation (Grant no. 91118005), the National Natural Science Foundation of China (Grant no. 61173131), the Natural Science Foundation of Chongqing (Grant no. CSTS2010BB2061), and the Fundamental Research Funds for the Central Universities (Grant nos. CDJZR12098801 and CDJZR11095501).