Abstract

With the continuous development of software, it is inevitable that there will be various unpredictable problems in computer software or programs that will damage the normal operation of the software. In the paper, static analysis software is taken as the research object, the errors or failures caused by the potential defects of the software modules are analyzed, and a software analysis method based on big data tendency prediction is proposed to use the software defects of the stacked noise reduction sparse analyzer to predict. This method can learn features from original defect data, directly and efficiently extract required features of all levels from software defect data by setting different number of hidden layers, sparse regularization parameters, and noise ratio, and then classify and predict the extracted features by combining with big data. Through experimental tests, the performance of the presented method is better than that of the comparison method in correct rate, accuracy rate, recall rate, F1-measurement, AUC value, and running time, which proves that the research results in this paper have more accurate failure prediction effect and can timely eliminate software failures.

1. Introduction

The application of software almost permeates every aspect of people’s life. With the rapid development of the software industry, the increasing demand, the integration of functions, and the application of plug-ins make the scale of software larger and larger and also make the software more and more complex. Software failures may cause serious economic losses to enterprises and even threaten people’s lives [1]. In this paper, combining with the encoder model in propensity prediction, aiming at the problems of slow convergence speed of loss cost function and sparsity regularization parameters and complexity and difficulty of parameter adjustment in traditional encoder, the loss cost function and sparsity constraint method of encoder are improved. At the same time, in order to reduce the influence of noise on data, the method can learn features from original defect data. By setting different number of hidden layers, sparse regularization parameters, and noise adding ratio, the required hierarchical features are directly and efficiently extracted from the software defect data, and then the extracted features are classified and predicted with logistic regression classifier [24]. Since the development of software defect prediction technology, it has been one of the research hotspots in the field of software engineering. It is also called two key technologies to improve software quality and reliability together with software defect detection technology in which the software defect detection technology is mainly to analyze the program modules that have failed to determine the specific location of the defect; the software defect prediction technology mainly measures the program modules that have not failed and predicts whether the module contains undiscovered defects through the constructed software defect prediction model.

In recent years, many domestic and foreign research scholars have devoted themselves to the research of software defect prediction technology and have achieved quite excellent research results. At present, according to different technologies, the existing software defect prediction technology can be simply divided into dynamic defect prediction technology and static defect prediction technology where dynamic defect prediction technology is to study the life cycle of the entire software system and predict the distribution of software defects over time based on the time when the software failure or system failure occurs; the static defect prediction technology is based on measurement data related to software defects such as the size and loop complexity of the software system to predict the defect tendency, defect density, or number of defects in software program modules.

The proposed software defect prediction method is used to predict the defects of the big data modeling platform system. Firstly, based on the documents and program codes in the development process of the big data modeling platform system, the program modules of the system are extracted, and the metric elements are designed and measured to obtain the characteristic data. Secondly, some program modules are randomly selected from all the program modules for manual testing and labeling. Then, the defect data set of labeled program modules is used as the training set to build a software defect prediction model to predict the defect tendency of unlabeled program modules [5, 6].

2. An Encoder Based on Propensity Prediction Mechanism

2.1. Loss Cost Function Based on Cross Entropy

In order to overcome the problem of low parameter update efficiency when using the cost function of square deviation in traditional encoders, it is hoped that the partial derivative of the loss cost function is independent of the derivative of the activation function [7], namely:

In formula (1), is the sparse parameter, is the weight matrix, b is the bias vector, l is the activation function, and xi is the training sample set. For each sample i of the input layer, the feature yi of the hidden layer of the sample can be obtained through coding operation and zi is the reconstructed data of the sample [8, 9].

Taking , for example,

Applying the cross-entropy function to the loss cost function of the encoder, the cross-entropy cost function can be expressed as follows:

Its partial derivative is expressed as follows:

In formula (4), f is the bias vector.

Compared with the square variance cost function, the cross-entropy cost function has obvious advantages; its partial derivative is independent of the derivative of the activation function, so it will not be affected by the saturation of sigmoid function. When the loss is large, the weight will be updated quickly; when the error is small, the weight will be updated slowly. Same as the cost function of square deviation, the cost function of cross entropy also has two properties:(1)Nonnegative: within the scope of the definition domain, its value is nonnegative(2)The smaller the difference between reconstructed data zi and input data xi is, the more its cost function approaches to 0 [1012]

2.2. Sparse Constraint Method Based on L1 Rule

By penalizing nonzero activation of hidden neurons in the encoder, this paper proposes a sparse encoder (L1-SAE) based on the L1 rule for sparse constraint [13].

The sparse encoder based on L1 rule does not force all hidden neurons to share the same degree of sparsity but directly applies the average activation of hidden neurons to the sparsity constraint [14]. The expression of L1 rule is as follows:

In formula (6), is a sparsity parameter, which is a small value close to 0; represents the activation degree of hidden neuron j under the given input of x; and is the number of hidden layer neurons in the encoder [15, 16].

The overall cost function of the sparse encoder based on L1 rule is as follows:

In formula (7), is the loss cost of the encoder, is the second term for the sparsity penalty term, and the sparsity regularization parameter is the weight of the sparsity penalty item to control the relative importance of the sparsity penalty item.

In the sparse encoder based on KL divergence, two super parameters need to be set in advance: sparse regularization parameter and sparsity parameter , while in the sparse encoder based on L1 rule, only one super parameter needs to be set in advance: sparse regularization parameter . Using fewer super parameters can significantly reduce the time needed to adjust model parameters [1719].

The sparse encoder based on L1 rule has the following advantages:(1)L1 rule is a convex quadratic optimization problem, which can be well realized and solved(2)The sparse degree of hidden neurons in the sparse encoder can be learned(3)Using fewer super parameters can significantly reduce the difficulty of parameter adjustment during training the model [2022]

2.3. Improved Denoising Sparse Autoencoder (DSAE)

An improved denoising sparse autoencoder (DSAE) is presented in this paper. First, noise processing is performed on the original data, and then sparse coding training is performed on the noise data after noise addition, which makes the encoder to learn to remove noise and obtain the original data without noise pollution, forcing the encoder to learn more robust representation of the original data and improving the generalization ability of the encoder [2325].

The improved DSAE structure is shown in Figure 1.

In Figure 1, xi is the original data, is the noise data after noise processing, yi is the feature of hidden layer, and zi is the reconstructed data. There are usually two ways to add noise as follows:(1)Gaussian noise (GN): noise obeying Gaussian distribution is added to sample xi;(2)Masking noise (MN): some values in sample xi are randomly selected and set to 0; the loss cost between reconstructed data and input data of the improved DSAE model is as follows:

Then, the loss cost function of the improved DSAE in the whole training sample set is as follows:

When the DSAE is trained, the back propagation algorithm and gradient descent method are used to iterate and update the network parameters so as to learn the encoder with optimal network parameters [26]. The specific algorithm of training encoder is shown in Algorithm 1.

 Input: training data x, the number of input layer nodes inputSize, the number of hidden layer nodes hiddenSize, the weight attenuation coefficient , the sparse regularization constant , the maximum number of iterations of the optimization loss cost function maxIter
 Output: optimal network parameters , b
(1)Initialize the encoder’s network parameters: weight matrix , bias vector
(2)Initialize iteration number, overall ;
(3)Noise processing is carried out on training data x to obtain noise data ;
(4) optimizes the network parameters of the encoder by iterative methods
(5)By coding the noise data , the feature y of the hidden layer is obtained.
(6)In the decoding operation, the feature y of the hidden layer is decoded to obtain the reconstructed data z;
(7)Calculate the overall cost of the encoder;
(8)Calculate the partial derivative of the loss cost function
(9)Calculate the gradient of network parameters ;
(10)Update network parameters ;
(11)End for loop
(12)Output the optimal network parameters and b.

The algorithm flow of training encoder is shown in Figure 2.

The algorithm flow of training the analyzer is both left-saturated and right-saturated. Therefore, the sigmoid function is saturated. And if and only when its value approaches, the derivative is 0, so the sigmoid function is soft saturated.

The advantages of the sigmoid function are as follows:(1)The input data are compressed to (0, 1), and the sigmoid function has monotonic continuity, optimization, and stability(2)The derivative is easy to implement

However, the sigmoid function also has disadvantages that cannot be ignored:(1)The amount of calculation is too large. When using the backpropagation algorithm to calculate the gradient, the derivation process involves division;(2)When the input data are very large or very small, due to its soft saturation, the gradient disappears and the training of the deep network cannot be completed.(3)Its output is not 0 as the mean value, which will allow the neurons in the next layer to get the nonzero mean value of the output of the previous layer as input, so that the trained network parameters tend to be all positive or all negative. As a result, the z-shaped drop occurs when the gradient descent method is used to optimize the network parameters.

3. Software Defect Prediction Based on Stacked Denoising Sparse Autoencoder (sDSAE)

3.1. sDSAE

Multiple improved denoising sparse encoders are stacked layer by layer to build a deep neural network model, and an improved stacked denoising sparse autoencoder (sDSAE) can be obtained, which can obtain deeper feature information of the input data. The feature information acquired by the deeper level has the stronger feature expression ability [27].

Figure 3 shows a three-layer sDSAE structure based on the stacked improved denoising sparse encoder.

In the training sDSAE, the characteristics of the first layer of the encoder are taken as the input of the second layer of the encoder, and the greedy training method of layer by layer is adopted to train each layer of the deep neural network successively, and then the entire deep neural network is trained to obtain network parameters. In other words, firstly, the original training data xi are used as the input of the first-order analyzer to train the first-order feature of the original data, and then the first-order feature is used as the input of the second-order analyzer to train the second-order feature of the original data, and so on. Taking the feature as the input of the nth level analyzer, the n-order feature of the original data can be trained [28].

Assuming that a deep neural network model is composed of sDSAE and logistic regression classifier, the training of the whole model can be divided into two processes: unsupervised pretraining process and supervised fine-tuning process. Among them, the specific steps to realize the supervised fine-tuning process are as follows:

Step 1. A feedforward pass is performed, and a layer-by-layer greedy algorithm is used to train the 2L layer and 3L layer of the sDSAE until the output layer Lln to obtain the network parameters of all layers.

Step 2. The loss cost is calculated between the classification result of the output layer logistic regression classifier and the corresponding label of the input data.

Step 3. The residual is calculated for the output layer, and the calculation formula is as follows:The characteristic representation of the last layer of the sDSAE will be input to the output layer; that is, the output layer is the classifier, so the derivation process needs to be handled separately. In the logistic regression classifier, I is used to represent the label corresponding to the input data and Q is used to represent the conditional probability vector, and then [29] is used in the formula.

Step 4. The residuals for all hidden layers are calculated in the sDSAE; let , then

Step 5. Partial derivatives are calculated by using residuals as follows:

Step 6. Network parameters are updated as follows:The above is an iterative step in the fine-tuning process. Through multiple iterations and updates, fine-tuning can obtain better network parameters and improve network performance.
The algorithm flow of training the sDSAE is shown in Figure 4.
The training stack noise reduction sparse analyzer has significant effects in the feature space related models (such as image matching and speech recognition), while the unrelated models in the feature space (such as software defect prediction and text classification) may loss important information. The reason is that for models that are unrelated in the feature space, multigranularity scanning reduces the importance of features at both ends of the feature space to a certain extent. In multigranularity scanning, both the first feature and the last feature are scanned only once; that is, both features are used only once. If the importance of the first feature or the last feature is very high, multigranularity scanning cannot effectively use this important feature.

3.2. Software Defect Prediction Method Based on sDSAE

In this paper, the sDSAE with four hidden layers is designed for feature extraction of software defect data, and a software defect prediction model is used to classify and predict extracted defect characteristics with logistic regression classifier. Its model structure is shown in Figure 5 [30].

The entire deep network including the logistic regression classifier is fine-tuned to obtain the optimal network parameters. The overall algorithm flow of software defect prediction based on sDSAE is as follows. (Algorithm 2).

 Input: training data defectData, label defectLabel corresponding to training data, test data testData, number of input layer nodes inputSize of sDSAE, number of hidden layer nodes hiddenSizeL1, hiddenSizeL2, hiddenSizeL3, hiddenSizeL4, weight attenuation coefficient lambda, sparse regularization parameter beta, masking noise the masking rate noiseRatio, the maximum number of iterations maxIter to minimize the loss cost function, the weight attenuation coefficient of logistic regression classifier LogisticLambda, the maximum number of iterations LogisticMaxIter.
 Output: predicted defect tendency label predLabel, predicted defect tendency probability value predScore.
(1)The software defect data defectData are preprocessed to obtain the processed training data trainData. The preprocessing process mainly includes removing invalid data and data standardization. The data standardization process refers to the process of making the training data conform to the standard normal distribution;
(2)Take the processed training data trainData as the input of the first layer of sDSAE and train to obtain the first-order feature sae1Features of the defect data;
(3)Take the first-order feature sae1Features of the defect data as the input of the second layer of sDSAE and train to obtain the second-order feature sae2Features of the defect data;
(4)Similar to Step 3, the third-order features sae3Features and fourth-order features sae4Features of the defect data can be obtained, respectively;
(5)The labels of each order feature and the software defect data obtained from Steps 24 are used as the logistic regression classification The input of the processor to construct a software defect prediction model
(6)“Fine-tuning” the constructed prediction model through the back propagation algorithm and gradient descent method to optimize the network parameters of each prediction model;
(7)The test data testData are preprocessed in the same way as the training data and then input to each trained prediction model to obtain the probability value predScore of the predicted defect tendency;
(8)If predScore ≥ 0.5, then predLabel = 1; otherwise, predLabel = 0.

4. Experimental Setup and Result Analysis

4.1. Experimental Data Set

In this paper, the performance of the software defect prediction model is evaluated by using eclipse defect data set, which is one of the most widely studied public data sets in the field of software defect prediction and can be available from EclipseBugData.

There are six ARFF files in the Eclipse data set, corresponding to the defect records of the three versions (Eclipse 2.0, Eclipse 2.1, and Eclipse 3.0) of the Eclipse defect data set at two granularities (files, packages). The defect data records are divided into prerelease defects and postrelease defects; prerelease defects refer to defects found during the development process, and postrelease defects are defects found during the user’s use phase. This experiment uses three versions of defect data records under the granularity of files and takes the defect tendency of the program modules after the software release as the prediction target. The class label hasDefects, which converts the defect number to the meaning of whether a software module has defects, is as follows:

The statistical information of the Eclipse defect data set files granularity is shown in Table 1.

4.2. Experimental Environment and Methods

The experimental environment is shown in Table 2.

In this experiment, because the feature number of defect data set is 200, the number of input layer nodes is set to 200; since there is no uniform rule for the depth selection of the model, which is usually determined by the experimental data and task requirements, the number of hidden layer nodes and hidden layer nodes are set according to the specific situation in the experiment; the weight attenuation coefficient lambda is set to 1e 3. In order to make the symmetry of analyzer invalid and get better training effect, the weight matrix of analyzer is usually initialized randomly instead of all zero.

In the experiment, the network parameter optimization method is designated as L-BFGS, which uses the quasi-Newton method and limited memory BFGS algorithm to update the weights and limits the maximum number of iterations maxIter to 400. The weight attenuation coefficient LogisticLambda of the logistic regression classifier is set to 1e-4, and the loss cost function is also optimized by minFunc, the optimization method is L-BFGS, and the maximum number of iterations LogisticMaxIter is set to 100.

The hyperparameters of the deep stacked forest algorithm are set according to the settings of the deep forest algorithm, as shown in Table 3.

The difference is that in the deep stacked forest, when sampling randomly, three scales and three times are used to sample the original features. The sampling scales are (d/18, d/10, d/5). The corresponding sampling times are 200, 100, and 50, respectively.

Since the file-level data of the Eclipse defect data set are the same in structure, one version of the defect data is taken as the training data set training model. First, the Eclipse file-level defect data sets are preprocessed; that is, the number of module defects post is converted to whether there are defects in the class label hasDefects. Then, the gcForest algorithm is used to construct a software defect prediction model and classify and predict the training set 9 times. Finally, the DSF algorithm is used to build a software defect prediction model and classify to predict. When the training data set and the test data set are from the same version, ten-fold cross-validation is used.

4.3. Experimental Results

An encoder with only one hidden layer is used to construct a software defect prediction model. The number of hidden layer nodes of the encoder is set to 100. The loss cost function uses the squared difference cost function and the cross-entropy cost function, respectively, and the prediction results have high correct rate, accuracy rate, recall rate, F1-measure, AUC value, and running time (average running time is taken during ten-fold cross-validation). The experimental data results are shown in Table 4, and the comparative test results are shown in Figure 6.

The encoder is used to extract the features of software defect data. There is no need to define the features in advance but only input the defect data into the network. The encoder will learn to obtain the feature representation of the defect data, and the obtained feature representation will be classified and predicted by logistic regression classifier, which can achieve good prediction effect.

It shows that the prediction model using the square difference cost function and the cross-entropy cost function is basically the same in the prediction accuracy rate and remains above 0.8, indicating that the prediction models using the two cost functions have high predictive capabilities; The prediction model using the cross-entropy cost function is better than that using the square error cost function, the prediction model of the function has advantages in prediction accuracy, recall, F1-measure, AUC value, and running time, and the running time of the prediction model using the cross-entropy cost function is only about 1/3 of the running time of the prediction model using the square deviation cost function.

In the experiments, the deep stacked forest algorithm mainly studies the influence of tree number and model depth on predictive model performance in the stacked forest structure and the effects of random sampling and stacked forests on the performance of prediction models in the deep stacked forest.

4.3.1. The Effect of the Depth of the Stacked Forest and the Number of Trees on the Performance of the Prediction Model in the Deep Stacked Forest Algorithm

In order to verify the influence of the depth of the stacked forest and the number of trees on the performance of the prediction model, a software defect prediction model based on the stacked forest is constructed where the depth of the stacked forest is one, two, three, and four layers and the number of trees is 500. The prediction results are compared on the correct rate, accuracy rate, recall rate, and running time (the average running time was taken during ten-fold cross-validation). The experimental results are shown in Figure 7.

In Figure 7, for stacked forests of the same depth, when the number of trees in the forest is less than 200, the prediction accuracy and accuracy rates are basically increasing, and the prediction recall rate is basically decreasing; when the number of trees in the forest is greater than 200, its predictions of correct rate, accuracy rate, and recall rate are all stable. When the number of trees in the forest is less than 200, the learning ability of the random forest and the completely random tree forest in the stacked forest increases as the number of trees increases, so that the stacked forest can learn more detailed data information. Therefore, the performance of the prediction model increases as the number of trees in the forest increases. When the number of trees in the forest is greater than 200, the learning ability of the random forest and the completely random tree forest in the stacked forest has reached saturation, and increasing the number of trees will not increase the learning performance. Therefore, the performance of the prediction model is stable, but it can be seen from the comparison figure of the running time that the running time of the prediction model basically has a linear growth relationship with the number of trees in the forest. It is concluded that when the number of trees in the forest is 200, the performance of the stacked forest reaches the best. As for the number of identical trees in the stacked forest, the deeper the stacked forest is, the more accurate the prediction rate remains, and the prediction accuracy rate shows a downward trend, but the prediction recall rate shows an increasing trend. It is concluded that the three-layer depth of the stacked forest is the best choice.

4.3.2. The Effect of Cascaded Forest and Stacked Forest on the Performance of the Prediction Model

A software defect prediction model based on deep stacked forests does not perform feature transformation on random sampling and only uses stacked forests for layer-by-layer learning. It is compared with the software defect prediction model based on deep forest which does not use multigranularity scan and only uses cascade forest. The experimental results are compared with correct rate, accuracy rate, recall rate, F1-measure, AUC value, and running time (average running time is used for ten-fold cross-validation). The experimental results are shown in Figure 8.

In Figure 8, the result of using cascading forest is that the prediction correct rate is between 0.84 and 0.89 and the average is 0.86; the accuracy rate is between 0.30 and 0.67, and the average is 0.48; the recall rate is between 0.15 and 0.40, and the average is 0.25; the running time is between 402s and 688s, and the average value is 533s. The result of using stacked forest is that the prediction correct rate is between 0.84 and 0.89 and the average is 0.86; the accuracy rate is between 0.31 and 0.68, and the average is 0.49; the recall rate is between 0.14 and 0.41, and the average is 0.25; the running time is between 382s and 662s, and the average value is 510s.

The above data shows that the employ of stacked forests and cascaded forests has the same performance in prediction correct rate, accuracy rate, and recall rate. However, using stacked forests has an advantage in predicting running time than using cascading forests, which can indicate that the prediction models using stacked forests are better than those using cascaded forests.

Comparing the two prediction models based on random sampling and the two prediction models based on multigranularity scanning, it can be found that the prediction accuracy rate is equivalent, the prediction accuracy rate is slightly reduced, the prediction recall rate and runtime efficiency are different degrees of improvement, and especially the increase in running time is larger. It can be explained that the prediction model based on random sampling is superior to the prediction model based on multigranularity scanning. Comparing the two prediction models based on stacked forests and the two prediction models based on cascaded forests, it can be seen that the accuracy of prediction is equivalent, the accuracy and recall of prediction are improved to different degrees, and only a slight decrease in running time is not obvious. It can be explained that the prediction model based on stacked forest has better performance than the prediction model based on cascaded forest.

5. Conclusion

By improving the loss cost function and sparsity constraint method of the traditional encoder, the improved encoder uses cross-entropy loss cost function and L1 sparsity constraint rule and improves the robustness of the sparse encoder in order to eliminate the influence of noise on data. In the experiment, the prediction model of the squared difference cost function is used for comparative testing. The experimental data show that the prediction model based on the sDSAE for feature extraction is better than the prediction model without feature extraction or feature extraction based on PCA. Although the prediction accuracy and accuracy are slightly decreased, the prediction recall rate is greatly improved. More importantly, there is a significant improvement in the comprehensive evaluation index F1-measure, and the feature is based on the sDSAE. The extracted prediction model has a significant improvement in predicting AUC value than the prediction model based on PCA for feature extraction. It is proved that the feature extraction method proposed in this paper has a positive meaning for the performance improvement of the software defect prediction model.

Data Availability

The data used to support the findings of this study are included within the article

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by Online identification and diagnosis of software performance problem based on SaaS software running log analysis, and it was supported in part by the grants of National Natural Science Foundation of China (nos. 62072342 and 61672392).