Abstract
The imputation of missing values is an important research content in incomplete data analysis. Based on the auto associative neural network (AANN), this paper conducts regression modeling for incomplete data and imputes missing values. Since the AANN can estimate missing values in multiple missingness patterns efficiently, we introduce incomplete records into the modeling process and propose an attribute cross fitting model (ACFM) based on AANN. ACFM reconstructs the path of data transmission between output and input neurons and optimizes the model parameters by training errors of existing data, thereby improving its own ability to fit relations between attributes of incomplete data. Besides, for the problem of incomplete model input, this paper proposes a model training scheme, which sets missing values as variables and makes missing value variables update with model parameters iteratively. The method of local learning and global approximation increases the precision of model fitting and the imputation accuracy of missing values. Finally, experiments based on several datasets verify the effectiveness of the proposed method.
1. Introduction
The interference of various factors in process of data collection, transmission and storage, etc. may cause data loss in different degrees. The incompleteness of data that leads to most of computational intelligence technologies cannot be applied directly [1]. In the cases where incomplete records cannot be simply deleted, an effective method is needed to impute missing values.
At present, researchers have proposed a variety of imputation methods. Mean imputation method imputes corresponding missing values with mean values of existing attributes [2]. The hot deck method finds the record most similar to the incomplete record in database and then imputes data with values of this record [3]. The nearest neighbors (KNN) imputation method takes the weighted average of records closest to the incomplete record to impute missing values [4]. Additionally, modelbased methods are usually an effective way to improve the accuracy of imputation. For example, the expectationmaximization (EM) method alternately performs the expectation step and the maximization step and iteratively updates model parameters and missing values until convergence [5]. The multiple imputation method obtains values through one or more models and comprehensively processes the results to impute missing values [6]. The imputation method based on linear model imputes missing values by modeling the linear relation between attributes [7]. It assumes that there is a linear correlation of the data, but the relation is complex and unknown in real data and often reflects nonlinear features.
The neural network is flexible in construction. In theory, a neural network with nonlinear activation function can approximate complex nonlinear relations [8]. The imputation model based on neural networks can mine complex association relations within attributes of incomplete data. The imputation method based on the neural network usually uses complete records to train the network, then inputs prefilling incomplete records into the network and uses the output of network to impute missing values [9]. Sharpe and Solly [10] constructed a multilayer perceptron (MLP) for each missingness pattern, which is used to fit the regression relation between missing attributes and existing attributes. However, the number of constructed models is large, and the training is more timeconsuming in the case of multiple missingness patterns. Ankaiah and Ravi [11] proposed an improved MLP imputation method, which takes each missing attribute as output and the remaining attributes as input to construct a network of single objective predictive. The number of models constructed by this method is equal to the number of missing attributes. Although the MLP imputation model can fit the regression relation between data attributes, it comes at the expense of model training time.
The auto associative neural network (AANN) is a type of network with the same number of nodes in the output layer and input layer. It is only necessary to build one model to impute incomplete data in all missingness patterns [12]. Marwala et al. [13] proposed an imputation method combining AANN and genetic algorithm (GA) and then applied it to two real datasets [14, 15]. This method takes the cost function of AANN as the fitness function of the genetic algorithm and uses the genetic algorithm to impute missing values. Based on the framework proposed by Marwala, Nelwamondo et al. adopted principal component analysis to select a reasonable number of nodes in the hidden layer [16] and reduce the dimension of data [17]. Ssali and Marwala [18] used the interval of continuous attribute divided by decision tree as data boundary, which further improved the imputation accuracy. In addition to the combination of AANN and GA, Ravi and Krishna [19] proposed four improved imputation models based on AANN, which are general regression auto associative neural network (GRAANN), particle swarm optimization based auto associative neural network (PSOAANN), auto associative wavelet neural network (AAWNN), and radial basis function auto associative neural network (RBFAANN). Among these models, GRAANN performs better than MLP and other three models in most datasets, and only needs one iteration to impute missing values. Gautam and Ravi proposed two imputation models based on AANN, which are auto associative extreme learning machine [20] and counter propagation auto associative neural network [21]. The experimental results show that the combination of local learning and global approximation can get better imputation results.
The above method only takes complete records to train the model, which avoids the problem of missing values during training. However, missing values in incomplete records will lead to incomplete model input in the imputation stage. Since the MLP imputation method constructs a specific model for each missingness pattern by taking incomplete attributes as output and complete attributes as input, it can directly input each incomplete record into the subnet of the corresponding missingness pattern. However, the AANN imputation method usually needs a prefilling method to deal with missing values during imputation. For instance, Ravi and Krishna [19] used averages to prefill missing values. Nishanth and Ravi [22] adopted means and medoids methods to prefill missing values. Gautam and Ravi [21] used the nearest neighbor method based on grey distance to prefill missing values.
The quantity of complete records is small when the missing rate in dataset is high. If only complete records are used to train the network, a large amount of information in incomplete records will be lost, and fewer records sometimes make the model unbuildable. Therefore, SilvaRamírez et al. [23] prefilled missing values with a fixed value and then trained the network by all records. GarcíaLaencina et al. [24–28] proposed a multitask network that uses zero to initialize missing values and allows incomplete records to participate in model training. Although the method of prefilling incomplete records with fixed values can make them participate in model training, the prefilling values have an estimation error. If the model is trained directly with prefilling data, the accuracy of the final model will be affected by the estimation error. In addition, Yoon et al. [29] proposed an imputation method base on Generative Adversarial Nets to generate data with generator. The network architecture can also try to use the inception architecture [30] in edge computing [31].
As mentioned above, the imputation method based on AANN can improve the training efficiency compared with MLP while solving multiple missingness patterns. Consequently, this paper conducts regression modeling for the attributes of incomplete data based on AANN architecture. By redesigning the data transmission structure of AANN, the representation of regression relations between data attributes is enhanced. Moreover, aiming at the problem of incomplete model input, this paper proposes a model training scheme that takes missing values as variables and makes the missing value variables update iteratively along with model parameters during model training. The improved model and training scheme make full use of the existing data in incomplete dataset and reduce the estimation error of missing value variables gradually during model training and increases the accuracy of imputation through local learning and global approximation.
The rest of this paper is organized as follows. Section 2 introduces MLP and AANN imputation models. Section 3 proposes ACFM based on AANN and a model training scheme named UMVDT. Section 4 analyses the imputation performance of ACFM and UMVDT. And the full text is summarized in Section 5.
2. MLP and AANN Imputation Models
MLP is a feed forward artificial neural network composed of input layer, output layer, and several hidden layers. When applying the MLP method to impute missing values, an MLP imputation network needs to be constructed for each missingness pattern. Figure 1 is an incomplete dataset with several missingness patterns, different positions and the number of missing values in the sample, and different deletion modes. For an incomplete data set that is missing at random, the higher the missing rate, the more missing patterns. And its imputation networks are shown in Figure 2, where represents the th record, represents the network output for the th record, and represents the dimension of attribute. If represents indices of missing attributes in the th missingness pattern, the cost function of the model is where represents the complete records, represents the nonlinear mapping of the model, and represents the weight of the model.
AANN requires that the number of nodes in the output layer is equal to that in the input layer. In order to prevent model overfitting, the number of nodes in the hidden layer is usually set to be less than that in the input layer. As shown in Figure 3, the imputation method based on AANN can
fill incomplete data under all missingness patterns through one structure. Generally, the model is trained by complete record subset, and the incomplete record subset is reconstructed after prefilled to impute the corresponding missing values. The cost function can be expressed as
It can be seen that each output value of AANN model is calculated by all input values. The output value is easier to learn the input value in the same position with model training, thus the quality of imputation values depends on a degree of the quality of prefilling values in imputation stage. The output value of the MLP model is calculated by a regression network; so, AANN lacks clear regression relations to guide the model training and impute the missing value compared with the MLP model.
3. Proposed Architecture
3.1. Attribute CrossFitting Model
The AANN imputation model implements the imputation of multiple missingness patterns through one architecture, but it does not establish a clear regression relation between data attributes. In this paper, the regression relations between each attribute and rest attributes in incomplete dataset are expressed on one architecture by redesigning the cost function of the model where represents an incomplete dataset. It can be seen from equation (3) that the th output value of the model is calculated from other input values except the th input value, which helps to establish a regression relation between each output value and remaining input values. Moreover, the output of the model is no longer dependent on the corresponding input value; thus, the effect of prefilling values is weakened during the imputation stage. In order to minimize the cost function, the network needs to fully learn the correlation between each output neuron and noncorresponding input neurons. Therefore, the cost function can effectively enhance the ability of mining internal association of attributes.
If the neural network is trained by incomplete records, the missing values need to be prefilled. However, there is an estimation error in prefilled values compared with original data. The model should limit the training error between prefilled data and its predicted data to optimize model parameters. This paper defines this error as missing value error. Hence, when training the network with an incomplete dataset, the cost function that the model needs to be optimized should be where is the set of indexes for missing values in record , and indicates that the missing value error is no longer used to optimize model parameters. The model constructed based on this cost function can fit regression relation between data attributes by one architecture, which is called attribute cross fitting model (ACFM) in this paper.
The data transmission process of output neurons of ACFM is shown in Figure 4 [32]. There is an incomplete record with two missing values and that input into ACFM. Because ACFM does not use missing value error to optimize model parameters, the output values and will not be calculated. The output value of ACFM is calculated by input values except . At the same time, the calculation of output values to has similar processes. It can be seen that the calculation amount of ACFM is the same as that of AANN. In this article, the input data has been expanded by dimensionality times when it is implemented by programming. Then, perform forward calculation in a fully connected manner. Finally, the output is sliced, and the required value is taken out. Therefore, the parameter of ACFM in the experiment is the parameter of AANN multiplied by the number of attributes of the data.
3.2. Updating Missing Values during Training
The prefilling missing values solve the problem of the incomplete model input, but the quality of prefilling values has an important impact on the quality of trained model when prefilling incomplete records are used to train the model directly. The prefilling values have an initial estimation error, which will reduce the accuracy of the model. Therefore, this paper proposes a model training scheme by treating missing values as variables and iteratively updating missing values during training process (UMVDT). UMVDT dynamically adjusts the values of missing and gradually reducing the estimation error of missing values, thus the missing values will meet the fitting relationship determined by existing data. As shown in Figure 5, UMVDT training scheme initializes the missing value variables in incomplete records and inputs incomplete records into ACFM for calculating the error between output and input values; then, it updates the missing value variables and the network parameters through the back propagation algorithm. The above process is repeated for all records until the model convergence. In the model based on UMVDT, the missing value variable is optimized by the regression structure within the incomplete data. The accuracy of the model will be improved with the deepening of the training, and the missing value predicted by the model will also be more accurate.
If the neurons in the input layer of ACFM are the first layer, and the output layer is , and represent the weights and thresholds from layers to (). And each output neuron of the model is directly output after linear summation; so, it can be expressed as where represents the linear summation of the th neuron in the layer , represents the number of neurons in layer , and represents the th output in layer . Corresponding to each neuron in the output layer, the output of th neuron in each hidden layer can be expressed as where is the activation function. According to equation (4), the error between th record and output of the network is
If we define the intermediate variables as where represents that the input value corresponding to the th predicted value is available, represents that the input value corresponding to the th predicted value is missing, and thus the partial derivative is set to zero, and the corresponding model parameters are not optimized. When , is and it can be concluded that the partial derivative of error for the network parameter is
Similarly, the partial derivative of error for the network parameter is
Assuming that the learning rate is , and when the gradient descent method is used to optimize the model, the updating rule of the model parameters is
Missing value variables are updated with the model parameters during model training. It can be deduced from equation (9) that the partial derivative of error for the missing value variable () is and the updating rule of missing value variable is
In summary, the imputation algorithm based on the ACFM model and UMVDT training scheme is described as follows:

4. Experiment
4.1. Datasets
In order to verify the imputation performance of proposed method, ten complete datasets obtained from the UCI database are used in our experiment, and the description of datasets is shown in Table 1. Among them, Stock is often used for clustering tasks, Concrete is often used for regression tasks, and the remaining data sets can be used for classification tasks. Most of these data are numeric, and some of them are nonnumeric in the ID column, which was deleted in the experiment. Additional information can refer to data sets UCI official website. For the sake of forming incomplete datasets, partial data are deleted randomly according to specified deletion rates which are set as 5%, 10%, 15%, 20%, 25%, and 30% and ensure that each incomplete record has at least one attribute value, which can be used for normal training.
4.2. Experimental Design
Six imputation methods based on MLP, AANN, and ACFM are realized. The method based on AANN and ACFM realizes the training by traditional training scheme and UMVDT training scheme. Traditional training scheme only uses the mean value to prefill missing value, and does not update missing values. To verify the effect of missing value error on imputation accuracy of the model, this paper uses equation (3) with missing value error and equation (4) without missing value error as the cost function, respectively. The specific methods are described as follows: (1)The imputation method based on the MLP model and traditional training scheme (MLPI): taking missing attributes as output and other attributes as input, multiple networks of single objective predictive are established based on MLP. These models are trained with complete records during the training stage. In the imputation stage, the incomplete records are prefilling with the mean method, and missing values are imputed with the reconstructed model output(2)The imputation method based on the AANN model and traditional training scheme (AANNI): the imputation process is same as MLPI, but the architecture is AANN(3)The imputation method based on the ACFM model where missing value error is used to optimize model parameters (ACFMMEI): equation (3) is used as the cost function of ACFM. The incomplete records are prefilling with the mean method, and then all records are used to train the model. Finally, the reconstructed model output is used to impute missing values(4)The imputation method based on the ACFM model where missing value error is not used to optimize model parameters (ACFMI): equation (4) is used as the cost function of ACFM, and the process is same as ACFMMEI(5)The imputation method based on the AANN model and UMVDT training scheme (AANNUMVDT): the mean value is used to initialize missing value variables. After that, the method uses all data to train the model, dynamically updates missing value variables during model training, and reconstructs the model output to impute missing values(6)The imputation method based on the ACFM model and UMVDT training scheme (ACFMUMVDT): the process is same as AANNUMVDT, but the architecture is ACFM
All models are optimized based on the gradient descent method with momentum. The learning rate is set as 0.2, and momentum is set as 0.9. All methods were repeated ten times at each missing rate, and the average value of ten imputation errors was taken as experimental results. Imputation error is evaluated by mean absolute percentage error (MAPE): where represents incomplete records subset, and represents the number of missing values.
4.3. Experimental Results
The experimental results are shown in Tables 2 and 3.
4.4. Experimental Discussion
The impact of architecture on imputation results: by observing the MAPE values of ACFMI, AANNI, and MLPI in Tables 2 and 3, we can see that the results of ACFMI are slightly worse than those of MLPI in four cases, which is Ecoli at the missing rate of 15%, Glass at the missing rates of 5% and 15%, and Stock at the missing rate of 5%. In addition to the above, all the results of ACFMI are better than MLPI and AANNI. Besides, there are fortythree results of MLPI superior to the AANNI among sixty imputation results. This result shows that MLP can more accurately characterize the regression relation within dataset than AANN, thereby obtaining higher imputation accuracy. ACFM increases the ability to fit regression relations by modifying the cost function compared to AANN. Meanwhile, compared with MLP, ACFM fits multiple regression relations through one architecture, which increases the generalization ability of ACFM on the premise of improving the imputation accuracy.
The impact of missing value error on imputation results: it can be observed from Tables 2 and 3 that ACFMI performs slightly worse than ACFMMEI under the missing rates of 10% and 20% in the Glass dataset, 5% in the Wine dataset, and four kinds of missing rates on the Concrete dataset. In addition, the imputation result of ACFMI is better than ACFMMEI in imputation results. It shows that the optimization of model parameters by missing value error affects the accuracy of modeling and thus leads to the poor performance of imputation results.
Taking the Iris dataset as an example, the imputation results of ACFMMEI and ACFMI at missing rates of 5%30% are shown in Figure 6. With the increase of missing rate, the gap between imputation results of ACFMMEI and ACFMI becomes larger. If we continue to use the missing value error to optimize the model parameters when there are more and more missing values in the dataset, the deviation of model will also increase. Therefore, in this paper, equation (4) is used as the cost function of ACFM in this paper; that is, only the errors of existing data are used to optimize the model parameters, which have certain reasonableness and correctness.
Comparison between UMVDT and traditional training scheme: except for the results of Glass and Concrete datasets at missing rates of 5% and 10%, the results of AANNUMVDT are better than those of AANNI. Among the 60 imputation results, the results of ACFMUMVDT are better than those of ACFMI accounted for 66.7%, wherein the Concrete data set of data values vary greatly. There are many zero values and large values, and many samples have the same value in attributes. UMVDT will change the missing values during the training process, which may cause the imputation results of many samples with the same value to be unstable. The above results show that UMVDT training scheme has higher imputation performance than traditional one. UMVDT training scheme makes full use of the whole existing data in incomplete records and takes missing values as variables to make them gradually match the fitting relationship. The missing value variables and model parameters are updated alternately; so, the imputation effect can be improved significantly.
When the missing rate of the Iris dataset is 15%, the imputation of ACFMI and ACFMUMVDT and the variation of the missing value variables (MVV) of ACFMUMVDT in each round are shown in Figure 7. It can be found that the missing values tend to be stable soon after a short period of fluctuation, and the imputation results of ACFMUMVDT also tend to be stable with the increase of iteration rounds. The missing value is updated iteratively. Not only the MAPE values calculated by missing value variables are more accurate than those of original model but also the imputation accuracy can be further improved by the model which is trained by the data updated iteratively.
The convergence of the proposed method: we take the Iris dataset as an example to verify the convergence of the proposed method. Figure 8 shows the fitting error of ACFMUMVDT at various missing rates. It can be observed that all curves of fitting errors decrease in different degrees at beginning and become stabilized gradually. It is because the UMVDT training scheme constantly updates missing value variables and changes missing values in incomplete records. Missing value variables and model parameters converge continuously in the alternate updating process. The curves in Figure 8 show that the imputation method proposed in this paper has ideal convergence.
5. Conclusions
To solve the problem of imputation of missing values, this paper conducts attribute association modeling for incomplete data based on AANN. By modifying the cost function of AANN, this paper represents the regression relation between each attribute and the rest attributes of incomplete data on one architecture and redesign ACFM for enhanced to fit the association relation between data attributes. And we only use the training errors of existing data to optimize the model to reduce the inaccurate error between missing values and its predicted values in incomplete data to optimize the model. In addition, for the problem of incomplete model input, this paper proposes UMVDT training scheme, which sets missing values as variables and updates the model parameters and missing value variables alternately. UMVDT gradually optimizes the missing value variables through the regression structure of the model and further reduces the negative impact of the uncertainty of missing values during model input on the model. Experimental results show that the ACFM model can obtain more accurate imputation results compared with MLP and AANN models, and UMVDT further improves the accuracy of imputation on AANN and ACFM models by gradually iterating the missing value variables compared with traditional training scheme.
Data Availability
All datasets in this study can be downloaded from http://archive.ics.uci.edu/ml/datasets.php. And all experimental results are included in this published article.
Conflicts of Interest
We declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the Natural Science Foundation of China (62076050, 62073056) and National Key R&D Program of China (2018YFB1700200).