Abstract

The accuracy of a knowledge extraction algorithm in a large database depends on the quality of the data preprocessing and the methods used. The massive amounts of data that we collect every day are putting storage capacity at a premium. In reality, many databases are characterized by attributes with outliers, redundant, and even more missing values. Missing data and outliers are ubiquitous in our databases, and imputation techniques will help us mitigate their influence. To solve this problem, as well as the problem of data size, this paper proposes a data preprocessing approach based on the k-nearest neighbor (KNN) completion for imputation of missing data and principal component analysis (PCA) for processing redundant data, thus reducing the data size by generating a significant quality sample after imputation of missing and outlier data. A rigorous comparison is made between our approach and two others. The dissolved gas data from Rio Tinto Alcan’s transformer T0001 were imputed by KNN, where k equals 5. For 6 imputed gases, the average percentage error is about 2%, 17.5% after average imputation, and 23.65% after multiple imputations. For data compression, 2 axes were selected based on the elbow rule and the Kaiser threshold.

1. Introduction

The large size of today’s databases poses the problem of archiving and preprocessing raw data. The data may be missing, aberrant, or redundant. The application of data analysis algorithms on such data complicates the learning process and affects the performance and reliability of the model [1]. Data preprocessing is undoubtedly crucial in the process of knowledge discovery from these voluminous databases. Indeed, it allows for improving the quality of the data submitted later to the data mining algorithms. As far as outliers are concerned, it is good to keep them original and mysterious in the raw data if possible. In other words, the reason for removing outliers should come from outside the dataset only when you already know the originals [2]. Ignoring missing data can lead to a loss of precision and strong biases in the analysis models. Missing data are represented by the so-called missing values matrix, the form of which depends on the type of missing data. Generally, we distinguish between MCARs, MARs, and MNARs. The MCAR is the missing data in a completely random way if the probability of absence is the same for all the observations. This probability depends only on external parameters independent of this variable. MARs are data that are not missing completely at random if the probability of absence is related to one or more other observed variables. MNARs are nonrandomly missing data if the probability of absence depends on the variable in question. In summary, these fall into three categories and are extensively detailed in [3]. A presentation and a review have been made on the different assumptions and techniques for processing these data [4, 5]. In 2009, the authors of the work [4] developed a modern method called Multiple Imputation by Chained Equations (MICE) based on a Monte Carlo Markov Chain algorithm for imputing missing values. To eliminate missing values, noise, and redundant attributes, and also to reduce their size by generating representative and quality samples. The work [6] presents a method based on the calculation of the empirical copula of the original sample. The performance of this work was only compared to the one whose imputation is based on the mean. To classify data from the gases below [7], preprocess was performed in 2014 using the k-nearest neighbors (KNN). This shows that classification without preprocessing the data first is less accurate. Grzymala–Busse processed the unknown values of variable attributes by setting up new decision tables with known attributes instead of the original table, which had unknown and inconsistent attributes in their work [8]. The rough set theory is developed for learning inconsistent rules..

This paper proposes a preprocessing method that combines the k-nearest neighbors and the principal component analysis (PCA). The k-nearest neighbor classifier is generally used in supervised classification because of its simplicity and robustness. To make it even more robust, i.e., less sensitive to data variations as shown in this work, references [9, 10] have respectively developed a k-nearest neighbor classifier based on the generalized mean distance. The main objective is to overcome the sensitivity of the k-neighborhood size and improve the KNN-based classification’s performance. In this proposed approach, the nested generalized mean distance computed by the multi-generalized mean distances in each class is designed for KNN-based classification decisions. The proposed method is shown to be suitable for pattern detection, on the one hand. On the other hand, a new representation method based on the k-nearest neighbor centroid coefficient is developed, which also aims to further improve the classification performance and reduce the sensitivity of the method to the size of the k-neighborhood, especially in cases of small sample size. In this work, the k-nearest neighbor-based classifier is used in the framework of data completion, and some works nowadays have used it in the framework of data imputation [11, 12].

The k-nearest neighbor method used for the imputation of missing data is based on the optimization of the choice of the k parameter, which relies on mechanical or discontinuous mean fitting techniques. For a k-nearest neighbor mean equal to 5 and for a Euclidean distance between two observations, 6 missing values are imputed. To observe the variables H2 (hydrogen) and CH4 (methane), which are low molecular weight gases, a weighting of N2 (nitrogen), CO2 (carbon dioxide), and CO (carbon monoxide) is recommended.

For PCA based on the weighted strategy, robustness will be improved by mitigating the statistical impact of outliers through reduced weighting [13]. Similarly, it will solve the normalization problem and the increasing difficulties of archiving after a judicious choice of the number of axes to be retained by using the kink theory and the Kaiser threshold. From the works [7, 14] it appears that the PCA associated with a classifier (ANN) is more accurate than the support vector machine (SVM) associated with the KNN classifier. It then becomes judicious to join a classifier (KNN) to an exploratory technique on the data (PCA). After a reminder of the KNN algorithm and the PCA algorithm in Section 2, the proposed preprocessing approach is presented in Section 3. Finally, the results of our method applied to a dissolved gas database are compared to other methods presented in Section 4, followed by a discussion.

2. Materials and Methods

The approach proposed here is to present the imputation techniques by completion, in which the data will be submitted and to consider outliers.

2.1. Imputation Methods

The most common imputation techniques are presented here in a nonexhaustive way, namely: stationary, linear combination, k-nearest neighbor, NIPALS, and multiple completions, to mention only those. A dataset consists of p quantitative or qualitative variables () observed on a sample of n individuals; M denotes the matrix indicating the missing values by .

2.1.1. Stationary and Linear Combination Completion

There are several possible stationary imputations. The most common attribute value fitting (CMCF) [15] or simply the last known observation carried forward (LOCF) is given as follows:where represents the missing data.

This method may seem too naive, but it is often used to lay the foundation for a comparison between imputation methods. Another common technique is to replace all missing values with a linear combination of observations. The case of imputation by the mean is given as follows:

Or by the median, as follows:

But this case is generalized to any weighted linear combination of observations. Instead of using all the available values, it is possible to restrict oneself to methods that select the most influential values by local aggregation or regression or even by combining different aspects.

2.1.2. Completion by the Nearest Neighbor Method

The k-nearest neighbors (KNN) imputation consists of running the following algorithm that models and predicts the missing data. First, the choice of the parameter k (), calculate the metric distances retain the k observations , for which these distances are smaller; and finally, assigning to the missing values the arithmetic mean of the values of the k neighbors, such that is given as follows:

Or an observation with q missing values.

2.1.3. NIPALS Completion Algorithm

The NIPALS (Nonlinear Iterative Partial Least Squares) algorithm is an iterative method for estimating PLS (Partial Least Square) regression. This algorithm can be adapted to impute missing data. Or such as , mathematical expectation E(Yi) = 0 (each column of the matrix is centered). The expansion of Y in terms of principal components and principal factors is given by following equation:where are the principal components and the principal vectors of the PCA of Y. For each variable of Yi, there is the equation given as follows:where represents the slope of the linear regression of Yi on the component ξh. The development of this algorithm is contained in [16].

2.1.4. Multiple Imputation

Multiple imputations retain the virtues of single imputation and correct its main shortcomings. As its name suggests, it consists of imputing missing values several times to combine the results to reduce the error due to imputation [17]. The multiple imputation procedure consists of two phases: the imputation phase and the statistical analysis phase. These two phases use two different models: the imputation model and the analysis model. Once the imputations have been performed, the statistician can perform any type of analysis, according to the standard procedures for the analysis of complete datasets [4]. In 2011, work [18] developed a multiple imputation program called Amelia II. The model is based on a normality assumption:

YNk (µ, Σ), or Y has a multivariate normal distribution with a mean vector μ and a covariance matrix Σ. Regarding multiple imputations, we are concerned with the parameters of the complete data, θ = (µ, Σ). Under the assumption that the data are MAR, we define Yobs as the observed data and Ymis as the missing data, such that Y = {Yobs, Ymis}. The missing data mechanism is characterized by the conditional distribution of M knowing Y given by p(M|Y).

Likelihood is then, written as

Since we are only interested in the inference of the parameters of the complete data, the likelihood can be written in the following form:

Now, using the iterative property of the expectationthe obtained a posteriori law is as follows:

2.2. Outlier Management

Outliers can come from two possibilities. Either they come from errors, or they have a history behind them. In principle, outliers should be very rare; otherwise, the experiment/investigation to generate the dataset will be inherently flawed. Defining an outlier is tricky. Outliers may be legitimate because they are part of the long tail of the population. For example, a team working on predicting a financial crisis determines that a financial crisis occurs in one out of every 1000 simulations. Of course, the result is not an outlier that should be discarded. The reason for removing outliers only comes into play when you know the original data for them. For example, if the heart rate data is strangely fast and you know that there is a problem with the medical equipment, you can remove the bad data. Rejecting mysterious outliers is risky for downstream tasks. For example, some regression tasks are sensitive to extreme values. It takes more experiments to decide whether the outliers exist for a reason. In such cases, do not remove or correct outliers in the data preprocessing steps [2].

3. A Proposed Approach to Data Preprocessing

The algorithm in Figure 1 is a maintenance data preprocessing technique that combines missing and redundant data management with data size reduction techniques (compression). The interest of such a work is the combination of its advantages in speed, robustness, and archiving.

The first operation consists of having a table of maintenance data for the power transformers. For this purpose, we have a database from Rio Tinto Alcan of Canada, in which we will exploit the GD (dissolved gas) data of equipment T0001. Due to this database being complete, we will simulate missing data and submit this incomplete table to our algorithm. For the imputation of missing data, we chose the k-nearest neighbor algorithm (KNN). This imputation technique requires the choice of the parameter k by optimization of a criterion. The missing values are imputed, taking into account the class of data to which they belong. Moreover, the notion of distance between observations must be chosen with care. We proceed essentially by learning Euclidian, Mahalanobis, or Minkowski distance metrics to evaluate the similarity between classes. The notion of learning metrics is a recent field in machine learning. The work presented in 2002 in the article [19] is considered a pioneer. The goal of metric learning is to answer a recurrent need for comparison functions. Indeed, in many learning algorithms, the notion of metric plays a fundamental role. This is, for example, the case in the k-nearest-neighbor algorithm [20]. Indeed, this classification algorithm proposes associating the majority class of the k closest points with a point of unknown class. This proximity is defined by a metric. The imputed value will thus be the sum of the k-nearest values belonging to the same class.

3.1. Distances of Similarity or Dissimilarity

Euclidean distance. The Euclidean distance is probably the most used distance. For two vectors, x and x′, it is noted as follows:

Minkowski distance. It is defined, for any p ≥ 1 and for two vectors x and x′, by the following equation:

Thus, for p = 1, we get the Manhattan distance, for p = +∞, we have the Chebyshev distance, and for p = 2, the Euclidean distance is found.

Distance from Mahalanobis is defined as follows:

This distance is parameterized. Indeed, depending on the matrix M chosen, the result obtained changes. Thus, if we put M = I, where I is the identity matrix, we find the Euclidean distance.

3.2. Compression Based on PCA

The principal component analysis, beyond being a descriptive technique of data analysis, is an extremely powerful tool of compression and synthesis of information, very useful when one is in the presence of an important quantity of quantitative data to be processed and interpreted. PCA consists of synthesizing the number of observed variables, in other words, it attempts to summarize the information contained in the data table into a reduced set of linear combinations of the initial variables, taking care to minimize the loss of information due to this reduction [21, 22]. These new synthetic variables, called “principal components or factors or macrocharacteristics” have the following properties:

The principal components, as noted , are linear combinations of the initial variables

They are uncorrelated (the linear correlation coefficients of the components taken two by two are zero), which avoids the redundancy of the already summarized information. The first component carries or summarizes more information than the second, which carries more than the third, and so on, so that by limiting ourselves to the first 2 or 3 components, we have a good summary of the information contained in the data [23]. The mathematical tools used are those of linear algebra and matrix calculation. The correlation matrix is diagonalized, and the eigenvectors of this matrix define the new variables sought: these are the principal components. We can show that the principal components thus defined, verify well the sought properties: uncorrelated between them, of decreasing variance, and linear combinations of the starting variables. This last property allows us to construct graphs representing the individuals as well as the variables in the space defined by the components [23]. In the article [14], PCA is used to improve the preprocessing of data.

4. Results and Discussion

This part is presented in two steps: the results and discussions after imputation and after compression.

4.1. Imputation of Data

The table of data that we submitted to the test is one of the dissolved gases taken from the transformer equipment (T0001) and presented in the Table 1. Afterward, we will simulate missing values (Table 2), and then submit the data table to the data preprocessing algorithms. Table 3 presents the statistical data before any imputation.

The table of data containing missing values is presented in Table 2 along with several data completion methods in turn. The table 4 shows the results of the different completion techniques.

Table 4 presents the statistical data after imputation by KNN and for a k-neighbor average equal to 5, by the mean and multiple. The results of Table 5 show that the imputation by KNN allows finding in the majority of the exact values.

The decomposition of mineral oil at low temperatures produces relatively large quantities of low molecular-weight gases such as hydrogen (H2) and methane (CH4). To better observe the evolution of these two quantities after different imputations presented in Table 5, a weighting of nitrogen (N2), carbon dioxide (CO2), and carbon monoxide (CO) is made.

Figure 2 shows the evolution of the exact values of the variables before any simulation of missing data (in blue) and the variations of these data after their imputation created by simulation. It can be seen from this figure that the evolution of the data imputed by KNN (in orange) is closer to that of the exact values. The imputation by KNN is robust because its standard deviation and its mean are less sensitive to the variations of the data. Example: H2 and N2 before imputation have, respectively, a mean of 80.034–71206.679 and a standard deviation of 19.320–10819.652. After imputation by KNN, they have 80.323–71286.032 and a standard deviation of 18.734–10399.787. The table of completed data after imputation by KNN is presented in Table 6.

The error in absolute value after the different imputations is given bywhere yi represents the imputed value and y is the exact value.

Figure 3 shows us that the error committed by imputing the 6 gases by the k-nearest neighbors (KNN) has an average error percentage of 2%, 17.5% by the average, and 23.65% by multiple imputations.

4.2. Data Compression Based on PCA

One of the applications of principal component analysis is compression. PCA consists of synthesizing the number of observed variables, i.e., summarizing the information contained in the data table, into a reduced set of linear combinations of the initial variables, taking care to minimize the loss of information. The table of data imputed by the k-nearest neighbors is subjected to compression by PCA, and the results are obtained from the Anaconda (Python) and XLStat software.

Table 7 and Figure 4 show that the first eigenvalue λ is 5.054 and represents 56.153% of the variability (inertia I). This means that if we represent the data on two axes, then we will always have 71.299% of the total variability preserved. Each eigenvalue corresponds to a factor. Each factor is a linear combination of the starting variables. In principal component analysis, the problem is to be able to determine the dimension of the optimum representation space. It is a question of preserving all the stable and important characteristics of the data studied while ignoring the unstable and meaningless axes [24].

4.2.1. Number of Axes to Retain

In practice, the only criteria applicable to the choice of the number of axes are empirical, the best known of which is that of Kaiser: in reduced centered data, we retain the principal components corresponding to eigenvalues greater than 1, which means that we are only interested in those components that «contribute» more than the initial variables [25]. We also use the broken sticks test and the kink rule, which consist of detecting the existence of a kink in the eigenvalue diagram. Figure 5 shows the kink thus formed, and Table 8 shows the results of the different tests.

For the Kaiser threshold:

For broken sticks: The component is validated if .

The calculation of the threshold of the eigenvalues is given by the expression (17) where λ is the eigenvalue, p is the number of variables, and n is the number of observations.

The use of the threshold (Kaiser-Saporta), and elbow (Cattell) rules limits the number of axes to two, while the broken sticks test (Frontier 1976) limits the number of axes to be retained to one. All of these approaches are consistent in that only one factor or axis appears to be sufficient in this study. For the sake of future interpretation and rotation of the axes, we have opted to maintain two axes as recommended by the elbow and Kaiser methods.

4.2.2. Observations

The study of the observations consists of examining their coordinates and especially their graphic representations. Figures 6 and 7 show the evolution of the two components retained according to the type of pretreatment applied.

The reference variable here is SI (without imputation), which represents the component retained at the end of the principal component analysis without having first carried out an imputation.

To take just one example, on 15/05/2001, the components F1 without imputation (F1 SI), with imputation by the k-nearest neighbors + principal component analysis (F1 Iknnacp), with imputation by the mean + principal component analysis (F1 Imoyacp), and with multiple imputation + principal component analysis (F1 Imacp) have the values 4.142, 4.213, 4.167, and 3.917, respectively. The results show that, for component 1, the preprocessed data are faithful to the reference with a few precisions.

For the same example, on 15/05/2001, the F2 components without imputation (F2 SI), with imputation by the k-nearest neighbors + principal component analysis (F2 Iknnacp), with imputation by the mean + principal component analysis (F2 Imoyacp), and with multiple imputation + principal component analysis (F2 IMacp) have, respectively, values −1.723, −1.822, −1.257, and 1.406. These results show that, for component 2, the preprocessed data with KNN + PCA imputation are more faithful to the reference data (F2 SI) with a few precisions. The combination of these two elements (KNN + ACP) produces the best accuracy for the T0001 transformer DGA dataset of different sizes and different percentages of missing values.

5. Conclusion

In this paper, an approach for preprocessing power transformer maintenance data is proposed. This approach uses both KNN completion with the Euclidean metric to impute quantitative data and a PCA whose function here is the management of redundant values and especially the compression of a large amount of data. This preprocessing approach, i.e., imputation by KNN completion and PCA, was rigorously compared to two other approaches, such as imputation by the mean + PCA and multiple imputation + PCA. It is clear that for 6 missing values, the k-nearest neighbor imputation and for k = 5, the error committed is around 2%, while the multiple and the mean imputation have 23.65% and 17.5% errors, respectively. Similarly, to observe low molecular weight gases produced at low temperatures such as hydrogen (H2) or methane (CH4), the weighting of nitrogen (N2), carbon dioxide (CO2), and carbon monoxide (CO) is performed (Figure 2). The KNN + ACP preprocessing is robust because the standard deviation and mean obtained by KNN completion are less sensitive to data variations and present results close to reality on the one hand, and on the other hand, the amount of starting data is considerably reduced while keeping the originality of the starting data at the maximum. For 31 observations and 9 variables, the Kaiser threshold is 1.188, which allowed us, in this case, to retain 2 components based on the kink principle and the eigenvalue threshold. Experiments conducted using this proposed combination show significant performance, especially when the percentage of variables and missing values in the dataset would be high.

Appendix

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors thank American Journal Experts for their technical support in formatting this work. Also, the Rio Tinto Alcan in Canada for providing data through Pr Issouf FOFANA.