Abstract

By analyzing the recorded operation data of a nuclear power plant (NPP), its results can serve the fault detection or operation experience feedback. Data missing exists in the recorded operation data. It may lower the data quality and affect the accuracy of the analysis results. In order to improve the data quality, two parts of researches are carried on. Firstly, to locate the missing data accurately the detecting algorithm for missing data of the NPP operation parameters based on wavelet analysis. Different judging basis is proposed for discrete and continuous missing respectively. Then, the filling method based on the hot deck algorithm are studied. As the dynamic properties of the parameters are closely related to the operating state of NPP, the similarity of the operation parameter vectors are formed to express the similarity of the operating states, so as to fulfill the requirements of the hot deck algorithm. To improve the accuracy of the measuring results, taken the differences between the characteristics of the analog parameters and the switch parameters into consideration, the similarity measurements using Mahalanobis distance for the analog parameter vectors and the matching measure for the switch parameter vectors are studied respectively. Finally, the operation data is taken to build the experiment data set for the algorithm verification. The results shows that the designed algorithm performs much better than the mean interpolation method and LSTM.

1. Introduction

With the development and application of digital instrument and control (I&C) system in nuclear power plant (NPP), the capacity of data storage and analysis is improved. Many big data analysis algorithms, like Machine Learning and Deep Learning, can be utilized to analyze the operation data [1]. Most of the operation data collection, transmission and storage in NPP are carried out by automatic instruments. Due to the influence of environment interference, inherent characteristics of instruments and some other reasons, the NPP operation data missing may occur randomly, which may directly affect the analysis of data. Therefore, the research on methods to improve data quality, which are called data cleaning, has received widespread attention [2]. Data cleaning aims to identify and correct the noise in the data and minimize its impact on the data analysis results. Therefore, before the application of big data analysis algorithm in NPP, it is necessary to study the detecting and filling algorithm of missing data in operation parameters. Noise in data mainly includes data missing, redundant data, conflicting data, and wrong data, which are collectively called dirty data [3, 4]. Currently, the widely studied data cleaning methods include methods based on removal, direct manipulation, models, and imputation [5].

The methods based on removal will reduce the amount of records used in the training process or in the adjustments of the prediction model when mining the data of NPP. As a result, the ability of finding consistent patterns gets weak [6]. The methods based on direct manipulation are commonly nonparametric. Therefore, when applied to NPP operating data, it makes the correlation of the attributes lower, which negatively influence the performance of the algorithms [7]. The methods based on models include statistical, probabilities and learning techniques for obtaining a model. The iterative algorithm can take quite a long time before it converges with the growth of the amount of data [8]. As a result, if the NPP operating data were processed with this algorithm, the large amount of data would lead to a low efficient. Compared with the above algorithms, the method based on imputation has the advantages of lower computational complexity, higher computational efficiency and higher accuracy. These advantages make it have advantages in dealing with the large amount of data and high correlation of the NPP operating data.

Scholars have conducted a large number of relevant studies about filling algorithm for missing data based on imputation. Ragel and Crémilleux proposed an imputation method using the RAR-Robust Association Rules algorithm in [9]. Then, in [10], the application of this method in a real database is demonstrated. In [11], a method based on Formal Concept Analysis is brought out. It uses an implication basis, which represents the dependencies between attributes to impute value for missing data. In [12], a composite imputation is proposed which applied other tasks, including the data clustering and the attribute selection, before the imputation process of a missing data to improve the quality of the imputed data. In [13], the Radial Basis Function Network classifier is applied to improve the data quality in data sets containing missing data. Wu et al. found a practical utilization of association rules to complete missing data in [14]. In [15], in order to extract the characteristics of motor signals affected by noise, a new advancing coupled multi-stable stochastic resonance method, namely CMSR, is proposed to deal with the dirty data.

However, most studies on the imputation methods tend to be more theoretical. L. O. Silva points out in [5] that in the literature of data mining context, few studies take the data missing into consideration. When it comes to data mining in NPP, the problem of data missing also remained to be solved and only a few researches has been done. The most representative research is conducted in [16]. A missing data imputation algorithm based on least squares support vector machine (LSSVM) is proposed to reconstruct the missing data in environmental radiation monitor sensor network of NPP. This research has very high reference value. It lays a good foundation for the subsequent research on data missing of NPP. The K-means clustering algorithm based on a noise algorithm in [17] provides a new way of thinking and comparison. And the mode of data analysis in [18] gives good references to time series data analysis in NPP.

Therefore, to deal with the operation data missing in NPP, the detecting algorithm based on wavelet analysis, and the filling method based on hot deck algorithm and Mahalanobis distance are studied, referring to the research achievements in other fields. In Section 1, the characteristics of the NPP operation data are analyzed, and the filling strategy based on hot deck algorithm is brought up. In Section 2, the missing data detecting algorithm based on wavelet decomposition is designed, which provides a solid foundation for the identification of missing data. In Section 3, the construction method for NPP operation state vector and the corresponding similarity measurement based on Mahalanobis distance and matching measure are designed. The detailed algorithm for filling missing data of NPP is further designed in Section 4. Taking the operation data of a NPP as a sample, the application effect of the designed algorithm is verified in Section 5. And the result is analyzed in Section 6.

2. The Characteristics of NPP Operation Data and the Data Filling Strategy

2.1. Characteristics Analysis of NPP Operation Parameters

The operation parameters of NPP are commonly divided into two types, the analog parameter and the switch parameter. And these parameters are coupled together by complex physical relationships. It indicates that the parameters are not independent from each other. The operation state of the NPP at a certain time can be represented by the two types of parameters. In another word, a particular operation state of a NPP must correspond strictly to a set of operation parameter value. These operation parameters representing the operation state of NPP are constructed into a vector, which is called the operation state vector of a NPP, or state vector for short. Due to the influence of various random errors in data collecting and transmitting, the value of the parameters may not be exactly the same. However, the similarity is still extremely high. Therefore, the missing data of NPP can be filled by searching the similar system operation state.

Data missing of NPP refers to the abnormal returning to zero of the operation data collected by the instrument and control (I&C) system, due to the interference of the external environment or the random fault of the data acquisition and recode device. Data missing is represented in two forms.(1)Discrete missing: it refers to the abnormal return to zero of discrete points in the process of a parameter changing with time.(2)Continuous missing: it refers to the abnormal return to zero of a number of continuous points in the process of a parameter changing with time.

There are differences between the mathematical characteristics of the two kinds of data missing. If there is a discrete missing in a parameter data recode, the missing point is represented as the first kind of discontinuity point in the time function of the parameter. If there is continuous missing in a parameter data recode, the first and last points of the missing segment are represented as the first kind of discontinuous points in the time function of the parameter.

2.2. The Missing Data Filling Algorithm

The requirements of the filling algorithm for data missing are further clarified as follows:(1)The algorithm should have the ability to reduce using algorithms that require verification of parameter data independence.(2)The designed algorithm should make full use of the corresponding relationship between the operation data of NPP and its operation state.

According to the requirements above, the alternative algorithms include the hot deck (HD) algorithm, the KNN algorithm, and the regression replacement algorithm [19].

For an event with missing data, the HD algorithm tries to find another event with complete data, which is the most similar to it. And then, the missing data of the previous event is filled with the corresponding data of the found event. Different events may use different similarity measurements [20].

KNN is the most typical representative of clustering methods. Firstly, the nearest k samples of missing data are determined according to the distance (commonly used Euclidean distance), and the weighted average of these k values is used to estimate the missing data of this sample.

The regression filling algorithm needs to select a number of independent parameters first. And then establish a regression equation with the parameters to estimate the missing value. That is to say, the missing value is replaced by the conditional expected value of the missing data [21]. The artificial neural network method is the most representative one of it [22].

KNN is a special case of the hot deck algorithm. The weighted average of k groups of data will passivate the influence of non-measured values associated with the parameters, which is not conducive to the filling effect. The filling effect of regression filling algorithm depends on the accuracy of the regression equation. Due to the fact that the operation data of NPP is complex and changeable, the establishment of regression equation has high requirements on data quantity and computer calculation force, which is not suitable for the current application. Magnani points out in [23], some advantages of the hot deck: reduction of the standard error without imposing a rigid model; production of a data set without missing data, and preservation of population distribution. Moreover, this method allows distinct imputation techniques to be used for each group generated.

The HD algorithm is simple in concept and uses the relationship between operation parameters and the operation state of NNP to estimate the missing data. Therefore, the HD algorithm is selected to deal with the missing data of NPP.

The HD algorithm for the missing data of the operation parameters of NPP mainly consists of the following steps:(1)Detection of missing points in the operation data.(2)Generation of the operation state vector.(3)Calculation of the operation state similarity.(4)Select the time point with perfect data which is most similar to the time with missing.(5)Fill the missing data with the value of the corresponding parameter in the time point with perfect data.(6)Check the switch parameter affected by the missing data and correct the wrong ones.

The brief flow of the filling algorithm is shown in Figure 1.

3. Design of Missing Data Detection Algorithm

The data missing of NPP is usually manifested as a numerical mutation of 0, but the case where the data naturally turns to 0 should be ruled out. That is, a data value of 0 is a necessary and insufficient condition for the point to be a data missing point. Therefore, it is necessary to further design the algorithm to detect whether the data return to 0 is a data missing point.

The analog parameters of NPP are continuous and derivable in the time domain. According to the analysis of data characteristics, the premise of determining whether a zero is a missing point is to determine whether its characteristics meet the requirements of discontinuity points. Here, the wavelet decomposition algorithm [24] is adopted to detect the discontinuity point.

Assume that the scale function and wavelet function of a certain wavelet are Φ and Ψ. For any non-negative integer j, Vj is defined as a j-order step function space spanned by the following set of functions as formula (1) over the real number field:where, t stands for the time label.

Define:where, is the coefficient of the function .

According to the wavelet decomposition theory, for any function f with finite discontinuous points, a step function fj ∈ Vj can be used to approximate it infinitely. If j is large enough, fj(t) f(t) fj can then be further decomposed into the sum of its sub-projection spaces, that is:

Assume that the function of the analog parameter x and time t is x = f(t). To detect the discontinuous points in f(t) with wavelet, the above algorithm is used. It means 2j segmentation points are inserted into the detection interval uniformly spaced to discretize it. In this case, fj(t) can be represented as follows:where and . It is further decomposed into the sum of fj−1(t) and , where and .

The projection coefficient of fj(t) on the lower order wavelet space , namely, the form of the wavelet coefficient is half of the difference between the adjacent coefficients of the higher order step function, i.e. . Thus, if there is a discontinuous point f(t0) at , the absolute value of will be relatively large due to the great difference between the values of and on both sides of the discontinuity point t0. The difference between and at the continuous point is very small, so will also be very small, close to zero. Thus, whether f(t) is continuous at t0 can be judged based on the value of . The analog parameter curve of the NPP is decomposed by wavelet, and the set is reconstructed. If there is a discontinuity point of the first kind in the original waveform, the corresponding point of the reconstructed waveform will be a distortion peak.

In essence, the above derivation process aims to extract the first derivative characteristics of the data by wavelet analysis, so as to judge its continuity. Therefore, it is not necessary to specify the selected wavelet basis, and any common wavelet basis function can meet the requirements of the design algorithm in this paper. At the same time, the first-order derivative of the function can be satisfied only by first-order decomposition.

As shown in Figure 2, a discrete missing appears at the time point 4000 of a certain parameter, and an obvious characteristic peak appears at the corresponding time point in Figure 3 of its reconstruction model.

As shown in Figure 4, (a) continuous data missing appears at time 4000 to 5000. In its reconstruction model of Figure 5, the corresponding beginning and end, time 4000 and 5000, show obvious characteristic peaks.

To sum up, the detection algorithm for missing data of NPP is listed as follows:(1)Scan the data values of operation parameters and mark all zeros.(2)Aggregate the continuous zeros into segments.(3)Decompose the data by wavelet and reconstruct its high frequency terms to obtain the derivative characteristic waveform.(4)Check the discrete and continuous zeros:(A)If it is a discrete zero, check whether there is a distortion peak at the time point corresponding to its characteristic waveform; if so, this point is a discrete missing point.(B)If it is a continuous zero, check whether there is a distortion peak at the beginning and end of the characteristic waveform of the corresponding segment. If the two characteristic peaks appear, at the beginning and end of the segment, it determines that the segment is continuous missing, and all points in the segment are missing points.

4. Design of Operation State Vector and the Similarity Measurement

4.1. Design of Operation State Vector

To deal with the missing operation data of NPP using HD algorithm, the system operation state vector should be constructed first. Therefore, the data similarity measurements of NPP operation data are studied.

There are obvious differences between the operation parameters of the analog and switch parameters of the NPP, if the two kinds of parameters are grouped into the same state vector, the similarity of the vector measured by a single method will not only increase the computational complexity of the algorithm.

To completely represent the operation state of the NPP, all parameters of NPP are used to form vectors to represent the operation state of the NPP. In order to accurately and completely retain the operating state information of the NPP, the assignment of elements in the designed operating state vector are absolutely original collected data without any process.

4.1.1. Switch Parameter State Vector Xb(t)

Switch parameter state vector, Xb(t), is a time-varying function vector, where t stands for the time point. It is composed of all switch parameters, and each element in the vector represents a switch or alarm parameter. The element value represents the measured value of the corresponding switch parameter at time t. Xb(t) is shaped as below:where , (1 ≤ i ≤ l), and l is equal to the sum of the number of switch parameters and the number of alarm parameters. And t stands for the time label.

4.1.2. Analog Parameter State Vector Xr(t)

Analog parameter state vector, Xr(t), is a time-varying function vector, where t stands for the time point. It is composed of all analog parameters, and each element in the vector represents an analog parameter. The element value represents the measured value of the corresponding switch parameter at time t. Xr(t) is shaped as below:where , (1 ≤ j ≤ s), and s is equal to the sum of the number of analog parameters. And t stands for the time label.

4.2. Design of the Similarity Measurement for Switch Parameters State Vector

A feature is called binary when there are only two states (0, 1) for it, and 0 means false, while 1 means true. All Boolean-type parameters are of typical binary feature. The match degree is used here to characterize the similarity of the Boolean parameter vector Xb(t).

4.2.1. Definition of Match

For two components, xi and yi, of a given vector, and ,(1)if xi = 1 and yi = 1, then it is called a 1–1 match;(2)if xi = 0 and yi = 1, then it is called a 0–1 match;(3)if xi = 1 and yi = 0, then it is called a 1–0 match;(4)if xi = 0 and yi = 0, then it is called a 0–0 match.

4.2.2. Selection of Matching Measure

In common methods of matching measure, the simple matching measure can perfectly fit the requirement of the similarity measurement of Boolean-type parameter state vector. Therefore, the simple matching measure is used to measure the similarity of Boolean-type parameter state vectors of NPP.

Assume that there are two Boolean parameter vectors , (1 ≤ i, j ≤ l). Let:which represents the number of 1–1 matches between and , and:which represents the number of 0–0 matches between and . Then the simple match degree is calculated as formula (9):

According to its definition, m (0 ≤ m ≤ 1) is a real number, and the greater the value of m is, the higher the similarity is.

4.3. Design of the Data Similarity Measurement for Analog Parameters State Vector

The similarity measurements of analog parameter state vectors in pattern recognition are distance measure and angle measure. The distance measure uses generalized distance to represent the similarity between vectors. The smaller the distance is, the higher the similarity is. The angle cosines are commonly used to represent the similarity between vectors. According to the monotonicity of the cosine function, the higher the cosine value is, the higher the similarity is.

For the analog parameter state vectors of NPP, if the angle cosine is used to represent the similarity between them, the length information will be lost, resulting in data distortion. For example, the cosine value of vectors, = (2, 2, 4) and = (1, 1, 2), is 1, which means they are of very high similarity. However, if such situation occurs in the operation of NPP, it is obvious that they are in two different operation states with low similarity. Therefore, the distance measure is chosen as the analog parameter state vector similarity measurement.

At present, the commonly used distance measures include Euclidean distance, Mahalanobis distance, Minkowski distance, Chebyshev distance, etc. The dimension of different operation parameters of NPP is different. Therefore, the distance measuring method is required to weaken the influence of different dimensions. Mahalanobis distance can take the relationship between parameters into account, and it is independent of dimension.

Let the data set DT be a vector set composed of corresponding to T continuous time points in the NPP. DT(s) =  , 1 <s ≤ T. x and y are two vectors of different time points in DT, The calculation formula of Mahalanobis distance between x and y is shown as formula (10):where ∑ represents the covariance matrix of DT, and ∑−1 represents the inverse matrix of ∑, where ∑ is calculated through formula (11).where f(x, y) = Cov(, ) stands for the covariance between vectors x and y.

4.4. Design of Synthetic Similarity Calculation Algorithm

The similarity measurement for the operation parameters of NPPs with different data types is proposed. The two methods are synthesized into a unified similarity measurement for system operation state, which provides the calculation basis of HD algorithm.

There are three types of operation parameters in a NPP, named analog parameters, switch parameters and alarm parameters. The relation.(1)The states of the alarm parameters are based on the judgment of analog parameters' threshold value, so the value of the analog parameters reflects the states of the alarm parameters to a large extent.(2)The switch parameters indicate the operation states of the equipment, such as pumps and valves in a NPP, and the changes of the equipment state will cause the change of the analog parameters.

It can be seen that the analog parameters contain some information about alarm parameters and switch parameters. Therefore, the similarity measurement of the NPP is designed at different levels. In detail:(1)First-level description. The Mahalanobis distance is calculated by using analog parameters. The data set with the smallest Mahalanobis distance is the most similar one of the data to be filled.(2)Second-level description. If multiple groups of the same minimum Mahalanobis distance appear in the calculation results of the first level description, the simple matching measure of the Boolean-type parameter state vector is calculated and arranged in ascending order according to the calculation results, the largest group of data is the data group most similar to the data to be filled.

5. Design of Filling Algorithm for Missing Data

5.1. Design of Algorithmic Flow

The specific process of constructing the missing data filling algorithm for the operation parameter data of NPP is as follows:

(i)Input: The operation data set
(ii)Output: The replacement data
(iii)Sum = The total number of parameters;
(iv)T = The time length of dataset;
(v)for i = 1 to Sum do
(vi)if (there is value 0 in Parameter[i]) then
(vii)Integrate the continuous zero points of each parameter into zero intervals of Parameter[i];
(viii)Decompose the time curve of Parameter[i] by wavelet;
(ix)Reconstruct the high frequency terms;
(x)Detect discrete missing and continuous missing;
(xi)Mark the missing points;
(xii)Num = The total number of parameters with data missing points;
(xiii)Establish the switch parameters state vector;
(xiv)Establish the analog parameters state vector;
(xv)for i = 1 to Num do
(xvi)Select the ith parameter with missing data;
(xvii)Tmp = The total number for the missing data points of the selected data;
(xviii)for j = 1 to Tmp do
(xix)Select the jth missing point as the time point to be filled;
(xx)For k = 1 to T do
Calculate the Mahalanobis distance between the point to be filled and the complete ones;
If current Mahalanobis distance < Minimum
Refresh Minimum;
Record the corresponding data;
else if Mahalanobis distance = = Minimum
Calculate the matching measure between the switch parameter state vectors;
Select the more similar one;
Refresh Minimum;
Record the corresponding data;
Fill the missing points with the recorded data;
Check the related switch parameter and correct the wrong ones;
Return Dataset after filling
5.2. Analysis of the Computation Complexity for the Algorithm

Suppose there are M parameters in the data set, and each parameter has N data points, in which there are K missing points in total. The missing points in the data set are much smaller than the size of the data set. At the same time, the number of the operating parameters is also much smaller than the size of the data set. Therefore, it can be concluded as formula.

Both M and K can be regarded as constants. The computational complexity of the entire algorithm is shown in equation (13).

The computational complexity of missing value detection mainly depends on the computational complexity of the calculation process for wavelet decomposition. In general, the computational time complexity of the wavelet decomposition algorithm is O(Nlog(N)). Therefore, the computational complexity of missing value detection is K·O(Nlog(N)), that is, O(Nlog(N)).

Computational complexity of missing value filling is shown in equation (14)where, Tmp represents the computational complexity required to calculate the Mahalanobis distance, and formula (15) calculates Tmp.

In summary, the computational complexity of missing value filling is O(N2). Therefore, the final computational complexity of the algorithm designed in this paper is O(N2). Compared with common algorithms, the designed algorithm is of high efficiency.

6. Experimental Verifications of the Proposed Algorithm

To verify the correctness and advantages of the designed algorithm, the operation data from the simulator of CAP 1400 developed by SJTU after normalization is used as a sample to carry out the experiments.

In order to make the experimental results easier to be analyzed, all the data are shown in a normalized form. A data set of 1200 time points after setting data missing is taken as the sample. The Error Rate is used to evaluate the effect of the algorithms. The calculation of the error rate is shown as formula (16):where, E is the Error Rate, F is the result calculated by the algorithms, and T is the true value which the missing points actually should be.

The filling methods based on mean interpolation (MI) and the method based on LSTM are set as the compare algorithms, to verify the advantage of the designed algorithm. For the MI algorithm, data near the missing point are selected as the calculation basis, and the average value is used as the filling value of the missing point. For the LSTM algorithm, several data before the missing point are selected as the basis, use the LSTM method for prediction, and then the predicted value is used as the filling value. Calculate the error rate of the three algorithms for the same missing point by formula (16) and compare the calculation results to reflect the superiority of the designed algorithm.

6.1. Experiment on Data with Discrete Missing Points

60 discrete missing points are set into two parameters, the outlet temperature of the first loop and the pressure of the Pressurizer (PZR), randomly. The curves of the parameters after setting are shown in Figure 6.

The detection results of the missing data are shown in Figure 7.

The curves of data after filling using the designed algorithm are shown in Figure 8.

Taking the filling data of the first loop outlet temperature as an example, the error rates of a part of calculation results are listed and compared with the other two algorithms in Table 1. And the full vision of the data is listed in the Attached Table 2 and Attached Table 3.

6.2. Experiment on Data with Continuous Missing Points

Three sections of continuous missing are set into the pressure of 1# steam generator (SG), and the duration of the three sections are set as 10 seconds, 50 seconds and 100 seconds. The curve after setting missing is shown in Figure 9. Missing data is detected, and the results are shown in Figure 10.

The missing data is filled with the designed algorithm, and the results are shown in Figure 11. A part of the error rates of the calculation results are listed and compared with the other two algorithms in Table 4. And the full vision of the data is listed in the Attached Table 5.

7. Result Analysis and Conclusion

The average error rates are listed in Table 6 to evaluate the stability of the algorithms from another perspective.

By analyzing the results of the experiment result in Section 5, the following conclusion could be drawn.(1)In the experiment on the data with discrete missing points, the error rates of the HD algorithm are commonly lower than those of the other methods.(2)In the experiment on the pressure of PRZ, the average error rate of the MI method rises sharply, while the other two methods raised slightly. By analyzing the original data, it can be found that some of the discrete missing points are located continuously. Then it can be further concluded that the HD algorithm is much more stable. Moreover, in the experiment on the data with continuous missing, this conclusion is proved more clearly.(3)Generally, the designed HD algorithm based on Mahalanobis distance performs better than other origin algorithm both on accuracy and on stability.

In the process of experimental verification, 50 data points were randomly selected to record their calculating time through the program, and the obtained list is shown in Table 7.

The average of the time in the table is 0.04246 seconds. It indicates that the designed algorithm is of very high efficiency (considering that the experiment has been carried out on a normal PC). And since the average calculating time is far less than the data acquisition frequency, the designed algorithm is also fit for online applications.

In summary, the missing data filling algorithm of NPP is studied based on analyzing the characteristics of NPP operation data. The missing data detection method based on wavelet decomposition is studied to identify the normal zero value and data missing, which solves the problem of unclear criteria for data missing. The construction method for operation state vector of NPP is studied. On this basis, the similarity measurement of the analog parameter vector based on Mahalanobis distance, the similarity measurement of the switch parameter vector based on match measure, as well as their joint similarity measure are studied. Then the entire algorithm flow of missing data filling algorithm for NPP is designed. Finally, the designed algorithm is verified by experiments, which proves its correctness and feasibility. And it performs better than some commonly used algorithms.

The application prospect of the designed algorithm may lay on the following aspects.(1)For the NPP operation data offline analysis, it can serve for data cleaning before the application of big data analysis for NPP abnormal operation state detection and operation experience feedback, to improve data quality and optimize data analysis results.(2)For the NPP operation data online analysis, it can be used to correct the measurement error when the sensor or the measuring channel fails, to improve the function of fault tolerant control.

Data Availability

The data used to support the findings of this study were supplied by Chen Yusheng under license and so cannot be made freely available. Requests for access to these data should be made to [Chen Yusheng, [email protected]].

Conflicts of Interest

The authors declare that they have no conflicts of interest.