Data Mining and Knowledge Discovery in Industrial EngineeringView this Special Issue
An Improved Generalized-Trend-Diffusion-Based Data Imputation for Steel Industry
Integrality and validity of industrial data are the fundamental factors in the domain of data-driven modeling. Aiming at the data missing problem of gas flow in steel industry, an improved Generalized-Trend-Diffusion (iGTD) algorithm is proposed in this study, where in particular it considers the sort of problem with data properties of consecutively missing and small samples. And, the imputation accuracy can be greatly increased by the proposed Gaussian membership-based GTD which expands the useful knowledge of data samples. In addition, the imputation order is further discussed to enhance the sequential forecasting accuracy of gas flow. To verify the effectiveness of the proposed method, a series of experiments that consists of three categories of data features in the gas system is presented, and the results indicate that this method is comprehensively better for the imputation of the periodical-like data and the time-series-like data.
Data missing is one of the major obstacles to obtain valid data samples , which also might be common or even inevitable in some data-driven-based research fields, such as sample surveys, industrial productions, medical research, soft engineering, and wireless broadcast environment [2, 3]. The data missing problem might destroy the samples integrality since every cell in database may not be independent, and furthermore a single missing value might call for dropping the entire observed values or the useful information [4, 5]. As such, some useful information or knowledge could be lost from the data set. Moreover, the data missing will also lead to the nonresponse bias of samples which could be a serious concern for the data-driven-based studies [6–10]. In the literatures, most of the existing methods for such problem were mainly based on the statistical techniques. For instance, the multiple imputation (MI), a kind of popular technique, was used to resolve the missing data of gross domestic product (GDP) , cancer databases , and sample surveys . And, a similar response pattern imputation (SRPI) was also implemented in . In , the authors used the classic expectation-maximization (EM) algorithm, principal component analysis (PCA), and singular value decomposition, while [16, 17] utilized the maximum likelihood technique to carry out the missing data imputation. However, all the techniques mentioned previously might be hard to reflect the relationship among regression variables, since the imputed values were mere approximations of unknown values. Besides the statistical techniques, the machine learning was paying more and more attentions nowadays, as presented in [18, 19].
In industrial manufacturing process, the phenomenon of data missing often occurs due to the events such as data collector failures, transmission errors, or information storage errors, which directly result in some obstacles for establishing data-driven-models, such as scheduling models, data-driven based regression prediction models, and stochastic optimization models [20–22]. There were different types of approaches for industrial practitioners in the literatures to deal with these data missing problems. In [23, 24], the authors proposed a method called list-wise deletion that was easy to be implemented; however, it tended to reduce the sample data size. Considering that a lot of missing data in industry have the form of time series sampled in equal intervals in most cases, the integrality of sample data has to be broken by such deleting the missing points. Besides wasting a lot of costly collected data, this method also led to invalid results if the excluded group was a selective subsample from the entire sample . Mean imputation presented in  was another widely employed method. However, the mean values of the sample might eliminate the samples diversity in time series whose amplitude dramatically changes, and the distortion of samples was usually unacceptable for industrial practitioners. With respect to the other statistical or machine learning techniques, the maximum likelihood estimation and the linear interpolator were, respectively, proposed in [27, 28], where the effective experiments were used to validate the time series imputation. Yet, all of these experiments showed high demands of samples, and as a result, their applications in real industrial process were rather limited. As for all of the above mentioned methods, few of them can bring satisfying imputation accuracy, once the consecutive missing happens, or the missing rate is high, and the sample size is too small.
The Generalized-Trend-Diffusion (GTD) is a method of sample construction aiming at small data sets. As the virtual examples presented in  and the functional virtual population in , the so-called shadow data and membership functions were employed to increase the knowledge of small data sets; see more details in . And, the expanded samples were provided for Back Propagate-based (BP) neural networks to carry out the forecasting, resulting in the prediction accuracy higher than that without expanding. Thus, the most significant advantage of GTD was that it could bring satisfactory forecasting accuracy with relatively small data sets. On the other hand, the original GTD described the membership degree to the mean value of observed sample via a triangular membership function. As such, each observed data point deviation from the mean value is proportional to the difference between the membership function values; that is, the observed data points linearly deviate from the mean point. However, such description of deviation cannot bring excellent accuracy in the imputation tasks for real industrial manufacturing process.
This paper aims at the missing data imputation of blast furnace gas flow in steel energy system. An improved GTD modeling algorithm based on Gaussian membership function is proposed considering the diversity of the gas flow data and the complex missing situations. The Gaussian membership shows that the observed data deviate from the mean value nonuniformly, and this deviation makes the close-to-mean values more likely to appear in the imputation. The samples, expanded by the membership function, make the predicted values by BP-based network lean to the mean. And, such predicted values do not make the samples single as those by the mean imputation do. In addition, the imputation order is essential to the accuracy of time series problem. A both-side-toward-middle (BSTM) order is proposed in this paper which is indicated to be more appropriate than the chronological order. And the tests are implemented to verify the effectiveness of the proposed method, in which the sample data comes from the practice of Shanghai Baosteel Co. Ltd. The results demonstrate that the improved GTD method is much better than the original version and other methods in several cases.
This study is organized as follows. In Section 2, the practical conditions of blast furnace gas in Baosteel is described. And then, the original GTD and its improved version are established in Section 3, where the details of how to use the improved GTD for the industrial missing data imputation are discussed. In Section 4, the validity of the improved GTD is verified by a series of comparative experiments. Finally, this study is summarized in Section 5.
2. Problem Description
Blast furnace gas (BFG) is a kind of byproduct gas generated in the process of iron making . As an important secondary energy for blast furnaces, coke ovens, power stations, and other units, its proper utility can not only reduce the energy consumption of steel enterprises but also improve their economic profits. Figure 1 shows the BFG system structure of a steel plant, where four blast furnaces supply the gas to consumers. However, BFG could be diffused if the flow prediction and the scheduling are inappropriately carried out, which will seriously pollute the environment. In this case, the supervision of BFG’s generation and consumption becomes a crucial task for the steel enterprises.
Currently the on-site technicians perform the balance scheduling by estimating the BFG generation amount which comes from the observed data. However, the observed data often miss due to the collector failure, transmission errors, information storage errors, and so forth. Furthermore, the generating process of BFG is rather complex, and the output fluctuates irregularly, therefore the data missing makes the workers work hard to perceive the dynamics of gas flow via generic model. In practice, the gas engineers in Shanghai Baosteel employ the personal experience-based estimation as the current wide using method when encountering single point missing. However, there are more consecutive missing points in real manufacturing process, which make such method relatively weak. In addition, if the missing rate is high, the whole time series can be treated as a combination of several small size series. In this case, the existing methodologies like the recursive neural networks presented in [33, 34] cannot be utilized because they need a large amount of sample data to train the regression model.
Aiming at the various features of a large number of gas units of BFG system, we summarize the flow tendencies of the generation and consumption units as three categories, which involve (1) the periodicity-like flow data (the gas consumption amount of hot blast stove, see Figure 2), (2) the concussive flow data (the gas consumption of coke oven, see Figure 3), and (3) the ordinary time series flow (the generation amount of blast furnace, see Figure 4).
3. Improved Generalized-Trend-Diffusion
The GTD is a method of sample construction aiming at small data sets which generates shadow data using the real data and the occurrence order of the observed data. The importance degree of those shadow data and observed data is quantified by the membership function values based on fuzzy theories. Both the membership values and the shadow data can be treated as the additional hidden data-related information, which helps to improve the imputation accuracy. All the previous features above make the GTD fit for the missing data imputation of time series because of their lack of more information except time . One can start by considering that observations are collected with an empty set, where each point occurs with each observation (Figure 5). As the data increases, the central location, symbolized “C” in Figure 5, of the data for each observation moves from one location to another. If each point deviation from the central location can be obtained, then the detailed distribution of the whole sample is clear. As such, the GTD with membership function can be used to describe such deviation. Let the membership function value at “C” be 1, and let those of some missing data be . When these values get closer to 1, the missing data approach “C” and vice versa.
In the original GTD model, one can let be the membership function for the data collected at Step . For example, at Step 1 refers to only, at Step 2 refers to , at Step 3 refers to , and so forth. The data like at Step 2 and at Step 3 are called the shadow data. They were called as such name because each of them was used repeatedly in each step when forming the corresponding membership functions, while it occurred actually once.
Then the imputation can be done by the shadow data. One can suppose that a sequence of data denoted as with missing has been obtained. The shadow data can be built by unevenly repeating the more recent data which bring more important contemporary information of system variation than that provided by the previous data. As shown in Figure 6, the most recent point is repeated times, is repeated times is repeated times, and so forth. And, such repeating was called as the backward tracking progress, since it is done in the backward tracking progress. Then, these repeated data (shadow data) with their membership function values help to enlarge the sample knowledge.
3.2. Improved Generalized-Trend-Diffusion
A triangular membership function (Figure 7) was used to describe each point’s deviation from the mean value in the original GTD. However, such description was somewhat unreasonable, since it restricted the deviation form as a linear one; that is, the deviation from the central location was proportional to the difference between the memberships. Under such condition, the possibility of the data value to appear in the imputation is equal. However, the mean-like data have actually a higher possibility of appearance in the industrial manufacturing process. In this study, we can call such data vividly as high frequency cloud. If a membership function can describe the high frequency cloud more like the mean value, then the data in the cloud (mean like) will reappear in the imputation with higher possibility. Considering such motivation, the Gaussian membership function (Figure 8) could be more competent to accomplish this job.
The information diffusion principle  is another reason for choosing the Gaussian membership function. Information diffusion has a function of filling in the blanks like the molecular diffusion, and its cause lies on that some data acquire little information from the sample knowledge, while molecular diffusion is caused by the heterogeneity in the space distribution. As for the molecular diffusion, it had been proved that current molecular density is proportional to the concentration gradient. If this principle is linked with the law of conservation of mass, the molecular diffusion can be described in the same form as the probability density of Gaussian distribution. As a kind of incremental learning method, the GTD is a representation of information diffusion. Since the causes of information diffusion and molecular diffusion are similar, we here get an inspiration to employ the Gaussian membership function in the improved GTD. The form of Gaussian function is as follows: where , , and are real constants and . In order to make the function adaptive to the sample construction in this study, we use (2) as the general form of Gaussian function instead of (1) as follows: where is the mean value of sample and is the standard deviation. Here, we make as 1, since the membership value at the mean value should be 1. After its form confirmed, the Gaussian membership is capable to enlarge the sample knowledge instead of the triangular one.
3.3. Data Imputation
The BP algorithm is a supervised learning method in a network, which is effected by altering the weights to minimize the difference between the output value and the desired output value . The enlarged knowledge then can be utilized by BP neural networks to finish the prediction.
Missing data points need to be imputed one by one, so that the order of imputation should be another concern in this study. If the imputation is real time, the chronological order has to be taken because one cannot currently acquire the future data points. However, the study in this paper is a data mining job which does not need real-time imputations. Furthermore, if the imputation is not real time, the BSTM order is superior to the chronological one. For instance, let there be five consecutive missing data, as Figure 9 shows. If the imputation order is chronological, the forecast error of point number 1 will be amplified so as to affect the forecast accuracy of point number 2, and the error of point number 2 will be again propagated to that of point number 3. In such way, the errors will be cumulative.
Besides, data points in time series are always fluctuating, as Figure 9 shows. If the missing happens on the hillside, the chronological imputation result is very likely to be the same as , since it continues the peak. However, the result may be like if we use the BSTM order. That is, points number 1 and number 2 are on the peak, while points number 4 and number 5 are on the plain which both continue the trends. As for point number 3, we impute that it sings the mean of points number 2 and number 4. Obviously, deviates from the real values more than which shows that the BSTM order is superior to the chronological one. This summary is consistent if analyzing the missing points on the peak or on the plain.
Let be a time series, the index of the first missing point denotes as , the number of the consecutive missing points denotes as , and the variable represents the embedding dimension, then we have where is the imputation of the former half, while is in the latter half. Then all the imputations can be expressed as
4. Experimental Results and Analysis
The imputation tests of missing data in BFG flow are carried out with the proposed Gaussian membership function-based method, called iGTD here. First of all, the superiority of the BSTM order to the chronological one is tested and verified. A series of consecutive 800 data is picked from number 1 blast furnace in Baosteel dating from 14:34:00/13/8/2010 to 3:54:00/14/8/2010. Considering that it is difficult to guarantee quantities of consecutive valid data in real industrial databases, and the small set of samples is our concerns in this study, the embedding dimension is empirically chosen as 15, and the hidden neuron size is chosen as 10 in the same manner. We divide the sample data into 4 groups. For each group, we randomly remove 3 consecutive points (A, B, and C) in 3 places. The tests are, respectively, implemented in the chronological order (A-B-C) and the BSTM order (A-C-B). Here, we use three indexes as the evaluation criterion of the imputation accuracy, which are root mean square error (RMSE), normalized root mean square error (NRMSE), and mean absolute percentage error (NRMSE) as follows: where is the total number of imputation, is the imputation value, and is the real value. As for the separated 4 groups of data, the imputation accuracies for the different order are shown in Table 1. It is apparent that the effectiveness of BSTM is superior to that of the chronological order-based imputation method.
To further verify the effectiveness of the proposed Gaussian-based membership function, we comprehensively take the three categories of gas flow data mentioned in previous section, which include the periodicity-like flow data, shown like the BFG consumption amount by hot blast stove; the concussive flow data, shown like the gas consumption by coke oven; and the ordinary time series flow, shown like the generation amount by blast furnace. The comparative experiments are carried out by using the EM method, regression, spline, and the original GTD. EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models , which was widely used in dealing with missing data, since the maximum likelihood estimate of the unknown parameters can be determined by the incomplete data set. The regression method employed multiple linear regressions to estimate the missing values.
We still apply the real industrial data in Baosteel to complete the comparative experiment, where the collected data are divided into several groups, and some consecutive 3 points, 4 points, and 5 points are removed from the time series. In order to cover the all of the possible situations, the removed data involves the time series areas on peak, trough, and plain, as Figure 10 shows, in which the points in red are removed.
(1) BFG Consumption by Hot Blast Stove (Periodicity Like)
The experimental data are from number 2 hot blast stove in Shanghai Baosteel randomly selected from 14:28:00/13/08/2010 to 21:55:00/14/08/2010. These data are divided into 3 groups, each of which is then divided into 3 subgroups, and each subgroup contains 200 points. The accuracies of the imputation result are presented in Table 2.
It can be found that the results by both the original GTD and the proposed iGTD are much more excellent in terms of the accuracy when the consecutive data missing occurs. Furthermore, the effectiveness of the iGTD is generally better than that of GTD. Then a conclusion can be drawn that the iGTD employed the Gaussian membership function can obtain the better data imputation results compared to the triangle-based membership of GTD.
(2) BFG Consumption by Coke Oven (Concussive)
The experimental data are from number 1 coke oven in Shanghai Baosteel randomly selected from 07:14:00/14/08/2010 to 14:35:00/15/08/2010. The data-grouped measure is similar to that in the validation for periodicity-like data missing. And, the corresponding imputation accuracies are listed in Table 3. From the experiments results, all the five methods are almost same imputation accuracies, and in particular EM should be the best solution method of the five. However, it is mentionable that the effectiveness of the proposed iGTD still does better than GTD in this test.
(3) BFG Generation Amount (Normal Time Series)
The experimental data are from number 1 blast furnace in Shanghai Baosteel randomly selected from 02:28:00/27/03/2010 to 18:33:00/01/04/2010. And, the comparative accuracies are listed in Table 4.
From Table 3, we can discover that the regression method presents the worst performance, while iGTD obtains the best one. For the data with normal property of time series, iGTD is better than GTD, while GTD wins all the other three methods.
A conclusion can be drawn from Tables 2–4 that the proposed iGTD and the GTD are superior to regression, EM, and spline for the periodicity-like data and the normal time-series-like data. As for the data with concussive amplitude, both iGTD and GTD do not have an advantage, and yet iGTD still beats GTD which means the proposed Gaussian membership function is superior to the triangular one in the real industrial manufacturing process. And, for the visual imputation results of the BFG generation and consumption, the comparative imputation curves are randomly chosen as Figures 11, 12, and 13 show, where the advantage of the method proposed in this study can be easily presented.
This study aims at the imputation of missing data of gas flow in steel industry. In order to improve the imputation accuracy, the proposed iGTD replaces the triangular membership function with the Gaussian one. Furthermore, the order of imputation is further discussed. The verification experiments show that the BSTM order brings less error than the chronological one does, since more observed data are utilized. As for the different data imputation method, compared to the original GTD, EM, regression, and spline, the proposed iGTD has some advantages in the problem with data properties of consecutively missing and small samples. And, the satisfying imputation accuracy provides the powerful support for the gas resources scheduling later. On the other hand, although the approach developed in this study can handle some types of missing in real industry, some theoretical analyses and the expanded application, for example, the type of concussive flow data, need to be given a further consideration in the future.
This work is supported by the National Natural Sciences Foundation of China (no. 61034003, no. 61104157, and no. 61273037) and the Fundamental Research Funds for the Central Universities of China (no. DUT11RC(3)07). The cooperation of energy center of Shanghai Baosteel Co. Ltd, China, in this work is greatly appreciated.
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2006.
A. F. de Winter, A. J. Oldehinkel, R. Veenstra, J. A. Brunnekreef, F. C. Verhulst, and J. Ormel, “Evaluation of non-response bias in mental health determinants and outcomes in a large sample of pre-adolescents,” European Journal of Epidemiology, vol. 20, no. 2, pp. 173–181, 2005.View at: Publisher Site | Google Scholar
J. M. Lepkowski, W. D. Mosher, K. E. Davis, R. M. Groves, and J. Van Hoewyk, “The 2006–2010 National Survey of Family Growth: sample design and analysis of a continuous survey,” Vital and Health Statistics, no. 150, pp. 1–36, 2010.View at: Google Scholar
A. L. Bello, “Imputation techniques in regression analysis: looking closely at their implementation,” Computational Statistics and Data Analysis, vol. 20, no. 1, pp. 45–57, 1995.View at: Google Scholar
D. A. Newman, “Longitudinal modeling with randomly and systematically missing data: a simulation of Ad Hoc, maximum likelihood, and multiple imputation techniques,” Organizational Research Methods, vol. 6, no. 3, pp. 328–362, 2003.View at: Google Scholar
S. Van Buuren, H. C. Boshuizen, and D. L. Knook, “Multiple imputation of missing blood pressure covariates in survival analysis,” Statistics in Medicine, vol. 18, pp. 681–694, 1999.View at: Google Scholar
D. Li, L. Chen, and Y. Lin, “Using Functional Virtual Population as assistance to learn scheduling knowledge in dynamic manufacturing environments,” International Journal of Production Research, vol. 41, no. 17, pp. 4011–4024, 2003.View at: Google Scholar
K. Goto, H. Okabe, F. A. Chowdhury, S. Shimizu, Y. Fujioka, and M. Onoda, “Development of novel absorbents for CO2 capture from blast furnace gas,” International Journal of Greenhouse Gas Control, vol. 5, no. 5, pp. 1214–1219, 2011.View at: Google Scholar
H. Jaeger, “A tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and “Echo State Network” approach,” GMD Report 159, German National Research Center for Information Technology, Berlin, German, 2002.View at: Google Scholar
H. Jaeger, “Adaptive nonlinear system identification with echo state networks,” Advances in Neural Information Processing Systems, vol. 15, pp. 593–600, 2003.View at: Google Scholar
E. P. Zhou and D. K. Harrison, “Improving error compensation via a fuzzy-neural hybrid model,” Journal of Manufacturing Systems, vol. 18, no. 5, pp. 335–344, 1999.View at: Google Scholar