Dynamic Prediction Research of Silicon Content in Hot Metal Driven by Big Data in Blast Furnace Smelting Process under Hadoop Cloud Platform

Han, Yang; Li, Jie; Yang, Xiao-Lei; Liu, Wei-Xing; Zhang, Yu-Zhu

doi:https://doi.org/10.1155/2018/8079697

Complexity

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Complexity Problems Handled by Big Data Technology

View this Special Issue

Research Article | Open Access

Volume 2018 | Article ID 8079697 | https://doi.org/10.1155/2018/8079697

Dynamic Prediction Research of Silicon Content in Hot Metal Driven by Big Data in Blast Furnace Smelting Process under Hadoop Cloud Platform

Yang Han,^1,2Jie Li ,^2,3Xiao-Lei Yang,^1,2Wei-Xing Liu,^2,3and Yu-Zhu Zhang^2,3

Academic Editor: Kaoru Ota

Received02 May 2018

Accepted25 Jul 2018

Published15 Oct 2018

Abstract

In order to explore a dynamic prediction model with good generalization performance of the content of [Si] in molten iron, an improved SVM algorithm is proposed to enhance its practicability in the big data sample set of the smelting process. Firstly, we propose a parallelization scheme to design an SVM solution algorithm based on the MapReduce model under a Hadoop platform to improve the solution speed of the SVM on big data sample sets. Secondly, based on the characteristics of stochastic subgradient projection, the execution time of the SVM solver algorithm does not depend on the size of the sample set, and a structured SVM algorithm based on the neighbor propagation algorithm is proposed, and on this basis, a parallel algorithm for solving the covariance matrix of the training set and a parallel algorithm of the iteration of the random subgradient projection are designed. Finally, the historical production big data of No. 1 blast furnace in Tangshan Iron Works II was analyzed during 2015.12.01~2016.11.30 using the reaction mechanism, control mechanism, and gray correlation model in the process of blast furnace iron-making, an essential sample set with input and output is constructed, and the dynamic prediction model of the content of [Si] in molten iron and the dynamic prediction model of [Si] fluctuation in the molten iron are obtained on the Hadoop platform by means of the structure and parallelized SVM solving algorithm. The results of the research show that the structural and parallel SVM algorithms in the hot metal [Si] content value dynamic prediction hit rate and lifting dynamic prediction hit rate were 91.2% and 92.2%, respectively. Two kinds of dynamic prediction algorithms based on structure and parallelization are 54 times and 5 times faster than traditional serial solving algorithms.

1. Introduction

Blast furnace anterograde and hot metal quality are the primary goals of iron-making process control. The silicon content of hot metal is an important characterization parameter for slag quality, tapping temperature, and hot metal quality. The fluctuation degree of silicon content is also an important parameter for the operation of a blast furnace. Therefore, it is necessary to construct a precise dynamic prediction model of molten iron content in molten iron. The support vector machine (SVM) algorithm is a widely used machine learning algorithm. The core idea of this algorithm is the maximum interval theory [1] and the structural risk minimization principle [2]. The algorithm solves a convex quadratic programming problem by training the sample data and obtains the optimal hyperplane. For the SVM solution algorithm, the traditional Newton iteration method [3] and the interior point method [4] can be used; these two algorithms have perfect research theories and are widely used. However, all the information of the Hessian matrix must be searched globally, which greatly reduces the solution speed. Based on this, relevant scholars have conducted in-depth research on SVM solving algorithms and proposed many classical algorithms, such as block algorithm, decomposition algorithm, and sequence minimization algorithm [5–7]; to some extent, although these algorithms have improved the solving efficiency of SVM, the amount of data generated in the era of big data is extremely large and constantly escalating, making it impossible to apply classic serial SVM solving algorithms for big data sample sets. From the situation is the bottleneck of the SVM set by the calculation of memory and operation time.

Based on the bottleneck of the SVM solution algorithm, relevant scholars have developed the parallelization strategy of the algorithm according to the iterative principle and convergence speed of the algorithm. In many parallel computing environments, cloud computing as a mainstream parallel computing research platform is extended by distributed computing [8], parallel computing [9], grid computing [10], and virtualization [11]. Hadoop is an open-source project under the Apache Software Foundation, which provides a highly reliable and scalable environment for distributed computing. Its most convenient application is the ability to build big-scale cluster systems on PCs. Because the integrated learning method [12] obtains different base classifiers through a specific learning method and then determines the final classification results according to some integration strategy for the results of each base classifier, a better classification result than a single classifier can be obtained, and MapReduce as one of the cores of the Hadoop distributed system is mainly responsible for the decomposition and integration of data and computing modules. Therefore, it is natural to develop integrated learning algorithms on the Hadoop cloud computing platform.

SVM, as a machine learning algorithm with a solid foundation of mathematical theory and excellent generalization performance, should use the Hadoop cloud computing platform and parallel computing to break through its bottleneck in the processing of big data sets and promote the application scope of the algorithm.

Xiaole et al. [13] performed the partitioned parallel design for the classical SVM serial interior point solution algorithm; Feng et al. [14] applied the self-adaptive parallel subgradient projection algorithm to reconstruct the compressed sensing signal and accelerated the convergence of the algorithm by adjusting the mechanism of the expansion coefficient; Dong et al. [15] improved the efficiency of the SVM algorithm by using the matrix LDLT parallel decomposition method; and Zhao et al. [16] adopted the MapReduce programming model to realize the minimum and maximum modular SVM’s SMO algorithm. The random gradient projection algorithm in the above algorithm is used to solve the objective function by random gradient descent; in each iteration, a training sample is randomly selected to calculate the gradient of the objective function, and then a step is preset in the opposite direction. The running time of the algorithm satisfies , where is the number of nonzero features on the boundary in each sample. The stochastic gradient descent algorithm has the lowest efficiency in many SVM serial solving algorithms, but can produce the best generalization performance. Shalev-Shwartz [17] pointed out that the number of iterations required for the stochastic gradient projection as the SVM solution method is , where is the precision of the solution, and the algorithm execution time does not depend on the size of the sample set and the scale cluster system can achieve the acceleration of the algorithm.

The stochastic subgradient projection algorithm is a typical algorithm for solving convex quadratic programming problems, which has a strong representativeness when it is used for solving SVM algorithms. Based on the deep analysis of the SVM solution process and stochastic subgradient projection algorithm, a parallel SVM algorithm using the stochastic subgradient projection algorithm and considering the structure of sample data is designed in this paper, with the help of the MapReduce model on the Hadoop cloud computing platform. The algorithm is applied to deal with the big historical data produced in the process of blast furnace production in order to obtain the efficient dynamic prediction model of [Si] in molten iron.

2. Foundation

2.1. Hadoop Framework

Hadoop, one of the distributed system infrastructure frameworks, has the advantages of high reliability, high efficiency, and freedom of scalability. It is a software platform for distributed processing of big data [18–20]. The key content of the software platform are MapReduce and Hadoop Distributed File System; the former is a distributed computing technology which is primarily responsible for computing, and the latter is a file system which is primarily responsible for data distributed storage.

The core idea of MapReduce distributed computing technology [21, 22] is to decompose a big-scale dataset into several small datasets, and then the multiple subnodes managed by one master node operate on several small data sets, and the computer realizes the aggregation of each child node and obtains the final big-scale data set operation result. In the MapReduce cluster, the application for pre-execution is called “job,” with a JobTracker. The work units “task” are decomposed from the job run on the corresponding compute nodes and have several JobTrackers, where JobTracker directs MapReduce work to build the link between the application program and Hadoop. JobTracker determines the files to be processed, responsible for assigning and monitoring the corresponding nodes for different tasks, and TaskTracker is responsible for independently executing the specific task and interacting with JobTracker. If JobTracker fails to receive the information submitted by TaskTracker on a timely basis, TaskTracker is determined to crash and the task is determined to be broken and assigned to other nodes for processing.

The MapReduce programming model takes the key-value pair as its input form, and its execution process can be seen as a key-value pair conversion to another batch-value pair output process [23]. The MapReduce framework decomposes the input data into segments according to certain rules and parses it into a batch of key-value pairs. During the Map phase, the user processes the key-value pair with the aid of the Map function to generate a series of intermediate key-value pairs during the Reduce phase. In the Reduce stage, the intermediate key-value pairs with the same key are aggregated to obtain the final result through the Reduce function. The specific process is shown in Figure 1. From Figure 1, we can see that users use the MapReduce programming model to write parallel programs and only need to use the Ctrip Map and Reduce functions of the program. Based on this, they do not need to consider the details of parallel execution of programs, thereby reducing the complexity of parallel programming and improving the ease of parallel programming.

HDFS is the foundation of data storage management in distributed computing, which has the advantages of high reliability, strong expansibility, and throughput. The premise and goal of the system design are as follows [24–27]: (1)When HDFS is running on normal hardware, each component may be faulty, so error detection and quick recovery are the core goals of HDFS.(2)The method of data reading on HDFS is massive data stream, so it requires high throughput when accessing data.(3)Big-scale data sets require HDFS to support large file storage and can provide higher data transmission bandwidth.(4)HDFS requires write once and read many times access model for documents, which simplifies data to meet high-throughput data access.(5)The cost of mobile computing is lower than that of mobile data. A computing task is close to its operation data, which can reduce network congestion and improve system throughput.(6)Heterogeneous hardware and software platforms should be portable.

HDFS uses master-slave architecture (see Figure 2); there is a NameNode and a plurality of DataNode. NameNode provides the ability to create, open, delete, rename, and direct the metadata of the file system. DataNode is used to store several pieces of data that are broken down, and each data block is copied into multiple copies and stored on different DataNode, so as to achieve the purpose of fault tolerance and disaster tolerance. By maintaining some data structures, NameNode records each information block divided by every file, as well as the status of each data node and other important information.

HSDS uses a file access model that writes one time but can be read multiple times, which makes HDFS data access with high throughput. In addition, HDFS also has a variety of reliability assurance measures, and it is stored in a number of copies of data to ensure data stability, also through data mode heartbeat detection, security mode, block reporting, and space recycling mechanism to ensure reliability.

2.2. Support Vector Machine

The support vector machine is a machine learning algorithm based on the VC dimension theory and structural risk minimization principle in statistical learning [28–30]. Mainly used in classification, according to the two classification problems, the main goal of this algorithm is to construct an optimal decision function (“best” based on the maximum interval theory), which can be used to classify the test data as accurately as possible. In the two classification problems of training set (where and ), the SVM under linearly separable conditions, SVM under incomplete linearly separable conditions, and SVM under linearly separable conditions are discussed.

If there is a hyperplane which can correctly divide the sample sets of +1 and −1, the data set is called linear separable data set. The purpose of applying SVM is to find two hyperbranes with the largest class spacing based on the distance between the two classes. The hyperplane can be described by , where is the normal vector for the class hyperplane and is the offset. In Figure 3, the blue empty circle indicates category 1, the black empty circle represents category 2, the blue solid circle represents the boundary of category 1, and the black solid box represents the boundary of category 2; the two boundaries are called support vectors, and the midline of the vector (black solid line) indicates the desired optimal hyperplane.

Based on the idea of the maximum interval theory, the SVM model can be described by Formula (1). This formula describes the original problem of SVM, which is a typical convex quadratic programming problem with linear constraints. It can uniquely determine the maximum interval classification hyperplane, and its Lagrangian function can be described by (2).

In Formula (2), is the Lagrange multiplier corresponding to each sample, and the minimum value of and is obtained for the function ; respectively, and can be obtained according to the extreme conditions and , and bringing it into Formula (2), the dual problem of the primal problem of SVM can be obtained and can be described by

The optimal decision function of SVM can be described by Formula (4). in Formula (4) is the optimal solution of (1), and is the optimal solution of (3).

The sample set may have noise or other factors in the acquisition process, resulting in the sample set category being unable to be completely divided by a hyperplane [31, 32], but the vast majority of the sample points can be correctly divided by the hyperplane. Based on this, the relaxation variable is introduced for the purpose of constructing the optimal hyperplane with only a few samples indistinguishable. The relaxation variable is composed of the product relaxation variable and , so the original problem (1) becomes

The purpose of introducing the penalty item is to enhance the fault tolerance of the SVM classifier. The larger the penalty factor is, the greater the penalty is. Equation (5), using hinge loss (or 0–1 loss function, or quadratic loss function, or other loss function), is not completely linearly separable in the other form of (6). It is easy to get the dual problem of (5), using hinge loss (or 0–1 loss function, or quadratic loss function, or other loss function), which is another form of (6) in the condition of not being completely linearly separable.

If the data sample set cannot be linearly separable, the original sample vector can be mapped from the original space to the higher-dimension feature space with a nonlinear function as shown in Figure 4, so that the sample can be linearly separable in the feature space. And then the optimal hyperplane is solved according to the maximum interval theory in the feature space. However, the complexity of algorithm calculations is increased in the process of solving the optimal hyperplane in high-dimensional space, which sometimes falls into a dimensional disaster. According to the Hilbert-Schmidt principle pointed out by the statistical learning theory, it is unnecessary to know the specific form of through the application of kernel function . Since ’s kernel function is easier to obtain than itself, will not be in the concrete solution appearing alone, but always appearing in the form of . At this time, the original problem of SVM can be described by (7), and the corresponding dual problem can be described by (8).

3. SVM Solution Algorithm Based on Pegasos

The problem of solving SVM is essentially a quadratic programming problem. Related scholars had done a lot of research on the quadratic programming problem. Among them, the most typical ways are the Internal point method and the decomposition method.

The Internal point method [33]: the main idea of the former is using the penalty function instead of the inequality constraint and then using the iterative Newton method to optimize the solution. It has to solve twice of the SVM dual problem and get a number of optimization variables. For them, the Newton method is used to synchronize each variable step by step to optimize the synchronization. However, the time complexity of the interior point method for solving SVM problems is , and the spatial complexity is . Therefore, in the case of a large number of training samples, the use of an internal point method to solve an SVM problem is very difficult. Moreover, the use of an internal point method to obtain high accuracy does not mean that the generalization accuracy can also be improved.

The decomposition method [34, 35]: sequence minimal optimization is an effective method proposed by Platt in the 1980s to solve the SVM problem. Keerthi had improved the SMO algorithm and made a detailed theoretical proof of the convergence of the SMO algorithm. The improved SMO algorithm decomposed the quadratic programming problem of SVM into a series of subproblems. Based on the constraint of , the scale of the working sample set was minimized, and each of the two Lagrange multipliers which did not satisfy the KKT condition was optimized for each heuristic. Fix other parameters and use a simple analytical method to obtain the optimal solution of these two parameters, then update. Repeat the above process until all Lagrange multipliers meet the conditions. Although the decomposition method can effectively overcome the problem that the internal point of memory is too large, it must be transformed into a solution to its dual problem which will lead to its convergence rate which is significantly slower than the original problem of the direct solution.

The Pegasos is a gradient-based algorithm that can be applied to the direct solution to the original problem of SVM. The random gradient descent and the projection step are alternately performed in the iteration, and a number of samples are taken from the whole training sample to calculate the subgradient of each round.

The SVM problem is described as

In formula (9),

Formula (9) is expressed as , if

So is the accuracy of the range of the solution.

The Pegasos requires that the number of the iterations is , where represents the accuracy of the solution obtained. It can be seen that the execution time of the algorithm does not directly depend on the size of the sample set and should be applied to the big-scale data set classification problem.

The Pegasos algorithm is described as follows:

Input: Training data set , Iterations , The number of samples selected per round of iterations;
Output: The normal vector of a hyperplane .
Main procedure:
1. Initialization vector w, arbitrarily select a vector w₁, and request ;
2. For
2.1 Select k samples from the training set S, subset , and replace the objective function with:

2.2 Determine the learning efficiency of the gradient descent method;
2.3 The use of in to determine the current loss of nonzero samples into a new subset

The sub-gradient direction of the objective function can be expressed as:

2.4 Update:

2.5 Projection steps:

3. Get the final result

4. Structure and Parallelization of the Pegasos Algorithm Based on AP Clustering

The convergence of the Pegasos algorithm based on the SVM solution does not depend on the number of samples which is selected for each iteration, but is constrained by the number of iterations . When the products of and are fixed at an appropriate value, the two parameters have a certain interval, and the variation in the interval does not affect the convergence of the algorithm. The number of samples selected in each iteration of this study is , and the maximum number of iterations is , therefore is between 10⁵ and 10⁶. Therefore, in dealing with big-scale samples and in each round of the iterative process, usually by selecting as many samples as possible reduces the number of iterations required for the algorithm. According to the above contents, the key steps of parallelizing the Pegasos algorithm are as follows [36–38]: randomly select the sample to obtain the objective function of the gradient direction of the process and through MapReduce break down to multiple machines in parallel. Each step of the gradient projection iteration is the same as 2.1–2.5 in Algorithm 1. But the parallel approach does not take into account the structural information of the sample, in order to effectively use the sample structure information and obtain a more reasonable SVM optimal superplanar classification. Based on the information of the sample structure, this paper proposed a Pegasos algorithm for parallel structured SVM.

The covariance matrix of the data contains the trend of data generation in the statistical sense, which can effectively reflect the result information of the data. The clustering information of data is important information for the structure of response data. This page uses the affinity propagation clustering algorithm [39], to provide a criterion for the sample data structure information and speed up the processing of big data samples; the parallel algorithm of the AP algorithm was designed.

4.1. AP Clustering Algorithm and Deserialization

The clustering mechanism of the AP clustering algorithm is “message passing.” By passing messages between data points, the final cluster center can be identified. Involving two important parameters, the degree of attractiveness parameters and attribution parameters, respectively, the larger the former, the greater the likelihood that the candidate cluster center becomes the center of the actual cluster. The larger the latter, the greater the likelihood that the sample point belongs to the cluster center. The actual message passing mechanism is implemented by iteratively updating the attractiveness matrix and the attribution degree matrix . The specific update rules are as follows [40–42]: (1)The attribution matrix is updated according to the membership degree matrix and the similarity matrix : (2)The membership matrix according to the attraction degree matrix and the similarity degree matrix :

In Formulas (12), (13), and (14), represents the degree of similarity between point and point ; the larger the value, the more likely it is that the point is as an actual center, as it is the main basis for priority; initialization assigns the same value to all points, denoted as parameter ; represents the degree of attraction of point to point ; and represents the degree of attribution of point to point . For point , so that reaches the maximum, point is their center; if , then point itself is a clustering center.

According to the basic idea of the AP clustering algorithm, in order to be able to effectively deal with big-scale data clustering problems, this paper designs a distributed AP based on MapReduce. For the implementation of the algorithm (see Figure 5), the process is divided into three phases: the first stage is Mapper1 and Reducer1, the second stage corresponds to Mapper2, and the third stage corresponds to Mapper3 and Reducer3. The algorithm is based on the MapReduce model, using AP clustering to sparse data, and then clustering represents the AP cluster again; this design can guarantee the clustering effect of the original AP algorithm but also improve the AP clustering efficiency and adaptability to different sizes of data.

4.2. Pegasos Algorithm of Structured SVM

The model of structured SVM can be described by

In formula (15), is the loss function; and are the regularization factors to balance the options.

In the Pegasos algorithm, is projected into aggregate after each iteration, but the structural information embedded in the Pegasos algorithm will cause the projection range to change, so that the original projection range is no longer applicable. Therefore, the optimal constraints of the structured SVM will also change, which is expressed in (16). In Formula (16), is the covariance matrix of the sample, and the corresponding structural information is different from that of , and is the unit matrix.

If the optimal solution of (15) is , the constraints on the left side of the constraint condition described in (16) can be transformed into (17); the dual problem of the optimization problem described by (15) can be expressed by (18), and the optimal solution of the dual problem is . From the strong duality theorem, we can see that the objective function value of the original problem is equal to the objective function value of the dual problem. In addition, (19) can be obtained by (17) and then simplified to obtain (20).

The above analysis shows that the data structural information is embedded into the Pegasos algorithm, and the optimal solution of the structured Pegasos algorithm is found in set B (see Formula (21)). Therefore, after completing each iteration of the gradient descent program, is projected into B, that is, the normal vector is multiplied by Formula (22). This step can make closer approximation to the optimal solution.

Algorithm 2 embeds the structure information into the initial objective function, which is characterized by the covariance matrix of the data samples, and the improvement of SVM original model is realized. The improved SVM algorithm achieves the time-consuming reduction of algorithm operation under the premise of satisfying the accuracy of algorithm classification. It is worth noting that the range of the descending sub-gradient direction and the range of the optimal solution of the Pegasos algorithm after embedding the structural information will change, but the intra-range fluctuation does not affect the application effect of SVM.

The Pegasos algorithm for structured SVM is described as follows:

Input: Training data set , the number of iterations , the number of samples for each iteration;
Output: The normal vector of the classification hyperplane is .
Main procedure:
1. Calculate the covariance matrix of the sample;
2. Initialize vector Arbitrarily select a vector and ask for ;
3. For
3.1 Select the subset of the samples, , from the training set and replace with the original objective function;
3.2 Determine the learning efficiency of the gradient descent method ;
3.3 will use to determine the current loss of nonzero samples into a new subset

The sub-gradient direction of the objective function can be expressed as:

2.4 Update:

2.5 Projection steps:

4. Get the final result

4.3. Structure and Parallelization of the Pegasos Algorithm

Based on the Pegasos algorithm of structured SVM, the parallel processing of the structured Pegasos algorithm is implemented on the Hadoop platform by means of a MapReduce parallel framework model. The algorithm is divided into two stages which are the covariance parallel calculation phase of the data samples and the subgradient projection iterative parallel phase of the data samples. There is a separate MapReduce task for each iteration in the two stages of the calculation process.

When the sample data structural information is obtained under the MapReduce framework model, the training samples must be scattered on the corresponding data nodes to solve the covariance matrix of the data samples scattered on the corresponding data nodes. The training set is divided into subsets, which is denoted by (23), and the variance sum and the mean of the corresponding data subsets are denoted by (24).

The covariance matrix on the training set can be expressed as

The simplified form of formula (25) available is

With the aid of Formula (26) and the MapReduce model, the solution of the covariance matrix of the whole training set is solved. The specific algorithm is described as follows:

//Map
Input: The sample on the current node;
Output: , , .
Main procedure:
1. Scan the current node sample, accumulate the number of samples of the current node ;
2. Calculate ;
3. Calculate ;
//Reduce
Input: Each node , , . ;
Output: .
Main procedure:
1. Summarize the output of the Map node;
2. Find the covariance matrix of the whole sample according to Eq. (26).

In conjunction with Algorithm 3, the algorithm of subgradient projection iteration is designed in parallel, and each iteration is taken as a single MapReduce task. The concrete steps for the design of the Map and Reduce algorithms for the times iteration are as follows:

//Map
Input: , The number of nodes , the number of random samples taken per iteration ;
Output: The current node obtained , among them: .
Main procedure:
1. Randomly selected k/M samples;
2. Define the zero vector ;
3. For to k/M
If
Then
4. Solving get
//Reduce
Input: The current number of iterations , , the number of nodes , the number of random samples taken per iteration ;
Output: .
Main procedure:
1. Calculate ;
2. Calculate ;
3. Calculate:
;
4. Calculate:

5. Algorithm of the Model and Example Analysis

5.1. Case Analysis and Data Collection

In this study, the deep excavation of big-scale historical production data of the No. 1 blast furnace (3200 m³) of Tangshan Iron and Steel Company in the period from December 1, 2015, to November 30, 2016, was carried out to construct the [Si] content of molten iron during the smelting process for the prediction model. (1)Sample output variable data collection: a total of 26 thousand samples of hot metal samples were collected during the iron cycle. To build a dynamic prediction model for the [Si] content of molten iron in the smelting process, the [Si] content of the 26 thousand molten iron collected was calculated according to the time series.(2)Sample input variable data collection: the factors affecting the [Si] content of molten iron in the tap hole are sample input variables. Through the reaction mechanism and control mechanism in the process of blast furnace iron-making, the action diagram of the factors affecting the [Si] content of molten iron is shown in Figure 6. Twenty-four influencing factors X1~X24 were extracted as the sample input index [43–45] which can be obtained in real time. According to the frequency collected every 0.5 s, we can get 62.4 million sets of sample input data set.

5.2. Sample Set Construction

In this study, the deep excavation of big-scale historical production data of the No. 1 blast furnace (3200 m³) of Tangshan Iron and Steel Company in the period from December 1, 2015, to November 30, 2016, was carried out to construct the [Si] content of molten iron during the smelting process for the prediction model. [Si] content means mass fraction.

The sample set is composed of sample input and sample output. Based on the reaction mechanism and control mechanism in the blast furnace iron-making process, the 24 indicator factors shown in Figure 6 are extracted as sample input. Using the gray relational model shown in (27) and (28), the correlation between the 24 input indices and the output [Si]% is calculated; that is, sample input index selection depends on the size of the correlation degree. Combined with the correlation degree calculation result and the blast furnace smelting process control principle, the correlation degree threshold is determined to be 0.870. The gray correlation between the 24 calculated input parameters and [Si]% is shown in Table 1. As can be seen from Table 1, excluding the gray correlation degree is less than the threshold of alternative indicators, to retain more than the threshold of alternative indicators.

In formula (27), is the correlation coefficient, is the differential coefficient, satisfying , , and .

In Formula (28), the gray relation degree is denoted by .

The blast furnace operation process is a large delay process, corresponding to the time synchronization which corresponds to the input index and [Si]% data between the existence of a large delay. In order to improve the accuracy of the prediction model, it is necessary to modify the order of the delay between the input and the output to construct the most relevant sample set. In this study, the correlation coefficient analysis method is used to calculate the influence degree of input variables on [Si]% at different time delays. Among them, the time series is set to 0T, 1T, 2T, 3T, 4T, and 5T in terms of the hot metal cycle. The smelting period T of the No. 1 (3200 m³) blast furnace in Tangshan Iron and Steel II Factory is 6 ~ 8 hours. The correlation coefficient calculation results are shown in Table 2.

62.4 million sample input datasets were integrated according to the smelting period and one-to-one correspondence with the content of [Si] in molten iron; after that, a sample set size of 26,000 was formed. Then, extract x1 ~ x19 as sample input, and design the essence sample set (sample size is 25996) corresponding to the sample input and output according to the correlation statistical analysis results in Table 2. The correspondence of the essence sample set is as follows:

5.3. Experimental Design

5.3.1. Sample Set

In this paper, the support vector machine parallel algorithm is designed, and a serum sample set is determined based on the [Si] content of the dynamic prediction model. Select the 24996 group of samples as the training set in chronological order, after selecting the group of samples as a test set of 1000.

5.3.2. Experimental Platform

Experiments were carried out on personal computers and cloud computing platforms. Personal computer configuration is 3.20 GHz frequency, with 8 GB memory, and cloud computing platform configuration is 1 master node server and 20 slave node servers. Each node processor is Intel® Xeon® CPU E5620, the frequency is 2.40 GHz, the operating system is 64-bit Debian Linux, and the Hadoop platform version is hadoop-0.20.2.

5.3.3. Parameter Setting

The number of iterations in the Pegasos algorithm is ; the number of samples selected for each iteration is , ; the number of iterations in the parallel Pegasos algorithm is ; the number of samples selected for each iteration is , , and ; the structure of the Pegasos algorithm is applied to AP clustering; and the training sample set on AP clustering parameters is , then we can get four clusters.

5.3.4. Algorithm Evaluation

Algorithm performance indicators are divided into two categories; in the [Si] content dynamic prediction simulation, the predicted hit rate (HR) and the training sample set are used to study the learning time , and the predicted hit rate HR is used in the [Si] content rise and fall prediction simulation, training sample learning time true positive rate TPR, true negative rate TNR, and geometric mean (GM). Among them, the predictive hit rate is calculated as (30), the true positive rate TPR is calculated as (31), the true negative rate of TNR is calculated as (32), and the geometric mean GM is calculated as (33).

Formulas (30), (31), (32), and (33): is the number of predicted times less than the error threshold in the [Si] content prediction process, is the total number of samples, is the number of times that the [Si] content actually rises and is predicted to rise, is the number of times that the [Si] content actually rises but is predicted to decrease, is the number of times that the [Si] content actually drops but is forecast to rise, and is the number of times that the [Si] content is actually decreased and the prediction is decreased.

5.4. Dynamic Prediction of [Si] Content and Analysis of the Results

Based on the above experimental design criteria, the [Si] content of the blast furnace iron-making process is predicted, and the predicted hit rate of the three algorithms is shown in Table 3. It can be seen from Table 3 that the SVM algorithm solved by serial Pegasos shows the best effect in predicting the hit rate of the [Si] content, but the sample training time is significantly longer than the other two algorithms. The SVM algorithm with a structural parallel Pegasos solution has a lower hit rate than the serial algorithm, but the sample training time is significantly better than the serial algorithm. Only the parallel algorithm does not consider the sample data structure of the SVM algorithm which has the worst accuracy, but the sample training time is significantly better than for the serial algorithm, compared with the structural parallel Pegasos that solved SVM. The algorithm has roughly the same sample training time. Therefore, the structure of this study designed parallel Pegasos algorithm has a stronger practicality.

The predictive hit rate , the true positive rate , the true negative rate , and the geometric mean of the three algorithms are predicted in Table 4 of the blast furnace iron-making process. According to the results of the three algorithms presented in Table 4, we can see that the SVM algorithm, which is designed in this paper, has stronger practicability in the prediction of [Si] content lift.

The distribution of the results around the actual results in the blast furnace [Si] content prediction is shown in Figure 7. It can be seen from Figure 7 that the prediction results are mostly concentrated in the actual value of ±0.06 and in terms of accuracy are to meet the actual needs. Figure 8 shows the predictive results and actual results in the timing of the control situation; it can be achieved in real-time [Si] dynamic forecast for the actual production of the blast furnace to provide guidance.

Compared with the evaluation of the algorithm, the SVM algorithm of structure and parallelism not only preserves the excellent performance of SVM (the accuracy of the absolute error can reach 91.2% of the predicted hit rate), and the algorithm of SVM speed has improved significantly (time-consuming than the serial algorithm which increased by 54 times). In addition, a 92.2% high hit rate was obtained in the notice of dynamic fluctuation of molten iron [Si], and the SVM algorithm was improved by 5 times in the solution speed.

It is noteworthy that the silicon content prediction algorithm is an analog quantity prediction. It has high precision requirements for numerical values. The traditional serial algorithm takes a lot of time. The parallel structured SVM algorithm can improve the learning speed of the algorithm more effectively. The silicon content up-and-down prediction algorithm is a two-classification problem, which has low precision requirements for numerical values and focuses on pattern recognition, and the traditional serial algorithm takes less time. The effect of using the parallel structured SVM algorithm to improve the learning speed of the algorithm is not obvious. This research carried out scientific algorithm design and simulation experiments. The empirical results show that the time-consuming improvement effect of the silicon content prediction algorithm is significantly higher than the silicon content up-and-down prediction algorithm.

6. Conclusions

The support vector machine is a machine learning algorithm based on the maximum interval theory. The biggest advantage of this algorithm is that it can avoid dimensionality disaster with kernel function and realize the maximum generalization performance of the algorithm by means of the structural risk minimization principle, and the algorithm is mainly applicable to small sample data. However, the processing of big data samples is not optimistic, especially in the solution of SVM which presents a serious shortage. Therefore, it is very necessary to design a parallel SVM algorithm in the Hadoop platform to improve the algorithm to solve the speed. This study researches the SVM algorithm for stochastic subgradient projection, and based on the characteristics of the execution time of the algorithm, a stochastic subgradient algorithm based on AP clustering is designed and used in SVM solution.

In the algorithmic link, the data set of the size of 6240 × 104 × 37 is processed, and the result of dynamic prediction of blast furnace [Si] content is obtained by using the random subgradient algorithm with fine sample set and application structure and parallelization.

In view of the advantages of the algorithm in forecasting accuracy and algorithm solving speed, it is worthy to be popularized in [Si] dynamic forecasting of blast furnace iron-making. The promotion route is divided into the following three steps:

Step 1. Based on the Hadoop platform, the SVM parallel algorithm is designed to train the sample data, and then the best [Si] content dynamic forecasting model is applied to the practice.

Step 2. Based on the practical effect of the target customers, the new data of the on-site production process is supplemented into the training samples to optimize the dynamic prediction model of the [Si] content.

Step 3. The SVM parallel algorithm is designed in this paper to configure the Hadoop platform for the blast furnace production site in real time. The real-time updating of the real-time data collection and training sample set is realized, and the [Si] dynamic forecasting model is provided in real time.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was financially supported in part by the National Natural Science Foundation of China (51504080), in part by the Science and Technology Project of Hebei Education Department (BJ2017021), and in part by the Outstanding Youth Fund Project of North China University of Science and Technology (JQ201705).

References

S. Rosset, J. Zhu, and T. Hastie, “Boosting as a regularized path to a maximum margin classifier,” Journal of Machine Learning Research, vol. 5, no. 4, pp. 941–973, 2004.
View at: Google Scholar
V. Vapnik, “Principles of risk minimization for learning theory,” in Advances in Neural Information Processing Systems, pp. 831–838, DBLP, 1992.
View at: Google Scholar
B. Kaltenbacher, “A posteriori parameter choice strategies for some Newton type methods for the regularization of nonlinearill-posed problems,” Numerische Mathematik, vol. 79, no. 4, pp. 501–528, 1998.
View at: Publisher Site | Google Scholar
K. Koh, S. J. Kim, and S. Boyd, “An interior-point method for big-scale l 1-regularized logistic regression,” 2007, http://JMLR.org.
View at: Google Scholar
L. W. Sun and J. Li, “BISVM: block-based incremental training algorithm of SVM for very big dataset,” Application Research of Computers, vol. 25, no. 1, pp. 98–100, 2008.
View at: Google Scholar
S. Lucidi, L. Palagi, A. Risi, and M. Sciandrone, “A convergent hybrid decomposition algorithm model for SVM training,” IEEE Transactions on Neural Networks, vol. 20, no. 6, pp. 1055–1060, 2009.
View at: Publisher Site | Google Scholar
S. S. Keerthi and E. G. Gilbert, “Convergence of a generalized SMO algorithm for SVM classifier design,” Machine Learning, vol. 46, no. 1/3, pp. 351–360, 2002.
View at: Publisher Site | Google Scholar
M. D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis, and A. Vakali, “Cloud computing: distributed internet computing for IT and scientific research,” IEEE Internet Computing, vol. 13, no. 5, pp. 10–13, 2009.
View at: Publisher Site | Google Scholar
Z. Fu, X. Sun, Q. Liu, L. Zhou, and J. Shu, “Achieving efficient cloud search services: multi-keyword ranked search over encrypted cloud data supporting parallel computing,” IEICE Transactions on Communications, vol. E98.B, no. 1, pp. 190–200, 2015.
View at: Publisher Site | Google Scholar
N. Mustafee, “Exploiting grid computing, desktop grids and cloud computing for e‐science: future directions,” Transforming Government: People, Process and Policy, vol. 4, no. 4, pp. 288–298, 2010.
View at: Publisher Site | Google Scholar
K. Dhanya and S. Preethi, “A virtual cloud computing provider for mobile devices,” International Journal of Advance Research, Ideas and Innovations in Technology, vol. 3, no. 3, pp. 411–414, 2017.
View at: Google Scholar
B. Liu, J. Chen, and S. Wang, “Protein remote homology detection by combining pseudo dimer composition with an ensemble learning method,” Current Proteomics, vol. 13, no. 2, pp. 86–91, 2016.
View at: Publisher Site | Google Scholar
S. Xiaole, L. Jianhua, L. Rui, and L. Xia, “Paralleled optimal power flow algorithm based on auxiliary problem principle and interior point algorithm,” Journal of Xian Jiaotong University, vol. 40, no. 4, p. 468, 2006.
View at: Google Scholar
X. Feng, W. Hao, and T. Jianeng, “Compressed sensing reconstruction based on subgradient projection method,” Journal of Convergence Information Technology, vol. 7, no. 6, pp. 9–15, 2012.
View at: Publisher Site | Google Scholar
J.-x. Dong, A. Krzyzak, and C. Y. Suen, “Fast SVM training algorithm with decomposition on very large data sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 4, pp. 603–618, 2005.
View at: Publisher Site | Google Scholar
J. Zhao, Z. Liang, and Y. Yang, “Parallelized incremental support vector machines based on MapReduce and Bagging technique,” in 2012 IEEE International Conference on Information Science and Technology, pp. 297–301, Hubei, China, 2012.
View at: Publisher Site | Google Scholar
S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011.
View at: Publisher Site | Google Scholar
A. Rabkin and R. H. Katz, “How Hadoop clusters break,” IEEE Software, vol. 30, no. 4, pp. 88–94, 2013.
View at: Publisher Site | Google Scholar
L. Wang, J. Tao, R. Ranjan et al., “G-Hadoop: MapReduce across distributed data centers for data-intensive computing,” Future Generation Computer Systems, vol. 29, no. 3, pp. 739–750, 2013.
View at: Publisher Site | Google Scholar
M. Niemenmaa, A. Kallio, A. Schumacher, P. Klemelä, E. Korpelainen, and K. Heljanko, “Hadoop-BAM: directly manipulating next generation sequencing data in the cloud,” Bioinformatics, vol. 28, no. 6, pp. 876-877, 2012.
View at: Publisher Site | Google Scholar
J. Dean and S. Ghemawat, “Map reduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, p. 107, 2008.
View at: Publisher Site | Google Scholar
R. Lämmel, “Google’s MapReduce programming model — revisited,” Science of Computer Programming, vol. 70, no. 1, pp. 1–30, 2008.
View at: Publisher Site | Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, “HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 922–933, 2009.
View at: Publisher Site | Google Scholar
X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: a case study of improving small file I/O performance on HDFS,” in 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–8, New Orleans, LA, USA, 2009.
View at: Publisher Site | Google Scholar
A. K. Karun and K. Chitharanjan, “A review on hadoop — HDFS infrastructure extensions,” in 2013 IEEE Conference on Information and Communication Technologies, pp. 132–137, Thuckalay, Tamil Nadu, India, 2013.
View at: Publisher Site | Google Scholar
A. Higai, A. Takefusa, H. Nakada, and M. Oguchi, “A study of effective replica reconstruction schemes for the Hadoop distributed file system,” IEICE Transactions on Information and Systems, vol. E98.D, no. 4, pp. 872–882, 2015.
View at: Publisher Site | Google Scholar
S. Park and Y. Lee, “Secure Hadoop with encrypted HDFS,” in Grid and Pervasive Computing, pp. 134–141, Springer, Berlin Heidelberg, 2013.
View at: Publisher Site | Google Scholar
M. M. Adankon and M. Cheriet, “Support vector machine,” in Third International Conference on Intelligent Networks and Intelligent Systems IEEE Computer Society, pp. 418–421, Chennai, India, 2010.
View at: Google Scholar
T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000.
View at: Publisher Site | Google Scholar
S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
View at: Publisher Site | Google Scholar
O. Chapelle, “Training a support vector machine in the primal,” Neural Computation, vol. 19, no. 5, pp. 1155–1178, 2007.
View at: Publisher Site | Google Scholar
A. Widodo and B. S. Yang, “Support vector machine in machine condition monitoring and fault diagnosis,” MechanicalSystems and Signal Processing, vol. 21, no. 6, pp. 2560–2574, 2007.
View at: Publisher Site | Google Scholar
T. Li, H. Li, X. Liu, S. Zhang, K. Wang, and Y. Yang, “GPU acceleration of interior point methods in large scale SVM training,” in 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 863–870, Melbourne, VIC, Australia, 2013.
View at: Publisher Site | Google Scholar
B. H. Yang, D. Chen, D. X. Zheng, and D. F. Zha, “Fault diagnosis method for bearing based on wavelet packet decomposition and EMD-SVM,” Computer Measurement and Control, vol. 23, no. 4, pp. 1118–1120, 2015.
View at: Google Scholar
A. P. Dobrowolski, M. Wierzbowski, and K. Tomczykiewicz, “Multiresolution MUAPs decomposition and SVM-based analysis in the classification of neuromuscular disorders,” Computer Methods and Programs in Biomedicine, vol. 107, no. 3, pp. 393–403, 2012.
View at: Publisher Site | Google Scholar
J. F. de Oliveira and M. S. Alencar, “Online learning early skip decision method for the HEVC inter process using the SVM-based Pegasos algorithm,” Electronics Letters, vol. 52, no. 14, pp. 1227–1229, 2016.
View at: Publisher Site | Google Scholar
V. Jumutc, X. Huang, and J. A. K. Suykens, “Fixed-size Pegasos for hinge and pinball loss SVM,” in The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, Dallas, TX, USA, 2013.
View at: Publisher Site | Google Scholar
Y. W. Chang, C. J. Hsieh, K. W. Chang, M. Ringgaard, and C. J. Lin, “Training and testing low-degree polynomial data mappings via linear SVM,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490, 2010.
View at: Google Scholar
K. Wang, J. Zhang, D. Li, X. Zhang, and T. Guo, “Adaptive affinity propagation clustering,” http://arxiv.org/abs/0805.1096.
View at: Google Scholar
C. Shea, B. Hassanabadi, and S. Valaee, “Mobility-based clustering in VANETs using affinity propagation,” in GLOBECOM 2009 - 2009 IEEE Global Telecommunications Conference, pp. 1–6, Honolulu, HI, USA, 2009.
View at: Publisher Site | Google Scholar
J. Vlasblom and S. J. Wodak, “Markov clustering versus affinity propagation for the partitioning of protein interaction graphs,” BMC Bioinformatics, vol. 10, no. 1, p. 99, 2009.
View at: Publisher Site | Google Scholar
R. Guan, X. Shi, M. Marchese, C. Yang, and Y. Liang, “Text clustering with seeds affinity propagation,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 4, pp. 627–637, 2011.
View at: Publisher Site | Google Scholar
N. Tsuchiya, M. Tokuda, and M. Ohtani, “The transfer of silicon from the gas phase to molten iron in the blast furnace,” Metallurgical Transactions B, vol. 7, no. 3, pp. 315–320, 1976.
View at: Publisher Site | Google Scholar
T. Bhattacharya, “Prediction of silicon content in blast furnace hot metal using partial least squares (PLS),” ISIJ International, vol. 45, no. 12, pp. 1943–1945, 2005.
View at: Publisher Site | Google Scholar
L. Jian, C. Gao, L. Li, and J. Zeng, “Application of least squares support vector machines to predict the silicon content in blast furnace hot metal,” ISIJ International, vol. 48, no. 11, pp. 1659–1661, 2008.
View at: Publisher Site | Google Scholar
H. Jiang and W. He, “Grey relational grade in local support vector regression for financial time series prediction,” Expert Systems with Applications, vol. 39, no. 3, pp. 2256–2262, 2012.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2018 Yang Han et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1515

Downloads

1293

Citations