Abstract

Subflow prediction is required in resource active elastic scaling, but the existing single flow prediction methods cannot accurately predict the peak variation of subflow in hybrid data flow. These do not consider the correlation between subflows. The difficulty is that it is hard to calculate the correlation between different data flows in hybrid data flow. In order to solve this problem, this paper proposes a new method DCCSPP (subflow peak prediction of hybrid data flow based on delay correlation coefficients) to predict the peak value of hybrid data flow. Firstly, we establish a delay correlation coefficient model based on the sliding time window to determine the delay time and delay correlation coefficient. Next, based on the model, a hybrid data flow subflow peak prediction model and algorithm are established to achieve accurate peak prediction of subflow. Experiments show that our prediction model has achieved better results. Compared with LSTM, our method has decreased the MAE about 18.36% and RMSE 13.50%. Compared with linear regression, MAE and RMSE are decreased by 27.12% and 25.58%, respectively.

1. Introduction

The hybrid data flows are widely used in practical applications. For example, Alibaba’s e-commerce platform uses a large-scale hybrid technology. This technology mixes online services with offline tasks. Hybrid data flow consists of online services and offline tasks. They enter the cluster at the same time and save the cost without affecting service quality.

The flow peak prediction is important in the active elastic expansion of the system [1]. Lombardi et al. [2] propose a novel elastic scaling approach, named ELYSIUM which contains the “predictionInputLoad” method to predict the maximum load. Bauer et al. [3] describe a new hybrid autoscaling mechanism, called Chameleon. Chameleon employs on-demand, automated time series-based forecasting methods to predict the arriving load intensity in combination. Hirashima et al. [4] give a new autoscaling mechanism which changes the scale of the target system based on the predicted workload.

In the active elastic scaling of the flow processing system, there are some studies on peak flow prediction. The authors regard network flow as a whole in the existing prediction methods. There are some traditional methods for network flow prediction, such as the ARIMA linear model and wireless network flow prediction model based on combinatorial optimization theory. Meanwhile, with the development in the neural network, the support vector machine (SVM) and other prediction model based on machine learning algorithm appears. Some authors use neural network models such as RNN [5], NARX recursive neural network model, LSTM [6], and GRU for predicting network peak flow. These prediction models can well explain the randomness and periodicity of flow.

However, the above methods are based on single flow prediction, without considering the possible correlation between individual flows in a hybrid data flow. Therefore, aiming at considering the influence of data correlation on peak flow prediction, this paper proposes a flow prediction method, named DCCSPP (subflow peak prediction of hybrid data flow based on delay correlation coefficients). We establish a delay correlation coefficient model to solve the correlation uncertainty of different subflows and consider the correlation influence between subflows based on the predicting results of the single flow. The more accurate the prediction of flow peaks, the more reliable the system flow information will be obtained, and this will provide better indexing parameters for the system’s elastic scaling.

In recent years, flow predictions based on time series have always been an attractive research area. Developing predictive models plays an important role in interpreting complex real-world elements [7].

Many of the traditional learning methods are used for time series prediction. Zhang et al. [1] propose an agile perception method to predict abnormal behavior. Yu et al. [8] describe an ARIMA linear model to predict network flow sequence. Aiming at solving the problem that a single model cannot fully describe change characteristics, a wireless network flow prediction model based on combinatorial optimization theory is proposed by Chen and Liu [9]. Liu et al. [10] give online learning algorithms for estimating ARIMA models under relaxed assumptions on the noise terms. Adebiyi et al. [11] examine the forecasting performance of ARIMA and artificial neural networks model. Wu and Wang [12] investigate time series prediction algorithms by using a combination of nonlinear filtering approaches and the feedforward neural network (FNN). Joo and Kim [13] propose a forecasting method based on wavelet filtering. Han et al. [14] introduce a multioutput least square support vector regressor. Chandra and Al-Deek [15] discuss a vector autoregressive model for prediction at short-term flow prediction on freeways. Conventional techniques for time series prediction are limited in their ability to process big data with high dimensionality, as well as efficiently represent complex functions. If the amount of linear data are not too large, the statistical method is reliable enough to be used for prediction. At the same time, the generated model is very complex and difficult to be implemented by nonlinear data types, so the prediction results are not very accurate when there are massive data.

Deep learning-based models have been successfully applied in many fields to time series prediction. There are many prediction models, which based on machine learning have been proposed. Haviluddin and Alfred [16] introduce a NARX recursive neural network model to predict network flow. Nie et al. [17] propose a novel network flow prediction method based on deep belief network (DBN) and logistic regression model for network flow prediction. In [18], network flow prediction of neural network models such as RNN [5], LSTM [6], and GRU is used. Hoermann et al. [19] report a deep CNN model for dynamic occupancy grid prediction with data from multiple sensors. The advantage of a Gaussian processes lies in its ability of modeling the uncertainty hidden in data, which is provided by predicting distributions [20]. Deep learning-based models are good at discovering intricate structure in large data sets [7]. These prediction models can well explain the randomness and periodicity of flow.

As mentioned above, the above methods are all for single flow prediction, without considering the possible correlation between data flows in hybrid flow. However, in the hybrid data flow, there is a lack of research on such flow prediction. Therefore, this paper mainly studies the correlation between different subflows in the hybrid flow and the peak prediction of each subflow.

3. Delay Correlation Coefficient Model Based on Sliding Time Window

In hybrid data flows, there are different degrees of correlation between different subflows. Considering the correlation between subflows and the pseudocorrelation caused by time analysis, this paper proposes a delay correlation coefficient model, which adds sliding time window according to Pearson correlation coefficient and time difference analysis [21]. This model is to calculate the delay correlation coefficient and delay time difference between different subflows. Based on the delay coefficient, the data flow that has an influence on the target subflow prediction is filtered out.

Correlation analysis [21] refers to the measure the closeness of the variables between two or more related variable elements. Correlation elements need to have a certain connection or probability to conduct correlation analysis.

The Pearson correlation coefficient, also known as Pearson product-moment correlation coefficient, represents the linear correlation between the two sets of variables and . The formula is shown as follows:

Equation (1) is the covariance formula. The covariance is divided by the standard deviation of the two related variables to obtain the Pearson correlation coefficient, which is described in formula (2). It is to compensate for the weak representation of the covariance value in the degree of random variable correlation:

The Pearson correlation coefficient can always be between . The closer the coefficients are to the extremes at both ends, the greater the linear relationship between the two random variables. If the coefficient is close to 0, it means that the two variables are not linearly related. If the coefficient approaches 1, it means that and can be well described by the straight line equation, all data points fall well on a straight line, and increases as increases. The coefficient approaching −1 means that all data points fall on a straight line, and decreases as increases.

In the flow processing system, the input of data is generally composed of multiple subflows, which we call it a hybrid data flow. This article defines the hybrid data flow as follows

Definition 1. The hybrid data flow in the period is , where indicates that there are kinds of data flows and indicates that data belonged to the th data flow arrives system at the time of .

Definition 2. The data set constituting a business is , where indicates that the data set of the service consists of kinds of data flows. Thus, service correlation exists in these data. For example, a hybrid data flow consisting of device login information and user behavior information. The flow of user behavior information is affected by the flow of device login information, and the two have a partial-order relationship. Since different service data flows require different processing operations and computing resources, it is necessary to perform shunt operations on the data of the hybrid data flow, as shown in Figure 1.
Through the statistics of discrete hybrid data, the observation sequence of each subflow is obtained. A set of hybrid data flow observation sequences composed of subflow observation sequences are defined.

Definition 3. The hybrid data flow observation sequence set is , where n represents M contains n data flows. represents the observed sequence of the ith data flow in M, that is, , where represents the observed value of data flow at t time and l represents the l observation values of the data flow . represents the observation sequence of the jth data flow in M, that is, , where represents the observed value of data flow at time t and l represents l observations in the data flow . And .

Definition 4. The ith subflow in hybrid data flow .

Definition 5. The delay time e is shown in Figure 2. It means that the change of in time has an effect on at time t.

Definition 6. The size of the sliding time window is h, as shown in Figure 3.
Let , where , so . Let , where , so .

Definition 7. The correlation coefficient of and when the delay time is . The calculation formula of is described in the following formula:where and . and . and are shown in Figure 4.

Definition 8. The maximum delay correlation coefficient between and is . Its calculation formula is as follows:When predicting , it is necessary to select the data flow with the highest delay correlation for the auxiliary prediction. The selection formula of is as follows:Algorithm 1 gives the pseudocode for selecting the auxiliary data flow algorithm as follows.

Input: the list of steams; the size of window; the number of predicted steam
Output: the number of subsidiary steam; the number of delay time
(1)procedure chooseSteam ()
(2)for iterate through the list of steams do
(3)for iterate through all of the number of delay time do
(4)for iterate through all of the size of window do
(5)Calculate and get the correlation coefficient between the delay time and window
(6)Summing up the correlation coefficients
(7)Calculate and get the mean of the correlation coefficient
(8)Update the maximum delay correlation coefficient and the delay time
(9)Update the maximum delay correlation coefficient and the delay time
(10)Update the number of the auxiliary data flow
(11)return the number of the auxiliary data flow and the delay time

4. Hybrid Data Flow Subflow Peaking Prediction Model

The selected data flow (i.e., X) is separately predicted by a single flow prediction method, and an initial prediction result set of X is obtained, where represents an initial prediction result for the value at time t in X.

Definition 9. The variation in x at time t is . represents the difference between the single prediction result at time t and time . The calculation formula is as follows:

Definition 10. The amount of change in y at time t is . represents the difference between the observed value at time and . The calculation formula is as follows:

Definition 11. To scale the range of y to the range of x in a same level, we defined , which is described as follows:

Definition 12. At the time t, the final prediction result of is . The calculation formula is as follows:where represents the weight of the correlation coefficient, and the calculation formula is as follows:Algorithm 2 gives the pseudocode for the hybrid data flow correlation prediction algorithm as follows.
The evaluation indexes in this paper are root mean square error (RMSE) and mean absolute error (MAE). The calculation formulas are as follows:The smaller the mean absolute error index value is, the more accurate the prediction result is. The smaller the root mean square error value is, the fewer the abnormal discrete points are, and the higher the prediction accuracy is.

Input: the list of predicted steam; the list of first prediction flow; the list of subsidiary steam; the number of delay time; the size of window; the number of time
Output: the number of the final predicted value at time t
(1)procedure prediction ()
(4)Calculate and get based on formula (6)
(5)Calculate and get based on formula (7)
(6)Calculate and get based on formula (8)
(7)Calculate and get based on formula (10)
(8)Calculate and get the final prediction result based on formula (9)
(9)return the final prediction result

5. Experimental Verification

5.1. Data Set

In order to analyze the prediction performance of the prediction method proposed in this paper, the device login data and behavior acquisition data provided by the mobile phone APP of a credit company in three periods of three months are selected. We collect 13,567 pieces of equipment login data and 282,685 pieces of behavioral data in a certain period of June, as data set 1, as shown in Figures 5 and 6. There are 27,381 device login records and 344,109 behavior data selected in a certain period of July, as data set 2, as shown in Figures 7 and 8. And data set 3 selects 17550 device login records and 755693 behavior data in a certain period of November, as shown in Figures 9 and 10. Each subset selects 4465 observations. From Figures 510, we can see that the change trend of device login statistics and behavior collection statistics is close, and there is a correlation between them. In the experiment, firstly, the results predicted by LSTM and unary linear regression model are as the control group. Then, the results by our model are as the experimental group. In the end, compare their prediction indicators and error indicators of peak prediction.

5.2. Compared with LSTM Prediction Method

In this paper, the first 90% observed values of each data set is selected as training sets to train the LSTM learning model, and the last 10% is used as the test set to analyze the predictive ability of the model. The overall prediction results of the test sets of data set 1, data set 2, and data set 3 are obtained, as shown in Figures 1113. And the prediction results for a period with high observed values in data set 1, data set 2, and data set 3 are shown in Figures 1416. In the DCCSPP, it is necessary to intercept the observation value of time window size for calculation, so before 90, the prediction method cannot give the prediction result, and the value is 0.

In this paper, we need to discuss the influence of time window, and the results of experiment on data set 2 are shown in Figure 17.

Compared with the LSTM model, it can be seen from Figures 1416 that the results changes in DCCSPP are closer to the real-observed values.

It can be seen from Figure 17 that the selection of time window has certain influence on the prediction results. Too small or too large time window has a bad influence on the prediction results. Therefore, in addition to data set 3, this article selects 90 as the size of the time window. On data set 3, the prediction method can get better prediction results when the time window size is 240.

The errors of prediction results for data set 1, data set 2, and data set 3 in this paper are shown in Table 1. The prediction method has the most obvious improvement in data set 1. MAE and RMSE decreased by 13.46% and 17.80%, respectively. And we found that the smaller values of the test set of data set 2 lead that the MAE and RMSE of data set 2 are smaller than the others. In the end, the overall results show that the accuracy of the prediction results can be improved by using the correlation coefficient algorithm based on the prediction results of the LSTM model.

This paper compares the calculation indexes of prediction results of multiple maximum peak points in data set 1, data set 2, and data set 3, and the results are shown in Table 2. It illustrates that the peak prediction in the test set is not accurate due to the unfavorable data in the training set of data set 1. The method proposed in this paper can significantly improve the index of peak prediction, with MAE and RMSE increasing by 41.46% and 33.79%, respectively. In data set 2 and data set 3, MAE is decreased about 12.83% averagely. However, the improvement in the RMSE index was limited, with an average increase of 3.3%. In conclusion, the method proposed in this paper can improve the final peak prediction results.

5.3. Compared with Simple Linear Regression

In this paper, the unary linear regression model is used to predict the test sets in data set 1, data set 2, and data set 3. Through experiments, the prediction results of data set 1, data set 2, and data set 3 are shown in Figures 1820, respectively. The prediction results for a period with high observation values are shown in Figures 2123, respectively.

It can be concluded from Figures 2123 that the results of the prediction model in this paper are closer to the actual changes in the observations compared to the unitary linear regression model.

In this paper, the error comparison of prediction results for data set 1, data set 2, and data set 3 is shown in Table 3. It can be seen from the chart that the prediction results of the unary linear regression model are not as good as the LSTM model in MAE and RMSE indexes. Through the method proposed in this paper, the prediction result index on data set 1 is better than LSTM. The method proposed in this paper is used in the unary linear regression prediction model. The experimental results show that the MAE value and the RMSE value are decreased by 15% to 26%. In conclusion, the method proposed in this paper used in the unary regression model can greatly improve the accuracy of the prediction results.

This paper compares the prediction results of multiple maximum peak points in data set 1, data set 2, and data set 3, and the results are shown in Table 4. As can be seen from the chart, 13 peak points with the highest observed values are selected in data set 1 to calculate the improvement of MAE and RMSE. They increase 33.45% and 28.73%, respectively. And 8 peak points with the highest observed values are selected in data set 2 to calculate the improvement of MAE and RMSE. They improve 32.40% and 29.49%, respectively. In data set 3, the 11 peak points with the highest observed values are selected to calculate the MAE and RMSE, which increase 15.50% and 18.52%, respectively. In conclusion, the method proposed in this paper can improve the final peak prediction results in the single-variable linear regression model’s peak prediction results.

The chart information of experiment 1 and experiment 2 can be obtained. The method proposed in this paper can improve the prediction results in both overall prediction and peak prediction. Compared with the LSTM method, MAE and RMSE decreased by 18.36% and 13.50%, respectively. Compared with the unary linear regression method, MAE and RMSE decreased by 27.12% and 25.58%, respectively. In the overall forecast, MAE and RMSE rose about 14.85% and 15.66%, respectively. In the peak forecast, MAE and RMSE decreased by about 24.75% and 19.54%, respectively. Therefore, the peak prediction method of hybrid data subflow proposed in this paper can effectively improve the result based on the prediction result.

6. Conclusions

For the hybrid data flow, there are related uncertainties in each subflow at different times. This paper establishes the delay correlation coefficient model. Through this model, the delay correlation coefficient and delay time are calculated. The prediction results of the respective flows are calculated by using the peak prediction method in the hybrid data flow. Experiments show that the DCCSPP model has good prediction results when there is uncertainty between the subflows in the hybrid flow.

In future work, we will introduce the correlation between subflows into the machine learning model. Using machine learning methods improves the accuracy of delay correlation coefficient calculations and the prediction results. At the same time, the model can also be applied to dynamic hybrid data flows. Design a dynamic allocation scheme based on the predicted peak results of each subflow, dynamically allocating resources to systems that require elastic scaling.

Data Availability

The data used in the paper came from an insurance company of China. Subject to the confidentiality agreement, the experimental data set cannot be disclosed to the public, and the name of the company cannot be mentioned in the paper. However, we guarantee that the data set used is authentic with the company.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of Shanghai (no. 19ZR1401900), Shanghai Science and Technology Innovation Action Plan Project (no. 19511101802), and National Natural Science Foundation of China (nos. 61472004 and 61602109).