Abstract

Traditional statistical methods and machine learning on massive datasets are challenging owing to limitations of computer primary memory. Composite quantile regression neural network (CQRNN) is an efficient and robust estimation method. But most of existing computational algorithms cannot solve CQRNN for massive datasets reliably and efficiently. In this end, we propose a divide and conquer CQRNN (DC-CQRNN) method to extend CQRNN on massive datasets. The major idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. It is obvious that the demand for the amount of primary memory can be significantly reduced through our approach, and at the same time, the computational time is also significantly reduced. The Monte Carlo simulation studies and an environmental dataset application verify and illustrate that our proposed approach performs well for CQRNN on massive datasets. The environmental dataset has millions of observations. The proposed DC-CQRNN method has been implemented by Python on Spark system, and it takes 8 minutes to complete the model training, whereas a full dataset CQRNN takes 5.27 hours to get a result.

1. Introduction

With the development of information technology, mobile Internet, social networks, and e-commerce have greatly expanded the boundaries and applications of the Internet. Terabytes of dataset are becoming more common. For example, the National Aeronautics and Space Administration Earth Observing System Terra and Aqua Satellites monitor the Earth atmosphere, oceans, and land, producing approximately 1.5 TB of environmental data per day. According to Intel’s forecast, in 2020, a networked self-driving car will generate 4 TB of data every 8 hours of operation. Massive datasets offer researchers both unprecedented challenges and opportunities. The key challenge is that using conventional computing methods to directly apply machine learning and statistical methods to these massive datasets is prohibitive. First, the calculation time is too long to get results quickly. Second, the data can be too big so that the computer primary memory overflowed. In order to overcome these challenges, researchers have proposed a divide-and-conquer method [13], which may be an effective method to analyze massive datasets.

In this paper, we consider a divide-and-conquer method for massive datasets. Fan et al. [4] analyzed the least squares regression of the linear model on a massive dataset using a divide-and-conquer method. Lin [1] considered a divide-and-conquer method for estimating equations in massive datasets. Chen [2] analyzed the generalized linear models in extraordinarily large data using the divide-and-conquer method. Zhang [5] proposed a divide-and-conquer kernel ridge regression. Schifano [3] extended the divide-and-conquer approach in [3] to online updating for stream data. A block average quantile regression (QR) approach for the massive dataset is proposed in [6] by combining the divide-and-conquer method with QR. Jiang [7] extended the work of [6] to composite quantile regression (CQR) for massive datasets. Recently, Chen et al. [8] studied QR under memory constraint for massive datasets. Chen [9] considered a divide and conquer approach for quantile regression in big data.

As we all know, QR is more robust than ordinary least squares regression (OLS) when error distribution is heavily skewed. However, the relative efficiency of the QR can be arbitrarily small compared to OLS. To this end, CQR, proposed in [10], is always effective regardless of the error distribution and works much better in efficiency when compared with the OLS. Since then, CQR has been extensively studied in the nonlinear model. Reference [11] utilized local polynomials and CQR to estimating nonparametric model and proposed the local polynomial CQR. Kai [12] studied CQR for varying coefficient partially linear model. Guo [13] proposed CQR for partially linear additive model. Jiang [14] studied two-step CQR for the single index model.

Artificial intelligence (AI) does not require any priori assumptions about the model in dealing with nonlinear problems, so that it is a significant advantage compared to the statistical model such as nonparametric model and varying coefficient partially linear model. There has been a lot of research on combining the nice properties of AI and QR or CQR. For instance, Taylor and Cannon [15, 16] proposed the quantile regression neural network (QRNN) by combining the artificial neural network (ANN) with QR. A support vector quantile regression (SVQR) method is proposed in [17] through combining support vector machine (SVM) and QR. A composite quantile regression neural network (CQRNN) method is studied in [18], which adds ANN structure to CQR. But, when the amount of data is large, it is well known that using conventional computing methods to directly compute CQR and ANN are very slow, so the computation of CQRNN is slower. It is shown that computation is a bottleneck for application of CQRNN on massive datasets.

In this paper, our focus is on CQRNN for massive datasets whose size exceeds the limitations of a single computer primary memory. Fortunately, we are not limited to a single computer. In this end, we consider CQRNN for massive datasets by the divide-and-conquer method on the distributed system. A distributed system is composed of multiple computers (called nodes) that can run independently, and each node uses wire protocols (such as RPC and HTTP) to transfer information to achieve a common goal or task. In the distributed system, the communication cost of data communication between different nodes is usually very expensive. A “master and worker” type of distributed system is considered where workers do not communicate directly with each other, which is shown in Figure 1. The concrete steps are illustrated as follows. Step 1: randomize the initial parameters. Step 2: distribute the initial parameters to each worker. Step 3: each worker trains a subset of the data and sends the training results to the master. Step 4: the master takes the weighted average of the training results of each worker as the approximate global training results. Step 5: when there is more data to be processed, return to Step 2.

Traditional statistical methods and machine learning on massive datasets are challenging owing to limitations of computer primary memory. CQRNN is an efficient and robust estimation method. But, most of existing computational algorithms cannot solve CQRNN for massive datasets reliably and efficiently. In this paper, we propose a DC-CQRNN method to extend CQRNN on massive datasets. The major idea is to divide the overall dataset into some subsets, applying the CQRNN for data within each subsets, and final results through combining these training results via weighted average. The proposed DC-CQRNN method can significantly reduce the computational time and the required amount of primary memory, and the training results will be as effective as analyzing the full data at the same time. For illustration, we used Monte Carlo simulation to compare the performance of DC-CQRNN method with CQRNN [18], QRNN [15, 16], the artificial neural network (ANN) [19], SVM [20], and random forest (RF) [21]. In addition, an environmental dataset application verifies and illustrates that our proposed approach performs well for CQRNN on massive datasets, where the environmental dataset has millions of observations.

The remainder of this paper is organized as follows. In Section 2, we present the DC-CQRNN for massive datasets in detail. In Section 3, we used Monte Carlo simulation to illustrate the finite sample performance of the proposed DC-CQRNN method. A detailed presentation of the environmental dataset analysis is in Section 4. In Section 5, we conclude the paper.

2. Methodology

2.1. Our Motivation

In recent years, China GDP has grown rapidly. This has also intensified the contradiction between national economic development and the environment. At the same time, smog pollution has occurred in parts of North China, Northeast China, and Central China, which had a huge effect on China. Therefore, it is extremely important to accurately report air quality to the public and to prevent smog measures in advance. At present, a large number of air quality monitoring stations are established in many places in China such as Beijing, Chengde, Tangshan, and Tianjin. However, for those areas that do not have air monitoring stations, how to accurately predict air quality and report to the public in a timely manner remains a problem.

To this end, we collected environmental datasets from air quality monitoring stations in different places, with a total of 1,018,562 observations. When using the CQRNN method to dealing with environmental datasets, we found that the data can be too big so that the general computer primary memory overflowed, and the computational time is too long to get results quickly. CQRNN method is basically ineffective in dealing with massive data. At the same time, it is obvious that SVM, ANN, RF, etc., also take a long time to processing massive datasets. In this section, we propose a divide-and-conquer CQRNN (DC-CQRNN) method to extend CQRNN on massive datasets. The same idea can be applied to SVM, ANN, and RF.

2.2. Composite Quantile Regression Neural Network

In the real world, there is usually a nonlinear relationship between predicted and predictor , which can be seen as a stochastic model as follows:where is the model error and is a vector of unknown parameters. There are many regression techniques to solve the unknown parameters of the model, such as OLS, QR, CQR, and their derived methods. Xu et al. [18] proposed CQRNN by combining the nice properties of ANN and CQR. Given predictor and predicted , the CQRNN objective aims for minimizing the empirical loss function:where , for (see [10]), and is the conditional quantile of at and can be estimated by the following two steps. Firstly, the outcome of the th hidden-layer node is calculated by an activation function to the inner product among and the hidden-layer weights plus the hidden layer bias .where denotes an activation function generally using the hyperbolic tangent, is a weight vector of hidden layer, and is a bias vector of the hidden layer. Secondly, an estimate of predicted is consequently given bywhere is the output layer activation function, is the output later weight vector, and is the output layer bias. Let contains all coefficients including weights and biases, to be trained or estimated. The main purpose of CQRNN is to estimate using

Remark 1. (1) According to the suggestion of [16], we use the Huber norm to overcome the problem that is not always differentiable. (2) To avoid over fitting the model, we add a penalty term in equation (5) to obtain equation (6).where is the -norm and is a regularization parameter. According to the suggestion of [18], we choose and through EBIC-like criterion:where is the number of selected variables corresponding to and .

2.3. Divide-and-Conquer Composite Quantile Regression Neural Network

When the sample size is too big, using conventional computing methods to directly solve the optimization problem in (5) is infeasible. Based on the ideas of [1, 2], our method is to divide the overall dataset into some subsets with each containable in the computer primary memory. Then, we implement the CQRNN for data within each subset. Final, we obtain results through combining these training results via weighted average.

In detail, we proposed DC-CQRNN which can be obtained by the following concrete steps:Step 1: randomize the initial parameters and divide the full dataset into subsets, so that the th subset contains observations: .Step 2: distribute the initial parameters to each worker.Step 3: each worker trains a subset of the full data set; obtains the estimators , using the methodology in solving equation (5); and sends to the master.Step 4: the master implements the weighted average of to obtain as the resulting estimator of :Step 5: when there are more data to be processed, return to Step 2. The detailed process is shown in Figure 2.

Remark 2. Obviously, if the number of K is large, the subsets’ data size will be very small, so the correlation among the values of the dataset is destroyed. To this end, we give some restrictions on to manage correlation. Regularity conditions are as follows:(a)The sample size of the th subset is and that (b), where and

3. Numerical Simulations

In this section we exploit Monte Carlo simulation to compare the finite sample performance of the DC-CQRNN method with CQRNN [18], QRNN [15, 16], ANN [19], SVM [20], and RF [21]. The performances of the CQRNN method with different are very similar in the simulation of [22]. Thus, we only consider as a compromise between the estimation and computational efficiency of the CQRNN method.

3.1. Simulation Data

In order to investigate the performance of DC-CQRNN methods with different structures, we choose the various values of parameters and , namely, and . Simultaneously, to illustrate the effectiveness and robustness of our method, we consider three different error distributions for : standard normal distribution , a student distribution with three degrees of freedom , and a Chi-square distribution with three degrees of freedom and two cases for .Case 1 (i.i.d): the data generate fromwhere and are independent and follow .Case 2 (non-i.i.d): the data are generated fromwhere and guarantees that is a non-i.i.d one.

We generated samples with the same observations and variables under case 1 and case 2 with three error distributions, respectively, and randomly assigned 50,000 samples as a training dataset (In sample) and the remaining samples as testing datasets (out of sample). In order to test the performance of the competitive model, all of the simulations are run for 100 replicates.

3.2. Prediction Performance

We use the mean absolute error (MAE) and root mean square error (RMSE) to compare the prediction performance of DC-CQRNN, CQRNN, DC-QRNN, QRNN, ANN, RF, and SVM both in sample and out of sample, wherewhere is an estimator of . We employ in (7) to select and , and the results are showed in Table 1. To reduce the computational load, we consider all the combinations of and . The predicted performance is listed in Tables 2 and 3.

In Tables 1 and 3, the DC- and DC- denote DC-CQRNN with and , respectively, and the and denote CQRNN with and , respectively. For QRNN and DC-QRNN, we only report the results at .

From Tables 2 and 3, we can see that(1)All of the case, CQRNN has higher prediction accuracy than QRNN, ANN, RFR, and SVM, especially when Q = 9(2)The results of DC-CQRNN and CQRNN are very close, indicating that DC-CQRNN method can be used as an effective approximation of CQRNN method under massive datasets

3.3. Running Time Performance

To assess the computational efficiency of the DC-CQRNN method, we record the running time of all method implemented in Python programming language and carried out on a Intel(R) Core(TM)i9-8950HK CPU (2.90 GHz) processors and 16 GB RAM. For fair comparison, we recorded only the average CPU time used by 100 repetitions of each method. The average CPU time of all method is presented in Table 4.

It can be seen from Table 4 that the CPU time of CQRNN is longer than that of QRNN. At the same time, DC-CQRNN runs faster than QRNN and CQRNN, and the CPU time of DC-CQRNN decreases as K increases. The main advantage of DC-CQRNN is that each worker in the master and worker type of distributed system is independent, so the CQRNN on each worker can be executed in parallel. As a result, the computing time of DC-CQRNN in massive datasets is greatly reduced compared to CQRNN.

In addition, to compare the computational efficiency of the DC-CQRNN method in both sequential and parallel distributed environment, in Table 5, the computing time of DC-CQRNN and DC-QRNN under sequential distributed environment (SDE) and parallel distributed environment (PDE) is recorded. It can be seen from Table 5 that the computational efficiency of PDE has obvious advantages compared with SDE.

4. Real-World Data Applications

4.1. Environmental Dataset

The research areas of this paper contain Baoding, Beijing, Chengde, Shijiazhuang, Tangshan, Tianjin, Xingtai, and Zhangjiakou in China from January 1, 2015, to July 1, 2019, totally 1,018,561 observations. Our study collected hourly historical monitoring PM2.5 from website—the Ministry of Environmental Protection of the People’s Republic of China (http://datacenter.mep.gov.cn)—and collected meteorological data from website—National Oceanic and Atmospheric Administration (https://www.noaa.gov/). Each sample includes PM2.5 concentration, temperature, pressure, humidity, wind direction, wind speed, and visibility, respectively, measured at the air quality monitoring stations. Details for the dataset are shown in Table 6.

4.2. Normalization

Data normalization is an important step for many machine-learning estimators, particularly when dealing with neural networks. Dataset with a wider range is likely to cause instability when training models. Standardization was used to standardize the features by deducting the mean and scaling the data, with the variance of variable calculated aswhere , , and are the sample values, mean, and standard deviation, respectively.

4.3. Empirical Results

The aim of our experiments is to examine the effectiveness of the proposed DC-CQRNN for the spatial prediction of PM2.5 concentration. We also consider models including ANN, QRNN, CQRNN, RF, and SVM. For the PM2.5 concentration monitoring dataset, it is divided into a training set and a testing set according to a ratio of 7 : 3. To get the results in a most objective way, we have repeated 100 times the experiments of training and testing at randomly chosen composition of training and testing data. The final results of training and testing are the average of all trials. For parameters setting in intelligent model, we use the EBIC approach, and Table 7 presents the results of parameters selection, and for machine learning model, we consider cross-validation [23] for their parameter setting.

Setting  = 1, 10, 50 and  = 4, 9. For DC-QRNN, we only report the results at . The results of our experiments measured by RMSE, MAE, and CPU time are reported in Table 8.

For the prediction accuracy of the model, the training results of DC-QRNN and DC-CQRNN are close to the QRNN and CQRNN based on full sample, respectively. The CQRNN is significantly better than the ANN, RF, and SVM. In particular, the CQRNN performs best when K = 9 for both in sample and out of sample.

Considering the CPU time of the model, the CPU time of the ANN based on full sample training is higher than RF and SVM. In the DC-QRNN and DC-CQRNN, the training of each piece of data is independent, so parallel calculation can be performed on each piece of data, which significantly reduces the computational time. The experimental results show that the CPU time of DC-QRNN and DC-CQRNN are much lower than QRNN and CQRNN based on full sample, respectively.

The environmental dataset includes 1,018,562 observations. The entire methodology has been implemented by Python on Spark system. By using the proposed DC-CQRNN method with and on the Spark system, it takes 8 minutes to complete the model training, whereas a full sample CQRNN method with takes 5.27 hours to get a result.

5. Results and Discussion

Massive datasets offer researchers both unprecedented challenges and opportunities. The key challenge is that using conventional computing methods to directly apply machine learning and statistical methods to these massive datasets is prohibitive. In this paper, the Monte Carlo simulation studies and an environmental dataset application verify and illustrate that our proposed approach performs well for CQRNN on massive datasets. Therefore, the DC-CQRNN method is effective and important. Obviously, the larger the value of K, the more efficient the DC-CQRNN method is. But, it should be noted that if the number of is large, the subsets data size will be very small, so the correlation among the values of the dataset is destroyed. Therefore, as long as the value of is not very large, it is ok.

6. Conclusion

Using composite quantile regression neural networks to dealing with massive datasets faces two main challenges: first, the calculation time is too long to get results quickly. Second, the data can be too big so that the computer primary memory overflowed. To solve this difficulty, we propose DC-CQRNN by a divide-and-conquer method on a “master and worker” type of distributed system. The proposed DC-CQRNN can significantly reduce the computational time and the required amount of computer primary memory, and the training results are as effective as analyzing the full data simultaneously. The divide-and-conquer ideal also extends to QRNN, ANN, and SVM. In the future, we will try to use the subsampling method to reduce the time-consuming of the neural network training massive data and save the input cost.

Data Availability

The research areas of this paper contains Baoding, Beijing, Chengde, Shijiazhuang, Tangshan, Tianjin, Xingtai, Zhangjiakou in China from January 1, 2015 to July 1, 2019, totally 1,018,561 observations. Our study collected hourly historical monitoring PM2.5 from website: the Ministry of Environmental Protection of the People’s Republic of China (http://datacenter.mep.gov.cn), and collected meteorological data from website: National Oceanic and Atmospheric Administration (https://www.noaa.gov/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (no. 11471264).