Toward solving the slow convergence and low prediction accuracy problems associated with XGBoost in COVID-19-based transmission prediction, a novel algorithm based on guided aggregation is presented to optimize the XGBoost prediction model. In this study, we collect the early COVID-19 propagation data using web crawling techniques and use the Lasso algorithm to select the important attributes to simplify the attribute set. Moreover, to improve the global exploration and local mining capability of the grey wolf optimization (GWO) algorithm, a backward learning strategy has been introduced, and a chaotic search operator has been designed to improve GWO. In the end, the hyperparameters of XGBoost are continuously optimized using COLGWO in an iterative process, and Bagging is employed as a method of integrating the prediction effect of the COLGWO-XGBoost model optimization. The experiments, firstly, compared the search means and standard deviations of four search algorithms for eight standard test functions, and then, they compared and analyzed the prediction effects of fourteen models based on the COVID-19 web search data collected in China. Results show that the improved grey wolf algorithm has excellent performance benefits and that the combined model with integrated learning has good prediction ability. It demonstrates that the use of network search data in the early spread of COVID-19 can complement the historical information, and the combined model can be further extended to be applied to other prevention and control early warning tasks of public emergencies.

1. Introduction

According to the World Health Organization, the new Coronavirus pneumonia outbreak constitutes a global public health emergency as of January 31, 2020 [1]. China has suffered great economic and social harm following the outbreak of Corona Virus Disease 2019 (COVID-19), in late December 2019 in Wuhan, China [2]. Considering that COVID-19 is highly infectious [3], its origin, transmission routes, and pathogenesis have remained unclear at the beginning of the outbreak, and the government's prevention and control strategy have been gradually and dynamically adapted to the epidemic's development. Scientific forecasting can help prevent major outbreaks by establishing a proactive and anticipatory safeguard system to prevent the continuous spread of epidemics [4]. The early identification of the epidemiological trend of COVID-19 will assist in controlling the spread and progression of the disease, thereby reducing its social burden. Therefore, it is important to examine a fast and efficient prediction model for major public emergencies to develop reasonable prevention and control measures.

In COVID-19 epidemic prediction, scholars around the world used public data of the epidemic to conduct a variety of data analyses and tested a variety of prediction methods and prediction models. In combining and analyzing the categories and mechanisms of these models, it has been found that existing prediction methods can be broadly classified into three categories: curve-fitting methods, kinetic methods, and machine learning methods. Zhao et al. [5] were the first to make this prediction. They used curve-fitting based on exponential growth trends to predict the number of cases at the beginning of the outbreak and determined that the early transmission capacity of the novel coronavirus was similar to or slightly greater than the early transmission capacity of SARS. The traditional curve-fitting prediction methods are very different from reality because they do not consider the characteristics of infectious diseases. A kinetic model describes the speed, spatial extent, transmission pathways, and kinetic mechanisms of infectious diseases [6]. It is worth noting that the majority of kinetic models for the current epidemic are based on SEIR models. The authors of Geng et al. [7] created an SEIR model to evaluate the efficacy of current prevention and control strategies using the characteristics of COVID-19. To predict COVID-19 with reference to the spread of SARS in the absence of data, Wu et al. [8] added crowd flow rate to the traditional SEIR model. Nevertheless, although kinetic methods are good predictors of early trends in epidemics, they are insufficient to estimate the spread of disease in open mobile settings or allow the assumed constants of disease transmission capacity and cure probability to match reality. Hence, an analysis of epidemic trends based on these methods will be inaccurate over the long term [9]. Since the emergence of COVID-19, with the continual increase in data samples and the advent of big data, the superiority of technologies, such as artificial intelligence, has been fully apparent. Researchers Peng et al. [10] employed four machine learning models for the prediction of clinical outcomes: artificial neural networks (ANN), naive Bayesian models (NBM), logistic regression (LR), and random forest (RF) models. The study by Liu et al. [11] constructed LR, k-nearest neighbor (KNN), decision tree (DT), multilayer perceptron (MLP), random forest (RF), support vector machines (SVM), and explainable boosting machine (EBM), a total of seven COVID-19 diagnostics, and evaluated the accuracy of each model for COVID-19. Ji et al. [12] developed a new particle swarm algorithm (ADVPSO) to optimize the parameters of ANN and applied it to predicting the spread trend of COVID-19. Using RF and ANN, Balerjee et al. [13] developed a predictive model for neo-coronavirus pneumonia based on the results of complete blood counts in admitted patients. The model was able to identify 85% of community neo-coronavirus-positive patients without considering patient symptoms or medical history. Using daily data from the Iranian Ministry of Health between February 19 and March 30, 2020, Moftakhar et al. [14] built ARIMA and ANN models, respectively, to forecast the number of confirmed patients to be confirmed in Iran within the next month. A study by Almazroi et al. [15] demonstrated that popular machine learning algorithms, such as ANN and DT, have high variance and bias in prediction and used four integrated models of gradient boosting decision tree (GBDT), RF, extreme gradient boosting (XGBoost), and voting regressors (VR) to predict the number of new coronary pneumonia cases in Saudi Arabia each day. The XGBoost model was found to be the strongest predictor. It is the advantage of machine learning methods that they can learn and train from historic data to construct an intelligent prediction model for the development of the epidemic. Not only does the prediction effect depend on the quality of the prediction model but also on the data used for the prediction. Online public opinion [16] is an important vehicle for the dissemination of emergency information in the context of big data. Following the outbreak of an emergency, internet users often learn about the event through search engines or portals. Ginsberg et al. [17] became the first to propose such a study. They utilized Google’s extensive user search data to accurately predict the trend of the proportion of influenza-like cases in the United States one week in advance using the “Google Flu Trends” software developed by Google in 2008. A subsequent study was conducted by some scholars as a follow up to this. Signorini et al. [18] calculated a real-time tracking prediction model for the proportion of influenza-like illnesses across the U.S. and in specific regions using the percentage of U.S. Twitter volume containing influenza-related keywords published within the U.S. as a predictor variable. Li et al. [19] noted that the web searches of historical data can better predict influenza trends, while the web searches of current data can ensure the accuracy of predictions of new changes. A few scholars have also utilized internet search data in the analysis and study of the current epidemic. In some parts of the United States, Kurian et al. [20] found a strong correlation between searches on Google Trends and outbreaks of new corona infections.

Overall, different attempts have been made to address the prediction problem of COVID-19 in the existing literature, however, some limitations remain. Firstly, traditional methods of curve-fitting and kinetic analysis have the shortcomings of having low prediction accuracy and poor generalization properties. Secondly, most of the data used in the existing literature are traditional statistical data, which present problems associated with long acquisition periods and poor timeliness, affecting prediction models’ timeliness. Thirdly, although machine learning techniques are widely employed in COVID-19 prediction, most scholarly studies apply them to predicting COVID-19 diagnosis, whereas there is little literature specifically addressing the prediction of the new corona epidemic trend.

This paper attempts to address the shortcomings listed above by focusing on the following three aspects of COVID-19 epidemic transmission prediction problem.

In the first step, the COVID-19 network opinion keyword database is developed using relevant references, the Baidu search index keyword time series data is crawled, and keywords that have high correlation are screened out as input features of the machine learning model using the LASSO algorithm.

Next, the final keywords are input into the XGBoost model. To increase the accuracy of the prediction, a chaotic search factor is designed, and a backward learning strategy is introduced to create an improved grey wolf evolutionary algorithm (COLGWO). Using the algorithm, the hyperparameters of the XGBoost model are determined, and then the COLGWO-XGBoost model as a prediction study of the change in the cumulative number of confirmed COVID-19 cases in China is performed.

The third point is that, to further improve the generalization performance of COLGWO-XGBoost, a combined Bagging-COLGWO-XGBoost early warning model is proposed based on Bagging’s idea of integrated learning. Such a model includes multiple benchmark models and performance evaluation metrics to provide a more scientific and comprehensive assessment of the proposed model.

2. Web Search Data

2.1. Data Collection

We used the cumulative number of confirmed diagnoses nationwide as the prediction label here, which was obtained using the DXY website’s (https://www.dxy.cn) interface (excluding Hong Kong, Macau, and Taiwan). Data began with the first national notification on the official website of the National Health and Wellness Commission of the People’s Republic of China. As of January 20, 2020, to May 7, 2020, the period ranged from the time the national prevention and control work was mostly in the normalization stage [21], with a total of 110 valid observations.

The internet generates more data with greater timeliness with the development of information technology and the penetration of networks. During an outbreak, people will turn to search engines, such as Baidu and Google, to seek information about the outbreak’s cause and the corresponding preventive measures. Utilizing open-source web search data to monitor the new corona epidemic in China is a powerful addition to traditional surveillance tools and can serve as an early warning system, guide medical treatment, and improve prevention and control strategies during the incipient stages of the epidemic. China’s most frequently used search engine Baidu has a feature called Baidu search index that counts the daily search volume figures for various keywords. It can effectively reflect a keyword’s media and user attention at a specific point in time and reflect the network public opinion change in an emergency situation [22]. According to the Baidu index, daily statistics are published for the previous day, while the main variable used as an early warning model must be at least two days ahead of time to have practical application [23]. As such, in the present study, we select the number of searches for Baidu index keywords related to the emergency state during the epidemic outbreak two days prior to the outbreak, as an input feature. This information is also gathered by web crawler technology on the official Baidu index web interface (https://index.baidu.com). The time period is from January 18, 2020, to May 5, 2020.

2.2. Data Preprocessing

As a keyword, it is a term that summarizes what the user is seeking most concisely. The effectiveness of this prediction depends on selecting effective web search terms relevant to this COVID-19 outbreak. There are three primary types of initial keyword selection methods: direct word selection method, technical word selection method, and range word selection method [24]. The analysis of the advantages and disadvantages of these three types of initial keyword selection methods and the review of relevant literature have led this study to adopt range word selection in combination with the direct word selection method to select 42 keyword libraries. High-dimensional data often cause problems, such as high computational complexity and long running times for models, which is why this study employs the Lasso method to reduce the dimensionality of the initially established index system and eliminate the features of no significant significance. Basically, the Lasso method is to add a penalty function to the sum of squared residuals and compress the coefficients in estimating the parameters to produce feature selection: the larger the parameter λ, the fewer the features selected. Our paper utilized a 10-fold cross-validation method to determine the best value for λ. The seven keywords with zero coefficients were removed, and a total of 35 keywords were found as predictive factors across four categories of prevention, symptoms, treatment, and common COVID-19 terms, as shown in Table 1.

The web search keywords in the table were used as input features, and the national cumulative number of confirmed diagnoses was used as a predictor label. Also, all prediction data have been normalized to eliminate prediction errors caused by different data magnitudes. Normalization is determined by the following mathematical principle:

Here, is the normalized data value, is the original data value before normalization, and and are the minimum and maximum values of the original data. This normalized value falls within the range of [0, 1], which is then randomly divided into a training set and a test set (where the training set represents 80% of the data samples and the test set represents 20%), and once the model has been trained, the prediction results are back-normalized to obtain the predicted values.

2.3. Correlation Analysis of Sample Data

Spearman’s correlation coefficients were calculated between the final keyword data in Table 1 and the predicted variable (cumulative number of confirmed diagnoses). The results are shown in Table 2. It can be seen that the Baidu index of 33 of the 35 keywords selected in the sample data showed a statistically significant correlation with the cumulative number of confirmed cases of COVID-19 (). Among them, the correlations of “fever,” “malaise,” “respiratory infection,” and “cold” were high in the classification of symptom words, with absolute values above 0.8. Although these four symptoms are not specific to COVID-19, they are easy to identify and are of great concern to the general population and are the best indicators of possible COVID-19 infection. It also indicates that the general population is aware of the basic knowledge of the symptoms of COVID-19. In addition, the absolute values of the correlation coefficients of “N95 mask,” “medical surgical mask,” and “84 disinfectant” also reached above 0.8. It indicates that with the further spread of the epidemic and the escalation of prevention and control measures, residents’ demand for personal protection knowledge tends to be stronger.

3. Algorithm Analysis

3.1. Grey Wolf Optimizer

Mirjalili et al. [25] developed a pack intelligence optimization algorithm in 2014 that emulates the predatory behavior of grey wolf packs by simulating behaviors characteristic of grey wolf packs. During the optimization phase of GWO, α, β, and δ wolves, the highest social level of the population in each generation, led the bottom ω wolves through encirclements and searching. GWO has a simple structure, few parameters to be adjusted, and is thus easily implemented, which makes it applicable to several fields. It has been widely used for optimization.

To begin, we can describe mathematically the process by which a wolf pack searches for and slowly surrounds its prey.

Here, is the position of the prey after the th iteration, is the position of the grey wolf at the th iteration, denotes the distance between the grey wolf and the prey, denotes the update of the position of the grey wolf, and are the coefficient vectors, is the convergence factor whose value decreases linearly from 2 to 0 with the number of iterations, is the number of previous iterations, and is the maximum number of iterations. and are the random numbers between [0, 1].

Secondly, the position of the three optimal wolves α, β, and δ is constantly updated to determine the prey. The following is a mathematical description of the hunting process of a wolf pack:

Here, , , and are the positions of α, β, and δ wolves when the population is iterated to the th generation. is the position of individual grey wolves in the th generation. and , and , and are the coefficient vectors of α, β, and δ wolves, respectively. , , and indicate the positions of α, β, and δ wolves after iterations, respectively. is the position of the next generation of grey wolves.

3.2. Improving Grey Wolf Optimizer

Though the standard GWO provides better performance than most intelligent optimization methods, it is not suitable for dealing with complex functions. Hence, there is a need to improve the balance between global search and local convergence to improve the performance of the grey wolf algorithm [26]. In the standard GWO, the prey direction and distance between preys are continuously adjusted by employing the equations from (7) through (13) until the prey is caught after many iterations, and by analyzing this process, it was found that grey wolves move to the same region as the number of iterations increases, which makes the whole pack converge seriously, making the wolf pack easily fall into local optimum. The chaos opposition learning grey wolf optimizer (COLGWO) is proposed in this study as a method to enhance the defects of standard GWO. The details are provided below.

3.2.1. Opposition-Based Learning

If the grey wolves of the initialized population are near the optimal solution, the convergence is quicker, and if the grey wolves of the initialized population deviate from the optimal solution, the algorithm is slower or fails to converge. Opposition-based learning (OBL) was implemented by Tizhoosh et al. [27] to help expand the individual search space, enhance the global exploration capability, and overcome the algorithm falling into local optimum. Below is its mathematical definition.

Initially, setting the inverse point serves as the population initialization for the grey wolf optimizer. For a point in the D-dimensional space, its reverse point is calculated as follows:

Here, , and and are the value boundaries of the point in the th dimension.

Secondly, the grey wolf algorithm is iteratively optimized using backward learning and jump probabilities to best achieve its global optimization objectives. Let be the jump probability, and if , then the reverse learning population is generated. Setting as one of the grey wolf individuals in the th dimension, , the corresponding inverse solution for that individual is calculated as follows:

Here, takes the value of any random number between [0, 1], and [, ] is the dynamic boundary of the particle in the th dimension, which is calculated as follows:

Within the dynamic boundary, a point is generated at random if the reverse solution jumps out of it.

3.2.2. Chaotic Search Factor

By utilizing the randomness, ergodicity, and regularity of the logistic chaotic sequence, a chaotic search input is added to the grey wolf optimizer, and a local search is performed in the vicinity of the optimal α-wolf in each iteration. In addition, a shrinkage strategy is introduced to achieve a large search range at the beginning of the iteration and a smaller search range at the end, which enhances the local mining capability of the standard GWO algorithm, allowing the algorithm to perform both global and local searches equally well while improving its accuracy. The following is a mathematical description.

The first step is to generate the chaotic variables .

Here, the range of the values of is [1, K], the range of the values of is (0, 1), and ≠ 0.25, 0.5, and 0.75. K is the length of the chaotic sequence.

A second step is the mapping of the chaotic variable generated by equation (18) to a chaotic vector in the definition domain [LB, UB].

Here, takes values within the range [1, K]. LB and UB represent the lower and upper limits, respectively.

In addition, the chaotic vector and optimal α-wolf are linearly combined to generate the chaotic wolf .where belongs to the range [1, K]. is the contraction factor, is the maximum number of iterations in the algorithm, and is the current number of iterations.

To be able to better understand the chaos search factor, the following is a brief example of a one-dimensional function sphere, from which higher dimensions can be extended. Suppose the function is defined in the domain [−100, 100], the current optimal wolf is 10, the current iteration number is 2, and the maximum iteration number is 100. If the random number , then according to equation (18), we have and . In turn, according to equation (19), we have , , and . From equation (21), we can get at this time. Then, the chaos vector is linearly combined with the head wolf at this time by equation (20) to get the chaos wolf , , and . However, if the generated fitness value is better than that of the wolf, then the new wolf is replaced as . It should be noted that if the current iteration number is 1, then and , i.e., , , and . The chaotic wolves at this point are randomly generated from the chaotic vectors, independent of the head wolf. It can be observed from the above example that gradually becomes smaller with the increase of evolutionary generations, and the generated chaotic wolf is smaller at this time, which indicates that the scope of local search becomes smaller with the continuous iteration of the grey wolf optimizer.

3.3. Bagging-Integrated Learning Strategy

For integrated learning methods to improve model generalization and prediction accuracy, a combination of prediction results from multiple homogeneous or heterogeneous models is often used [28]. Bootstrap aggregating (Bagging) aims to combine the predictions of multiple homogeneous base learners, and the combination of this strategy can achieve a significant improvement in prediction model generalization capabilities and avoid the phenomenon of overfitting.

In Bagging-integrated learning strategy, data are randomly selected from a data set using a put-back sampling methodology (Bootstrap) and then N subtraining sets of equal size are produced. Following that, each subtraining set is separately trained to obtain N base learners, and finally, the final prediction results of each base learner are arithmetically averaged together. This process is illustrated in Figure 1.

3.4. Bagging-COLGWO-XGBoost Model

XGBsoost is a boosting class model that has been developed by Chen et al. [29] in 2016, which is a combination of a linear scale solver and a categorical regression tree. The basic idea behind this model is to combine multiple tree models that have low prediction accuracy through various methodologies to construct a combined model with a higher prediction accuracy. After the combinatorial model is constructed, it is iterated through gradient boosting, with each iteration producing a new tree to fit the residuals generated by the previous tree until the optimal model is obtained. The XGBoost method is based upon a second-order Taylor expansion of the loss function. Furthermore, to mitigate the decline of the objective function and the complexity of the model, a regular term is added in addition to the objective function to arrive at the optimal solution, thereby avoiding the overfitting problem. A number of studies in recent years have demonstrated good performance of the XGBoost model for predictions in biology, medicine, and economics. The mathematical principles of the model are as follows:

The integration model of the definition tree is as follows:

In this equation, is the prediction value, is the number of trees, is the range of tree selections, and is the th input feature.

The loss function for the XGBoost model is shown below.

Here, the first part of the function is the error between the predicted and the actual training values of the XGBoost model, while the second is used to represent the complexity of the tree, which is important when controlling the regularization of the complexity of the model.

Here, and represent penalty factors.

It is minimized by adding the incremental function to equation (23) to minimize the value of the loss function. Thus, the objective function for the tth time becomes as follows:

A second-order Taylor expansion of equation (25) is used to approximate the objective function at this point. Define the set of samples in each subleaf of the th tree as . At this point, we can approximate as follows:

Here, is the first-order derivative of the loss function. is the second-order derivative. The subsequent equation is calculated by defining , .

The following equation is obtained by taking the partial derivatives of ω.

The following equation can be obtained by substituting the weights into the objective function.

Training the XGBoost model is heavily dependent on the choice of parameters as different parameter selections have a significant effect on the prediction results. Within the XGBoost algorithm, there are 23 hyperparameters, which can be classified into three types: general parameters for the control of macrofunctions, booster parameters to control booster details, and learning target parameters to control the training target parameters. COLGWO-XGBoost takes the three hyperparameters of learning_rate, n_estimators, and max_depth, which have the greatest impact on the performance of XGBoost, and uses them in the calculation of the position vector of the head wolf α in COLGWO. The COLGWO algorithm is iterated until the position vector corresponding to the global optimal position is returned to the XGBoost model. Furthermore, to enhance the generalization characteristics of the prediction model, the Bagging integration strategy is employed to directly integrate the COLGWO-XGBoost model into the integrated learning process to construct the final early warning model in this paper. The combined use of this strategy can reduce the variance of the COLGWO-XGBoost model and prevent overfitting. The following are specific steps for implementing the model.

Step 1. Initialize the parameters by setting the maximum number of iterations M, the number of populations N, the upper bound UB, and the lower bound LB of the parameters to be searched for COLGWO.

Step 2. Randomly initialize the populations and generate equal numbers of reverse populations based on equation (14), use equation (29), select the individual positions that correspond to the optimal fitness values among the two populations to be put into the final initial population, and then select the individual positions that correspond to the optimal three fitness values as α, β, and δ wolves.

Step 3. The chaotic search factor is called to generate the chaotic wolf according to equations 18 through (21), and if the fitness value of is greater than wolf, then is replaced with wolf, and the local search ends. Otherwise, if the chaotic sequence length is K, then the local search ends as well.

Step 4. The convergence coefficient and the coefficient vectors A and C were updated for α, β, and δ wolves, respectively, according to equations (4) to (6).

Step 5. It is necessary to calculate the distances between α, β, and δ wolves and their prey, and update the positions of these wolves according to equations (7) to (12), respectively.

Step 6. Update wolf locations based on equation (13) to produce a new generation of grey wolf populations.

Step 7. Follow Steps 3 to 6. If the maximum number of COLGWO iterations M is reached, then stop the iteration. Determine the position of α wolf at this point as a parameter of the sought-after XGBoost model, and construct the COLGWO-XGBoost prediction model.

Step 8. Using bootstrap sampling, input data are divided into N subtraining sets of equal sample size, and then the COLGWO-XGBoost model is used to fit each subtraining set separately to generate N base learners.

Step 9. Using N base learners, the prediction data are predicted separately, and their predicted results are arithmetically averaged, resulting in the final prediction result of the Bagging-COLGWO-XGBoost model.

4. Experimental Results and Discussion

4.1. Improved Algorithm Performance Testing
4.1.1. Standard Test Functions

We have selected eight typical standard test functions from the global optimization benchmark functionality library to test the optimization-seeking capability of the COLGWO algorithm, whose names, functions, search intervals, and theoretical minimum values are shown in Table 3. are continuous single-peaked functions, which are typically used to test the local mining abilities of algorithms. are complex nonlinear multipeaked functions with a large number of local extremes, which are generally used to test the global pioneering abilities of algorithms, and their results provide guidance to algorithms.

4.1.2. Experimental Analysis

A comparison of the optimization performance of DE (differential evolution), PSO (particle swarm optimization), GWO, and COLGWO in eight typical benchmark test functions was conducted in the same runtime environment using Python version 3.7 as the simulation program. The primary parameters of the four algorithms were the same, namely 30 populations, a maximum number of iterations of 500, and a dimension of 30. The algorithms were run independently 30 times, and the mean and standard deviation of the search results were calculated. Table 4 shows the experimental results. On the basis of Table 4, it can be observed that the optimization results of GWO outperform those of the two classical evolutionary algorithms, DE and PSO, for continuous single-peak functions and complex nonlinear multipeak functions. Even though the improved COLGWO algorithm does not achieve the theoretically optimal solution for the function, it improves significantly compared to before the improvement, and it has the best solution accuracy among the four algorithms. It indicates that the modification of the standard GWO algorithm in this study is very effective, can well-balance the global search ability and local search ability, and improve the stability of the algorithm. For the eight standard test functions, Table 4 only displays the optimal values determined by the four algorithms. A comparison graph of the convergence of the four algorithms for the test functions is shown in Figure 2 for the purpose of visualizing the convergence of the algorithms. Based on Figure 2, the speed of convergence of COLGWO algorithm has been significantly accelerated. Because of this, the improved algorithm presented in this paper takes into account the convergence speed, global exploration ability, and local fine search ability of the algorithm at the same time, which is a very powerful improvement.

4.2. Comparison of Prediction Effects
4.2.1. Evaluation Functions

As evaluation functions for the forecasting models, we select root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The model’s smaller RMSE, MAE, and MAPE values indicate better prediction performance, and the actual formula is as follows:

Here, is the number of samples, is the sample number, stands for the true value, and stands for the predicted value of the model.

4.2.2. Experimental Analysis

The present study employs multiple algorithms as comparison models to evaluate the predictive ability of the Bagging-COLGWO-XGBoost model in the COVID-19 cumulative number of confirmations data. For single-model prediction, in addition to four traditional machine learning models of SVR, LR, MLP, and GDBT, convolutional neural networks (CNN) [30], recurrent neural network (RNN) [31], and long short-term memory (LSTM) [32], which are three classical deep learning models, were also selected to compare and analyze the prediction effect of XGBoost. In the combined algorithm prediction, the GWO-optimized XGBoost model (GWO-XGBoost), GWO-optimized GBDT model (GWO-GBDT), COLGWO-optimized GBDT (COLGWO-GBDT), COLGWO-optimized XGBoost (COLGWO-XGBoost), and Bagging-COLGWO-GBDT are selected to compare and analyze the prediction effects of the Bagging-COLGWO-XGBoost model proposed in this study. Table 5 contains the parameters of each algorithm model.

The main model and the comparison model proposed in this paper are fitted to each prediction model in the training dataset, and the predictions are performed on the test dataset. The prediction accuracy of the single model in the three indexes is given in Table 6, and Figure 3 shows the prediction effect of the single model. It can be observed that the prediction performance of XGBoost and GBDT in single model is significantly better than SVR, MLP, LR, CNN, RNN, and LSTM. XGBoost ranked first in MAPE, and GBDT ranked first in RMSE and MAE. Both models outperformed several other comparative models in three metrics, RMSE, MAE, and MAPE, and this result indicates that the selection of a suitable regression model plays a key role in improving the prediction performance under the same data premise.

To further observe the difference in prediction performance between GBDT and XGBoost, in the combined algorithm, the GWO-seeking algorithm is first proposed to optimize the hyperparameters of these two models separately, and then the COLGWO proposed in this paper is used to compare the optimization results. Finally, the Bagging strategy is used to integrate the learning of these two models. Table 7 and Figure 4 show the experimental results of the combined algorithm. After using the GWO optimization algorithm, the prediction accuracy of GBDT is slightly improved, however, it significantly lags behind that of XGBoost, Using COLGWO for the parameter optimization of both models achieves better optimization results than GWO. Using the Bagging integration strategy can further reduce the generalization error of the combined algorithm, and this strategy has the best prediction effect when combined with XGBoost using COLGWO algorithm. It shows that the improvement of XGBoost on GBDT is successful. Since the model adds a regular term in the loss function for controlling the complexity of the model, it reduces the variance of the model in terms of the trade-off variance bias and makes the trained model simpler, thus effectively preventing overfitting. In addition, although the prediction performance of XGBoost is excellent, the model is very sensitive to parameters, and the use of an efficient parameter selection method is crucial to the prediction effect, and the experimental results confirm the superiority of the improved algorithm (COLGWO) proposed in this study in terms of hyperparameter selection.

In summary, the overall prediction accuracy of the model optimized by the combined algorithm is better than that of the single model prediction. The Bagging-COLGWO-XGBoost model with the integrated strategy ranked first in all three evaluation metrics of all algorithms, improving 77.19%, 63.28%, and 71.8%, respectively, compared to XGBoost, and 60.21%, 34.14%, and 3.23%, respectively, compared to Bagging-COLGWO-GBDT. It indicates that the COLGWO algorithm proposed in this paper is very effective in the application of hyperparameter selection for the XGBoost model in the early warning problem of COVID-19. Combining with the Bagging integration strategy will further enhance the prediction effect.

5. Conclusions

This work analyses the initial web search data associated with COVID-19 at the time of its outbreak and then uses the LASSO method to select features for input into the model of XGBoost for prediction. A new grey wolf algorithm is proposed to optimize the hyperparameters of XGBoost to improve prediction accuracy. We built a Bagging-COLGWO-XGBoost early warning model, which utilizes the Bagging integration strategy, to enhance the generalizability of the combined algorithm. After analyzing the comparison of these early warning models, the subsequent conclusions were drawn.

Firstly, the data source affects the accuracy and timeliness of the prediction model. Web search data has many advantages over traditional statistics, and it is convenient, timely, and sensitive to user needs, which can effectively supplement traditional statistics.

Secondly, to address the shortcomings of the standard GWO algorithm, such as low solution accuracy, slow convergence speed, and it being easy to fall into the local optimum and impossible to jump out, it was proposed in this paper that COLGWO initially employed a mechanism based on backward learning to initialize the population. Adding jump probabilities makes the algorithm jump out of the local optimal solution. Additionally, to enhance the mining ability of the standard GWO algorithm, a chaotic search operator was developed, which uses chaotic sequences to carry out local searches around the current optimal solution. Experimental results demonstrate that the improved grey wolf algorithm satisfies the purpose of balancing the global and local search for optimal solutions of the algorithm and that the optimization results in all eight standard test functions are vastly superior to those of the grey wolf algorithm. Also, the results demonstrate that the optimization of hyperparameters of the XGBoost model using it can significantly improve prediction accuracy.

Thirdly, in terms of prediction performance, the use of Bagging-integrated learning can result in a reduction in the variance of the predictive model and enhance generalization capability. The Bagging-COLGWO-XGBoost model presented in this paper performs significantly better than the thirteen other models in the prediction problem of the cumulative number of confirmed COVID-19 cases in China, indicating that the model is a good and reliable one.

Overall, the combined predictive model Bagging-COLGWO-XGBoost developed in this study combined with web search data can predict the cumulative number of confirmed cases of COVID-19 in China under this emergency situation. In addition, this model may be continuously updated and optimized according to the latest data to provide real-time prediction, which will serve as a reliable source of data for the Chinese government to formulate reasonable and effective health policy. Furthermore, the model constructed in this study can also be used for the prediction of similar major public emergencies and can be extended to other public emergencies related to early warning.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.


This research was funded by the National Natural Science Foundation of China (Grant no. 81973791).