Abstract

Atmospheric refraction is a special meteorological phenomenon mainly caused by gas molecules and aerosol particles in the atmosphere, which can change the propagation direction of electromagnetic waves in the atmospheric environment. Atmospheric refractive index, an index to measure atmospheric refraction, is an important parameter for electromagnetic wave. Given that it is difficult to obtain the atmospheric refractive index of 100 meters (m)–3000°m over the ocean, this paper proposes an improved extreme gradient boosting (XGBoost) algorithm based on comprehensive learning particle swarm optimization (CLPSO) operator to obtain them. Finally, the mean absolute percentage error (MAPE) and root mean-squared error (RMSE) are used as evaluation criteria to compare the prediction results of improved XGBoost algorithm with backpropagation (BP) neural network and traditional XGBoost algorithm. The results show that the MAPE and RMSE of the improved XGBoost algorithm are 39% less than those of BP neural network and 32% less than those of the traditional XGBoost. Besides, the improved XGBoost algorithm has the strongest learning and generalization capability to calculate missing values of atmospheric refractive index among the three algorithms. The results of this paper provide a new method to obtain atmospheric refractive index, which will be of great reference significance to further study the atmospheric refraction.

1. Introduction

Atmospheric refraction, which is mainly caused by the gas molecules and aerosol particles, is a special phenomenon of atmospheric environment; it can change the propagation route in the atmospheric environment and then cause the abnormal propagation of electromagnetic wave. Especially under the condition of trapping refraction, the electromagnetic wave will be caught in a certain thickness of atmosphere and propagate back and forth through the upper and lower gas layers; this propagation phenomenon is called atmospheric ducts’ propagation, and the atmosphere which causes this abnormal phenomenon is called atmospheric duct. Because atmospheric duct can significantly change the route of electromagnetic wave, it has a great impact on the radar system, navigation system, and communication system [1]. If we make full use of the atmospheric duct’s effect, we can realize the beyond-the-horizon transmission and target detection. Atmospheric refractive index is an important index to evaluate the degree of atmospheric refraction, and the change of atmospheric refractive index is one of the important bases to judge the appearance of atmospheric ducts. The reason for the formation of atmospheric duct is that the atmospheric temperature and humidity decrease sharply with the height, which makes the atmospheric refractive index decrease gradually and then leads to the formation of the atmospheric ducts [2].

So far, many scholars have studied the atmospheric refractive index [313]. The traditional way to obtain the refractive index of the atmosphere is to measure profiles of temperature and humidity by radiosonde and then estimate the atmospheric refractive index [14, 15]; later, Mathai and Harrison [16] used the measured volume scattering coefficient and particle size distribution to estimate the refractive index of atmospheric aerosol. Besides, radar phase information can also be used to retrieve atmospheric refractive index [17, 18], and some estimation methods of the atmospheric structure constant also provide reference for obtaining atmospheric refractive index [1921]. In recent years, some scholars use new methods to measure the refractive index of the atmosphere. For instance, Dinar et al. [22] used a cavity ring-down aerosol spectrometer to obtain the complex refractive index of the atmosphere; Xie et al. [23] used one-dimensional variational assimilation algorithm to obtain atmospheric refractive index from ground-based Global Positioning System (GPS) phase delay. Besides, with the development of computer, machine learning is applied more and more in the field of meteorology [2433]. Machine learning is a process of using computers to summarize the existing data, get its general rules, and establish the corresponding mapping relationship [34]. It is very suitable for data analysis, especially in the case of large amount of data [35]. Extreme gradient boosting (XGBoost) is an integrated model based on a decision tree [36, 37]. It is a typical representative of machine learning, which shows a strong generalization force in big data prediction in recent years. Because the meteorological data are recorded by sea surface buoys generally below 100 m, and occultation data and meteorological satellite remote sensing data are not sufficiently accurate above the ocean at a height below 3000 m, the meteorological data between 100 m and 3000 m are often missing; hence, the atmospheric refractive index at the certain height cannot be calculated, and then the determination of atmospheric ducts is difficult and inefficient; so calculating the atmospheric refractive index 100 m–3000 m over the ocean by using machine learning to determine the occurrence of atmospheric ducts is a focus of research in meteorological data processing.

In this paper, we propose an improved XGBoost based on the CLPSO operator, which has a fast convergence speed and strong ability to jump out of the local optimum solution for parameter optimization combination to obtain the atmospheric refractive index. To test its effectiveness, it is applied to filling missing values in the modified atmospheric refractive index over the ocean. The modified atmospheric refractive indices in the lower (0–100 °m) and upper (3000°m–10000°m) layers are used as input, and those in the middle layer (100°m–3000°m) are used as output. The improved XGBoost algorithm is then used for learning and training to fill in missing values of modified atmospheric refractive index in the middle layer. Finally, to verify its feasibility for calculating the modified atmospheric refractive index, the mean absolute percentage error (MAPE) and root mean-squared error (RMSE) are used as evaluation criteria to compare the prediction results of the proposed method with backpropagation (BP) neural network and traditional XGBoost algorithm; the results show that the improved XGBoost algorithm significantly improves accuracy and reduces operation time and has stronger learning and generalization capability to calculate missing values of modified atmospheric refractive index than other algorithms. The results of this paper provide a new method to obtain atmospheric refractive index, which will be of great reference significance to avoid the loss of electromagnetic wave propagation in the future.

2. Materials and Methods

The test data used in this paper are high-resolution sounding balloon data [38] collected from 1998 to 2008 at an observation station in Hawaii, USA, at 155.1°W longitude and 19.7°N latitude; the sounding balloon is released at UTC = 12 every day, and the vertical meteorological data are recorded once a day. The balloon recorded such meteorological parameters as temperature, pressure, and humidity at different height with a vertical resolution of 20°m. The average thickness of the atmospheric ducts is between tens and hundreds of meters [1]. Therefore, this sounding data can be used to study the atmospheric refractive index and atmospheric duct over the ocean.

Atmospheric duct is an abnormal phenomenon in the tropospheric atmospheric environment. It is mainly due to the uneven distribution of atmospheric temperature and humidity in the vertical direction, which leads to the significant difference of atmospheric refractive index in the vertical direction and then leads to the occurrence of atmospheric ducts [1]. In order to treat the earth's surface as a plane and judge the change of atmospheric refractive index conveniently and then evaluate the gradient of atmospheric refractive index and its influence on electromagnetic wave propagation more easily, modified atmospheric refractive index M is defined. The calculation formula of modified atmospheric refractive index M is as follows [1]:

In equation (1), N is the atmospheric refractive index, P is the air pressure and its unit is pha, T is the temperature and its unit is K, e is the water vapor pressure and its unit is pha, and Z is the altitude and its unit is m. According to the variation of M with height, atmospheric ducts are mainly divided into four categories [1]: surface ducts, surface-based duct, elevated ducts, and evaporation ducts, as shown in Figure 1. Surface ducts mainly appear on the land, and their height is usually less than 300 meters. Elevated ducts mainly appear at high altitude over land and ocean, usually between 300 meters and 3000 meters. Evaporation ducts mainly appear over the ocean, usually below 40 meters in height.

3. BP Neural Network

3.1. Introduction to BP Neural Network

Artificial neural networks, which are widely used in data prediction, use mathematical methods to simulate the structure, function, and mode of processing of biological neural systems to build a complete information processing system. In this paper, the BP neural network model [39, 40], which is relatively mature and widely used, is used to fill the modified atmospheric refractive index of the middle layer (100°m–3000°m) by inputting the modified atmospheric refractive index at the lower (0–100°m) and higher (3000°m–10000°m) layers.

3.2. Model Building
3.2.1. Sample Selection

Because the number of samples is very large, the dimensionality of the input vector will become too large, and the generalization ability will be poor if the traditional sample selection method is used. So we need to reconstruct the model’s input vector. In this paper, the input vector is constructed by calculating the correlation coefficients of different time samples, and sequences with large correlation coefficients are included in the input samples. The correlation coefficient is calculated as follows: if S and U are used to represent two sequences, their correlation coefficients of S and U are as follows [34]:where l is sample size and and are sample means. The larger the correlation coefficient is, the greater the correlation is. In this way, the sequence with a large correlation coefficient can be included in the input sample to reduce its dimensionality and make the best use of historical information.

3.2.2. Data Normalization

To prevent output saturation caused by the excessive absolute value of the net input, normalization processing is needed to cause the data vector to fall within the [0, 1] or [−1, 1] range before the training samples are input to the input layer of the neural network, and the antinormalization processing is carried out when the neural network outputs the data. Scale normalization can effectively distance the input and output vector data to prevent the data from being too dense. The transformation formula is as follows:

In equation (3), is the input or output data, is the minimum value in the data, and is the maximum value.

After comprehensive consideration, we formulate a four-layer BP neural network including an input layer, two hidden layers, and an output layer. The input layer is set to one neuron, the first hidden layer to 30 neurons, the second hidden layer to 20 neurons, and the output layer to one neuron. The network structure is shown in Figure 2. Besides, the BP neural network used the standard BP algorithm, and the excitation function of nodes of the hidden layer of the BP neural network is the rectified linear units (ReLU), a piecewise linear function, which can change all negative values to zero and maintain the positive values.

4. Improved XGBoost Algorithm Using CLPSO Operator

4.1. Introduction to XGBoost Algorithm

XGBoost [31, 32] is an integrated model based on a decision tree. Its basic principles are as follows [31].

For any given training set of n samples,

The model for the integration of the tree uses functions constructed by K superimposed tree models to predict the output. The formula is as follows:where F represents the instance space of all regression trees.

To obtain the optimal model through learning and consider the influence of overfitting on predictive accuracy, the following objective functions are constructed and minimized:where l represents a differentiable convex function used to measure the error between the predicted value and the actual value, N represents the number of nodes in the regression tree, represents the weight of each node in N leaf nodes, and is the penalty function that punishes the complexity of the model, such as the number of leaf nodes of the generated regression tree and the weight of each leaf node. The model represented by equation (6) is an integrated tree model that cannot be solved in Euclidean space by traditional optimization method, so we use iterative approximation to solve it and use to represent the prediction function formed using t iterations. In the t + 1-th iteration, we optimize the following equation:

By optimizing the above equation, the predictions of the model are gradually improved through iteration. To obtain the optimal solution to the above equation, Taylor expansion is carried out at point of equation (7) to form the following expression:

By calculating and and substituting them into equation (7), we can conclude, for any decision tree function , the following:where represent all sample sets belonging to node j in the training set and q represents the structure of the decision tree corresponding to the decision tree function . Because different decision tree structures have different values of , it can be used to calculate the new decision tree in the t +1-th iteration.

To better solve the problem of overfitting, XGBoost adds a regular term to the cost function that contains the number of leaf nodes of the tree and the square sum of the modules of the weight of each leaf node W. In addition, XGBoost uses column subsampling technology to reduce computation and avoid overfitting. After each iteration, XGBoost multiplies the weight of the leaf node by a coefficient to weaken the influence of each tree so that there is more learning space in later stages. The XGBoost tool supports parallelism, and one of the most time-consuming steps in learning the decision tree is to sort the values of the features. However, before training, XGBoost sorts the data in advance and saves it as a block structure that is repeatedly used in later iterations to significantly reduce the amount of calculation. This block structure also makes parallelism possible; when splitting nodes, the gain in each feature needs to be calculated and can be obtained by parallel calculation. Finally, the feature with the largest gain is selected for splitting.

4.2. Model Building
4.2.1. Selecting Superparameters

The selection of the superparameters of XGBoost is directly related to the effect of the results of the algorithm. XGBoost needs to adjust nine superparameters and traditional method of adjusting parameters is to use the grid search method. Grid search algorithm is a method that divides the superparameters to be searched into grids in a certain space and searches for the optimal superparameters by traversing all the points in the grid. This method can find the global optimal solution when the optimization interval is large enough and the step distance is small enough. However, because the classification accuracy of most superparameter groups in the grid is very low, and the classification accuracy of superparameter groups only in a relatively small interval is very high, all parameter groups in the traversal grid are easy to fall into local optimum, and it is quite a waste of time. To solve this problem that the traditional XGBoost method is easy to fall into the local optimum and improve the calculation efficiency, we used the CLPSO operator to optimize the selection of superparameters.

4.2.2. Introduction to CLPSO Operator

The CLPSO [41, 42] can be described as a D-dimensional minimum optimization problem:

CLPSO stands for “comprehensive learning particle swarm optimization,” which is an improved version of the traditional particle swarm optimization (PSO) algorithm. It enhances learning between particles and accelerates the convergence of the population. It solves the defect in the original PSO algorithm whereby it can easily fall into the local optimum. Formulae to update the speed and position of the traditional PSO algorithm are as follows:where represents the historical optimal value of the i-th particle and is the global optimal value of all particles. Equations (11) and (12) show that each particle swarm learns from its historical and global optimal values in each learning process to continuously improve its fitness.

The CLPSO algorithm changes velocity equation (11) of the PSO so that the learning object can learn from different particles in different dimensions and at different times. The updated formula is as follows:where and indicate that particle i needs to learn from the historical optimal value of particle number in dimension D.

4.2.3. Optimizing XGBoost Algorithm by CLPSO Operator

The process of optimizing CLPSO consists of three steps [42]:Step 1: set parameters of the CLPSO algorithm and initialize the particle swarmStep 2: after initialization, update the speed and position of each particle, take the current positional information as the XGBoost superparameters, run the experiment, take the experimental result as the fitness value of the particle, and update its historical and global optimal values according to its fitness valueStep 3: repeat step 2 N times

The flowchart of this process is shown in Figure 3. There are nine superparameters of XGBoost during the experiment, which are the number of iterations (n_estimators), the learning rate (learning_rate), the maximum depth (max_depth), the sampling ratio (subsample and colsample_bytree), the sum of sample weights of minimum leaf nodes (min_child_weight), the decline value of loss function (gamma), L1 regularization (reg_alpha), and L2 regularization (reg_lambda). Finally, we use the CLPSO algorithm and conduct three cross validations to find the optimal superparameters. The final results are shown in Table 1.

5. Results and Discussion

After analyzing the modeling, to measure the effect of filling in missing values in the data, 160 profiles were randomly selected for testing, and the rest of the profiles were used for training. The RMSE and MAPE were used as evaluation indices. They are calculated as follows:

Table 2 shows the overall error of 160 profiles randomly selected and the total convergence time for training set, and Figures 4 and 5 show the average error of the 160 profiles at each height.

Table 2 shows that the MAPE and RMSE of the improved XGBoost algorithm are about 39% less than those of the BP neural network and about 32% less than those of traditional XGBoost. The operating time of the improved XGBoost algorithm is about 99% shorter than that of BP neural network and 18% shorter than that of traditional XGBoost. The proposed algorithm not only improves accuracy but also reduces the time needed for calculation to a significant extent to improve the efficiency of calculation. From Figures 4 and 5, it is clear that most MAPE of BP neural network exceeds 1.5% and most RMSE exceeds 8 M. The MAPE of traditional XGBoost exceeds 1% at most heights and some even exceed 1.5% at both ends’ height; the RMSE of traditional XGBoost is about 8 M at most heights and some even exceed 10 M at both ends’ height. However, the MAPE of the improved XGBoost algorithm is less than 1% for results above 1000 m and less than 1.5% for those below 1000 m; all RMSE of the improved XGBoost algorithm are less than 8 M. The MAPE and RMSE of the proposed method are smaller than those of the BP neural network and traditional XGBoost at the same height, which further proves its advantages in filling missing values in the data. Figure 6 shows the fitting diagram of a randomly selected profile from the 160 test profiles.

From Figure 6, we can conclude that although the three prediction algorithms have the same trend as the true value, the fitting effect of the improved XGBoost algorithm at the fluctuation of the curve is in good agreement with the true value. To further measure the effect of a single profile, Figures 7 and 8 show the MAPE and RMSE at different heights of the selected profile.

Figures 7 and 8 show that the MAPE and RMSE from 0 m to 1000 m, 1500 m–2000 m and 2500 m–3000 m, when using the improved XGBoost algorithm, are much less than those of the BP neural network and traditional XGBoost. We can see that the 1500 m–2000 m is precisely the area where the profile of the modified atmospheric index fluctuated from Figure 6. According to the definition of large atmospheric ducts, this is the area where they occur.

Through the above analysis, we can conclude that the improved XGBoost algorithm is considerably superior to the BP neural network and traditional XGBoost in terms of filling in missing values of the modified atmospheric refractive index, including the profiles of the index fluctuation. According to the definition of atmospheric ducts, the region of fluctuation is where atmospheric ducts occur, so the correct judgment of modified atmospheric refractive index provides a favorable support for the subsequent use of the modified atmospheric refractive index judgment and the analysis of the occurrence of atmospheric ducts.

6. Conclusions

In order to solve the problem that obtaining the atmospheric refractive index’s efficiency is low in atmospheric environment, this paper proposes an improved XGBoost based on the CLPSO operator, which has high convergence speed and strong ability to jump out of the local optimum solution for parameter optimization combination; the CLPSO can not only enhance learning between particles but also accelerate the convergence of the population. To test its effectiveness, it is applied to filling missing values in the modified atmospheric refractive index over the ocean. High-resolution sounding balloon data were chosen as experimental data, and the modified atmospheric refractive index was calculated at different altitudes. The modified atmospheric refractive indices in the lower (0–100 m) and upper (3000 m–10000 m) layers were used as input, and that in the middle layer (100 m–3000 m) was used as the output. The improved XGBoost algorithm was used for learning and training to fill in the missing values of the modified refractive index of the atmosphere in the middle layer (100 m–3000 m).

The results of the proposed method were compared with those of the BP neural network and traditional XGBoost through experiment. The MAPE and RMSE of the improved XGBoost algorithm are 39% less than those of BP neural network and 32% less than those of the traditional XGBoost, and its operating time is 99% shorter than that of the BP neural network and 18% than that of the traditional XGBoost. Compared with the BP neural network and traditional XGBoost, the improved XGBoost algorithm records significant improvements in terms of both accuracy and operating time and has a significant advantage in terms of providing an accurate modified atmospheric refractive index, especially when the profile fluctuates. According to the definition of atmospheric ducts, the region of fluctuation is precisely where atmospheric ducts occur, so the correct judgment of modified atmospheric refractive index provides a favorable support for the subsequent judgment of the atmospheric ducts. The results of this paper show a new method to obtain the atmospheric refractive index over the ocean, which will be of great significance to the subsequent judgment of the atmospheric ducts over the ocean. In addition, we can also use the results to realize the beyond-the-horizon transmission and target detection of electromagnetic wave over the ocean.

However, the proposed algorithm fills in only values of the modified atmospheric refractive index from the perspective of machine learning and does not consider such physical processes as weather. In future work, we intend to consider such processes to further optimize the proposed algorithm to fill in missing values of the modified atmospheric refractive index better.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Contract no. 41875045 and Hunan Provincial Innovation Foundation for Postgraduate under Grant no. CX20200093.