Research Article  Open Access
Prediction of Tropical Cyclones’ Characteristic Factors on Hainan Island Using Data Mining Technology
Abstract
A new methodology combining data mining technology with statistical methods is proposed for the prediction of tropical cyclones’ characteristic factors which contain latitude, longitude, the lowest center pressure, and wind speed. In the proposed method, the best track datasets in the years 1949~2012 are used for prediction. Using the method, effective criterions are formed to judge whether tropical cyclones land on Hainan Island or not. The highest probability of accurate judgment can reach above 79%. With regard to TCs which are judged to land on Hainan Island, related prediction equations are established to effectively predict their characteristic factors. Results show that the average distance error is improved compared with the National Meteorological Centre of China.
1. Introduction
Typhoon is a kind of tropical cyclones (TCs), the centersustained wind speed of which arrives at level 12 to level 13 (typhoon is not distinguished from TC in this paper unless specially emphasized). Hainan Island (108°37′E~111°05′E, 18°10′N~20°10′N) in China is well known as “typhoon corridor.” According to the historical data analysis of TCs landing on Hainan Island, the yearly and the monthly statistical results are shown in Figures 1 and 2, respectively. (Note: in this paper the condition to determine whether a typhoon lands or not is that the minimum distance between the typhoon center and Hainan Island is no more than the preset influencing radius, which is 300 km herein). Thus the frequency of TCs landing on Hainan Island is very high. Besides, typhoon ranks at the top among all kinds of disasters on Hainan Island. Taking the typhoon “Damrey” as an example, in 2005, it destroyed 18 cities of Hainan and affected up to 6.305 million people among whom 21 persons were killed. The direct economic loss reached 12.1 billion RMB [1]. Therefore, the timely and accurate forecast of TCs is very important for disaster prevention on Hainan Island. It can also effectively reduce the damage and loss caused by the TC when it happens.
The main methods for traditional TC forecast contain statistical methods and dynamic methods, most of which are along with complicated processes or lower precision. The statistical methods use the historical TCs’ positions, intensity, and so on to predict TC’s characteristic factors, such as fuzzy multicriteria decision support model [2], conditional nonlinear optimal perturbation, first singular vector, ensemble transform Kalman filter [3], back propagationneural network [4], adaptive neural network classifier using a twolayer feature selector [5], and a support vector machine using data reduction methods [6]. Dynamic methods are mainly based on numerical forecast, such as a simplified dynamical system based on a logistic growth equation (LGE) [7], a regional coupled atmosphereocean model [8], the PSUNCAR Mesoscale Model version 5 [9], and the GFDL 25kmResolution Global Atmospheric Model [10]. Taking three main prediction centers, for example, the average distance error of 24/48 hours’ forecast by the USA National Hurricane Center (NHC) is 106/187 km, which is 125/243 km for Japan Meteorological Agency (JMA) and 120/215 km for National Meteorological Centre of China (NMCC) [11]. Zhang et al. compared the monitoring data from HY2 and QuikSCAT’s satellite scatterometers with the actual typhoon data from ground observation. The result shows that the deviations of typhoon path and intensity are large and their standard deviations are also very big [12]. Therefore, although there are many typhoon forecast methods in use at present, their precision still cannot meet the need for realtime typhoon warning.
By using data mining technology in combination with statistical methods, a new TC forecast method based on the historical data is proposed in this paper. Firstly, the region where typhoon centers were located 48 (or 72 as a comparative experiment) hours before landing on Hainan Island is divided into five (or a number of as a comparative experiment) areas using means clustering algorithm. Then the TC landing criterion of each area is formed by classification and regression trees (CART). Further, prediction sum of squares (PRESS) algorithm and its progressive optimal algorithm are applied to optimize forecast factor sets. Finally, part of the historical data is used to establish prediction equations by multiple linear regression model (MLRM) and the accuracy of these equations is examined by the remaining historical data. All results show that this methodology is more accurate compared with present existing forecast methods.
2. Data and Methodology
2.1. Data
Data used in this research is based on TCs’ best track datasets of the years 1949~2012 in the northwestern Pacific waters (including the South China Sea, northern of the equator, and western of 180°E) [13], which are derived from the TC information center of China Meteorological Administration (CMA) (http://tcdata.typhoon.gov.cn/). CMA best track datasets contain 2172 TCs, which in total have 62663 observation points. Every observation point may provide information as follows: the observation time, strength grade, latitude, longitude, the lowest center pressure (hereinafter referred to as air pressure), 2minuteaveragenearcentermaximum wind speed (hereinafter referred to as wind speed), and average wind speed in 2 minutes. Because the average wind speed in 2 minutes of most observation points cannot be obtained, strength grade (SG), latitude (LAT), longitude (LON), air pressure (AR), wind speed (WS), latitude migration velocity (LATMV), and longitude migration velocity (LONMV) are selected as seven predictors (hereinafter referred to as observation point information). Current LATMV and LONMV can be calculated using the following method.
Set the moments of current observation point and previous two observation points as , , and , respectively (where is on behalf of longitude and is on behalf of latitude). Then the LONMV and LATMV of the current observation time are and , which are calculated as where is the mean radius of the Earth with the value of 6370.856 km; represents the sign function; the unit of and is km/h.
2.2. Methodology
As mentioned in the Introduction, a landing TC is defined as a TC that the minimum distance between the typhoon center and Hainan Island is no more than the preset influencing radius. Hence, in order to distinguish TCs between landing and not landing on Hainan Island, the minimum distance between each TC’s track and the outer boundary of Hainan Island needs to be calculated according to CMA best track datasets. Due to the variety of TCs’ tracks, applying general curvefitting directly does not provide a good result. Therefore, polynomial fitting [14] is applied in this paper where an intermediate variable is introduced to conduct curvefitting with the latitude and longitude, respectively. Taking an arbitrary TC, for example, the specific fitting effects are shown in Figures 3 and 4. Using the fitting polynomials of each TC’s and the outer boundary of Hainan Island’s latitude and longitude with respect to the corresponding intermediate variables, the distance between any point of each TC’s track and any point on the outer boundary of Hainan Island can be calculated, from which the minimum distance can be selected. The great circle distance (GCD) between any two points on the Earth can be calculated using formula (3). The GCD is the shortest distance between any two points on the Earth. Set any two points on the Earth as and and the GCD is
The time intervals used to forecast TCs by three main prediction centers (NHC, JMA, and NMCC) are 24, 48, and 72 hours. In order to forecast TCs in a timely manner and compare forecast accuracy among different methods, here the region where TCs’ centers were located 48 hours before they landed on Hainan Island (shown in Figure 5) is selected as research object. In order to narrow the research scope, means clustering algorithm [15, 16] is applied to divide the region where TCs’ centers were located 48 hours before they landed on Hainan Island into five areas. In this section the situation, in which the region where TCs’ centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas, is taken as an example for a convenient statement. Other situations of a comparative experiment are also conducted in Section 3.3.
For each of the five areas, all the observation points of both landing and not landing TCs which entered into this area are filtrated. With strength grade, latitude, longitude, air pressure, wind speed, latitude migration velocity, and longitude migration velocity as classification properties, the TC landing criterion of each area is formed by using CART algorithm [17, 18]. The flow diagram of forming landing criterions is shown in Figure 6.
For TCs which are judged to be landing on Hainan Island, PRESS and its progressive optimal algorithm and MLRM can be used to forecast TCs’ characteristic factors (including latitude, longitude, the lowest center pressure, and wind speed). Forecasts in this paper contain landing prediction pattern and dynamic prediction pattern. Landing prediction pattern is defined as employing the observation point information of those points which first enter into any area to predict the characteristic factors when TC lands. Dynamic prediction pattern is defined as 24 hours’ and 48 hours’ prediction with respect to the observation point which enters into any area. The flow diagrams of landing forecast pattern and dynamic forecast pattern are shown in Figures 7 and 8, respectively. Here PRESS [19] and its progressive optimal algorithm [20, 21] are used to select the best forecast factor set from seven predictors which will be used to forecast corresponding characteristic factor. MLRM [22] is used to establish corresponding forecast equations. MLRM is expressed as [23] where is the estimated value, is the regression coefficients, is the random error, and are the forecast factors of the observation point.
3. Results and Discussions
In this section, the situation, in which the region where TCs’ centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas, is firstly researched. Other situations of a comparative experiment, in which the research object may be the region where TCs’ centers were located 72 hours before they landed and the number of areas of divided research object may be any number of , are also discussed at the end of the section.
3.1. Dividing the Research Region into Five Areas
means clustering algorithm is used to divide our research region into five areas as described in Section 2.2, of which the geometric centers and scopes are shown in Table 1. With respect to each area, all the observation points of both landing and not landing TCs which entered into this area are filtrated. The positions of these observation points are shown in Figure 9 and are used to form TCs’ landing criterions. The numbers of these observation points (OPs) for both landing and not landing TCs are shown in Table 2. The division of five areas further narrows the research scope and makes the selection of the observation points more pertinent so as to form the effective landing criterions, which will be illustrated further in Section 3.3.


3.2. The Formation of Landing Criterions in Five Areas
According to the CART algorithm, the landing criterions in five areas are shown in Figure 10 (refer to Section 2.1 for the meaning of seven predictors). The corresponding probability of accurate judgment (), probability of false alarm (), and probability of false dismissal () are shown in Table 3. Set the numbers of OPs for landing and not landing TCs in any area to be and , respectively; the number of OPs which are judged to be landing according to landing criterions when they landed truly is denoted as and the number of OPs which are judged to be not landing according to landing criterions when they did not land truly is denoted as . Then , , and of this area are calculated as follows:

3.3. Other Situations as a Comparative Experiment
In Sections 3.1 and 3.2, the region where TCs’ centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas. In this section other situations as a comparative experiment are researched and compared with each other. Finally, we select the situation which produces the best result.
In order to distinguish different situations, the labels of them are denoted in Table 4, where the parameter Ti is used to illustrate the research object is the region where TCs’ centers are located. Ti hours before they landed on Hainan Island; Nu denotes the number of areas of divided research object.

It can be seen from Table 4 that FE5 is the situation which has been researched in Sections 3.1 and 3.2. The remaining seven situations are researched as follows using the methods which are identical with FE5.
The , , and of each area for each of the remaining seven situations can be calculated according to formula (5), the results of which are shown in Table 5.
(a) , , and of each area for FE1  
 
(b) , , and of each area for FE3  
 
(c) , , and of each area for FE7  
 
(d) , , and of each area for ST1  
 
(e) , , and of each area for ST3  
 
(f) , , and of each area for ST5  
 
(g) , , and of each area for ST7  

In order to select the best of these eight situations in Table 4, an evaluation method is introduced, with which the Index of each situation is calculated, where Index is defined according to formula (6). For any situation, denotes the number of areas of divided research object and , , and denote the , , and of the th area, respectively. Consider the following:
It is obvious that the higher the Index is, the better the result of the landing criterions on the whole is. The Index for each of eight situations is shown in Table 6. The situation FE5 shows the best result, which also illustrates that the research scheme in Sections 3.1 and 3.2 is better compared with other situations. Finally, situation FE5 is selected to form the landing criterions.

3.4. Forecast of TCs’ Characteristic Factors
3.4.1. Landing Prediction Pattern
The landing forecast pattern is defined as follows: obtaining the observation point information (seven predictors) when landing TCs’ centers first enter into any area, which can be used to forecast the characteristic factors (LAT, LON, AP, and WS) when TCs land on Hainan Island. The flow diagram is shown in Figure 6. Taking area 1, for example, the OPs of historical landing TCs’ centers when they first entered into area 1 are shown in Figure 11 and the tracks of historical landing TCs which passed through area 1 are shown in Figure 12.
Dividing the historical landing TCs passing through each area into two groups with the same number, one group of TCs is used to establish prediction equations and the other group is used to test the accuracy of these equations. The results of testing of these prediction equations for TCs which passed through each area are shown in Table 7. Making use of the actual and predicted longitude and latitude of TCs’ centers, in combination with the formula (3), the calculated mean and standard deviation of GCD in the landing prediction pattern are shown in Figures 13 and 14, respectively. Averaging the results of five areas, it can be obtained that the average of the mean/standard deviation (SD) of GCD is 144.6382/97.8740 km. In [24], Yu et al. analyze the average GCD error of 48 hours’ forecast in the South China Sea, which is 222.6 km. Therefore, the landing prediction pattern proposed in this paper shows good prediction accuracy. For TCs which are judged to be landing on Hainan Island, as long as the observation point information when their centers first enter into any area are obtained, the corresponding forecast equations can be used to predict characteristic factors when they land.

3.4.2. Dynamic Prediction Pattern
The dynamic prediction pattern is using the current observation point information to conduct 24 hours’ and 48 hours’ forecast, which is also illustrated in Figure 8. There are two different forecast models in dynamic prediction pattern that are described as follows.
Forecast Model 1. It is to obtain the current observation point information (seven predictors) when landing TCs’ centers enter into any area for the first time, making use of which to conduct 24 hours’ and 48 hours’ forecast.
Forecast Model 2. It is to obtain the current observation point information (seven predictors) when landing TCs’ centers are in any area (not necessarily enter into any area for the first time), making use of which to conduct 24 hours’ and 48 hours’ forecast.
In the process of actual prediction, for TCs, which are judged to be landing on Hainan Island, the observation point information when their centers enter into any area for the first time and the established equations in forecast model 1 are used to conduct dynamic prediction. Furthermore, the observation point information when TCs’ centers are in area (it is not necessary that TCs’ centers enters into this area for the first time) any area and the established equations in forecast model 2 can be used to conduct dynamic prediction.
Similar to Section 3.4.1, the historical observation points that meet the corresponding requirements in corresponding forecast model (1 or 2) are divided into two groups with the same number of observation points. One group of observation points is used to establish prediction equations and the other group of points is used to test the accuracies of these equations. The results of testing these prediction equations in forecast model 1 and forecast model 2 are shown in Table 8. In combination with formula (3), the calculated mean and standard deviation of GCD in two forecast models are shown in Figures 15 and 16, respectively. Averaging the results of five areas, it can be obtained that the averages of the mean/standard deviation of GCD under forecast model 1 and forecast model 2 for 24 hours’ forecast fare 150.5192/84.6156 km and 141.5464/81.2509 km, respectively. For 48 hours’ forecast, the averages of the mean/standard deviation of GCD under two different forecast models are 261.7517/145.6345 km and 256.7109/145.2903 km, respectively. Even though the mean of GCD in dynamic prediction pattern is no less than three main prediction centers (NHC, JMA, and NMCC), it is much less than the numerical prediction model in [25], the means of which are 186.3/319.5 km based on System T106 and 161.8/295.8 km based on T213 both for 24/48 hours’ forecast. Besides, forecast models 1 and 2 are all more accurate than the forecast using satellite scatterometer’s monitoring data in the sense of the standard deviation of GCD and the mean of weed speed error, which are 149.6002 km and 11.9618 m/s in [12]. It can be seen from Table 8 and Figures 15 and 16 that the accuracies of forecast model 1 and forecast model 2 vary from different areas and different characteristic factors. The more accurate forecast model can be selected from forecast models 1 and 2 according to actual conditions. The results of statistical significance tests for each equation used to forecast corresponding characteristic factor in forecast model 1 and forecast model 2 are shown in Table 9, which show that value is much less than 0.05 in almost every case and prove that the corresponding prediction equation is significant.


(a) The mean of GCD for 24 hours’ forecast of two forecast models
(b) The standard deviation of GCD for 24 hours’ forecast of two forecast models
(a) The mean of GCD for 48 hours’ forecast of two forecast models
(b) The standard deviation of GCD for 48 hours’ forecast of two forecast models
4. Summary
In this paper, the CMA best track datasets from 1949 to 2012 are used, in combination with data mining technology and statistical methods, to put forward a new methodology to forecast TCs’ characteristic factors. This methodology can accurately judge whether TCs land on Hainan Island or not and forecast their characteristic factors (including longitude, latitude, the lowest center pressure, and wind speed). The average of the probabilities of accurate judgment for landing criterions is 74.70% and the highest accuracy can reach 79.76%. For the forecast of landing TCs’ characteristic factors, landing prediction pattern and dynamic prediction pattern are proposed, which not only can accurately forecast the characteristic factors when TCs land but also realize dynamically 24 hours’ and 48 hours’ forecast. The effect of the landing prediction pattern is better, of which the mean of GCD is 144.6382 km, compared with the current 48 hours’ forecast in the South China Sea, which is 222.6 km. Even though the mean of GCD in dynamic prediction pattern is no less than three main prediction centers (NHC, JMA, and NMCC), it is much less than the numerical prediction model in [25] and the method using satellite scatterometer’s monitoring data in [12]. The forecast methodology proposed in this paper provides a new method for typhoon warning on Hainan Island without getting too much knowledge of meteorology involved and thus simplifies the implementation of the prediction process and meanwhile guarantees the accuracy of prediction.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is in part supported by the National ScienceTechnology Support Plan in the domain of advanced energy technology. Hainan Power Grid Corporation provided support for this research through “Integrated Demonstration Project of Regional Smart Grid” no. 2013BAA01B03. Dr. Xinhong Huang, a Research Engineer in the Department of Electrical & Computer Engineering from the University of Western Ontario, also puts forward valuable revision comments on this paper.
References
 National Climate Center of China, Typhoon “Damrey” Has Caused Serious Damages to Hainan Province, 2005, http://ncc.cma.gov.cn/Website/index.php?NewsID=1454.
 J. S. Pedro, F. Burstein, and A. Sharp, “A casebased fuzzy multicriteria decision support model for tropical cyclone forecasting,” European Journal of Operational Research, vol. 160, no. 2, pp. 308–324, 2005. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 X. Qin and M. Mu, “A Study on the reduction of forecast error variance by three adaptive observation approaches for tropical cyclone prediction,” Monthly Weather Review, vol. 139, no. 7, pp. 2218–2232, 2011. View at: Publisher Site  Google Scholar
 Y. Wang, W. Zhang, and W. Fu, “Back Propogation(BP)neural network for tropical cyclone track forecast,” in Proceedings of the 19th International Conference on Geoinformatics, pp. 1–4, IEEE, Shanghai, China, June 2011. View at: Publisher Site  Google Scholar
 B. Feng and J. N. K. Liu, “An adaptive neural network classifier for tropical cyclone prediction using a twolayer feature selector,” in Advances in Neural Networks—ISNN 2005, vol. 3497 of Lecture Notes in Computer Science, pp. 399–404, Springer, Berlin, Germany, 2005. View at: Google Scholar
 H.J. Song, S.H. Huh, J.H. Kim, C.H. Ho, and S.K. Park, “Typhoon track prediction by a support vector machine using data reduction methods,” in Computational Intelligence and Security, vol. 3801 of Lecture Notes in Computer Science, pp. 503–511, Springer, Berlin, Germany, 2005. View at: Google Scholar
 M. DeMaria, “A simplified dynamical system for tropical cyclone intensity prediction,” Monthly Weather Review, vol. 137, no. 1, pp. 68–82, 2009. View at: Publisher Site  Google Scholar
 H. R. Winterbottom, E. W. Uhlhorn, and E. P. Chassignet, “A design and an application of a regional coupled atmosphereocean model for tropical cyclone prediction,” Journal of Advances in Modeling Earth Systems, vol. 4, no. 10, Article ID M10002, 2012. View at: Publisher Site  Google Scholar
 L. M. Ma and Z. M. Tan, “Improving the behavior of the cumulus parameterization for tropical cyclone prediction: convection trigger,” Atmospheric Research, vol. 92, no. 2, pp. 190–211, 2009. View at: Publisher Site  Google Scholar
 J. S. Gall, I. Ginis, S.J. Lin, T. P. Marchok, and J.H. Chen, “Experimental tropical cyclone prediction using the GFDL 25kmresolution global atmospheric model,” Weather and Forecasting, vol. 26, no. 6, pp. 1008–1019, 2011. View at: Publisher Site  Google Scholar
 G. Lv, “The Northwestern Pacific typhoon track forecast in 2005,” in Proceedings of the Meteorological Science and Technology Symposium on Taiwan Strait, p. 7, Chinese Meteorological Society, 2006, (Chinese). View at: Google Scholar
 D. Zhang, Y. Zhang, T. Hu, B. Xie, and J. Xu, “A comparison of HY2 and QuikSCAT vector wind products for tropical cyclone track and intensity development monitoring,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 8, pp. 1365–1369, 2014. View at: Publisher Site  Google Scholar
 M. Ying, W. Zhang, H. Yu et al., “An overview of the China meteorological administration tropical cyclone database,” Journal of Atmospheric and Oceanic Technology, vol. 31, no. 2, pp. 287–301, 2014. View at: Publisher Site  Google Scholar
 G. E. Forsythe, “Generation and use of orthogonal polynomials for datafitting with a digital computer,” Journal of the Society for Industrial & Applied Mathematics, vol. 5, no. 2, pp. 74–88, 1957. View at: Google Scholar
 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl, “Constrained kmeans clustering with background knowledge,” in Proceedings of the 18th International Conference on Machine Learning, vol. 1, pp. 577–584, 2001. View at: Google Scholar
 Y. Ye, “Neighborhood density method for selecting initial cluster centers in Kmean clustering,” in Proceedings of the Workshop on Data Mining for Biomedical Applications (PAKDD '06), pp. 189–198, 2006. View at: Google Scholar
 P. Xiong, Data Mining Algorithm and Clementine Practice, Tsinghua University Press, Beijing, China, 2011 (Chinese).
 S. L. Crawford, “Extensions to the CART algorithm,” International Journal of ManMachine Studies, vol. 31, no. 2, pp. 197–217, 1989. View at: Publisher Site  Google Scholar
 D. M. Allen, “Mean square error of prediction as a criterion for selecting variables,” Technometrics, vol. 13, pp. 469–475, 1971. View at: Google Scholar
 S. Yu and J. Shen, “Forward and backward algorithms for selecting predictors on the basis of the criterion from prediction sum of squares and their application,” Acta Meteorologica Sinica, vol. 1, pp. 83–90, 1988. View at: Google Scholar
 D. Yao and S. Yu, “The stepwise algorithm of selecting forecast factors based on PRESS rule,” Journal of Atmospheric Sciences, vol. 2, pp. 129–135, 1992 (Chinese). View at: Google Scholar
 Y. Lu, Mathematical Statistics Methods, East China University of Science and Technology Press, Shanghai, China, 2005, (Chinese).
 K. Xie and B. Liu, “An ENSOforecast independent statistical model for the prediction of annual Atlantic tropical cyclone frequency in April,” Advances in Meteorology, vol. 2014, Article ID 248148, 11 pages, 2014. View at: Publisher Site  Google Scholar
 J. Yu, J. Tang, Y. Dai, and B. Yu, “The error and cause analysis of China's typhoon path prediction,” Journal of Weather, vol. 6, pp. 695–700, 2012. View at: Google Scholar
 S. Ma, A. Qu, and Z. Yu, “The parallelization of typhoon numerical prediction model of and track forecast error analysis,” Journal of Applied Meteorology, vol. 3, pp. 322–328, 2004 (Chinese). View at: Google Scholar
Copyright
Copyright © 2014 Ruixu Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.