About this Journal Submit a Manuscript Table of Contents
Advances in Meteorology
Volume 2014 (2014), Article ID 203545, 15 pages
http://dx.doi.org/10.1155/2014/203545
Research Article

Feature Selection for Very Short-Term Heavy Rainfall Prediction Using Evolutionary Computation

1Department of Computer Science and Engineering, Kwangwoon University, 20 Kwangwoon-Ro, Nowon-Gu, Seoul 139-701, Republic of Korea
2Forecast Research Laboratory, National Institute of Meteorological Research, Korea Meteorological Administration, 45 Gisangcheong-gil, Dongjak-gu, Seoul 156-720, Republic of Korea

Received 16 August 2013; Revised 23 October 2013; Accepted 1 November 2013; Published 6 January 2014

Academic Editor: Sven-Erik Gryning

Copyright © 2014 Jae-Hyun Seo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We developed a method to predict heavy rainfall in South Korea with a lead time of one to six hours. We modified the AWS data for the recent four years to perform efficient prediction, through normalizing them to numeric values between 0 and 1 and undersampling them by adjusting the sampling sizes of no-heavy-rain to be equal to the size of heavy-rain. Evolutionary algorithms were used to select important features. Discriminant functions, such as support vector machine (SVM), k-nearest neighbors algorithm (k-NN), and variant k-NN (k-VNN), were adopted in discriminant analysis. We divided our modified AWS data into three parts: the training set, ranging from 2007 to 2008, the validation set, 2009, and the test set, 2010. The validation set was used to select an important subset from input features. The main features selected were precipitation sensing and accumulated precipitation for 24 hours. In comparative SVM tests using evolutionary algorithms, the results showed that genetic algorithm was considerably superior to differential evolution. The equitable treatment score of SVM with polynomial kernel was the highest among our experiments on average. k-VNN outperformed k-NN, but it was dominated by SVM with polynomial kernel.

1. Introduction

South Korea lies in the temperate zone. In South Korea, we have clearly distinguished four seasons, where spring and fall are short relatively to summer and winter. It is geographically located between the parallels 125°04′′E and 131°52′′E and the meridians 33°06′′N and 38° 27′′N in the Northern Hemisphere, on the east coast of the Eurasian Continent, and also adjacent to the Western Pacific, as shown in Figure 1. Therefore, it has complex climate characteristics, which show both continental and oceanic features. It has a wide interseasonal temperature difference and much more precipitation than that of the Continent. In addition, it has obvious monsoon season wind, a rainy period from the East Asian Monsoon, locally called Changma [1], typhoons, and frequently heavy snowfalls in winter. The area belongs to a wet region because of more precipitation than that of the world average.

203545.fig.001
Figure 1: The location of South Korea in East Asia and the dispersion of automatic weather stations in South Korea.

The annual mean precipitation of South Korea, as shown in Figure 2, is around 1,500 mm and 1,300 mm in the central part. Geoje-si of Gyeongsangnam-do has the largest amount of precipitation, 2007.3 mm, and Baegryeong island of Incheon has the lowest amount of precipitation, 825.6 mm.

fig2
Figure 2: Annual (a) and summer (b) mean precipitation in South Korea (mm) [4].

When a stationary front lingers across the Korean Peninsula for about a month in summer, more than half of the annual precipitation falls during the Changma season. Precipitation for the winter is less than 10% of the total. Changma is a part of the summer Asian monsoon system. It brings frequent heavy rainfall and flash floods for 30 days on average, and serious natural disasters often occur.

The heavy rainfall is one of the major severe weather phenomena in South Korea. The weather phenomena can lead to serious damage and losses of both life and infrastructure, and it is very important to forecast heavy rainfall. However, it is considered a difficult task because it takes place in very short time interval [2].

We need to predict this torrential downpour to prevent the losses of life and property [1, 3]. Heavy rainfall forecasting is very important to avoid or minimize natural disasters before the events occur. We used real weather data collected from 408 automatic weather stations [4] in South Korea, for the period from 2007 to 2010. We studied the prediction of one hour to six hours of whether or not heavy rainfall will occur in South Korea. To the best knowledge of the authors, this problem has not been handled by other researchers.

There have been many studies on heavy rainfall using various machine learning techniques. In particular, several studies focused on weather forecasting using an artificial neural network (ANN) [511]. In the studies of Ingsrisawang et al. [11] and Hong [12], support vector machine was applied to develop classification and prediction models for rainfall forecasts. Our research is different from previous work on how to process weather datasets.

Kishtawal et al. [13] studied the prediction of summer rainfall over India using genetic algorithm (GA). In their study, the genetic algorithm found the equations that best describe the temporal variations of the seasonal rainfall over India. The geographical region of India has been divided into five homogeneous zones (excluding the North-West Himalayan zone). They used the monthly mean rainfall during the months of June, July, and August. The dataset consist of the training set, ranging from 1871 to 1992, and the validation set, ranging from 1993 to 2003. The experiment of the first evolution process and the second evolution process were conducted using the training set and the validation set, in order. The performance of the algorithm for each case was evaluated, using the statistical criteria of standard error and fitness strength. Chromosome was made up of five homogeneous zones, annual precipitation, and four elementary arithmetic operators. The strongest individuals (equations with best fitness) were then selected to exchange parts of the character strings between reproduction and crossover, while individuals less fitted to the data are discarded. A small percentage of the equation strings’ most basic elements, single operators and variables, are mutated at random. The process was repeated a large number of times (about 1,000–10,000) to improve the fitness of the evolving population of equations. The major advantage of using genetic algorithm versus other nonlinear forecasting techniques, such as neural networks, is that an explicit analytical expression for the dynamic evolution of the rainfall time series is obtained. However, they used quite simple or typical parameters of a genetic algorithm. If they conducted experiments by tuning various parameters of their genetic algorithm, they would report the experimental results showing better performance.

Liu et al. [14] proposed a filter method for feature selection. Genetic algorithm was used to select major features in their study, and the features were used for data mining based on machine learning. They proposed an improved Naive Bayes classifier (INBC) technique and explored the use of genetic algorithms (GAs) for selection of a subset of input features in classification problems. They then carried out a comparison with several other techniques. This sets a comparison of the following algorithms, namely, genetic algorithm with average classification or general classification (GA-AC, GA-C), C4.5 with pruning, and INBC with relative frequency or initial probability density (INBC-RF, INBC-IPD), on the real meteorological data in Hong Kong. In their experiments, the daily observations of meteorological data were collected from the Observatory Headquarters and King’s Park for training and test purposes, for the period from 1984 to 1992 (Hong Kong Observatory). During this period, they were only interested in extracting data from May to October (for the rainy season) each year. INBC achieved about a 90% accuracy rate on the rain/no-rain (Rain) classification problems. This method also attained reasonable performance on rainfall prediction with three-level depth (Depth 3) and five-level depth (Depth 5), which was around 65%–70%. They used a filter method for feature selection. In general, it is known that a wrapper method performs better than a filter method. In this study, we try to apply a wrapper method to feature selection.

Nandargi and Mulye [15] analyzed the period of 1961–2005 to understand the relationship between the rain and rainy days, mean daily intensity, and seasonal rainfall over the Koyna catchment in India, on monthly, as well as seasonal, scale. They compared a linear relationship with a logarithmic relationship, in the case of seasonal rainfall versus mean daily intensity.

Routray et al. [16] studied a performance-based comparison of simulations carried out using nudging (NUD) technique and three-dimensional variation (3DVAR) data assimilation system, of a heavy rainfall event that occurred during 25–28 June, 2005, along the west coast of India. In the experiment, after observations using the 3DVAR data assimilation technique, the model was able to simulate better structure of the convective organization, as well as prominent synoptic features associated with the mid-tropospheric cyclones (MTC), than the NUD experiment, and well correlated with the observations.

Kouadio et al. [17] investigated relationships between simultaneous occurrences of distinctive atmospheric easterly wave (EW) signatures that cross the south equatorial Atlantic, intense mesoscale convective systems (lifespan > 2 hours) that propagate westward over the western south equatorial Atlantic, and subsequent strong rainfall episodes (anomaly > 10 mm·day−1) that occur in eastern Northeast Brazil (ENEB). They forecasted rainfall events through real-time monitoring and the simulation of this ocean-atmosphere relationship.

Afandi et al. [2] investigated heavy rainfall events that occurred over Sinai Peninsula and caused flash flood, using the Weather Research and Forecasting (WRF) model. The test results showed that the WRF model was able to capture the heavy rainfall events over different regions of Sinai and predict rainfall in significant consistency with real measurements.

Wang and Huang [18] studied on finding the evidence of self-organized criticality (SOC) for rain datasets in China, by employing the theory and method of SOC. For that reason, they analyzed the long-term rain records of five meteorological stations in Henan, a central province of China. They found that the long-term rain processes in central China exhibit the feature of self-organized criticality.

Hou et al. [19] studied the impact of three-dimensional variation data assimilation (3DVAR) on the prediction of two heavy rainfall events over southern China in June and July. They used two heavy rainfall events: one affecting several provinces in southern China with heavy rain and severe flooding; the other is characterized by nonuniformity and extremely high rainfall rates in localized areas. Their results suggested that the assimilation of all radar, surface, and radiosonde data had a more positive impact on the forecast skill than the assimilation of either type of data only, for the two rainfall events.

As a similar approach to ours, Lee et al. [20] studied feature selection using a genetic algorithm for heavy-rain prediction in South Korea. They used ECMWF (European Centre for Medium-Range Weather Forecasts) weather data collected from 1989 to 2009. They selected five features among 254 weather elements to examine the performance of their model. The five features selected were height, humidity, temperature, U-wind, and V-wind. In their study, a heavy-rain criterion is issued only when precipitation during six hours is higher than 70 mm. They used a wrapper-based feature selection method using a simple genetic algorithm and SVM with RBF kernel as the fitness function. They did not explain errors and incorrectness for their weather data. In this paper, we use the weather data collected from 408 automatic weather stations during the recent four years from 2007 to 2010. Our heavy-rain criterion is exactly that of Korea Meteorological Administration in South Korea, as shown in Section 3. We validate our algorithms with various machine learning techniques, including SVM with different kernels. We also explain and fixed errors and incorrectness for our weather data in Section 2.

The remainder of this paper is organized as follows. In Section 2, we propose data processing and methodology for very short-term heavy rainfall prediction. Section 3 describes the environments of our experiments and analyzes the results. The paper ends with conclusions in Section 4.

2. Data and Methodology

2.1. Dataset

The weather data, which are collected from 408 automatic weather stations during the recent four years from 2007 to 2010, had a considerable number of missing data, erroneous data, and unrelated features. We analyzed the data and corrected the errors. We preprocessed the original data given by KMA, in accordance with Table 1. Some weather elements of the original data had incorrect value, and we replaced the value with a very small one (−107). We created several elements, such as month (1–12) and accumulated precipitation for 3, 6, and 9 hours (0.1 mm), from the original data [21]. We removed or interpolated each day data of the original data, when important weather elements of the day data had very small value. Also, we removed or interpolated new elements, such as accumulated precipitation for 3, 6, and 9 hours, which had incorrect value. We undersampled the weather data that were adjusted for the proportion of heavy-rain against no-heavy-rain to be one in the training set, as shown in Section 2.3.

tab1
Table 1: Modified weather elements [4, 21].

The new data were generated in two forms: whether or not we applied normalization. The training set, ranging from 2007 to 2008, was generated by undersampling. The validation set, the data for 2009, was used to select an important subset from input features. The selected important features were used for experiments with the test set, the data for 2010. Representation of our GA and DE was composed of 72 features accumulated for the recent six hours, as shown in Figure 3. The symbols shown in Figure 3 mean modified weather elements in order by index number shown in Table 1. The symbol “—” in Table 1 means (NA not applicable).

203545.fig.003
Figure 3: Representation with 72 features (accumulated weather factors for six hours).
2.2. Normalization

The range of each weather element was significantly different (see Table 2), and the test results might rely on the values of a few weather elements. For that reason, we preprocessed the weather data using a normalization method. We calculated the upper bound and lower bound of each weather factor from the original training set. The value of each upper bound and lower bound was converted to 1 and 0, respectively. Equation (1) shows the process for the used normalization. In (1), means each weather element. The validation set and the test set were normalized, in accordance with the ranges in the original training set. Precipitation sensing in Table 2 means whether or not it rains:

tab2
Table 2: The upper and lower bound ranges of weather data.
2.3. Sampling

Let be the frequency of heavy rainfall occurrence in the training set. We randomly choose among the cases of no-heavy-rain in the training set. Table 3 shows the proportion of heavy-rain to no-heavy-rain every year. On account of the results of Table 3, we preprocessed our data using this method called undersampling. We adjusted the proportion of heavy rainfall against the other to be one, as shown in Figure 4 and Pseudocode 1.

tab3
Table 3: Heavy rainfall rate.

pseudo1
Pseudocode 1: A pseudocode of our undersampling process.

203545.fig.004
Figure 4: Example of our undersampling process.

Table 4 shows ETS for prediction after 3 hours and the effect of undersampling [22] and normalization for 3 randomly chosen stations. The tests without undersampling showed a low equitable threat score (ETS) and required too long a computation time. In tests without undersampling, the computation time took 3, 721 minutes in k-NN and 3, 940 minutes in k-VNN (see Appendix B), the “reached max number of iterations” error was raised in SVM with polynomial kernel (see Appendix C), and and of ETS were zero. In tests with undersampling, the computation time took around 329 seconds in k-NN, 349 seconds in k-VNN, and 506 seconds in SVM with polynomial kernel. The test results with normalization showed about 10 times higher, than those without normalization.

tab4
Table 4: Effect of undersampling (sampled 3 stations, prediction after 3 hours).
2.4. Genetic-Algorithm-Based Feature Selection

Pseudocode 2 shows the pseudocode of a typical genetic algorithm [23]. In this figure, if we define that is the count of solutions in the population set, we create new solutions in a random way. The evolution starts from the population of completely random individuals, and the fitness of the whole population is determined. Each generation consists of several operations, such as selection, crossover, mutation, and replacement. Some individuals in the current population are replaced with new individuals to form a new population. Finally, this generational process is repeated, until a termination condition has been reached. In a typical GA, the whole number of individuals in a population and the number of reproduced individuals are fixed at and , respectively. The percentage of individuals to copy to the new generation is defined as the ratio of the number of new individuals to the size of the parent population, , which we called “generation gap” [24]. If the gap is close to , the GA is called a steady-state GA.

pseudo2
Pseudocode 2: The pseudocode of a genetic algorithm.

We selected important features, using the wrapper methods that used the inductive algorithm to estimate the value of a given subset. The selected feature subset is the best individual among results of the experiment with the validation set. The experimental results in the test set with the selected features showed better performance than those using all features.

The steps of the GA used are described in Box 1. All steps will be iterated, until the stop condition (the number of generations) is satisfied. Figure 5 shows the flow diagram of our steady-state GA.

figbox3
Box 1: Steps of the used GA.

203545.fig.005
Figure 5: Flow diagram of the proposed steady-state GA.
2.5. Differential-Evolution-Based Feature Selection

Khushaba et al. [25, 26] proposed a differential-evolution-based feature selection (DEFS) technique which is shown schematically in Figure 6. The first step in the algorithm is to generate new population vectors from the original population. A new mutant vector is formed by first selecting two random vectors, then performing a weighted difference, and adding the result to a third random (base) vector. The mutant vector is then crossed with the original vector that occupies that position in the original matrix. The result of this operation is called a trial vector. The corresponding position in the new population will contain either the trial vector (or its corrected version) or the original target vector depending on which one of those achieved a higher fitness (classification accuracy). Due to the fact that a real number optimizer is being used, nothing will prevent two dimensions from settling at the same feature coordinates. In order to overcome such a problem, they proposed to employ feature distribution factors to replace duplicated features. A roulette wheel weighting scheme is utilized. In this scheme, a cost weighting is implemented, in which the probabilities of individual features are calculated from the distribution factors associated with each feature. The distribution factor of feature is given by the following equation: where , are constants and is a small factor to avoid division by zero. is the positive distribution factor that is computed from the subsets that achieved an accuracy that is higher than the average accuracy of the whole subsets. is the negative distribution factor that is computed from the subsets that achieved an accuracy that is lower than the average accuracy of the whole subsets. This is shown schematically in Figure 7, with the light gray region being the region of elements achieving less error than the average error values and the dark gray being the region with elements achieving higher error rates than the average. The rationale behind (2) is to replace the replicated parts of the trial vectors according to two factors. The factor indicates the degree to which contributes to forming good subsets. On the other hand, the second term in (2) aims at favoring exploration, where this term will be close to 1, if the overall usage of a specific feature is very low.

203545.fig.006
Figure 6: The DEFS algorithm [25, 26].
203545.fig.007
Figure 7: The feature distribution factors [25, 26].

3. Experimental Results

We preprocessed the original weather data. Several weather elements are added or removed, as shown in Table 1. We undersampled and normalized the modified weather data. Each hourly record of the data consists of twelve weather elements, and representation was made up of the latest six hourly records, 72 features, as shown in Figure 3. We extracted a feature subset using the validation set and used the feature subset to do experiments with the test set.

The observation area has 408 automatic weather stations in the southern part of the Korean peninsula. The prediction time is from one hour to six hours. We adopted GA and DE among the evolutionary algorithms. SVM, k-VNN, and k-NN are used as discriminant functions. Table 5 shows the parameters of a steady-state GA and DE, respectively. LibSVM [27] is adopted as a library of SVM, and we set SVM type, one of the SVM parameters, as C_SVC that regularizes support vector classification, and the kernel functions used are polynomial, linear, and precomputed. We set to be 3 in our experiments.

tab5
Table 5: Parameters in GA/DE.

In South Korea, a heavy-rain advisory is issued when precipitation during six hours is higher than 70 mm or precipitation during 12 hours is higher than 110 mm. A heavy-rain warning is issued when precipitation during 6 hours is higher than 110 mm, or precipitation during 12 hours is higher than 180 mm. We preprocessed the weather data using this criterion. To select the main features, we adopted a wrapper method, which uses classifier itself in feature evaluation differently from a filter method.

An automatic weather station (AWS) [28] is an automated version of the traditional weather station, either to save human labor or to enable measurements from remote areas. An automatic weather station will typically consist of a weather-proof enclosure, containing the data logger, rechargeable battery, telemetry (optional), and the meteorological sensors, with an attached solar panel or wind turbine and mounted upon a mast. The specific configuration may vary, due to the purpose of the system. In Table 6, Fc and Obs are abbreviations for forecast and observed, respectively. The following is a measure for evaluating precipitation forecast skill:

tab6
Table 6: Contingency table.

These experiments were conducted using LibSVM [27] on an Intel Core2 duo quad core 3.0 GHz PC. Each run of GA took about 201 seconds in SVM test with normalization and about 202 seconds without normalization; it took about 126 seconds in k-NN test with normalization and about 171 seconds without normalization; it took about 135 seconds in k-VNN test with normalization and about 185 seconds without normalization.

Each run of DE took about 6 seconds in SVM test with normalization and about 5 seconds without normalization; it took about 5 seconds in k-NN test with normalization and about 4 seconds without normalization; it took about 5 seconds in k-VNN test with normalization and about 4 seconds without normalization.

The heavy-rain events, which meet the criterion of heavy rainfall, consist of a consecutive time interval, which has a beginning time and an end time. The coming event is to discern whether or not it is a heavy rain on the beginning time. For each hour from the beginning time to the end time, discerning whether or not it is a heavy rain means the whole process. We defined CE and WP to be forecasting the coming event and the whole process of heavy rainfall, respectively.

Table 7 shows the experimental results for GA and DE. Overall, GA was about 1.42 and 1.49 times better than DE in CE and WP predictions, respectively. In DE experiments, SVM and k-VNN were about 2.11 and 1.10 times better than k-NN in CE prediction, respectively. SVM and k-VNN were about 2.48 and 1.08 times better than k-NN in WP prediction, respectively. In GA experiments, SVM with polynomial kernel showed better performance than that with linear or precomputed kernel on average. SVM with polynomial kernel and k-VNN were about 2.62 and 2.39 times better than k-NN in CE prediction, respectively. SVM with polynomial kernel and k-VNN were about 2.01 and 1.49 times better than k-NN in WP prediction, respectively. As the prediction time is longer, ETS shows a steady downward curve. SVM with polynomial kernel shows the best ETS among GA test results. Figure 8 visually compares CE and WP results in GA experiments.

tab7
Table 7: Experimental results (1–6 hours) by ETS.
fig8
Figure 8: Experimental results for GA from 1 to 6 hours.

Consequently, SVM showed the highest performance among our experiments. k-VNN showed that the degree of genes’ correlation had significantly effects on the test results, in comparison with k-NN. Tables 8, 9, 10, and 11 show detailed SVM (with polynomial kernel) test results for GA and DE.

tab8
Table 8: Results of DE with SVM from 1 to 6 hours (CE).
tab9
Table 9: Results of DE with SVM from 1 to 6 hours (WP).
tab10
Table 10: Results of GA with SVM from 1 to 6 hours (CE).
tab11
Table 11: Results of GA with SVM from 1 to 6 hours (WP).

We selected the important features using the wrapper methods using the inductive algorithm to estimate the value of a given set. All features consist of accumulated weather factors for six hours, as shown in Figure 3. The selected feature subset is the best individual among the experimental results, using the validation set. Figure 9 shows the frequency for the selected features after one hour to six hours. The test results using the selected features were higher than those using all features. We define a feature as . The derived features from the statistical analysis, which has a 95 percent confidence interval, were the numbers , , , , , , , , , , , , , , , , , , , , and . The main seven features selected were the numbers , , , , , , and and were evenly used by each prediction hour. These features were precipitation sensing and accumulated precipitation for 24 hours.

fig9
Figure 9: Frequency for selected features after from 1 to 6 hours.

We compared the heavy rainfall prediction test results of GA and DE, as shown in Table 7. The results showed that GA was significantly better than DE. Figure 10 shows precipitation maps for GA SVM test results with normalization and undersampling, from one to six hours. The higher ETS is depicted in the map in the darker blue color. The numbers of automatic weather stations by prediction hours are 105, 205, 231, 245, 223, and 182, in order from one to six hours, respectively. The reasons for the differential numbers of automatic weather stations by prediction hours are as follows. First, we undersampled the weather data by adjusting the sampling sizes of no-heavy-rain to be equal to the size of heavy-rain in the training set, as shown in Section 2.3. Second, we excluded the AWS number in which the record number of the training set is lower than three. Third, we excluded the AWS in which hit and false alarm are 0 from the validation experimental results. Finally, we excluded the AWS in which hit, false alarm, and miss are 0 from the test experimental results.

fig10
Figure 10: Individual maps, with AWS in blue dots, for GA heavy rainfall prediction after from 1 to 6 hours (ETS).

The weather data collected from automatic weather stations during the recent four years had a lot of missing data and erroneous data. Furthermore, our test required more than three valid records in the training set. For that reason, the number of usable automatic weather stations was the lowest in the prediction after one hour and increased as the prediction time became longer.

4. Conclusion

In this paper, we realized the difficulty, necessity, and significance of very short-term heavy rainfall forecasting. We used various machine learning techniques, such as SVM, k-NN, and k-VNN based on GA and DE, to forecast heavy rainfall after from one hour to six hours. The results of GA were significantly better than those of DE. SVM with polynomial kernel among various classifiers in our GA experiments showed the best results on average. A validation set was used to select the important features, and the selected features were used to predict very short-term heavy rainfall. We derived 20 features from the statistical analysis, which has a 95 percent confidence interval. The main features selected were precipitation sensing and accumulated precipitation for 24 hours.

In future work, we will preprocess the weather data by various methods, such as representation learning, cyclic loess, contrast, and quantile normalization algorithms. Also, we will apply other machine learning techniques, such as statistical relational learning, multilinear subspace learning, and association rule learning. As more appropriate parameters are applied to the evolutionary algorithm or machine learning techniques, we expect to get better results. We have validated our algorithms with AWS data; however, it would be interesting to examine the performance with, for example, satellite data as another future work.

Appendices

A. Spatial and Temporal Distribution of Heavy Rainfall over South Korea

We calculated the rainfall duration, which meets the criterion of heavy rainfall, from each automatic weather station for the period from 2007 to 2010. We divided the rainfall duration by 100 and let the result be depicted in the map. Figure 11 shows the distribution of heavy rainfall for the whole seasons. Figure 12 shows the distribution of heavy rainfall by seasons. Most heavy rainfalls have been concentrated in summer, and they have a wide precipitation range regionally. Also, their frequencies are quite different from region to region.

203545.fig.0011
Figure 11: The distribution of heavy rainfall for the whole seasons (2007–2010).
fig12
Figure 12: The distribution of heavy rainfall by seasons (2007–2010).

B. k-Nearest Neighbors Classifier

In pattern recognition, the k-nearest neighbors algorithm (k-NN) [29] is a method for classifying objects based on the closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally, and all computation is deferred until classification. The k-NN algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k-nearest neighbors ( is a positive integer, typically small). The k-NN classifier is commonly based on the Euclidean distance between a testing sample and the specified training samples.

Golub et al. [30] developed a procedure that uses a fixed subset of informative genes and makes a prediction based on the expression level of these genes in a new sample. Each informative gene casts a weighted vote for one of the classes, with the magnitude of each vote dependent on the expression level in the new sample, and the degree of that gene’s correlation with the class distinction in their class predictor. We made a variant k-nearest neighbors algorithm (k-VNN) that the degree () of genes’ correlation was applied to a majority vote of its neighbors. Box 2 shows the equation calculating correlation between feature and class. In Box 2, means a feature (i.e., a weather element) and means a class (i.e., heavy-rain or no-heavy-rain). The test results of k-VNN were better than those of k-NN. We set to be 3 in our experiments because it is expected that the classifier will show low performance if is just 1 and it will take a long computing time when is 5 or more.

figbox4
Box 2: Correlation between feature and class (0 or 1) [30, 31].

C. Support Vector Machine

Support vector machines (SVM) [32] are a set of related supervised learning methods that analyze data and recognize patterns and are used for classification and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, which of two possible classes the input is a member of, which makes the SVM a nonprobabilistic binary linear classifier. Since an SVM is a classifier, it is then given a set of training examples, each marked as belonging to one of two categories, and an SVM training algorithm builds a model that assigns new examples into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category, based on which side of the gap they fall on.

D. Evolutionary Computation

A genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution, and this heuristic is routinely used to generate useful solutions to optimization and search problems [33]. In the process of a typical genetic algorithm, the evolution starts from the population of completely random individuals, and the fitness of the whole population is determined. Each generation consists of several operations, such as selection, crossover, mutation, and replacement. Some individuals in the current population are replaced with new individuals to form a new population. Finally, this generational process is repeated, until a termination condition has been reached.

Differential evolution (DE) is an evolutionary (direct-search) algorithm, which has been mainly used to solve optimization problems. DE shares similarities with traditional evolutionary algorithms. However, it does not use binary encoding as a simple genetic algorithm, and it does not use a probability density function to self-adapt its parameters as an evolution strategy. Instead, DE performs mutation, based on the distribution of the solutions in the current population. In this way, search directions and possible step sizes depend on the location of the individuals selected to calculate the mutation values [34].

E. Differences between Adopted Methods

In applied mathematics and theoretical computer science, combinatorial optimization is a topic that consists of finding an optimal object from a finite set of objects. In many such problems, exhaustive search is not feasible. It operates on the domain of those optimization problems, in which the set of feasible solutions is discrete or can be reduced to discrete, and in which the goal is to find the best solution [33].

Feature selection is a problem to get a subset among all features, and it is a kind of combinatorial optimization. Genetic algorithms (GAs) and differential evolutions (DEs) use a random element within an algorithm for optimization or combinatorial optimization, and they are typically used to solve the problems of combinatorial optimization such as feature selection, as in this paper.

Machine learning techniques include a number of statistical methods for handling classification and regression. Machine learning mainly focuses on prediction, based on known properties learned from the training data [33]. It is not easy to use general machine learning techniques for feature selection. In this paper, machine learning techniques were used for classification. GA and DE could be used for regression, but they have a weakness in handling regression because these algorithms will take longer computing time than other regression algorithms.

F. Detailed Statistics of Experimental Results

Tables 811 show SVM (with polynomial kernel) test results for GA and DE. As shown in the contingency Table 6, the test results show ETS and other scores. We defined CE and WP to be forecasting the coming event and the whole process of heavy rainfall, respectively. The test results include the number of used automatic weather stations by each prediction hour, and the number of those is equally set, in the same prediction hour of each experiment. As a result, GA was considerably superior to DE.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

A preliminary version of this paper appeared in the Proceedings of the International Conference on Convergence and Hybrid Information Technology, pp. 312–322, 2012. The authors would like to thank Mr. Seung-Hyun Moon for his valuable suggestions in improving this paper. The present research has been conducted by the Research Grant of Kwangwoon University in 2014. This work was supported by the Advanced Research on Meteorological Sciences, through the National Institute of Meteorological Research of Korea, in 2013 (NIMR-2012-B-1).

References

  1. J. Bushey, “The Changma,” http://www.theweatherprediction.com/weatherpapers/007.
  2. G. E. Afandi, M. Mostafa, and F. E. Hussieny, “Heavy rainfall simulation over sinai peninsula using the weather research and forecasting model,” International Journal of Atmospheric Sciences, vol. 2013, Article ID 241050, 11 pages, 2013. View at Publisher · View at Google Scholar
  3. J. H. Seo and Y. H. Kim, “A survey on rainfall forecast algorithms based on machine learning technique,” in Proceedings of the KIIS Fall Conference, vol. 21, no. 2, pp. 218–221, 2011, (Korean).
  4. Korea Meteorological Administration, http://www.kma.go.kr.
  5. M. N. French, W. F. Krajewski, and R. R. Cuykendall, “Rainfall forecasting in space and time using a neural network,” Journal of Hydrology, vol. 137, no. 1–4, pp. 1–31, 1992. View at Scopus
  6. E. Toth, A. Brath, and A. Montanari, “Comparison of short-term rainfall prediction models for real-time flood forecasting,” Journal of Hydrology, vol. 239, no. 1–4, pp. 132–147, 2000. View at Publisher · View at Google Scholar · View at Scopus
  7. S. J. Burian, S. R. Durrans, S. J. Nix, and R. E. Pitt, “Training artificial neural networks to perform rainfall disaggregation,” Journal of Hydrologic Engineering, vol. 6, no. 1, pp. 43–51, 2001. View at Publisher · View at Google Scholar · View at Scopus
  8. M. C. Valverde Ramírez, H. F. de Campos Velho, and N. J. Ferreira, “Artificial neural network technique for rainfall forecasting applied to the São Paulo region,” Journal of Hydrology, vol. 301, no. 1–4, pp. 146–162, 2005. View at Publisher · View at Google Scholar · View at Scopus
  9. N. Q. Hung, M. S. Babel, S. Weesakul, and N. K. Tripathi, “An artificial neural network model for rainfall forecasting in Bangkok, Thailand,” Hydrology and Earth System Sciences, vol. 13, no. 8, pp. 1413–1425, 2009. View at Publisher · View at Google Scholar · View at Scopus
  10. V. M. Krasnopolsky and Y. Lin, “A neural network nonlinear multimodel ensemble to improve precipitation forecasts over continental US,” Advances in Meteorology, vol. 2012, Article ID 649450, 11 pages, 2012. View at Publisher · View at Google Scholar
  11. L. Ingsrisawang, S. Ingsriswang, S. Somchit, P. Aungsuratana, and W. Khantiyanan, “Machine learning techniques for short-term rain forecasting system in the northeastern part of Thailand,” in Proceedings of the World Academy of Science, Engineering and Technology, vol. 31, pp. 248–253, 2008.
  12. W.-C. Hong, “Rainfall forecasting by technological machine learning models,” Applied Mathematics and Computation, vol. 200, no. 1, pp. 41–57, 2008. View at Publisher · View at Google Scholar · View at Scopus
  13. C. M. Kishtawal, S. Basu, F. Patadia, and P. K. Thapliyal, “Forecasting summer rainfall over India using genetic algorithm,” Geophysical Research Letters, vol. 30, no. 23, pp. 1–9, 2003. View at Scopus
  14. J. N. K. Liu, B. N. L. Li, and T. S. Dillon, “An improved Naïve Bayesian classifier technique coupled with a novel input solution method,” IEEE Transactions on Systems, Man and Cybernetics C, vol. 31, no. 2, pp. 249–256, 2001. View at Publisher · View at Google Scholar · View at Scopus
  15. S. Nandargi and S. S. Mulye, “Relationships between rainy days, mean daily intensity, and seasonal rainfall over the koyna catchment during 1961–2005,” The Scientific World Journal, vol. 2012, Article ID 894313, 10 pages, 2012. View at Publisher · View at Google Scholar
  16. A. Routray, K. K. Osuri, and M. A. Kulkarni, “A comparative study on performance of analysis nudging and 3DVAR in simulation of a heavy rainfall event using WRF modeling system,” ISRN Meteorology, vol. 2012, no. 21, Article ID 523942, 2012. View at Publisher · View at Google Scholar
  17. Y. K. Kouadio, J. Servain, L. A. T. Machado, and C. A. D. Lentini, “Heavy rainfall episodes in the eastern northeast Brazil linked to large-scale ocean-atmosphere conditions in the tropical Atlantic,” Advances in Meteorology, vol. 2012, Article ID 369567, 16 pages, 2012. View at Publisher · View at Google Scholar
  18. Z. Wang and C. Huang, “Self-organized criticality of rainfall in central China,” Advances in Meteorology, vol. 2012, Article ID 203682, 8 pages, 2012. View at Publisher · View at Google Scholar
  19. T. Hou, F. Kong, X. Chen, and H. Lei, “Impact of 3DVAR data assimilation on the prediction of heavy rainfall over southern China,” Advances in Meteorology, vol. 2013, Article ID 129642, 17 pages, 2013. View at Publisher · View at Google Scholar
  20. H. D. Lee, S. W. Lee, J. K. Kim, and J. H. Lee, “Feature selection for heavy rain prediction using genetic algorithms,” in Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems and 13th International Symposium on Advanced Intelligent Systems (SCIS-ISIS '12), pp. 830–833, 2012. View at Publisher · View at Google Scholar
  21. J. H. Seo and Y. H. Kim, “Genetic feature selection for very short-term heavy rainfall prediction,” in Proceedings of the International Conference on Convergence and Hybrid Information Technology, vol. 7425 of Lecture Notes in Computer Science, pp. 312–322, 2012.
  22. N. V. Chawla, “Data mining for imbalanced datasets: an overview,” Data Mining and Knowledge Discovery Handbook, vol. 5, pp. 853–867, 2006.
  23. Y.-S. Choi and B.-R. Moon, “Feature selection in genetic fuzzy discretization for the pattern classification problems,” IEICE Transactions on Information and Systems, vol. 90, no. 7, pp. 1047–1054, 2007. View at Publisher · View at Google Scholar · View at Scopus
  24. K. A. de Jong, An analysis of the behavior of a class of genetic adaptive systems [Ph.D. thesis], University of Michigan, Ann Arbor, Mich, USA, 1975.
  25. R. N. Khushaba, A. Al-Ani, and A. Al-Jumaily, “Differential evolution based feature subset selection,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), pp. 1–4, December 2008. View at Scopus
  26. R. N. Khushaba, A. Al-Ani, and A. Al-Jumaily, “Feature subset selection using differential evolution and a statistical repair mechanism,” Expert Systems with Applications, vol. 38, no. 9, pp. 11515–11526, 2011. View at Publisher · View at Google Scholar · View at Scopus
  27. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011. View at Publisher · View at Google Scholar · View at Scopus
  28. Automatic Weather Stations, http://www.automaticweatherstation.com.
  29. R. Chang, Z. Pei, and C. Zhang, “A modified editing k-nearest neighbor rule,” Journal of Computers, vol. 6, no. 7, pp. 1493–1500, 2011. View at Publisher · View at Google Scholar · View at Scopus
  30. T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–527, 1999. View at Publisher · View at Google Scholar · View at Scopus
  31. Y. H. Kim, S. Y. Lee, and B. R. Moon, “A genetic approach for gene selection on microarray expression data,” in Genetic and Evolutionary Computation—GECCO 2004, K. Deb, Ed., vol. 3102 of Lecture Notes in Computer Science, pp. 346–355, 2004. View at Publisher · View at Google Scholar
  32. Y. Yin, D. Han, and Z. Cai, “Explore data classification algorithm based on SVM and PSO for education decision,” Journal of Convergence Information Technology, vol. 6, no. 10, pp. 122–128, 2011. View at Publisher · View at Google Scholar · View at Scopus
  33. Wikipedia, http://en.wikipedia.org.
  34. E. Mezura-Montes, J. Velázquez-Reyes, and C. A. Coello Coello, “A comparative study of differential evolution variants for global optimization,” in Proceedings of the 8th Annual Genetic and Evolutionary Computation Conference, pp. 485–492, July 2006. View at Scopus