Advances in Meteorology

Volume 2017 (2017), Article ID 5019646, 17 pages

https://doi.org/10.1155/2017/5019646

## Applications of Cluster Analysis and Pattern Recognition for Typhoon Hourly Rainfall Forecast

^{1}Department of Civil Engineering, National Taiwan University, Taipei 10617, Taiwan^{2}Department of Civil and Water Resources Engineering, National Chiayi University, Chiayi 60004, Taiwan

Correspondence should be addressed to Nan-Jing Wu

Received 30 June 2016; Revised 8 January 2017; Accepted 7 February 2017; Published 21 March 2017

Academic Editor: Soni M. Pradhanang

Copyright © 2017 Fu-Ru Lin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Based on the factors of meteorology and topography, it is assumed that there exist some certain patterns in spatial and temporal rainfall distribution of a watershed. A typhoon rainfall forecasting model is developed under this assumption. If rainfall patterns can be analyzed and recognized in terms of individual watershed topography, only the spatial rainfall distribution prior to a specific moment is needed to forecast the rainfall in the next coming hours. It does not need any other condition in meteorology and climatology. Besides, supplement techniques of missing rainfall gage data are also considered to build an all-purpose forecast model. By integrating techniques of cluster analysis and pattern recognition, present proposed rainfall forecasting model is tested using historical data of Tamsui River Basin in Northern Taiwan. Good performance is validated by checking on coefficient of correlation and coefficient of efficiency.

#### 1. Introduction

Typhoon rainfall forecast is extremely important since it is the basic requirement in flood routing simulation using a hydrologic model, allowing an extension of the lead-time of the river flow forecasting computations. It is particularly needed in small- and medium-sized mountainous basins [1]. In Taiwan, due to the high mountains and steep river slopes, heavy rainfalls, especially during typhoon events, have frequently led to serious disasters, such as flooding, landslide, or debris flows. In order to reduce loss of life and major economic impacts, the government has invested a great deal of manpower and budgets to build the disaster warning systems in which rainfall forecast plays a key role. It provides rainfall input data to forecast the surface runoff outflow of a watershed. This outflow or the gaged water depth at the outlet of the watershed is also needed as the information for the upstream boundary condition of unsteady river flow computations [2–4]. Quite often, whenever a typhoon has occurred, undesired conditions may occur when gaged rainfall data do not transmit into database system at all for further computational uses. Furthermore, lack of immediate rainfall data may affect the accuracy in real-time flood forecasting or other systems. In order to deal with such situation, the authority should not only assure the stability of an observation system and its transmission instruments but also build an all-purpose rainfall forecast model to manage the situation of lost data at any moment and provide reasonably accurate and efficient forecast data.

Traditionally, rainfall forecasting is based mainly on numerical fluid dynamic models [5]. This classical approach attempts to model the fluid and thermal dynamic systems for grid-point time series prediction based on boundary meteorological data. The simulation often requires intensive computations involving complex differential equations and computational algorithms. Besides, the accuracy is bounded by certain constraints such as the adoption of incomplete boundary conditions, model assumptions, grid resolutions, and numerical instabilities. Furthermore, because of the high variability in space and time, typhoon rainfall is one of the most difficult elements for the hydrologic cycle to forecast. The highly nonlinear and extremely complex physical process of typhoon rainfall also leads to a lot of difficulties in constructing a physically based mathematical model [6].

Radar data and satellite images were also used to forecast the rainfall [7, 8]. Unfortunately, the relationship between rainfall and the outputs from satellite and radar images was not clear while the outputs do not allow a satisfactory assessment of rain intensities [1]. Another reason was that due to ground occultation and altitude effects, the radar detection was particularly difficult in mountainous regions [9, 10].

In recent decades, the research using artificial intelligence has gained scientific attention. The Artificial Neural Network (ANN) is one of the most representatives of these achievements. Researches using ANNs were sequentially reported. Luk et al. [11] assumed that the spatial rainfall distribution at a specific moment is bounded with the records of the relevant rainfall gages in the lasted time interval. By using a backward propagation neural network (BP-ANN), they successfully built a model for forecasting the rainfall pattern in the next coming 15 minutes. The same concept was used to build another rainfall forecasting model by applying other kinds of neural networks such as feedforward neural network, partial recurrent network, and time delayed neural network [12]. Toth et al. [1] compared the accuracy of the short-term rainfall forecasts obtained with three time series analysis techniques, such as linear stochastic autoregressive moving average (ARMA) models, artificial neural networks (ANNs), and the nonparametric nearest-neighbors method. Chang et al. [13] compared and discussed three types of multistep-ahead (MSA) methods using previous rainfall and river stage for flood forecasting. Lin et al. [14] used a novel kind of neural network called support vector machines (SVMs) to construct typhoon rainfall forecasting models. They used these models with and without typhoon characteristics to forecast the rainfall. Because all the rainfall or flood forecasting models mentioned above regard the gaged rainfall records in the last period of time as the input data, these models might not work properly when data gaps occur. The model could not carry on further computations unless the lost data can be estimated correctly.

When a storm or frontal surface is approaching, the rainfall patterns in the windward area may be quite different from those in the leeward area, due to the topographical effects. As the storm or frontal surface moves during the typhoon period, the rainfall patterns may alter drastically at a specific gage location. This implies that the spatial and temporal distribution of the rainfall are influenced by some information of meteorology and topography. Because the topography does not change with time and also storms or frontal surfaces usually move along some certain paths, the trends of spatial-temporal rainfall distribution could be bounded within some specific patterns. Based on the consideration of these meteorological and climatological factors, it is assumed in this paper that there exist certain patterns in spatial and temporal rainfall distribution for a particular river basin. An unsupervised pattern recognition method, which has powerful ability of fault tolerance, is applied. The clustered construction can identify the coordinate data from the remainder data even if the input data are incomplete or have data gaps. The results from the recognized patterns are model outputs. These model outputs are used as the input for river runoff or elevation forecasting at the outlet of the basin. This paper brings up the pattern recognition and cluster analysis in statistics to classify the rainfall distribution in space and time from historical data of similar meteorological and climatological conditions. This study intends to build an all-purpose model with good accuracy and reliability for typhoon hourly rainfall forecast. The model holds good for its design function even with data gaps in rainfall data.

#### 2. Methodology

##### 2.1. Cluster Analysis

Cluster analysis is the general logic, formulated as a procedure, by which we objectively group the entities together on the basis of their similarities and differences [15]. The objective of data clustering is to employ certain clustering algorithms to identify clusters consisting of similar data within a dataset. The original dataset is thus decomposed into disjoint clusters, with each cluster having a center to represent the cluster. We can use the cluster centers to represent the original dataset to achieve the following two goals, namely, data compression, and computation reduction. In general, clustering algorithms can be divided into two types: hierarchical clustering and nonhierarchical clustering (or called partition clustering). Two sorts of hierarchical clustering could be found. They are agglomerative and divisive ones. For agglomerative hierarchical clustering, the number of clusters is increased from one until the desired number of clusters is reached. On the other hand, for divisive hierarchical clustering, the number of clusters is decreased from the size of the dataset until the desired number of clusters is reached. For nonhierarchical clustering approaches, the number of clusters is fixed in advance. And then a number of iterations are performed to identify the best clusters with their cluster centers [16].

Many empirical results indicate that the point of adding nonrandomly selected, nonhierarchical clustering method is better than the hierarchical clustering method [17]. Meanwhile, in nonhierarchical clustering the number of clusters should be predetermined and its starting from a randomly initial partition may cause optimization locally. Therefore, some algorisms such as two-stage cluster or two-step cluster were developed by using one or two algorisms above to increase their advantage and decrease their shortcoming. The Statistical Product and Service Solutions (SPSS) two-step cluster will be used in this paper, and below is mainly drawn from “the support document of SPSS and IBM knowledge center” [18, 19], for completeness. The SPSS two-step clustering component is a scalable cluster analysis algorithm designed to handle very large datasets and is well-known for recent years. The procedure of the cluster is divided into two steps. In the first step, the records were preclustered into many small subclusters by a sequential clustering approach. Thus, the records were scanned one by one and decided if the current record should merge with the previously formed clusters or start a new cluster based on the distance criterion. A modified cluster feature (CF) tree which consists of levels of nodes was implemented. In the second step, subclusters resulting from the first step were taken as input and then were grouped into the desired number of clusters by agglomerative hierarchical clustering method.

##### 2.2. Pattern Recognition

The concept of “recognition” comes from the main theory of artificial neural networks. When new input data comes out, one can determine the category and the output corresponding to that category immediately. The network structure requires powerful ability of fault tolerance. A clustered construction, even if the input data is incomplete, can still identify the coordinate data from the remainder data and show which category it belongs to. The key point of pattern recognition in this study is the winner-take-all (WTA) network. For a group of artificial neurons, the neurons compete with each other. The weight is given as 1 to the winner neuron, the one who is closest to the input data, and 0 to all others. This process is known as the winner-take-all.

In this paper, a “pattern” is a multivariable time-space series. The rainfall record of some lasted time interval at a specific moment of several gages is combined as an input vector and the dataset collected from numerous storm events is divided into some specific groups. This way, not only the characteristics of rainfall within the space, such as topography (windward, leeward, altitude, etc.), but also the “behavior” that they change over time, can be obtained. With these procedures, a model of typhoon rainfall forecast can be established. The so-called “pattern” is referred to as the rainfall distribution in time and space with respect to a certain typhoon category, and “recognition” is the information available to the corresponding classification categories.

Assume that there is a group of statistical samples. Each sample is composed of values and expressed as a mathematical vector of components:where is the serial number of a specific sample.

Firstly, the cluster analysis is preceded. In order to divide these samples into several certain patterns, the neural network structure of winner-take-all (WTA) is employed to describe the distribution of samples. The pattern which any specific sample belongs to can be expressed as where is a natural number that expresses the pattern to which the th sample belongs, denotes the numbers of classification, and is a binary function, which is the radial basis function (RBF) used in WTA neural networkwhere in which denotes the th component of and represents the center of the th cluster, which resulted from the approach of two-step clustering (see Section 2.1):

After completion of classifying the statistical samples, for any new input, , one can find which pattern it belongs to by checking with this formula:

Furthermore, a relation between the input data and output data needs to be constructed. Consider an output data which is composed of values; each corresponds to a specific . The output vector is expressed asHere is the serial number of a specific sample as previously defined. After all the samples have been clustered, one can find the th component of the output vector corresponded to an input vector by the following formula:where represents the th component of the th output pattern. If each sample belongs to a certain cluster and the distances among them are very small, one can determine by using the average value to represent the whole values of output data: where is the total number of samples.

When new data are added, one can find the cluster centers as described previously, identify to which pattern this sample belongs, and may use the relationship between input and output to predict the corresponding output.

##### 2.3. Model Setup

In practice, the input data, , are composed of spatial and temporal information and can be expressed as follows:where is the serial number of a specific sample and is the number of time steps considered in the input pattern. The subscript of (i.e., ) is the serial number of the rain gage while denotes the rainfall data at time in rain gage number 1 and denotes the rainfall data at previous time steps in the same rain gage. The values of , are all defined in a similar way. And the output data, , can be expressed as follows:In this paper, rainfall records of last (specifically, ) and present hour are used as the input data to forecast gaged rainfall in the next hour. So, contains components and contains components.

#### 3. Applications

##### 3.1. Study Area

In this paper, the feasibility of this method is tested to the rainfall forecasting in Tamsui River Basin in the Northern Taiwan. Tamsui River runs through Taipei, the capital city of Taiwan, and has a total drainage area of approximately 2726 km^{2}. Due to the peculiar topography, the three mainly tributaries, Keelung River, Dahan Stream, and Sintain Stream, converge in Taipei Basin in which there usually are severe damage during storms and typhoons. Because of concentration of population (population ) and developed urban and suburban areas, government has invested a great deal of manpower and budgets to build the flood warning system. So there are abundant historical observations of rainfall data. However, when typhoon occurs, the transmittal system becomes poor, resulting in missing rainfall data. Furthermore, lack of immediate rainfall data may affect the accuracy in flood forecasting. In order to deal with this situation, one should not only ensure the stability of observation system and transmission instrument but also build an all-purpose forecast model to manage the situation of lost data at any moment.

There are many rain gages in Tamsui River Basin. Some of them, belonging to Water Resources Agency, are operationally stable and experience fewer situations of lost data. Therefore, in this paper hourly rainfall data of these rain gages are used to forecast the gaged rainfall in the next hour. There are total 16 rainfall gages in Tamsui River Basin which belong to Water Resources Agency. Three of them were set up after 2001; the other 13 gages have more than 20 years of historical data. Locations of Tamsui River Basin and these 13 rain gages are shown in Figure 1. Frequency diagrams and information of hourly rainfall of the rain gages in Tamsui River Basin during typhoon events are shown in Figure 2.