Abstract

Floods belong to the most hazardous natural disasters and their disaster management heavily relies on precise forecasts. These forecasts are provided by physical models based on differential equations. However, these models do depend on unreliable inputs such as measurements or parameter estimations which causes undesirable inaccuracies. Thus, an appropriate data-mining analysis of the physical model and its precision based on features that determine distinct situations seems to be helpful in adjusting the physical model. An application of fuzzy GUHA method in flood peak prediction is presented. Measured water flow rate data from a system for flood predictions were used in order to mine fuzzy association rules expressed in natural language. The provided data was firstly extended by a generation of artificial variables (features). The resulting variables were later on translated into fuzzy GUHA tables with help of Evaluative Linguistic Expressions in order to mine associations. The found associations were interpreted as fuzzy IF-THEN rules and used jointly with the Perception-based Logical Deduction inference method to predict expected time shift of flow rate peaks forecasted by the given physical model. Results obtained from this adjusted model were statistically evaluated and the improvement in the forecasting accuracy was confirmed.

1. Introduction

Disaster management is generally becoming more and more important task. Among many natural disasters, floods are the one of the most hazardous, and moreover, one of the most frequently occurring in the region of the central Europe. Researchers invest enormous efforts into investigation of distinct flood models that would help to forecast floods and thus provide the disaster management with a reliable decision support that could be used in order to prevent further deceases and material costs.

One of such long-term researches focusing on the disaster management and especially on modeling and forecasting floods gave rise to the creation of the FLOREON, a system for emergent flood prediction [1]. No matter how sophisticated the system is, due to the natural imprecision in data sources (e.g., measuring stations) and due to the natural imprecision in parameter setting (crisp values determined by an expert decision), and having in mind how complicated the whole problem is, it necessarily provides forecasts that are not always precise.

Therefore, it seems to be appropriate to focus on some analysis of the performance of the system that could give at least a vague idea under which conditions the system works, under which conditions it provides us with a certain imprecision, and under which conditions we are able to correct the forecast. Based on the sources of the imprecision, it seems that an appropriate data-mining technique that would involve fuzziness might provide us with promising results and is worth of being attempted. In this investigation, we face the above foreshadowed problem with the help of the fuzzy GUHA method, that is, a specific variant of associations mining technique that allows using the concepts of fuzzy logic in a broad sense [2].

1.1. Brief Problem Description

The data being analyzed come from the measures of water flow rate of the Odra River in Ostrava, Czech Republic. Measuring stations provide us with a flow rate [m3/s] on hourly basis. The goal is to forecast a future flow rate. This is done by the so-called Math-1D model [3] developed for the FLOREON disaster management IT system [1].

The Math-1D model is a differential equation based model of the flow rate. In order to provide us with flow rate forecasts, it uses information about precipitations (past and forecasted), soil type, river bank shape, and other parameters. Although it is a well-established physical model that is empirically examined, it is not sufficiently reliable. The reason does not lie in the model but in the fact that most of the parameters and input data are highly imprecise. For example, the soil type is provided by a hydrologist expert but, due to natural limitations, without a deeper geological analysis and, moreover, the provided soil type is the same for the whole river flow.

Having in mind these limitations, the Math-1D model forecasts depend mainly on the measured past precipitations and flow rates and on the forecasted precipitations. Thus, provided forecast, though often reliable, may be even highly imprecise. The imprecision may be viewed in two perspectives: in the vertical one and the horizontal one. The vertical imprecision actually means either the overestimation or even worse the underestimation of the flow rate in the culminating peak. For our investigation, the second, that is, the horizontal imprecision, is crucial. That is, we focus on the precision in terms of time; that is, we focus on the question whether and under which conditions the model forecasts the peak discharge earlier or later and how big is the time shift of the peak.

The vertical as well as horizontal imprecision may be significant. As one can see from an exemplary forecast in Figure 1, the real culminating peak can appear a few hours sooner than forecasted. Let us note that the Math-1D model does not use the knowledge of the water flow rate in the past and depends mainly on the precipitations. This explains why it may happen that the model does not fit well the past data (from −119th hour to 0th; see Figure 1). On the other hand, precipitations are rather precise compared to the data from the measuring stations that may not be well calibrated or, even worse, the measuring station may be partially damaged or even fully destroyed (occasionally, it happens that even during a massive flood, measuring stations provide a zero flow rate measurements).

Our task is to analyze and forecast the peak shift on the horizontal (time) axis. In other words, the task is to build a model that would (based on the flow rate measurements and the Model-1D performance in the past) provide disaster management with a valuable information about possible horizontal imprecision of the Math-1D model and, moreover, that would additionally provide the disaster management with an estimation about the peak shift. This peak shift estimation could be used in the corrections of the forecasts.

2. Theoretical Background

In this Section, we introduce fundamental theoretical background that is used in our investigation. As there is no space to introduce all the theoretical concepts in detail, we will provide readers only with a brief introduction and refer to further valuable sources [47].

2.1. Evaluative Linguistic Expressions

One of the main constituents of systems of fuzzy/linguistic IF-THEN rules is evaluative linguistic expressions [4], in short evaluative expressions, for example, very large, more or less hot, and so forth. They are special expressions of natural language that are used whenever it is important to evaluate a decision situation, to specify the course of development of some process, and in many other situations. Note that their importance and the potential to model their meaning mathematically have been pointed out by Zadeh (e.g., in [8, 9]).

A simple form of evaluative expressions keeps the following structure: Atomic evaluative expressions comprise any of the canonical adjectives small, medium, and big, abbreviated in the following as Sm, Me, and Bi, respectively.

Linguistic hedges are specific adverbs that make the meaning of the atomic expression more or less precise. We may distinguish hedges with narrowing effect, for example, very, extremely, and so forth and with widening effect, for example, roughly, more or less and so forth. In the following text, we, without loss of generality, use the hedges introduced in Table 1 that were successfully used in real applications [10] and that are implemented in the LFLC software package [11]. As a special case, the can be empty. Note that our hedges are of so-called inclusive type [12], which means that extensions of more specific evaluative expressions are included in less specific ones; see Figure 2.

Evaluative expressions of the form (1) will generally be denoted by script letters , , and so forth. They are used to evaluate values of some variable . The resulting expressions are called evaluative linguistic predications and have the form

Examples of evaluative predications are “temperature is very high,” “price is low,” and so forth. The model of the meaning of evaluative expressions and predications makes distinction between intensions and extensions in various contexts. The context characterizes a range of possible values. This range can be characterized by a triple of numbers , where and . These numbers characterize the minimal, middle, and maximal values, respectively, of the evaluated characteristics in the specified context of use. Therefore, we will identify the notion of context with the triple . By we mean . In the sequel, we will work with a set of contexts that are given in advance.

The intension of an evaluative predication “ is ” is a certain formula whose interpretation is a function: that is, it is a function that assigns a fuzzy set to any context from the set .

Given an intension (3) and a context , we can define the extension of “ is ” in the context as a fuzzy set: where denotes the relation of fuzzy subsethood.

Convention 1. For the sake of brevity and simplicity and having in mind that an extension is a fuzzy set on a given context, we will omit the notion of extension from our consideration when appropriate and write only the abbreviated forms: if there is no danger of any confusion caused by the fact that the left-hand side does not explicitly mention the chosen context and variable .

2.2. Linguistic Descriptions

Evaluative predications occur in conditional clauses of natural language of the form where , are evaluative expressions. The linguistic predication “ is ” is called the antecedent and “ is ” is called the consequent of rule (6). Of course, the antecedent may consist of more evaluative predications, joined by the connective “AND.” The clauses (6) will be called fuzzy/linguistic IF-THEN rules in the sequel.

Fuzzy/linguistic IF-THEN rules are gathered in a linguistic description, which is a set , where Because each rule in (7) is taken as a specific conditional sentence of natural language, a linguistic description can be understood as a specific kind of a (structured) text. This text can be viewed as a model of specific behavior of the system in concern.

The intension of a fuzzy/linguistic IF-THEN rule in (6) is a function: This function assigns to each context and each context a fuzzy relation in . The latter is an extension of (8).

We also need to consider a linguistic phenomenon of topic-focus articulation (cf. [13]), which in the case of linguistic descriptions requires us to distinguish the following two sets: The phenomenon of topic-focus articulation plays an important role in the inference method called perception-based logical deduction described below.

Convention 2. Besides the above introduced notions of topic and focus, it is sometimes advantageous to introduce the following notation: which will denote the set of extensions of evaluative predications that are contained in knowing the particular context . This notation will be used later on when defining the function of local perception. In the view of Convention 1 one can also easily introduce the as follows:

2.3. Ordering of Linguistic Predications

To be able to state relationships among evaluative expressions, for example, when one expression “covers” another, we need an ordering relation defined on the set of them. Let us start with the ordering on the set of linguistic hedges. We may define the ordering of examples of hedges as follows:

We extend the theory of evaluative linguistic expressions by the following inclusion axiom. Let denote the kernel of a fuzzy set . For any , hold for any atomic expression under the assumptions , .

Based on we may define an ordering of evaluative expressions. Let , be two evaluative expressions such that and . Then we write if and .

In other words, evaluative expressions of the same type are ordered according to their specificity which is given by the hedges appearing in the expressions. If we are given two evaluative predications with an atomic expression of a different type, we cannot order them by .

Finally, we define the ordering of evaluative predications wrt. a given observation. Let us be given a context , an observation , and two extensions and from the . We write either if or if and .

It should be noted that usually the contains intensions of evaluative predications which are compound by a conjunction of more than one evaluative predication. In other words, we usually meet the following situation:

In this case, the ordering is preserved with respect to the components: and the extension of the compound linguistic predication is given as follows: Then, the final ordering is analogous to the one-dimensional one.

2.4. Perception-Based Logical Deduction

Perception-based Logical Deduction (abb. PbLD) is a special inference method aimed at the derivation of results based on fuzzy/linguistic IF-THEN rules. A perception is understood as an evaluative expression assigned to the given input value in the given context. The choice of perception depends on the topic of the specified linguistic description. In other words, perception is always chosen among evaluative expressions which occur in antecedents of IF-THEN rules; see [5, 10, 14].

Based on the ordering of linguistic predications, a special function of local perception assigns to each value for a subset of intensions minimal wrt. the ordering

Let be a linguistic description (7). Let us consider a context for the variable and a context for . Let an observation in the context be given, where . Then, the following rule of perception-based logical deduction () can be introduced: where is the conclusion which corresponds to the observation in a way described below. Inputs to this inference rule are linguistic description and local perception from (19). This local perception is formed by a set of evaluative expressions from antecedents of IF-THEN rules (i.e., from the topic) of the given linguistic description. Formula (19) chooses these antecedents which best fit the given numerical input ; in other words, they are the most specific according to the ordering .

Once one or more antecedents , are chosen according to (19), we compute for any of them conclusions : where is the Łukasiewicz implication [2] given by .

Suppose that is nonempty; that is, . Then the final conclusion is given as the Gödel intersection of the set of all conclusions that correspond to members in ; that is,

In many application, the inferred output fuzzy set needs to be defuzzified to a crisp value in . For this task, a special defuzzification technique called Defuzzification of Evaluative Expressions (abb. DEE) has been proposed. In principle, this defuzzification is a combination of First-Of-Maxima (FOM), Mean-Of-Maxima (MOM), and Last-Of-Maxima (LOM) that are applied based on the classification of the output fuzzy sets. Particularly, if the inferred fuzzy set is of the type Small (nonincreasing), the LOM is applied; if the inferred output is of the type Big (nondecreasing), the FOM is applied; and finally, if the inferred output is of the type Medium, the MOM is applied; see Figure 2.

3. Fuzzy GUHA: Linguistic Associations Mining

In this paper, we employ the so-called linguistic associations mining [15] for the fuzzy rule base identification. This approach, mostly known as mining association rules [16], was firstly introduced as GUHA method [17, 18]. It finds distinct statistically approved associations between attributes of given objects. Particularly, the GUHA method deals with Table 2 where denote objects, denote independent boolean attributes, denotes the dependent (explained) boolean attribute, and finally, symbols denote whether an object carries an attribute or not.

The original GUHA allowed only boolean attributes to be involved; see [19]. Since most of the features of objects are measured on the real interval, standard approach assumed categorization of quantitative variables and subsequently definition of boolean variables for every category.

The goal of the GUHA method is to search for linguistic associations of the form where , are (compound) evaluative predications [20] containing only the connective AND and for are all variables occurring in . The , are called the antecedent and consequent, respectively. Generally, for the GUHA method, the well-known fourfold table is constructed; see Table 3.

Symbol , in Table 3, denotes the number of positive occurrences of as well as ; is the number of positive occurrences of and of negated , that is, of “not .” Analogous meaning has the numbers and . For our purposes, only numbers and are important.

The relationship between the antecedent and the consequent is described by so-called quantifier . There are many quantifiers that characterize validity of association (23) in the data [18]. For our task, we use the so-called binary multitudinal quantifier . This quantifier is taken as true if where is a confidence degree and is a support degree.

Example 1. For example, let us consider Table 4.
Depending on the chosen confidence and support degree, the GUHA method could generate, for example, the following linguistic association:

According to [21], there are two approaches in treating quantitative variables in association rules mining. The first one is to categorize the variables using the predefined concept hierarchies (e.g., ). And the second one is to search for clusters in a variable and discretize it according to the found clusters (distribution of the data). Nevertheless both approaches divide numerical variables into crisp intervals.

In many situations, including our situation, it is better to define fuzzy sets on the numerical variables and use the fuzzy variant of the GUHA method [15, 22]. In this case we have also two possibilities how to treat quantitative variables. Either we will apply fuzzy clustering or we will use some predefined concepts. Because of the well-developed theory of Evaluative Linguistic Expressions (Section 2.1) we chose the latter approach.

In the fuzzy variant of the method, the attributes are not boolean but rather vague. The minimum (resp., maximum) of a particular attribute becomes (resp., ) and thus we obtain the context for the given attribute ( might be median, mean, or other value between and ). With canonical adjectives Sm, Me, and Bi and seven different linguistic hedges we may define more than 20 fuzzy sets for every quantitative variable (attribute). The values (or ) are now elements of the interval that express membership degrees.

For example, instead of defining a boolean variable (see Table 4), we take the quantitative variable BMI and generate all the possible evaluative linguistic predications and define fuzzy sets , so that the first column in Table 4 is replaced with Table 5. In this way we are able to separate a group of malnourished people (). Analogically to capture such cases of people, who have almost ideal BMI index, we define , . Finally, instead of we define , , and so forth. Thus, we add information that was lost by the transition from the quantitative variable BMI into two boolean variables and . More importantly, we also capture gradual transitions between different groups of people. An object (in our example a patient) might have membership degree to the fuzzy set equal to and simultaneously belongs to the fuzzy set with the degree . This way we capture the information about the patient that is on the transition from being underweight to having ideal BMI index. This kind of information cannot be captured by crisp intervals.

In this way we treat every quantitative variable so that the final fuzzy GUHA table will look similarly to Table 6.

The fourfold table analogous to Table 3 is constructed also for the fuzzy variant of the method. The difference is that the numbers , , , and are not summations of 1 s and 0 s but summations of membership degrees of data into fuzzy sets representing the antecedent and consequent or their complements, respectively. Naturally, the fact that antecedent as well as consequent holds simultaneously leads to the natural use of a t-norm [23]. In our case, we use the Gödel t-norm that is the minimum operation. For example, if an object belongs to a given antecedent in a degree 0.7 and to a given consequent in a degree 0.6, the value that enters the summation equals to . Summation of such values over all the objects equals the value in Table 3; the other values from the table are determined analogously. The rest of the ideas of the method remain the same.

By using fuzzy sets, we generally get more precise results, and, more importantly, we avoid undesirable threshold effects [24]. The further advantage is that the method searches for implicative associations that may be directly interpreted as fuzzy rules for the PbLD inference system.

Example 2. A confirmed association as may be directly interpreted as the following fuzzy rule:

“IF Body-Mass-Index is Extremely Big AND Cholesterol level is Very Big THEN Blood Pressure is More or Less Big.”

This approach has been found very efficient and reasonable, for example, for the identification of the so-called Fuzzy Rule Base Ensemble [25] which is a special ensemble technique for time series forecasting [26] that uses fuzzy rules to determine weights of individual forecasting methods. Naturally, the overlapping of extensions of linguistic expressions causes a massive generation of redundant associations. However, there exist efficient methods that detect and remove these redundancies automatically; see [6, 7].

In Section 4.3, we apply this method to artificial variables computed from the measures of water flow in order to obtain interesting descriptions of water flow rate peak time shift.

4. Data Analysis

4.1. Data Description

As mentioned in the introduction, we are provided only with the data from the measuring stations and from the Math-1D model implemented in the FLOREON system. Unlike the Math-1D model, we are neither provided with the measured precipitations nor with their forecasts nor with other physical attributes or their estimations. The reason is that this is the domain for the physical model Math-1D and our task is not to build another competitive physical model but to concentrate on the analysis of the existing one. However, in order to deal with the (fuzzy) GUHA method, we need to generate several features (artificial variables) and investigate the question, which of those variables have some influence on the performance of the model.

For the purpose of this investigation, we were provided with the data set collected from different events (floods) on the measuring station Svinov placed on the Odra River (Svinov is a part of the Ostrava city through which the Odra River flows. Naturally, the measuring station carries the same name). The whole data set is divided into 57 simulations. Each simulation captures a state of the system (provided real values and model values) at some time point that is for each simulation denoted by zero (). Each simulation can be further divided into past and future data measured or simulated on the hourly basis.

So, we can introduce the following two sets: and the two time dependent variables, namely, the real water flow rate at time and the originally modelled flow rate at time , denoted by and , respectively. Thus, we can also introduce the following sets: and analogously

Indeed, the values had been unknown at the time point and they were added to the data later on only for the comparison and efficiency evaluation purposes. The values are forecasts made by the original Math-1D model that were at disposal at the time point .

The aim is to analyze associations between input variables that were at disposal at the time (, , and ) and the dependent variable which was (for this stage of investigation) chosen to be the peak-time , that is, the time of maximum water flow rate:

For the sake of result quality evaluation, the data was split into a training set and a testing set in the ratio of 2 : 1, that is, 38 simulations for the training and 19 simulations for the testing.

4.2. Features Generation and Reduction

For each simulation , a set of features was extracted by applying several statistical characteristics on different vectors of data that were derived from and . Namely, the following statistics were utilized: mean , standard deviation , median , minimum , maximum , range (), interquartile range , difference of the last value and the mean , coefficient of variation , difference of the mean and the median , absolute difference of the mean and the median , skewness and its absolute value , kurtosis and its absolute value , slope computed from linear regression of (where is the intercept and is the residual error), and trend strength computed as a value of the hypothesis .

All the statistics listed above were computed for each of the following data . Additionally, the same statistics were determined for the following further newly created data vectors: where again .

Analogously, the same statistics have been utilized also for with the only difference stemming from the different time values; that is, they were applied to

Finally, the time point of the forecasted peak,

was also added as an additional feature. It means that a total amount of 205 new features were generated.

From the pool of features, a regression method [27] was utilized to select those, which had the highest significance for a regression model. Particularly, the dependent variable

denoting the peak shift, was modelled with the linear regression of all the generated features. After that, statistical significance of all the regression coefficients was tested and only features with -value below 0.05 were selected.

In this way, we ended the feature selection with the following three features: : standard deviation of ; : coefficient of variation of ; and finally, : time point of the forecasted peak given by (33).

4.3. Fuzzy GUHA Application

All computed features, which were found statistically significant, as described in the previous subsection, are viewed as quantitative variables. In order to use them in mining linguistic associations, we had to convert them into fuzzy attributes. More specifically, we generated all the possible linguistic expressions (see Section 2) and determined appropriate contexts for each of the variables, and finally, for each simulation, we determined the degrees of membership of the given simulation into the extensions of the linguistic expressions for each variable. Such process turned the three antecedent variables into 63 fuzzy attributes—each related certain evaluative linguistic expression (21 expressions for each variable; see Section 2.1).

The above introduced variable is the dependent variable whose dependence on the generated attributes appearing in antecedents is being “explained” with the help of the fuzzy GUHA method and the generated linguistic associations; see Section 3.

Part of the resulting fuzzy GUHA table that contained 84 columns, 63 for antecedent attributes, and 21 for consequent attributes is shown in Table 7.

Upon the choice of the multitudinal implicational quantifier and the degree of confidence and the degree of support , the fuzzy GUHA generated considerably many linguistic associations. After the application of the redundancy detection and removal algorithm [7], we have obtained 69 fuzzy rules Table 8 that have a twofold importance:(i)they describe the situations, under which the disaster management may expect some time shift of the water flow rate peak, which is essential for precise warning and evacuation of people or other preparations works that may save material costs of the approaching disaster;(ii)connected to the PbLD inference mechanism, they may be directly used to forecast the time shift of the peak originally forecasted by the Math-1D model and, thus, to directly correct and precisiate the forecast by the physical model.

5. Prediction, Results, and Evaluation

5.1. Results and Evaluation

The prediction model was evaluated on a testing dataset, that is, on data previously hidden during the whole data-mining procedure. The testing dataset consists of 19 simulations, each simulation containing hourly flow rates for five days in the past and two days of predictions for the future.

On the testing simulations, the prediction accuracy of the time of culminating-peak was compared between the original Math-1D model and the Math-1D model newly adjusted with GUHA association rules.

For each testing simulation and model—either original (by Math-1D) or adjusted (with the help of fuzzy rules)—a prediction error was evaluated as follows: where is the peak time forecasted for simulation by a given model and is the time of real occurrence of the peak in simulation ; see also formulas (30)–(33). Summary of the comparison can be found in Table 9.

Briefly, it can be stated that the original model expects the flood peaks approximately a half an hour later than in reality, on the testing dataset. After adjustments made by our GUHA model, the estimates become more accurate. More precisely, the original (Math-1D) model error is on average 0.603 days (with standard deviation 0.521). The error of the adjusted model is −0.205 days (with standard deviation 0.65).

A bias towards positive values of the original model was also justified by the one sample Wilcoxon rank sum test [28]: a null hypothesis of zero shift was rejected with value = 0.000487, on the original model. On the other hand, the same hypothesis cannot be rejected for the adjusted model (with value = 0.1776). Similar results were also obtained with the one sample -test (see Table 10).

6. Conclusion

In this paper, we attempted to deal with an adjustment of a physical model of water flow rate during floods with the help of linguistic associations mining. As any physical model based on differential equations (the Math-1D model, in our case) is highly dependent on many unreliable parameters, it seems reasonable to perform some real data analysis that would inform us, when and under which conditions the model is (in terms of the culminating water flow rate peak) time lagged or vice-versa too much ahead.

We approached the task with the help of the fuzzy GUHA method that automatically generates linguistic associations. The provided data was firstly extended by a creation of artificial variables describing various features of the data. The resulting variables were later on translated into fuzzy GUHA table using the so-called Evaluative Linguistic Expressions. This table was used to mine the associations that may be directly interpreted as fuzzy IF-THEN rules. Such interpretation is beneficial not only because of its interpretability but it can also be used jointly with the Perception-based Logical Deduction inference method in order to predict expected time shift of the flood peaks originally forecasted by the physical model. Results obtained from this adjusted model were statistically evaluated in order to confirm the improvement in the forecasting accuracy.

Let us note that the data-mining analysis as well as experimental evaluation was performed only on a single measuring station Svinov placed on Odra River. Indeed, as the physical model depends on many imprecise and estimated parameters that may differ over the river flow, each station would require its own analysis. However, as the number of stations in the whole region is rather low (9 stations placed on four main rivers), such approach is obviously feasible. Thus the promising results give chance for further and deeper analysis that could enhance the disaster management by more accurate physical models with forecasts adjusted by fuzzy IF-THEN rules. On the other hand, there is a serious complication in the lack of the past data that could be analyzed. The high number of previous floods is unfortunately not accompanied by a sufficiently high number of precise data. As we have mentioned, there was, for example, a problem of measured zero water flow rates even during massive floods due to uncalibrated measuring stations or due to other unspecified reasons. This lack of reliable data may significantly complicate the situation.

As the first step for future research, we plan to extend our investigation by using measured past precipitations and possibly also the forecasted future precipitations that are already at disposal to the Math-1D model but that were not at disposal to our data analysis presented in this paper.

Acknowledgment

This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excellence Project (CZ.1.05/1.1.00/02.0070).