Advances in Meteorology

Volume 2018, Article ID 5024930, 11 pages

https://doi.org/10.1155/2018/5024930

## Development of Heavy Rain Damage Prediction Model Using Machine Learning Based on Big Data

^{1}Department of Civil Engineering, Inha University, Incheon 22212, Republic of Korea^{2}Institute of Water Resources System, Inha University, Incheon 22212, Republic of Korea

Correspondence should be addressed to Hung Soo Kim; rk.ca.ahni@mikoos

Received 13 February 2018; Accepted 15 May 2018; Published 13 June 2018

Academic Editor: Alastair Williams

Copyright © 2018 Changhyun Choi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Prediction models of heavy rain damage using machine learning based on big data were developed for the Seoul Capital Area in the Republic of Korea. We used data on the occurrence of heavy rain damage from 1994 to 2015 as dependent variables and weather big data as explanatory variables. The model was developed by applying machine learning techniques such as decision trees, bagging, random forests, and boosting. As a result of evaluating the prediction performance of each model, the AUC value of the boosting model using meteorological data from the past 1 to 4 days was the highest at 95.87% and was selected as the final model. By using the prediction model developed in this study to predict the occurrence of heavy rain damage for each administrative region, we can greatly reduce the damage through proactive disaster management.

#### 1. Introduction

The occurrence of natural disasters such as floods, tsunamis, and earthquakes is increasing due to the climate change. Also, the damage is becoming larger and larger due to the rapid urbanization over the world. In South Korea, about 65% of all damage is due to heavy rain, and thus there is a pressing need for countermeasures [1]. If the scale and impact of such damage is estimated quickly in advance, this makes disaster management more possible at the preventive and preparatory stages, and this would help to avoid large-scale damage due to heavy rain like that which occurred in Hongcheon and Cheongju in the summer of 2017. In particular, if there is rapid predisaster forecasting of expected damage by the administrative division for the regions that will be affected, this can be of great help to policymakers in setting up and implementing disaster prevention measures. Moreover, it will be possible to establish a voluntary disaster management system in which citizens themselves can prepare for disasters and expected damage by receiving forecasts about them.

Previous studies that were used in predicting and preparing for natural disaster damage in advance mostly performed linear regression analysis using weather factors such as precipitation, rainfall intensity, maximum wind speed, and hurricane central pressure that cause natural disasters and damage through floods, rainstorms, and hurricanes [2–11]. These studies analyzed the relationship between weather factors and damage extent through regression analysis, and they used the constructed regression models to attempt to predict the extent of damage through weather factors alone. However, it proved difficult for most of these models to predict the actual extent of damage adequately. In order to overcome the shortcomings of such studies, others have taken into account socioeconomic factors such as per capita income, population density, and imperviousness of an area in addition to weather factors that directly give rise to natural disasters [12–16]. Although the inclusion of socioeconomic factors besides weather factors led to some improvement in the prediction performance of these linear regression models, the nonlinear character of disasters and their damage scale present problems that cannot be solved by them. More recently, rapid advances in computing technology and data processing speed have led to the emergence of studies that apply big data and machine learning to disaster management [17, 18]. The predominant approach in all these studies is to use just a handful of explanatory variables in a regression model to estimate the damage scale of disasters. In regard to disaster management research in Korea, there is in particular a dearth of studies that use machine learning, which is known to be able to maximize the prediction performance of models, and big data, which produce valuable information through various data that could not previously be taken into account.

Accordingly, the present study relies on the meteorological big data provided by the Korea Meteorological Administration to arrive at a list of various explanatory variables that account for the occurrence of heavy rain damage and uses machine learning—known to have higher prediction performance than regression models—to develop functions that can predict heavy rain damage in advance. For this purpose, we constructed a response variable and explanatory variables for the study area of our study and used various machine learning models such as decision trees, bagging, random forests, and boosting to develop prediction models for heavy rain damage based on big data. We used two algorithms in developing the prediction functions, namely, Algorithm 1 that uses same-day weather observation data to make predictions and Algorithm 2 that uses past weather observation to do so. Models were constructed on this basis, and we thereby developed a prediction model for heavy rain damage that can be used immediately in actual practice.

#### 2. Theoretical Background

##### 2.1. Machine Learning

Machine learning is a field concerned with deriving new knowledge by feeding the requisite data to a computer and making it learn from them like a human being studying a new subject area. For example, suppose that there is a set of pairs (*x*, *y*) with the data (1, 7), (2, 14), (3, 21), and (5, 35) already given as members of the set. Even if a computer does not know the function for *y*, machine learning can be used to make it provide, say, the *y* values for (7, ?) or (10, ?) after the data are entered and the computer learns from them. That is, the computer will give the answers even without directly programing it with the function *y* = 7*x*. In machine learning, there are two main types of learning method. One method is supervised learning that is used to infer the function for *y*, and the other is unsupervised learning that is used to determine how the data for *x* values are distributed. The present study uses decision tree learning, which is a representative technique in machine learning, along with ensemble methods based on decision tree models such as bagging, random forests, and boosting, in order to develop a prediction model for heavy rain damage. All the methods used here are supervised learning techniques, which use their own algorithms to generate rules that best explain the response variables.

##### 2.2. Decision Tree Models

Decision tree models can be used in both classification and regression, and they express results in the form of tree-shaped graphs. A decision tree finds rules that best explain values of a response variable by recursively partitioning the space of each explanatory variable. If the entire domain of explanatory variables is partitioned into number of domains on the basis of the criterion minimizing the classification error rate, then the Gini index and cross-entropy are mainly used as related criteria to determine this, as shown in (1):

In (1), indicates the proportion of the data in the partition that belongs to class of the response variable. The response variable in this study has two classes, 1 and 0, and thus has the values 1 and 0. A decision tree grows through top-down partitioning. After the first split of the domain of explanatory variables into partitions that minimize the indices given in (1), the resulting partitions are again split into further partitions that minimize the same indices. This goes on until the degree of minimization becomes very minute, or when a prespecified stopping condition is met. For a decision tree that has stopped growing, pruning is automatically performed to prevent overfitting.

In general, a decision tree can have lower prediction performance than other prediction models, but it has the advantage of being relatively easy to interpret. However, if decision trees are actively used in the ensemble techniques described below, this will not only compensate for the weaker prediction performance of a single decision tree, but it can even exhibit equal or greater prediction performance than other complex models. Figure 1 shows a schematic of the decision tree concept.