Estimation of Missing Rainfall Data Using GEP: Case Study of Raja River, Alor Setar, Kedah
Water resources and urban flood management require hydrologic and hydraulic modeling. However, incomplete precipitation data is often the issue during hydrological modeling exercise. In this study, gene expression programming (GEP) was utilised to correlate monthly precipitation data from a principal station with its neighbouring station located in Alor Setar, Kedah, Malaysia. GEP is an extension to genetic programming (GP), and can provide simple and efficient solution. The study illustrates the applications of GEP to determine the most suitable rainfall station to replace the principal rainfall station (station 6103047). This is to ensure that a reliable rainfall station can be made if the principal station malfunctioned. These were done by comparing principal station data with each individual neighbouring station. Result of the analysis reveals that the station 38 is the most compatible to the principal station where the value of R2 is 0.886.
The importance of precipitation is identifying precipitation characteristics, occurrence and temporal and spatial variability, statistical modeling and forecasting of precipitation, and resolving the problems such as floods, droughts, and landslides as stated by Silva et al. . But, in some cases, a large number of stations could be down at the same time, thus creating many inaccurate readings or missing data [2, 3]. In Malaysia, the number of rain gauge stations with complete records for a long duration is very scarce. Rainfall records often contain missing data values due to malfunctioning of equipment and severe environmental conditions. Thus, the estimation of rainfall is needed if missing data happened at the principal rainfall station. This study was to investigate the possibilities of correlating monthly rainfall of principal rainfall station to its six neighbouring stations. This was done to ensure that a reliable rainfall station can be done before proceeding with water resources management and flood management modelling.
1.1. Description of the Study Area
The study area was carried out in Alor Setar city, the capital of Kedah state in Malaysia. It is located within the Raja River catchment. It is prone to flood due to its flat and low elevation. In 1992, Department of Irrigation and Drainage (DID) carried out the Flood Mitigation Project to solve the flooding problems of Alor Setar city where the whole Raja River system was converted to concrete lined channel. It was separated from Kedah River by gated structure and pumping station .
A study is being conducted to investigate how Raja River system responds to the land use change by carrying out hydrologic and hydraulic modeling. One of the main inputs of the modeling is precipitation data but missing precipitation data has always been an issue for hydrologic modelling as stated earlier. There are seven rainfall stations in this study area; station 6103047 as principal station is surrounded by six Muda Agricultural Development Authority (MADA) rainfall stations as shown in Figure 1 and Table 1. The minimum densities recommended of precipitation stations by the WMO are 1 station for 250 km2 for the mountainous area, 1 for 900 km2 for the coastal area, and 1 for 10 km2 for urban areas . In the study area, there are seven stations within 200 km2 for the study area (approximately 1 station for 30 km2).
2. Data and Methodology
In order for hydrologic modelling to be conducted smoothly, data consistency of a principal station was compared to its neighbouring rainfall station by applying gene expression programming (GEP) technique. Monthly rainfall series data have been obtained from DID and MADA for 9-year periods from 2001 to 2009. For this study, MADA stations were selected based on closest distance with station 6103047 as shown in Table 1.
Genetic programming (GP), a branch of genetic algorithms (GA), is a method for determining the most “fit” computer program by artificial evolution. GP initializes a population which consists of chromosomes, and the fitness of each chromosome is evaluated regarding a target value. The individuals in the new generation are, in their turn, through a few developmental processes, such as expression of the genomes, confrontation of the selection environment, and reproduction with modification. The reproduction includes not only replication but also the action of genetic operators capable of creating genetic diversity. During replication, the genome is copied and transmitted to the next generation. So, in GEP, a chromosome might be modified by one or several operators at a time or might not be modified at all [6–10].
Hashmi et al.  show simple example of a GEP model having two genes (terms), which are linked by an addition function, and presented here to clarify the working of the GEP system. This GEP chromosome is given by where “a,” “b,” “c,” and “d” are predictor variables and +, *, and / represent addition, multiplication, and division, respectively. Equation (1) can also be expressed by the following expression tree (ET) which is usually produced by GEP software packages. In this study, the data for the training set in GEP is selected from 2001 to 2006 and the rest is used as the testing set. The functional set and operational parameters used in the present GEP modelling are listed in Tables 2 and 3, respectively.
3. Results and Discussion
GEP was used to predict precipitation of station 6103047 using 9 years of monthly rainfall data to select the most suitable rainfall station. By using GEP, (2) was generated where refers to station 38. The equation for GEP also can be expressed by expression tree (ET) as shown in Figure 2 to show the relationship between station 38 and station 6103047. Only station 38 is discussed here since it is the best rainfall station. Consider
The GEP was able to determine the most suitable rainfall station to replace the principal station. The coefficient of determination () and the root mean square error (RMSE) are used in the current study. The represents the degree of association between the predicted and the measured values as shown in
The of GEP technique (0.886) for station 38 in Table 4 has the highest values. If close to 1, it indicates that we have accounted for almost all the variability with the variables specified in the model.
The values of for GEP technique of station 38 is 0.941. Its value of 1 represents a perfect relation, and 0 indicates no relationship between the variables. The degree to which two or more predictors are related to the dependent variable is expressed in . The function has a determination coefficient as a measure of the goodness of fit of the model, and this represents the proportion of the variation of the dependent variable (station 6103047 rainfall depth).
The graph in Figure 3 shows the predicted station 6103047 against the observed station 6103047 which achieves acceptable . The value of is 0.879 for GEP technique. So, station 38 for GEP technique is reasonably close to the observed station 6103047 as the value is 0.941. The larger value, the stronger the association between the two variables and the more accurate the prediction of the values of station 6103047.
This study is using GEP technique to determine the most fitted rainfall station to the principal rainfall station. The predicted GEP model gives satisfactory results. As GEP technique provides more efficient result, it will be used to estimate the missing rainfall and to correlate monthly precipitation data from the principal station to station 38. From the analysis, station 38 is the most fitted rainfall to the principal station as having the highest (0.886) which is very close to 1, suggesting very little discrepancy between observed and predicted precipitation. It shows that GEP can be used as an effective tool to be used for estimating precipitation.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors are grateful to the Universiti Sains Malaysia for providing research Grant (1001/REDAC/814085) to study integrated river basin management for Raja River, Kedah, Malaysia.
R. P. de Silva, N. D. K. Dayawansa, and M. D. Ratnasiri, “A comparison of methods used in estimating missing rainfall data,” Journal of Agricultural Science, vol. 3, pp. 101–108, 2007.View at: Google Scholar
J. Kajornrit, K. W. Wong, and C. C. Fung, “Estimation of missing rainfall data in Northeast region of Thailand using Kringing methods: a comparison study,” in Proceedings of the International Workshop on Bio-Inspired Computing for Intelligent, Environments and Logistic Systems, pp. 1–8, 2011.View at: Google Scholar
M. S. Ramli, Z. Abu Hasan, and K. Hock Lye, “Application of one-dimensional water quality modelling for in stream dissolved oxygen,” in Sustainable Solutions for Global Crisis of Flooding, Pollution and Water Scarcity, pp. 1–149, 2011.View at: Google Scholar
WMO (World Meteorological Organisation), “Hydrology: from measurement to hydrological information,” in Guide to Hydrological Practices, vol. 168, WMO, Geneva, Switzerland, 6th edition, 2008.View at: Google Scholar
C. Ferreira, Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence, Springer, 2nd edition, 2006.
J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, University of Michigan Press, 1975.