Genetic Programming and Standardization in Water Temperature Modelling
An application of Genetic Programming (an evolutionary computational tool) without and with standardization data is presented with the aim of modeling the behavior of the water temperature in a river in terms of meteorological variables that are easily measured, to explore their explanatory power and to emphasize the utility of the standardization of variables in order to reduce the effect of those with large variance. Recorded data corresponding to the water temperature behavior at the Ebro River, Spain, are used as analysis case, showing a performance improvement on the developed model when data are standardized. This improvement is reflected in a reduction of the mean square error. Finally, the models obtained in this document were applied to estimate the water temperature in 2004, in order to provide evidence about their applicability to forecasting purposes.
Evolutionary computing has been widely used in hydraulics and hydrology, for example, the studies ofSavic et al. , Madsen et al. , and Dorado et al., related to rainfall-runoff processes, modeling of an urban aquifer as was discussed by Hong and Rosen , or the modifications of genetic programming algorithms attempting to get an agreement with the problem dimension in natural and compounded channels as applied by Keijzer and Babovic , Harris et al. , and Keijzer et al. . On the other hand, water temperature is an important parameter to consider because of the changesit can experience due to human activities. In the last three decades diverse studies about weather changes have been made, related to the increase of extreme events such as floods and droughts (e.g., Lehner et al. ) the increasing air and water temperatures (e.g., Seguí ; Webb and Nobilis ), ice melting, and greenhouse effect (e.g., Greve ), with all their consequences in the surrounding ecosystems (e.g., Schindler , Álvarez Cobelas et al. ).
The motivation to work with models that allow the representation of water temperature behavior year after year is because each time a possible abnormal increase in this parameter occurs, the consequences and implications for the physical and chemical properties of water with their corresponding effects in aquatic life are numerous. Some models have been applied to maximum water temperatures by means of nonlinear relationships between air temperature and water temperature (Caissie et al. ), but there are other important variables involved in water temperature variation during a given period of time. In order to preserve the ecological balance it is very important to have a continuous inspection of water quality in that portion of the river. Freshwater organisms are mostly ectotherms and are therefore largely influenced by water temperature. Some of the expected consequences of a water temperature increase are life-cycle changes (Hellawell ; Winfield and Nelson ), shifts in the distribution of species with the arrival of allochthonous species (Walther et al. ), and the expansion of epidemic diseases (Harvell et al. ) as a possible result. Also, aquatic flora and fauna depend on dissolved oxygen to survive, and this water quality parameter is a function of water temperature as well.
2. Study Site
The field data used in this study were taken from the lower Ebro River, Spain. This river has a basin of 85 000 km2 and an average year inflow of 17 000 hm3 in natural regime. Three dams are located along the river (Figure 1) which change the water temperature regime (Val ): Mequinenza (1534 hm3), Ribarroja (210 hm3), and Flix (11 hm3). Five kilometers downstream the Flix dam, the water is taken from the river with cooling purposes in the Nuclear Central Ascó. Water is returned to the river with a higher temperature and flows downstream to Miravet. In this zone several meteorological gauge stations were installed, including some measuring water temperature (Figure 1). These data were applied in studies made by Val  and Prats et al. . Besides, an important effort has been recently made to obtain equations to predict water temperature associated to meteorological variables that are easily measured, centered at the Ribarroja station (Arganis et al. [21, 22]).
3.1. Evolutionary Algorithms
Evolutionary algorithms, also known as Evolutionary Computation (EC), the optimization tool used in this work, use computational models of evolutionary processes in the design and implementation of computer-based problem solving. A general definition and classification of these evolutionary techniques is given in Bäck . He defines an EA as a search and optimization algorithm, inspired by the process of natural evolution, which maintains a population of structures that evolve according to the rules of selection and other operators such as recombination and mutation. Here, the structure of all evolution-based algorithms is shown in Algorithm 1.
In a similar way to that of natural evolution and heredity, these algorithms work on a population of N individuals , representing search points in the space of potential solutions of a given problem. How well each individual adapts each generation t to the problem under investigation is provided by a quality measure called the “fitness”. The population evolves, generation by generation, towards better regions of the search space by means of genetic processes, such as selection, recombination, and mutation. The selection process uses the fitness measure to choose individuals of the previous generation to be reproduced, favoring those of higher quality. The recombination operation promotes the exchange of genetic information between parent individuals, thereby producing descendants. The mutation operation alters the genetic information by introducing some changes into the population. The evaluation process is repeated until a predefined termination criterion is met, or alternatively, until a maximum number of generation (iterations) is reached. This artificial evolution process is the foundation of the evolution-based algorithms used in this work, genetic programming.
3.2. Genetic Programming Algorithm
A typical genetic programming algorithm consists of a set of functions, which can involve arithmetic operators , transcendental functions , even relational operators or conditional operators (IF), and a terminal set with variables and constants . An initial population is randomly created with a number of parse-tree individuals composed of nodes (operators plus variables, and constants) previously defined according to the problem domain (an example of GP individual is given in Figure 2). An objective function must be defined to evaluate the fitness of each individual (in this case each individual will be a resultant model or program of the random combination of nodes). Selection, crossover, and mutation operators are then applied to the best individuals, and a new population is created. The whole process is repeated until the given generation number is reached (Cramer , Koza ).
3.3. Brief Description of the Physical Phenomena and Their Related Variables
The water from a river is in a constant heat exchange with its surroundings: the atmosphere and the river bed. This process may reach equilibrium so that the heat lost by the water equals that which is absorbed. Normally, the water temperature increases throughout the river in a natural state as the altitude decreases. To this spatial variation a double temporal variation is superimposed. In a river reach temperature varies following both a daily and an annual cycle.
In the study performed by Val , an analysis of five kilometers of the Ebro River was performed, in a section downstream of the Flix hydroelectric center; in this reach, hourly temperature measurement data are available for different sections. It was observed that during the summer a 9° difference may exist between the Flix Central site and the temperature before the dams. Additionally, downstream of Flix Central, the water temperature recovers, trying to reach thermal equilibrium with its surroundings. To estimate the heat that is absorbed by the river water as it progresses naturally through a certain reach, and its corresponding temperature variation, an energy balance is established between the caloric energy received and the caloric energy emitted by the water along that reach. This can be done based on the thermic balance presented by Edinger et al. . This balance can be expressed as where is the total caloric power absorbed by water as it moves along a river reach, by square meter of free surface, measured in W/m2. This is the result of the balance of the different heat inputs and outputs for water as it moves along the reach. is the net (incident minus reflected) total shortwave solar radiation (direct plus diffused) that is absorbed by the water by square meter of free surface, measured in W/m2. This is a function of the incident solar radiation rs and the reflected , which is proportional to , and this proportionality is given by the constant which is also known as the albedo. is the net longwave atmospheric radiation (incident minus reflected) absorbed by water by free surface square meter, measured in W/m2. is the longwave radiation emitted by water by free surface square meter, measured in W/m2, determined as a function of average surface water temperature, . is the heat lost through evaporation by free surface square meter, measured in W/m2, determined as a function of the wind velocity, , the vapor pressure of saturation, and the air’s vapor pressure. Relative humidity is also a variable that affects the water-atmosphere heat exchange. is the sensitive heat interchanged by conduction between the atmosphere and water by free surface square meter, measured in W/m2, dependent upon the air temperature and that of the water . is the heat exchanged with the substrate (river bed) by square meter of river reach.
The heat stored by a water mass as it moves along a river stretch of longitude L is estimated by , where is the caloric power absorbed by water, in (W/m2 ), is the water temperature increment, in (°), is the circulating flow, in (m3/s), is the specific heat of water in (Kcal/°Kg), is the density of water, in (Kg/m3), is the longitude of the studied reach, in (m), and is the effective width of the river, in (m).
On the other hand, through an analysis of the historical behavior of the time variation of water temperature during consecutive years, similar results were observed, both in the cyclical variation and in the tendency to increase or decrease. This leads to an expectation of a correlation between the temperature variation in year and the temperature of previous years.
This background described led to the choice of the measured variables which were used in the prediction model.
Additionally, when physical variables are used to be fitted by means of genetic programming, several questions about the dimensionality of the problem could be made. But this problem can be solved considering the possible existence of dimension in the obtained constants of the calculated model. New physical interpretations of the related variables can be done by analyzing the model terms.
In this document, for simplicity, only four arithmetic operators were considered: .
Twelve independent variables, one dependent variable, and a vector of real constants were selected. Thus, in the nonstandardized case the terminal set is
where and are the hourly average relative humidity values recorded in the years 1998 to 2000, in decimals, and are average air temperature values from years 1998, 1999, and 2000, in °, and are the average wind speeds from years 1998, 1999, and 2000, in m/s, and are average solar radiations from years 1998, 1999, and 2000, in W/m2, is the hourly average water temperature measured from year 2000, in °, and is a real constant vector.
Tests were made with one hour, daily and weekly averaged water temperatures.
In the standardized case all the last variables are dimensionless.
3.4. Objective Function
The objective function considered in this problem was defined as the minimization of the mean square error between calculated and measured data:
where measured data, calculated data, and counter from 1 to data number .
The genetic programming algorithm was implemented in MATLAB (The MathWorks ).
The variables were standardized by subtracting the mean and dividing by the standard deviation:
where is the standardized variable, dimensionless; is the variable before standardization; is the mean value of , with the same units as (the arithmetic average was used); is the standard deviation of , with the same units as .
Variables with large variances tend to have a larger effect on the resulting model than those with small variances that can be also relevant. Standardized variables can then be advantageous in that their means are zero and their second moments (variances) are one.
3.6. Input Data
Meteorological and water temperature data were taken in gauging stations installed in the Ebro River. Data consist of 10-minute averages of measurements taken every minute. Water temperatures were measured just downstream of the hydroelectric power plant of Flix. The meteorological variables were measured at the measuring station located on the Ribarroja Dam. The hourly average was calculated for all the variables and taken as input data: relative humidity (), air temperature (), wind speed (), and solar radiation () as independent variables and water temperature () as the dependent variable.
The first experiment was carried out with the original data, and the second one with the standardized ones. GP parameter settings for both experiments are shown in Table 1
3.7. Model Linearity
In order to validate the applicability of the method, the correlation coefficient between measured and calculated data was obtained:
where is the covariance between the variables and are the standard deviation of and , respectively.
4. Results and Discussion
4.1. One-Hour Average Data
The genetic programming algorithm tendency is to produce relatively simple models. The equations produced in both experiments were
where is the estimator of mean water temperature in 2000, in °; is the water temperature in 1998, in °; is the water temperature in 1999, in °; is the estimator of standard deviation of water temperature in 2000, in °; is the standard deviation of water temperature in 1998, in °; represents the standard deviation of water temperature in 1999, in °. Table 2 shows the mean square error (MSE) obtained with both models.
The mean () and the standard deviation () of the residuals, calculated as the difference between measured and calculated water temperatures, appear in Table 3. The measured and calculated data, including their differences for both experiments, are shown in Figures 3 and 4.
In Figures 5 and 6, the measured and calculated data with (6) and (8) are plotted against the identity function to obtain the correlation coefficient (), checking the linearity in the fitting. Results given by Figures 3–6 show an improvement in calculated data when standardization is applied; residuals are slightly reduced, fluctuations become softer, and this is verified by the correlation coefficient. The mean square error is reduced in about 30%, and there is less data dispersion (standard deviation of residuals decreases 17%).
4.2. Daily Average Data
In this case, the equations obtained without and with standardization were as follows:
where is the daily average water temperature value estimated in 2000, in °; , is the daily average relative humidity values recorded in 1998, in decimals; are the daily average air temperatures of 1998 and 1999 in °; is the daily average solar radiation of 1998, in °; the prefix indicates a standardized variable.
By applying an inverse standardization process,
In (13), data from 2000 are estimated according to (9) and (10), but considering daily measurements. The mean square errors (MSEs) obtained by using (11) and (13) are detailed on Table 4. The mean () and the standard deviation of residuals () of this experiment appear in Table 5.
Water temperature variations against time and the obtained differences are plotted on Figures 7 and 8. Figures 9 and 10 show a comparison between measured and calculated daily average water temperatures with respect to the identity function. Results for daily analyses report a reduction of nearly 40% in mean square error with the equation obtained using standardized data. In this case the standard deviation of residuals is also smaller (12% lower than using nonstandardization). Figure 11 shows an example of the performance of the best individual in each generation when the genetic programming algorithm was applied.
4.3. Weekly Average Data
In this last experiment, the equations obtained without and with standardization were
respectively, where is the weekly average water temperature value estimated in 2000, in °; and are the weekly average relative humidity values recorded in 1998 and 2000, in decimals; , and are the weekly average air temperatures of 1998, 1999, and 2000, in °; , and are the weekly average wind speeds from years 1998, 1999, and 2000, in m/s; and , are the weekly average solar radiation values of 1998 and 1999, in W/m2; the prefix indicates standardized variable.
Equation (15) must be nonstandardized to get the average weekly temperature approach:
The results obtained for the weekly analysis show a reduction of 52% in the mean square error when data are previously standardized, and of about 31% reduction in the standard deviation of residuals. The correlation coefficient is also close to one.
5. Getting the Daily Water Temperature for the Year 2004
The climatic daily data measured from 2002 to 2003 in Flix and Miravet stations were taken to estimate water temperature in the year 2004, in order to check the accuracy of models given by (11), (12), and (13). The climatic data for 2004 needed by the models were assumed as the average of the years 2002 and 2003. The average water temperature for 2004 was assumed ° with a standard deviation of °.
A mean square error of 49.549 and a correlation coefficient of 0.6744 were obtained by applying (11) as it is shown in Figures 16 and 17, with an important variation in the residuals; by contrast, with (13) that demands standardized data, a mean square error of 13.027 and a correlation coefficient of 0.7445 were obtained; the residuals took values between ° and almost ° (Figures 18 and 19). Therefore, this last model had a better performance in daily data, in this year.
With both equations very big residuals for water temperature were obtained for some days of the estimated year.
Different models which allow the estimation of water temperature in the Ebro River in a given year were obtained, taking into account climatic variables measured in the same year, but also considering their variability in two previous years, in an attempt to explain the possible evolution of the water temperature behavior.
The GP algorithm considered as input hourly, daily, and weekly average measured data without and with standardization, in order to analyze the resulting equations when the shape of the input data varies from one form to another.
Intrinsically, measured data of water temperature and climatic variables have more oscillations in hourly average data than in daily or weekly average data. Particularly, in the experiment using hourly data, the GP algorithm amplifies the water temperature oscillations, probably because in the actual physical process, the oscillations of the climatic variables are filtered. Nevertheless, by using standardized data, mean square errors were lower than those without standardization, and a lower dispersion in data could be obtained. Similar situations occurred in the case of daily data.
According to the mean square errors, the standard deviation of residuals, and the correlation coefficient, when weekly data were considered, GP algorithms produced models more capable to follow the behavior of water temperature. This was particularly true for those models obtained with standardized data.
Therefore equations such as those obtained herein can be used as a first approximation to predict changes in water temperature when changes occur in climatic variables such as air temperature, wind speed, relative humidity, and solar radiation, all of which affect the water temperature as well as the physical and chemical water conditions, including the flora and fauna of a river.
When the models for daily data were applied in another year, lower correlations between measured and predicted data were obtained, particularly with the model that does not take into account standardized variables.
According to these results, it is feasible to obtain some improvements in generating water temperature models by means of genetic programming, when the standardization process is incorporated.
Results also show limits on the models developed herein; the models produced oscillations in the water temperatures that do not correspond to the measured data; the results of forecasting from 2004 are only fair. That is probably due to the fact that some variables included in the physical phenomena are eliminated, and the filtering that occurs in nature is not reproduced; nevertheless, these results are considered useful as a first-order explanation of a complex process. However future work is suggested to compare the proposed method with physically based ones.
H. Madsen, M. B. Butts, S. T. Khu, and S. Y. Lyong, “Data assimilation in rainfall-runoff forecasting,” in Proceedings of the 4th Conference of Hydroinformatics, pp. 1–8, Cedar Rapids, Iowa, USA, July 2000.View at: Google Scholar
J. Dorado, J. R. Rabuñal, J. Puertas, A. Santos, and D. Rivero, “Prediction and modelling of the flow of a typical urban basin through genetic programming,” in Proceedings of the European WorkShops on Applications of Evolutionary Computation (EvoWorkshops '02), vol. 2279 of Lecture Notes in Computer Science, pp. 190–201, 2002.View at: Google Scholar
M. Keijzer and V. Babovic, “Declarative and preferential bias in GP-based scientific discovery,” Genetic Programming and Evolvable Machines, vol. 3, pp. 41–79, 2002.View at: Google Scholar
E. L. Harris, V. Babovic, and R. A. Falconer, “Velocity predictions in compound channels with vegetated floodplains using genetic programming,” International Journal of River Basin Management, vol. 1, no. 2, pp. 117–123, 2003.View at: Google Scholar
M. Keijzer, M. Baptist, V. Babovic, and J. R. Uthurburu, “Determining equations for vegetation induced using genetic programming,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '05), pp. 1999–2006, Washington, DC, USA, June 2005.View at: Publisher Site | Google Scholar
J. Seguí, Análisis de la Serie de Temperatura del Observatorio del Ebro 1894–2002, Observatori de l'Ebre, Roquetes, Spain, 2003.
B. W. Webb and F. Nobilis, “Water temperature behaviour in the River Danube during the Twentieth Century,” Hydrobiologia, vol. 291, no. 2, pp. 105–113, 1994.View at: Google Scholar
D. W. Schindler, “Widespread effects of climatic warming on freshwater ecosystems in North America,” Hydrological Processes, vol. 11, no. 8, pp. 1043–1067, 1997.View at: Google Scholar
M. Álvarez Cobelas, J. Catalán, and D. García De Jalón, “Impactos sobre los ecosistemas acuáticos continentales,” in Evaluación Preliminar de Los Impactos en España por Efecto del Cambio Climático, J. M. Moreno, Ed., pp. 113–146, Ministerio de Medio Ambiente, Madrid, Spain, 2005.View at: Google Scholar
J. M. Hellawell, Biological Indicators of Freshwater Pollution and Environment Management, Elsevier, London, UK, 1986.
I. J. Winfield and J. S. Nelson, Cyprinid Fishes. Systematics, Biology and Exploitation, Chapman & Hall, London, UK, 1991.
S. R. Val, Incidencia de los embalses en el comportamiento térmico del río. Caso del sistema de embalses Mequinenza-Ribarroja-Flix en el río Ebro, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona, Spain, 2003.
J. Prats, R. Val, J. Armengol, and J. Dolz, “A methodological approach to the reconstruction of the 1949–2000 water temperature series in the Ebro River at Escatrón,” Limnetica, vol. 26, no. 2, pp. 293–306, 2007.View at: Google Scholar
M. L. Arganis, S. R. Val, V. K. Rodríguez, M. R. Domínguez, and R. J. Dolz, “Comparación de curvas de ajuste a la Temperatura del Agua de un río usando programación genética,” in Congreso Mexicano de Computación Evolutiva (COMEV '05), pp. 1–8, Universidad Nacional Autónoma de Aguascalientes, May 2005.View at: Google Scholar
M. L. Arganis, S. R. Val, M. R. Domínguez, V. K. Rodríguez, R. J. Dolz, and J. Eaton, “Comparison between equations obtained by means of multiple linear regression and genetic programming to approach measured climatic data in a river,” in IWA Watermex, pp. 1–8, Washington, DC, USA, May 2007.View at: Google Scholar
T. Bäck, Evolutionary Algorithms in Theory and Practice, Oxford University Press, Oxford, UK, 1996.
N. L. Cramer, “A representation for the adaptive generation of simple sequential programs,” in Proceedings of the International Conference on Genetic Algorithms and the Applications, J. J. Grefenstette, Ed., pp. 183–187, 1985.View at: Google Scholar
J. R. Koza, “Hierarchical genetic algorithms operating on populations of computer programs,” in Proceeding of the 11th International Joint Conference on Artificial Intelligence, vol. 1, pp. 768–774, Morgan Kaufmann, 1989.View at: Google Scholar
J. E. Edinger, D. K. Brady, and J. C. Geyer, “Heat exchange and transport in the environment,” Tech. Rep. 14, Electric Power Research Institute, Palo Alto, Calif, USA, 1974.View at: Google Scholar
The MathWorks, “MATLAB Reference Guide,” The MathWorks, Inc., 1992.View at: Google Scholar