Water Quality Modeling in Reservoirs Using Multivariate Linear Regression and Two Neural Network Models
In this study, two artificial neural network models (i.e., a radial basis function neural network, RBFN, and an adaptive neurofuzzy inference system approach, ANFIS) and a multilinear regression (MLR) model were developed to simulate the DO, TP, Chl a, and SD in the Mingder Reservoir of central Taiwan. The input variables of the neural network and the MLR models were determined using linear regression. The performances were evaluated using the RBFN, ANFIS, and MLR models based on statistical errors, including the mean absolute error, the root mean square error, and the correlation coefficient, computed from the measured and the model-simulated DO, TP, Chl a, and SD values. The results indicate that the performance of the ANFIS model is superior to those of the MLR and RBFN models. The study results show that the neural network using the ANFIS model is suitable for simulating the water quality variables with reasonable accuracy, suggesting that the ANFIS model can be used as a valuable tool for reservoir management in Taiwan.
Water is an important resource for the survival and health of humans and ecosystems. The quality of water is also crucial. The wording, water quality, can be used to address the condition in water column which included the characteristics of physical, chemical, and biological characteristics. Assessing the water quality variables is needed to develop best planning and management for water resources [1, 2].
The definition of eutrophication is the nutrient enrichment which means the excessive phosphorus and nitrogen loads in lakes and reservoirs, resulting in a serious problem. The natural or artificial enrichment of inland water bodies can cause eutrophication with algal blooms to deteriorate the water quality for human use and the decrease of dissolved oxygen levels, resulting in adverse effects on fisheries. The application of some modeling techniques to predict the behavior of enriched water bodies allows for combating adverse effects [3–5].
Due to the advanced computing capabilities, two-dimensional/three-dimensional reservoir hydrodynamic and water quality models have been developed and widely applied to resolve water quality problems [6–9]. Although deterministic models have been adopted for modeling water quality, these models require input data, model parameters, and extensive information to obtain results . Because a large number of factors affecting the water quality have complicated nonlinear relationships with the variables, traditional deterministic models are not easy to handle.
Because of the existing difficulties and challenges in the simulating of water quality conditions using the hydrodynamic and water quality model, relatively novel computational approaches, artificial neural networks (ANNs), which have found wide acceptance in many disciplines, provide an alternative method for understanding and managing the water quality in reservoirs. ANNs are well-suited for this application because of their informative processing characteristics, such as nonlinearity, parallelism, noise tolerance, and learning and generalization capabilities [11, 12].
One of the successful applications of artificial intelligence techniques including knowledge-based systems, genetic algorithms, artificial neural networks, and fuzzy inference systems has been simulating complex nonlinear systems . In particular, artificial neural networks have been successfully used as tools to simulate and predict water quality in water bodies [14–18]. The ANN model which is a black-box model does not require knowledge of any of the parameters . ANNs have several advantages including learning ability, dealing with complex nonlinear data, and parallel processing ability. Prediction with ANN can be performed by the network learning experimentally generated data or using validated models .
Recently, Ranković et al.  developed a feed-forward neural network model to simulate the dissolved oxygen in the Gruza Reservoir in Serbia. Soltani et al.  combined a water quality simulation model and a hybrid genetic algorithm to determine the optimal operating polices for different reservoir outlets. The water quality simulation model is based on an adaptive neural fuzzy inference system that was trained using the results of a numerical water quality simulation model. Again, Ranković et al.  used an adaptive network-based fuzzy inference system model to simulate the dissolved oxygen in the Gruza Reservoir in Serbia. However, evaluating the efficiency of multivariate linear regression and artificial neural network models to predict the water quality parameters in reservoirs has not been reported.
The main objective of this study is to establish a multivariate linear regression (MLR) model and two artificial neural network models, including a radial basis function neural network (RBFN) and an adaptive neurofuzzy inference system (ANFIS), to simulate the dissolved oxygen (DO), the total phosphorus (TP), the chlorophyll a (Chl a) content, and the Secchi disk depth (SD), which are commonly used as indicators of the eutrophication of reservoirs. A further aim is to demonstrate the application of these models to identify complex nonlinear relationships between the inputs and outputs in the Mingder Reservoir, Taiwan. The simulated and measured DO, TP, Chl a, and SD values were compared using the MLR, RBFN, and ANFIS models.
2. Materials and Methods
2.1. Description of Study Area and Data Collection
The Mingder Reservoir (Figure 1) is located in central Taiwan. The reservoir, completed in 1970, has a total watershed area of 61 km2 and an effective storage volume of m3 and was built as a water supply source to Miaoli County for industrial supply, irrigation, and flood control. The dam site of the Mingder Reservoir is located at the tributary of the Houlong River and is approximately 7 km from Miaoli City.
The Mingder Reservoir is an in-channel reservoir that was formed by the construction of an earth dam and has a surface area of 1.62 km2. Its water quality has been routinely monitored over the past two decades, once per season since 1993. There are three water quality sampling stations in the reservoir.
Carlson  proposed a trophic state index (CTSI) related to the diverse aspects of the trophic state found in multiparameter indices. The CTSI has the simplicity of a single parameter index for the water quality assessment of impounded water bodies. The CTSI has been adopted by the Taiwan Environmental Protection Administration (TEPA) to determine eutrophication status. It is calculated with three variables: the Secchi disk depth (SD), the total phosphorus (TP), and the chlorophyll a concentration (Chl-a). For this reason, these three water quality variables were predicted using three approaches presented in this study.
Figure 2 shows the CTSI values of the Mingder Reservoir from 1993 to 2013. A CTSI value above 50 indicates that the water body is eutrophic, as shown in Figure 2 for the Mingder Reservoir. This figure shows that the reservoir is in a status between mesotrophic and eutrophic. Agricultural activities are the major sources of nutrients and provide excessive nutrient loads to the reservoir.
The historical water quality data, including the water temperature, pH, electrical conductivity, turbidity, suspended solids, total hardness, total alkalinity, Secchi disk depth, dissolved oxygen, chlorophyll a concentration, total phosphorus, nitrate nitrogen, and biochemical oxygen demand from 1993 to 2013, were collected from the TEPA and analyzed. A statistical summary of the water quality variables is shown in Table 1.
2.2. Adaptive Neurofuzzy Inference System (ANFIS)
The ANFIS is a fuzzy Sugeno model with multilayer feed-forward network which potentially captures the benefits of artificial neural networks and fuzzy logic approaches in a single framework. ANFIS provides easy learning and adaptation since it is a framework of adaptive systems. The Sugeno system is the most commonly adapted fuzzy model for ANFIS framework to deal with the complicated nonlinear problems because of less computational time .
To easily understand the architecture of ANFIS, we assume that the fuzzy inference system has two inputs, and , and only one output, . Therefore two fuzzy if-then rules based on a first-order Sugeno fuzzy model is provide as below : Rule 1: if is and is , then , Rule 2: if is and is , then ,
where and are the fuzzy sets and , , and are the parameters that will be determined during the training and testing processes. The architecture of ANFIS with two rules is shown in Figure 3, in which circles represent fixed nodes and squares indicate adaptive nodes. A brief introduction of the ANFIS model follows.
Layer 1. Each node in this fuzzy layer is an adaptive node with a node functionwhere and are the inputs to node ; and are the linguistic labels; and and are the membership functions for the and linguistic labels, respectively. Although many membership functions such as the trapezoidal membership function, the Gaussian membership function, the Gaussian combination membership function, the spline-based membership function, and the sigmoidal membership function can be used. Chau  investigated the effect of different membership functions on the performance of ANFIS model and found that the differences are insignificant. In this study, we adopted the generalized bell-shaped membership function which is a commonly used function :where , , and are the parameter sets. The parameters in this layer are the premise parameters.
Layer 2. Each node in this layer is a fixed node which is marked with circle and labeled with in Figure 3. The outputs of this layer, which are called the firing strengths (), are the products of the corresponding degrees obtained from Layer 1 (the input layer):
Layer 3. Every node in this layer is a circle node labeled (Figure 3). The third layer contains fixed nodes that calculate the ratio of the firing strengths of the th rule to the sum of all rules’ firing strength:
Layer 4. The nodes in this layer are adaptive and adjustable. The output of each node is the product of normalized firing strength, , from the third layer: where , , and in this layer are consequent parameters.
Layer 5. The single node computes the overall output by summing all of the incoming signals:
The details and mathematical background for these algorithms can be found in Jang .
2.3. Radial Basis Function Neural Network (RBFN)
A radial basis function neural network (RBFN) contains a feed-forward structure including one input layer, a single hidden layer, and one output layer, as shown in Figure 4. The concept of the RBFN is to establish a radial basis function () and to determine the relationship between the input and the output using curve-fitting approaches [27, 28]. A single output RBFN with hidden layer neurons can be described aswhere is the input vector; is the connecting weights between the hidden layer and the output layer; and is the output value of the neuron in the hidden layer after transfer by the radial basis function. Many radial basis functions can be used in RBFN model. The most common radial function is the Gaussian function  which is adopted in the present study:where is center vector of the th neuron in the hidden layer; is defined as , is the maximum distance among and is the number of the center vectors; and denotes the Euclidean distance between and . The approach used to determine is the orthogonal least squares (OLS) approach developed originally by Chen et al.  and improved by Ham and Kostanic  and Kecman . The simplest way to select RBFN centers is random method . However, the OLS algorithm provides an optimum number of centers () in RBFN model from the training patterns .
2.4. Multilinear Regression
Multilinear regression (MLR) is used to model the linear relationship between a dependent variable and one or more independent variables. MLR is based on least squares: the model is fit such that the sum of the squares of the differences between the observed and simulated values is minimized . The model expresses the value of a predicted variable as a linear function of one or more predictor variables: where is the value of the th predictor, is the regression constant, and is the coefficient of the th predictor.
2.5. Indices of Simulation Performance
Many statistical indexes can be used to evaluate the performance of models. The most common indexes for evaluating the performance of ANN models are the root mean square error (RMSE) and the correlation coefficient (). To strictly evaluate the performance of MLR, RBFN model, and ANFIS model, three criteria, mean absolute error (MAE), RMSE, and , are employed in this study. These criteria can be defined by the following equations:where is the total number of data points, is a simulated water quality variable, is a measured water quality variable, and and denote the average simulated and measured water quality variables, respectively.
3. Results and Discussion
3.1. Selecting the Input Variables
The selection of an appropriate set of input variables for the MLR, RBFN, and ANFIS models is important for predicting the water quality variables in reservoirs . It is difficult to determine how many input variables and which input variables should be adopted in the MLR, RBFN, and ANFIS models. The straight and useful way is to find the correlation coefficient between the individual dependent variables and water quality (DO, TP, Chl a, and SD). A total of 389 data sets collected from 1993 to 2013 were used for the analyses. Table 2 shows the correlation coefficient () between the individual dependent variables and DO, TP, Chl a, and SD, respectively. A correlation coefficient greater than 0.1 was set as the threshold for selecting the input variables. It indicates that the most effective inputs to affect DO are the pH value, chlorophyll a concentration, water temperature, total phosphorus, nitrate nitrogen, and biochemical oxygen demand. The most important water quality variables that affect the TP are the turbidity, suspended solids, pH value, dissolved oxygen, electrical conductivity, and nitrate nitrogen. It also illustrates that the most important input variables that affect Chl a and SD are, respectively, the six variables of the pH value, dissolved oxygen, water temperature, biochemical oxygen demand, nitrate nitrogen, and total hardness, and the six variables of the suspended solids, electrical conductivity, turbidity, total hardness, total alkalinity, and total phosphorus.
3.2. Water Quality Prediction Using the Multilinear Regression Model
The multilinear regression models were used to predict DO, TP, Chl a, and SD. The multilinear regression models in (11) to (14) were obtained from the input water quality variables:where BOD is the biochemical oxygen demand (mg/L); EC is the electrical conductivity (mho/cm 25°C); NO3 is the nitrate nitrogen (mg/L); pH is the pH value; SS is the suspended solids (mg/L); TA is the total alkalinity (mg/L); TB is the turbidity (NTU); TH is the total hardness (mg/L); and WT is the water temperature (°C).
Equation (11) was used to predict the DO concentration for the training and testing phases. 272 and 117 data sets were used for training and testing the MLR model, respectively. Table 3 shows the performance evaluation using the MLR model for DO, TP, Chl a, and SD. The MAE, RMSE, and values for the DO training phase were 1.81 mg/L, 2.33 mg/L, and 0.67, and these values for the DO testing phase were 1.81 mg/L, 2.41 mg/L, and 0.64, respectively. The MAE, RMSE, and values for the TP training phase were 9.82 g/L, 15.29 g/L, and 0.55, and these values for the TP testing phase were 9.99 g/L, 14.70 g/L, and 0.31, respectively. The MAE, RMSE, and values for the Chl a training phase were 3.96 g/L, 5.85 g/L, and 0.50, and these values for the Chl testing phase were 4.18 g/L, 5.92 g/L, and 0.55, respectively. The MAE, RMSE, and values for the SD training phase were 0.34 m, 0.45 m, and 0.43, and these values for the SD testing phase were 0.37 m, 0.53 m, and 0.35, respectively.
3.3. Water Quality Prediction Using the RBFN Model
Due to the poor performance using the multilinear regression model, the RBFN model was applied to predict the DO, TP, Chl a, and SD in the reservoir. The same data sets adopted in the multilinear regression model were used for training and testing the RBFN model.
Table 4 shows the performance evaluation for the DO, TP, Chl a, and SD during the training and testing phases. The MAE, RMSE, and values for the DO during the training phase were 1.49 mg/L, 1.97 mg/L, and 0.77, and these values during the testing phase were 1.45 mg/L, 1.96 mg/L, and 0.79, respectively. The MAE, RMSE, and values for the TP during the training phase were 8.89 g/L, 11.77 g/L, and 0.75, and these values during the testing phase were 7.95 g/L, 9.91 g/L, and 0.73, respectively. The MAE, RMSE, and values for Chl a during the training phase were 3.09 g/L, 4.39 g/L, and 0.79, and these values during the testing phase were 2.86 g/L, 4.08 g/L, and 0.75, respectively. The MAE, RMSE, and values for SD during the training phase were 0.29 m, 0.39 m, and 0.68, and these values during the testing phase were 0.23 m, 0.31 m, and 0.76, respectively. The performances for the DO, TP, Chl a, and SD using the RBFN model were better than using the MLR model during the training and testing phases.
Some specific parameters may cause significant changes in ANN model . In order to avoid overfitting or underfitting, we repeated ANN model until the minimum of errors occurred. Through the reiterative runs, the sum of error reduction ratio in RBFN model is set to 0.85 for providing the best performances. However, the statistical errors for the value were still lower than 0.8 using the RBFN model, which means that the RBFN model is not good enough for predicting the water quality variables.
3.4. Water Quality Prediction Using the ANFIS Model
An alternative approach, the ANFIS model, was used to predict the water quality variables. The network was trained using the training data set (i.e., 272 data sets), and then it was tested with the testing data set (i.e., 117 data sets). Figures 5 and 6 show the comparison of the model-predicted DO and model-measured DO for the training and testing phases and scatter plots, respectively. The error values, representing the predicted DO minus the measured DO, are also shown in Figure 5. The MAE, RMSE, and values for the training phase were 1.30 mg/L, 1.74 mg/L, and 0.85, while these values for the testing phase were 0.92 mg/L, 1.32 mg/L, and 0.88, respectively, as shown in Table 5. Figures 7 and 8 compare the model-predicted and model-measured TP concentrations for the training and testing phases and scatter plots, respectively. The MAE, RMSE, and values for the training phase were 6.45 g/L, 9.22 g/L, and 0.86, while these values for the testing phase were 5.44 g/L, 7.42 g/L, and 0.86, respectively. Number of membership functions of ANFIS model is specified as 2 to obtain the best performances.
Figures 9 and 10 compare the model-predicted and model-measured chlorophyll a concentrations for the training and testing phases and scatter plots, respectively. The MAE, RMSE, and values for the training phase were 3.19 g/L, 4.67 g/L, and 0.77, and these values for the testing phase were 3.25 g/L, 3.05 g/L, and 0.83, respectively. The performance evaluation for predicting the Chl a concentration using the ANFIS model during the training phase is slightly inferior to that using RBFN model, but the performance evaluation for predicting the Chl a concentration using the ANFIS model during testing phase is superior to that using RBFN model (Table 4). This result may be the reason that the predicted Chl concentration using the ANFIS model underestimated the high Chl a concentration in the observational data during the training phase.
Figures 11 and 12 illustrate the comparison of the model-predicted and model-measured Secchi disk depth for the training and testing phases and scatter plots, respectively. The MAE, RMSE, and values for the training phase were 0.21 m, 0.31 m, and 0.80, while these values for the testing phase were 0.15 m, 0.24 m, and 0.89, respectively. Overall, the performances of the DO, TP, Chl a, and SD with the ANFIS model are better than those with the multilinear regression model and the RBFN model. The performance evaluations with ANFIS during the testing phase are superior to those during the training phase. A similar result is also found in Ranković et al. .
Akkoyunlu et al.  applied two ANN models and an MLR model to estimate the DO concentration in Lake Iznik, Turkey. They found that the MLR model performed less accurately to predict the DO. In the current study, the MLR model not only is less accurate for predicting the DO concentration but also does not simulate the TP, Chl a, or SD well.
In general, the neural network models provide better performance than the MLR model in this study. The MLR model is still a useful tool for simple and fast predicting water quality, although the predicted results by MLR model are of insufficient accuracy. For instance, Chenini and Khemiri  adopted MLR approach to evaluate the ground water quality in Maknassy Basin, central Tunisia. Thoe and Lee  used MLR to forecast the daily water quality of Hong Kong Beach.
Chau  performed a benchmarking comparison for predicting the river flow discharge using ANN and ANFIS models. His study indicated that the ANFIS model exhibited better performance than ANN model, but the performance of these two models were very close. Our study showed that the predictions of water quality using RBFN and ANFIS models had significant differences. It may be the reason that the characteristics of input data are different. Chau  used antecedent flow data (i.e., , , and ) as input vector, while input data with different independent variables is adopted in the present study.
ANN models (i.e., RBFN and ANFIS models) and an MLR model were developed to predict the DO, TP, Chl a, and SD in the reservoir. The performances of the RBFN, ANFIS, and MLR models were evaluated using the mean absolute error, the root mean square error, and the correlation coefficient. The linear regression between the water quality variables (DO, TP, Chl a, and SD) and the individual dependent variables was used to select the major input variables for the RBFN, ANFIS, and MLR models.
These models were then constructed to predict the DO, TP, Chl a, and SD in the Mingder Reservoir of central Taiwan. The water quality variables predicted using the MLR model did not yield good results. The correlation coefficient between the predicted water quality variables and the measured data was less than 0.8 using the RBFN model. In general, the ANFIS model better predicts the water quality variables than the RBFN model does. The ANN model, including the RBFN and ANFIS model, can preserve the nonlinear characteristics between the input and output variables, which are superior to conventional statistical approaches (i.e., the MLR model). In the real world, temporal and spatial distributions in observational data do not exhibit simple regularities; therefore, they are difficult to accurately predict. It is necessary to use nonlinear models, such as the ANN model, which are suitable for complex nonlinear systems. The proposed approach using the ANFIS model has yielded valuable information that can be used by decision-makers for aiding reservoir water quality management.
In the present study, we focus on the prediction of water quality instead of forecast. In a future study, different lead-time forecasts in the water quality can be developed to assist the local authorities for water quality management. The soft computing techniques, such as the combining fuzzy optimal model with genetic programming [40, 41], support vector machine [42, 43], and particle swarm optimization training algorithm for a neural network , can also be develop to improve the prediction of water quality in the reservoirs.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The project under which this study was conducted is supported by the National Science Council (NSC), Taiwan, under Grant no. NSC100-2625-M-239-001. The authors would like to express their appreciation to the Environmental Protection Administration, Taiwan, for providing the measured data.
F. Recknagel, A. Talib, and D. van der Molen, “Phytoplankton community dynamics of two adjacent Dutch lakes in response to seasons and eutrophication control unravelled by non-supervised artificial neural networks,” Ecological Informatics, vol. 1, no. 3, pp. 277–285, 2006.View at: Publisher Site | Google Scholar
P. Kudu, A. Debsarkar, and S. Mukherjee, “Artificial neural network modeling for biological removal of organic carbon and nitrogen from slaughterhouse wastewater in a sequencing batch reactor,” Advances in Artificial Neural Systems, vol. 2013, Article ID 268064, 15 pages, 2013.View at: Publisher Site | Google Scholar
C. T. Cheng, K. W. Chau, Y. G. Sun, and J. Y. Lin, “Long-term prediction of discharges in Manwan Reservoir using artificial neural network models,” in Advances in Neural Networks—ISNN 2005, vol. 3498 of Lecture Notes in Computer Science, pp. 1040–1045, Springer, Berlin, Germany, 2005.View at: Publisher Site | Google Scholar
H. B. Chu, W. X. Lu, and L. Zhang, “Application of artificial neural network in environmental water quality assessment,” Journal of Agricultural Science and Technology, vol. 15, no. 2, pp. 343–356, 2013.View at: Google Scholar
T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applications to modeling and control,” IEEE Transactions on Systems, Man and Cybernetics, vol. 15, no. 1, pp. 116–132, 1985.View at: Google Scholar
K. W. Chau, Modelling for Coastal Hydraulics and Engineering, Spon Press, 2010.
M. Verleysen and K. Hlavackova, “Learning in RBF networks,” in Proceedings of the International Conference on Neural Network (ICNN '96), pp. 199–204, Washington, DC, USA, June 1996.View at: Google Scholar
F. M. Ham and I. Kostanic, Principles of Neurocomputing for Science and Engineering, McGraw-Hill, New York, NY, USA, 2001.
V. Kecman, Learning and Soft Computing: Support Vector Machine, Neural Networks, and Fuzzy Logic Models, MIT Press, Cambridge, Mass, USA, 2001.
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition, 1999.
S. Weisberg, Applied Linear Regression, John Wiley & Sons, New York, NY, USA, 2nd edition, 1985.
I. V. Tetko, D. J. Livingstone, and A. I. Luik, “Neural network Studies. 1. Comparison of overfitting and overtraining,” Journal of Chemical Information and Computer Sciences, vol. 35, no. 5, pp. 826–833, 1995.View at: Google Scholar
H. Tabari, O. Kisi, A. Ezani, and P. Hosseinzadeh Talaee, “SVM, ANFIS, regression and climate based models for reference evapotranspiration modeling using limited climatic data in a semi-arid highland environment,” Journal of Hydrology, vol. 444-445, pp. 78–89, 2012.View at: Publisher Site | Google Scholar