The Adaptive-Clustering and Error-Correction Method for Forecasting Cyanobacteria Blooms in Lakes and Reservoirs
Globally, cyanobacteria blooms frequently occur, and effective prediction of cyanobacteria blooms in lakes and reservoirs could constitute an essential proactive strategy for water-resource protection. However, cyanobacteria blooms are very complicated because of the internal stochastic nature of the system evolution and the external uncertainty of the observation data. In this study, an adaptive-clustering algorithm is introduced to obtain some typical operating intervals. In addition, the number of nearest neighbors used for modeling was optimized by particle swarm optimization. Finally, a fuzzy linear regression method based on error-correction was used to revise the model dynamically near the operating point. We found that the combined method can characterize the evolutionary track of cyanobacteria blooms in lakes and reservoirs. The model constructed in this paper is compared to other cyanobacteria-bloom forecasting methods (e.g., phase space reconstruction and traditional-clustering linear regression), and, then, the average relative error and average absolute error are used to compare the accuracies of these models. The results suggest that the proposed model is superior. As such, the newly developed approach achieves more precise predictions, which can be used to prevent the further deterioration of the water environment.
Cyanobacteria blooms are the consequence of eutrophication, which can no longer be ignored. It clearly interferes with lake ecosystem functioning (severe loss of aquatic biodiversity) and services such as the provisioning of drinking water and recreation . The formation of cyanobacteria blooms is a multifactor coupled multidimensional coordination complex dynamical system with an intrinsic strong nonlinear dissipative structure . Precise prediction of cyanobacteria blooms can effectively mitigate eutrophication and protect the lake ecosystem. The cyanobacteria blooms have always exhibited high variability and instability, which makes them very difficult to monitor and predict .
Cyanobacteria-bloom data are a typical and complex type of chaotic time-series data based on long-term field observation and simulation experiments . Traditionally, global and local prediction algorithms have been used for the classical modeling and prediction of chaotic time series [5, 6]. The global method attempts to approximate the entire time series on all attractors and seeks a function that is valid at every point. Then, we can predict the future values from knowledge of all previous elements of the time series. However, the parameters may be changed when new information is added to the model . The local method only uses the nearest information to the local attractors, and the dynamics of the system are described locally, step by step, in the phase space .
Recently, researchers found that the prediction precision can be improved by using some combined techniques. These include artificial neural networks combined with fuzzy theory [9, 10], the unbiased composite kernel LSSVR , and -nearest neighbors with phase space reconstruction . Generally, these methods can obtain better results than traditional individual models, but they are complex, and the parameter selection is affected by personal experience.
The variation of cyanobacteria blooms is of a dynamic and sensitive nature, as ecological environments show different characteristics at different intervals. Local linearization is a useful method for modeling and prediction in this situation, and the choice of the dimension of the regression model is also an important issue. In this paper, we propose a validity index to cluster some typical operating intervals on research objects for dynamic state identification. The typical operating intervals are subsets of samples, which are of unequal sizes. The subset based on the division of similarity also has considerable influence on the operating-point range of the modeling process. The optimal number of neighbors participating in the modeling is set by the particle swarm optimization (PSO) algorithm. By using an error-correction method to adjust the function coefficients, the model can approximate the local attractor and improve the prediction accuracy.
In this paper, the following contributions are made:(i)The clustering algorithm based on the validity index proposed in this paper is used to split the time-series data into unequal intervals and to narrow the field of prediction by finding partial neighbor information of current points in each subset. Using this approach, the prediction speed can be improved. In addition, this validity index is easy to understand and convenient for popularization and application.(ii)Prediction deviation is used to adjust model parameters online, which means the idea of error-correction is introduced, and the proposed prediction model has the ability to resist external disturbances.
The rest of the paper is organized as follows. In Section 2, the prediction model is introduced. In Section 3, experiments and comparisons of different models are presented. Conclusions are given in Section 4.
2. Model-Based Prediction of Cyanobacteria Blooms in the Short Term
The objective is to predict the general trend of cyanobacteria blooms over a short period. The proposed prediction model involves the following three steps: first, using the proposed concise validity index, the time-series dataset is split into typical subsets. The number of clusters is determined by data similarity characteristics, and the maximum value of the validity index corresponds to the optimal clustering number. Then, the error-correction method, PSO, and fuzzy regression algorithm are used to optimize the data-driven model, which is used to predict the cyanobacteria blooms for the next time. The values of multiple nearest neighbors of the subset are used to express the evolutionary trends in the time series, and the error between the predicted value and actual value can be used as feedback to correct the model coefficients dynamically based on fuzzy theory. In this way, the prediction of cyanobacteria blooms can be performed precisely without generating cumulative errors.
2.1. Optimal Number of Clusters Based on the Validity Index
The objective of the current work is to predict the fluctuation of cyanobacteria blooms over a consecutive future time. The most basic step of the proposed prediction method is splitting the time-series modeling data into a set of typical subsets with nearest-neighbor similarity via the adaptive-clustering principle, according to which the number of clusters is determined by the similarity characteristics of the data. Clustering is an unsupervised learning process. The optimal number of clusters and the partition are determined according to a validity index that rewards higher similarity within each cluster and lower similarity between different clusters. Research results show that there is no validity index that is applicable to every case. At present, many validity indices have been used in various studies. The commonly used clustering validity indices include Calinski-Harabasz (CH) , Hartigan (Ht) , Homogeneity-Separation (HS) , and Krzanowski-Lai (KL) .
Based on the preliminary research on chlorophyll a concentration, this paper proposes an improved cluster validity index according to its characteristics. The natural properties of the clustering results are used to evaluate intraclass tightness and interclass separability. We define the intraclass distance and interclass distance to compute the cluster validity index, which is named the clustering comprehensive quality. The formula for the intraclass distance is as follows:Here, is the number of clusters; is the data size of each cluster, where ; and denote the th and th data of the subset, which has samples, where .
The formula for the interclass distance is as follows:Here, denote the th and th cluster centers, where .
The formula for the proposed cluster validity index is as follows:Here, , , and are the weights for balancing the intraclass distance and interclass distance. For the general case, they are all positive real numbers, and , . In application, when the intraclass distance is too small or the interclass distance is too large, the three parameters can be adjusted to avoid weakening the performance indicator of clustering comprehensive quality. Obviously, the greater the clustering comprehensive quality is, the better the effect of clustering is.
In this paper, , , and are set based on the characteristics of the distribution of the chlorophyll a concentration data.
2.2. -Nearest Neighbors Based on the Particle Swarm Optimization Algorithm
Splitting into different clusters is equivalent to the local division of the motion law. In the -nearest method, each sample can be represented by its closest neighbors. After determining the best cluster partition, it is critical to construct a model to predict the value of the evolutionary track using local similarity information. It is a difficult task to select the appropriate number of similar sample points. The particle swarm optimization algorithm is a new global-optimization evolutionary algorithm based on the foraging behavior of birds, which was advanced by Dr. Eberhart and Dr. Kennedy . In this paper, the particle swarm optimization algorithm is used to determine the optimal number of most similar neighbors among a cluster for a sample point.
2.3. Fuzzy Linear Regression Model
Consider the classical linear regression equationHere, is the regression coefficient, where , and is the independent variable of the model.
In a real application, the data often have fuzzy uncertainties due to various factors. In this paper, fuzzy linear regression is introduced.
The fuzzy linear regression equation is represented as follows:Here, is the fuzzy regression coefficient, where ; is the fuzzy independent variable, where ; and is the fuzzy dependent variable.
To simplify the use of the model, the triangular fuzzy representation was adopted as and , where and are central values of triangular fuzzy numbers and and are the fuzzy amplitudes.
2.4. Model Improvement and Solving
Optimization of the regression coefficients should be subject to two constraint conditions: one ensures that the model fits the data well, and the other ensures the minimal fuzziness of the regression function. By scaling the degree of the fit using the closeness degree, the problem is transformed into a problem that minimizes the fuzzy amplitude under the condition . The transformation is given by Here, is the given closeness-degree standard, and is the fuzzy amplitude of .
The formulation of the closeness degree is represented as follows:
Given a set of weights , the fuzzy amplitude of the system under the weights is represented as follows:
Obviously, the fuzzy linear regression is essentially an optimization problem, which is solved by minimizing the objective function subject to the following constraints:Here, is the weight of the fuzzy amplitude in the objective function, and, generally, its value is 1.
In a linear programming problem, the weight of each fuzzy amplitude in the objective function should be different because the influence of each regression variable on the dependent variable is different. Since the correlation coefficient between the regression variable and dependent variable can roughly reflect the degree of influence of each regression variable on the dependent variable, the weights are corrected according to correlation coefficients. The formulation is represented as follows:Here, is the correlation coefficient between the th regression variable and the dependent variable.
2.5. Model Evaluation
After the model is solved, a certain standard can be selected to evaluate the fitting performance of the model.(1)Closeness degree. The fitting accuracy is generally considered to be high when the closeness degree is greater than 0.5.(2)The ratio of the relative error between the predicted value and actual value to the actual value and the ratio of the fuzzy amplitude to the actual value are
When and are less than 20%, the fitting accuracy is better.
2.6. Prediction Model
In view of the characteristics of the random and nonperiodic chaotic behavior in the evolution process of cyanobacteria blooms, a fuzzy linear regression prediction model is constructed. According to the theory of the local linearization of nonlinear systems, a simple and novel validity index is used to split the sample data into several subsets with the adaptive-clustering algorithm. Then, the particle swarm optimization algorithm is used to select the optimal number of nearest neighbors in clusters of modeling data based on the Euclidean distance. Finally, with the fuzzy characteristics of the monitoring data, the fuzzy linear regression model is constructed.
First, the adaptive-clustering algorithm is used to split the time-series data into unequal intervals and narrow the field of prediction. The most important aspect is the choice of the number of clusters. To obtain the best clustering partition, a simple and effective validity index is proposed to evaluate the quality of clustering. Second, the subset of current monitoring points is selected, and the particle swarm optimization algorithm is used to determine the optimal number of most similar neighbors from among the subset for a sample point. Third, with the selected nearest-neighbor data, a fuzzy linear regression method based on error-correction is adopted for the local small-range prediction. This model effectively utilizes the historical information of the nearest neighbors of the subset to make the predictive value closer to the actual value and revise the prediction model dynamically based on the deviation.
3. Experiments and Comparison
This section presents experiments on the application of the prediction model to cyanobacteria-bloom forecasting. In this paper, we used chlorophyll a concentration as the characterization index to characterize cyanobacterial bloom formation. The proposed adaptive-clustering and error-correction method was able to capture the dynamic characteristics of high nonlinearity and uncertainty.
It has been confirmed that the evolution of cyanobacteria blooms has a chaotic property, and the sensitivity characteristics of the chaotic system can cause great differences in evolutionary trajectories due to slight disturbances. Therefore, the chaotic system cannot be predicted over the long term. A set of chlorophyll a concentrations with 2000 data values collected every 4 hours from January 1, 2011, to December 31, 2011, in the Jin Shu monitoring site of Taihu was used to evaluate the prediction of chlorophyll a concentration from January 1, 2012, to January 7, 2012.
3.1. Evaluation Indicator
To evaluate the accuracy of these predictions, the average absolute error and the average relative error between predicted values and actual values can be calculated as follows:Here, is the average absolute error, is the absolute error of the chlorophyll a concentration, is the actual value, is the predicted value, is the prediction length, is the relative error, and is the average relative error.
3.2. Optimal Cluster Number
The training data were optimized based on the validity index proposed in this paper, and the curve of clustering comprehensive quality with different numbers of clusters is shown in Figure 1. The optimal number of clusters was computed according to formula (3); obviously, .
The same training data are split according to the abovementioned four kinds of clustering validity indices: the CH index, Ht index, HS index, and KL index. The curves of the indices are shown in Figure 2.
(a) CH index curve
(b) KL index curve
(c) HS index curve
(d) Ht index curve
For the CH index, KL index, and HS index functions, the number of clusters at which the function attains its maximum is taken as the optimal number of clusters. For the Ht index function, the minimum number of clusters with index-function value less than or equal to 10 is taken as the optimal number of clusters. As shown in Figure 2, the optimal numbers of clusters obtained by the CH index, KL index, HS index, and Ht index are , , , and , respectively.
3.3. Model Construction and Evaluation
The optimal number of neighbors was determined using the particle swarm optimization algorithm for each subset. The optimal number of neighbors was found to be 10 by optimization.
Therefore, 10 nearest-neighbor samples were included in the fuzzy linear regression prediction model in each subset. was set to 0.8, and Lingo software was used to compute the fuzzy coefficients. The fuzzy coefficients are listed in Table 1.
To evaluate the fitting accuracy of the model, relevant data were substituted into the model. The predicted values and actual values are listed in Table 2.
As shown in Table 2, most of the closeness degrees exceed 0.5; the values of and are less than 20%. This means that the fitting accuracy of the model is high.
3.4. Experimental Results
The values obtained using the fuzzy linear regression model for 20-step prediction and the actual values are shown in Figure 3.
The values predicted by the fuzzy linear regression model based on adaptive-clustering and error-correction are consistent with the actual values of chlorophyll a concentration, and the model has good prediction precision.
To assess the effectiveness of the proposed method, traditional-clustering linear regression and adaptive-clustering linear regression based on the four kinds of clustering validity indices described above were used to validate the experimental results. In addition, phase space reconstruction was used. The fuzzy linear regression model based on adaptive-clustering and error-correction was calculated with the center value of the predicted value. The results of other prediction models are shown in Figure 4, and the errors are listed in Table 3.
According to Figure 4 and Table 3, the prediction precision of the model constructed in this paper is relatively high. In this way, the effectiveness of the proposed method and the feasibility of the model are verified. Although the optimal clustering number obtained by the HS index is the same as that obtained by the cluster validity index we used in this paper, the calculation method for the HS index is difficult to understand, and its generality is not strong. It is clear that the validity index proposed in this paper is convenient and easy to understand and has a strong practicality.
4. Conclusion and Future Work
In this paper, the adaptive-clustering and error-correction methods were introduced to predict the chlorophyll a concentration used to characterize cyanobacterial bloom formation. Considering the chaotic and uncertain characteristics of chlorophyll a concentration, the adaptive-clustering algorithm was introduced here to obtain some typical subsets. In addition, the optimal number of nearest neighbors for the model was optimized by particle swarm optimization. Finally, the fuzzy linear regression method based on error-correction was used to revise the model dynamically near the operating point. We found that the combined method can characterize the evolutionary track of cyanobacteria blooms in lakes and reservoirs. It can be concluded that the model constructed in this paper can capture complex dynamics, such as the trends of chlorophyll a concentration in cyanobacteria blooms.
This model is intended for short-term forecasting. The complexities of data-driven modeling can greatly increase the difficulty of prediction. Future work should explore additional learning methods and even combine different learning methods to predict cyanobacteria blooms.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was financially supported by University Innovation Ability to Enhance the Project of Beijing Municipal no. PXM2014_014213_000033, Science and Technology Key Project of Beijing Municipal Education Commission no. KZ201510011011, and the National Natural Science Foundation of China no. 51179002.
L. S. Chen, Nonlinear Biodynamic System, Science Press, Beijing, China, 1993.
B. Q. Qin, G. J. Yang, J. R. Ma et al., “Dynamics of variability and mechanism of harmful cyanobacteria bloom in Lake Taihu, China,” Kexue Tongbao/Chinese, vol. 61, no. 7, pp. 759–770, 2016.View at: Google Scholar
P. R. L. Alves, L. G. S. Duarte, and L. A. C. P. D. Mota, “Improvement in global forecast for chaotic time series,” Computer Physics Communications, vol. 207, pp. 325–340, 2016.View at: Google Scholar
P. L. Gentili, H. Gotoda, M. Dolnik, and I. R. Epstein, “Analysis and prediction of aperiodic hydrodynamic oscillatory time series by feed-forward neural networks, fuzzy logic, and a local nonlinear predictor,” Chaos. An Interdisciplinary Journal of Nonlinear Science, vol. 25, no. 1, Article ID 013104, 2015.View at: Publisher Site | Google Scholar | MathSciNet
H. Tongal and R. Berndtsson, “Phase-space reconstruction and self-exciting threshold modeling approach to forecast lake water levels,” Stochastic Environmental Research & Risk Assessment, vol. 28, no. 4, pp. 955–971, 2014.View at: Google Scholar