The Big Data Processing Algorithm for Water Environment Monitoring of the Three Gorges Reservoir Area
Owing to the increase and the complexity of data caused by the uncertain environment, the water environment monitoring system in Three Gorges Reservoir Area faces much pressure in data handling. In order to identify the water quality quickly and effectively, this paper presents a new big data processing algorithm for water quality analysis. The algorithm has adopted a fast fuzzy C-means clustering algorithm to analyze water environment monitoring data. The fast clustering algorithm is based on fuzzy C-means clustering algorithm and hard C-means clustering algorithm. And the result of hard clustering is utilized to guide the initial value of fuzzy clustering. The new clustering algorithm can speed up the rate of convergence. With the analysis of fast clustering, we can identify the quality of water samples. Both the theoretical and simulated results show that the algorithm can quickly and efficiently analyze the water quality in the Three Gorges Reservoir Area, which significantly improves the efficiency of big data processing. What is more, our proposed processing algorithm provides a reliable scientific basis for water pollution control in the Three Gorges Reservoir Area.
The Yangtze River Three Gorges project is one of the world-renowned large water conservancy projects. After the completion of the project, the storage capacity of the reservoir has been growing tremendously. Because of the deterioration of the self-purification capacity of hydrostatic body, the environment of reservoir water is gradually worsening, which has attracted widespread attention. The safety of water environment is related to the Three Gorges Reservoir Area, the success of the Yangtze River region water security, the smooth implementation of the water diversion project, and even the overall sustainable development of China [1, 2]. According to the requirements of the “observational study of water quality during storage of the Three Gorges Project 135 Rules,” the “Upstream of the Yangtze River Water Environment Monitoring Center,” mainly monitoring the turbidity, sediment volume, conductivity, temperature, and hazardous materials (such as copper, lead, phenols, and cyanide) and another 27 items in the eight sections of Qingxichang, Fengdu, Zhongxian, Wanxian (Tuokou), Wanxian (Shaiwangba), Yunyang, Fengjie, and Wushan and the three sections of tributaries (Daning River, Xiao River, and Long River), has been found. Meanwhile, environmental monitoring stations of various levels have been established in the Three Gorges Reservoir Area, as an extremely important task of Three Gorges Reservoir Area water environment monitoring. Since monitoring methods and means applied to the existing monitoring equipment are unable to meet the needs of the task, the most pressing solution is needed for improvement on monitoring capacity.
In order to overcome the deficiencies of the prior technique, the project team funded by the Natural Science Foundation of Chongqing has made a series of researches. It has applied the wireless sensor network (WSN) technology to Three Gorges Reservoir Area water environmental monitoring, studied the expansion technology of WSN expansion in Three Gorges Reservoir Area water environment monitoring, and explored new ways to solve the use of new WSN in wireless remote real-time monitoring of water environment issues. On the basis of the research results obtained [3, 4], in this paper, we mainly discuss big data processing on water environment monitoring in the Three Gorges Reservoir Area.
The database or big data solution becomes efficient and receives more attention nowadays [5, 6]. The big data processing has become a hot issue [7–9]. Considering the characteristics of large and wide distribution, the WSN applied to Three Gorges Reservoir Area water environment monitoring, with thousands of nodes, is a typical large-scale one. Obtained by this network, massive data about the Three Gorges Reservoir Area water quality is available; this is typical big data processing. Therefore, figuring out how to deal with these data fast and efficiently is important for water quality analyzing, which is also the key issue to be addressed in this paper.
In water quality analysis, the assessment of water quality has a variety of parameters according to surface water quality standards . The dissolved oxygen (DO), permanganate index (CODMn), and ammonia nitrogen (NH3–N), which have an impact on water quality in the Three Gorges Reservoir Area, should be comprehensively analyzed to determine water quality. The data collected from each node of ultra-large-scale WSN in the Three Gorges Reservoir Area contains these three parameters (DO, CODMn, and NH3–N). Obviously, the analysis and processing of multidimensional data from WSN monitoring network are a typical big data processing problem.
Currently, the key technologies for big data analysis are clustering, fuzzy logic, evolutionary algorithms, and so forth . There have been many scientific studies on big data analysis methods: the literature  reveals the connection between SoS system and research big data analysis and establishes a stable system model to predict the photovoltaic energy; big data analysis is adopted in  for the study of a complex global environment system, playing an important role in environmental data analysis; the author in  studies the big data analysis in the management and prediction of natural disaster. However, the system model proposed in  is so sophisticated that it is not suitable for water environment monitoring system with single data type; the algorithms adopted in [13, 14] put high requirements on the processing ability of a computer, which is also not appropriate for water environment monitoring system.
In this paper, the existing big data analysis and processing method has been studied in depth based on the achievement in our project. According to massive data collected from “ultra-large-scale WSN for water environment monitoring of Three Gorges Reservoir Area,” our emphasis is put on the research of the big data analysis and processing method.
The rest of this paper is organized as follows. In Section 2, we collect data from these network nodes. According to the fuzzy C-means clustering algorithm and hard C-means clustering algorithm, we present the fast fuzzy C-means clustering algorithm in Section 3. Then, numerical results are provided in Section 4. Finally, Section 5 gives the conclusion.
2. The Network Model of Water Environment Monitoring
For the wide distribution of the Three Gorges Reservoir Area, some places are still not accessible. This leads to the special area in Three Gorges Reservoir Area and the reservoir shows a tree structure of winding . Therefore, the special structure should be considered in the distribution of nodes and the planning of water monitoring network. The distribution of the Three Gorges Reservoir Area and structure model of ultra-large-scale WSN water environment monitoring system are shown in Figure 1.
From Figure 1, we observe that a huge number of sensor nodes are distributed in the ultra-large-scale WSN. In the process of data collection, water quality indicator data has obvious characteristics of big data. Simultaneously, the data collected from node includes many different indicators of water quality parameters, such as DO, CODMn, and NH3–N. As we can see, the traditional data analysis methods face a challenge to analyze large data in a short time as well as obtain the quality of current water conditions in a particular region. Therefore, we need to explore a fast and effective method for big data analysis and processing.
3. Research of Big Data Processing Algorithm
3.1. Fuzzy C-Means Clustering Algorithm Analysis
Both clustering, fuzzy logic and evolutionary algorithms are an effective way to solve the problem of big data. The concept of fuzzy clustering was first proposed by Ruspini. The fuzzy logic is adopted in clustering, and the result of fuzzy clustering represents the extent of the sample belonging to the related cluster, rather than simply hard clustering indicating which sample belongs to a cluster [16, 17]. The fuzzy clustering is more scientific and reasonable. In many fuzzy clustering methods, the most widely used is the fuzzy C-means clustering (fuzzy C-means, referred to as FCM) algorithm [18–20]. Compared with the traditional hard C-means (hard C-means, referred to as HCM), FCM clustering can get richer information, which can reflect the actual distribution of the sample accurately. The convergence has also been demonstrated . The FCM clustering algorithm is the development of the HCM and it was proposed by Bezdek in 1981, which is used to put the distribution of data points in multidimensional data space into a number of several specific classes. FCM algorithm is taken to continuously optimize the sample through an iterative and C cluster center similarity objective function [22, 23], so that we can have access to local minima and obtain the optimal clustering. However, the FCM clustering algorithm is extremely time consuming. In the FCM clustering algorithm, the membership used to indicate the degree of each data point belongs to a cluster [24, 25].
Let denote the set of the data points without clustering; is the number of clusters, and the fuzzy matrix indicates the fuzzy partition of data points. The is one element of the matrix ; it shows the membership of the data point for the th cluster. The must meet the following conditions :
Bezdek summed up the fuzzy partition problem as an extreme value problem that follows the objective function under the constraint conditions:
In formula (2), is a matrix of and is the th cluster center. The indicates the Euclidean distance between the th cluster center and the th data point. And is the fuzzy index which is used to control the fuzzy degree of classification matrix. The greater is, the higher degree of fuzziness of the classification is. When , the FCM clustering algorithm will degenerate into the HCM clustering algorithm. Therefore, the objective function is the quadratic sum of the weighted distance for each data point to the cluster center. The FCM algorithm is an iterative solution procedure making the objective function to minimization.
The Lagrange multiplier method is applied to solving the optimization problem of formula (2) under the restriction of formula (1), and we can obtain the formula for and :
In the FCM clustering algorithm, we set an initial cluster center randomly. At the same time, adjust the classification and cluster center according to formula (3) and formula (4) in each iteration. Until the change of the cluster center coming from two adjacent iterations is less than the preset , we believe that the algorithm has been convergent. But the iterative workload is great, and the clustering analysis has low efficiency, so it is not suitable for some clustering analysis which is multidimensional with a large amount of data. In particular, for such big data processing in Three Gorges Reservoir Area water quality monitoring, it cannot meet the water quality monitoring requirements quickly and efficiently.
3.2. Fast Fuzzy C-Means Clustering Algorithm
Further studies have shown what follows. In the standard FCM clustering algorithm, when , the fuzzy cluster center is much close to the hard cluster center. The deviation between them will increase slightly with the value of increased, and the similarity between them is 0.98 in the maximum deviation; it is close to 1 in the vicinity of the typical . Pal and Bezdek think that the reasonable range is , so we take in this paper. At this moment, the fuzzy cluster center is close to the hard cluster center very much. Therefore, the hard cluster center can be treated as the initial value of the fuzzy cluster center to accelerate the speed of convergence and reduce the number of iterations. Accordingly, an improved FCM clustering algorithm that is called fast fuzzy C-means clustering algorithm (FFCM) can be obtained.
Firstly, the HCM clustering algorithm is adopted to quickly determine hard cluster center of the input data. Then the hard cluster center is used as the initial cluster center of the FCM clustering algorithm for the fuzzy clustering iteration. Because of these, the convergence of the algorithm is speeded up; thus this fast FCM clustering algorithm is built. The iterative steps of the algorithm are as follows.(1)Determine the number of clusters , select the constant , and initialize the membership matrix by formula (1); set the number of iterations as .(2)Calculate the cluster center as follows: (3)For the th iteration, modify the membership matrix as follows: (4)If , then enter the next step. Otherwise, set and go to step .(5)Set the iteration number . Set as the initial cluster center of FCM clustering algorithm by the result of step .(6)Calculate the membership matrix according to the by formula (3).(7)Adjust the cluster center further according to formula (4).(8)If , the algorithm ends. Then output the cluster center and membership matrix . Otherwise, set and return to step .
The calculated membership matrix can determine the membership of each data point for each cluster. Among them, for one data point, the degree of membership reflects the cluster that the data point belongs to. The hard cluster center is determined by using HCM clustering algorithm from step to step . Because the hard cluster center is close to the fuzzy cluster center very much, the hard cluster center which is got from step will be taken as the initial cluster center of FCM clustering algorithm. And then, the data samples will be managed by the FCM clustering through the following steps, so that the convergence speed of the fuzzy clustering algorithm could be speeded up. The number of iterations in fuzzy clustering algorithm would be reduced, and the time of the clustering analysis would be reduced greatly. For the water environment monitoring of Three Gorges Reservoir Area, the water quality data can be clustered rapidly by this fast fuzzy C-means clustering algorithm, which can be used to judge the water status quickly.
4. Simulation and Analysis
In order to verify the feasibility of the big data processing algorithm and its application to the big data processing for water environment monitoring in the Three Gorges Reservoir Area, we use the massive amounts of data collected by “ultra-large-scale WSN for water environment monitoring of Three Gorges Reservoir Area,” refer to monitoring data of the environment inspection department of Chongqing , and select 1024 groups of the sample data for a simulation test. There are three characteristic parameters (DO, CODMn, and NH3–N) in each group of data; the unit of them is mg/L.
According to the standard of surface water environment quality, it divides the water quality into five classes named I, II, III, IV, and V, in which class I is the best, class V is the worst, and II, III, and IV are for the intermediate case. Then we select , and we assume that the collected water samples obey the uniform distribution. Next, we analyze the samples based on the traditional clustering algorithm and the fast fuzzy C-means clustering algorithm proposed in this paper, respectively, and present the performance of different algorithms.
First of all, the cluster center of the five kinds of water quality is analyzed using HCM clustering algorithm and FCM clustering algorithm, respectively. The two algorithms’ simulation cluster centers obtained are shown in Table 1. From Table 1, HCM cluster center is much close to the FCM cluster center. So, the HCM cluster center can be used as the initial cluster center of FCM clustering algorithm.
We cluster the selected sample data by FCM clustering algorithm and FFCM clustering algorithm, respectively, and then we analyze the convergence and the iterative process of the two algorithms. The convergence and the iterative process for the two algorithms are shown in Figure 2. As we can see from Figure 2, the convergence speed of the new FFCM clustering algorithm is much faster than the FCM clustering algorithm. This shows that the FFCM clustering algorithm improves the convergence speed. The convergence speed of the HCM is very fast and the number of iterations needed is very small. Further analysis reveals that the FCM algorithm requires iterative operation 44 times and the FFCM algorithm only needs 24 times. So, the iterations of the FFCM algorithm are much less than those of the FCM algorithm. The new algorithm can reduce the required clustering time effectively and improve the efficiency of the data processing.
The cluster centers of five kinds of water quality conditions are shown in Table 2, which are obtained by the cluster analysis of the FFCM clustering algorithm. The cluster analysis results and the distribution of water quality under different factors are shown in Figures 3, 4, 5, and 6.
From Figure 3 to Figure 6, the different colors represent different water quality; it is shown that the water quality is affected by 3 characteristic parameters: DO, CODMn, and NH3–N. With comprehensive analysis of these three sets of parameters of water quality indicators, we can judge the quality correctly. The water quality situation can be considered comprehensively by the fast fuzzy C-means clustering algorithm from the three main indicators. It divides the 1024 selected sample points into five categories and determines the water quality level.
In Figures 5 and 6, the classification of the water quality is more accurate than DO and CODMn, and the effect of NH3–N is relatively less important. Mainly on the basis of DO and CODMn, we can get a preliminary determination of water types. The water in which DO is higher than 11 mg/L can be determined as class I. If CODMn is less than 3 mg/L, when DO is in the range of 9 mg/L to 10 mg/L, the water can be determined as class II. When DO is in the range of 8 mg/L to 9 mg/L, the water is class III. When the DO is less than 8 mg/L, the water is classified as V. The water belongs to class IV when the CODMn is more than 3 mg/L.
In this paper, a new fast fuzzy C-means clustering algorithm is proposed to complete the big data processing of the Three Gorges Reservoir Area water environment monitoring. The new algorithm improves the clustering algorithm convergence. The result of hard clustering is utilized to guide the initial value of fuzzy clustering. The new clustering algorithm can speed up the rate of convergence and improve the efficiency of big data processing. Simulation results show that, compared with the HCM clustering and the standard FCM clustering algorithm, this algorithm can not only effectively realize fuzzy clustering of data but also have faster convergence. The algorithm can quickly and efficiently analyze the discrimination of water quality in the Three Gorges Reservoir Area, improve the efficiency of big data processing in Three Gorges Reservoir Area water quality testing, and provide a reliable scientific basis for water pollution control in the Three Gorges Reservoir Area. The algorithm can not only be applied to the complexity of ultra-large-scale WSN for big data analysis and processing but also have some guidance for other areas in the big data processing.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors acknowledge the following foundation items: the Scientific and Technological Project of Chongqing (no. cstc2012gg-yyjs40010) and the Natural Science Foundation of Chongqing (no. CSTC, 2008BB2340).
X.-B. Li, X.-A. Liu, and D.-L. Fu, “Analysis on water quality variation of the mainstream and tributaries in Wushan section, three Gorges Reservoir Region,” Environmental Science and Management, vol. 35, no. 5, pp. 122–128, 2010.View at: Google Scholar
Y. Xu, M. Zhang, L. Wang, L. Kong, and Q. Cai, “Changes in water types under the regulated mode of water level in Three Gorges Reservoir, China,” Quaternary International, vol. 244, no. 2, pp. 272–279, 2011.View at: Publisher Site | Google Scholar
Y. Zhong and Y. Song, “Energy-saving adaptive routing algorithm for large-scale wireless sensor network,” Computer Engineering and Applications, vol. 49, no. 1, pp. 89–93, 2013.View at: Google Scholar
Y. Zhong, L. Cheng, L. Zhang, Y. Song, and H. R. Karimi, “Energy-efficient routing control algorithm in large-scale WSN for water environment monitoring with application to three gorges reservoir area,” The Scientific World Journal, vol. 2014, Article ID 802915, 9 pages, 2014.View at: Publisher Site | Google Scholar
S. Yin, X. Li, H. Gao, and O. Kaynak, “Data-based techniques focused on modern industry: an overview,” IEEE Transactions on Industrial Electronics, 2014.View at: Publisher Site | Google Scholar
S. Yin, G. Wang, and X. Yang, “Robust PLS approach for KPI-related prediction and diagnosis against outliers and missing data,” International Journal of Systems Science, vol. 45, no. 7, pp. 1375–1382, 2014.View at: Publisher Site | Google Scholar | MathSciNet
S. Yin, S. X. Ding, X. Xie, and H. Luo, “A review on basic data-driven approaches for industrial process monitoring,” IEEE Transactions on Industrial Electronics, no. 99, 10 pages, 2014.View at: Publisher Site | Google Scholar
S. Yin, X. Gao, H. R. Karimi, and X. Zhu, “Study on support vector machine-based fault detection in tennessee eastman process,” Abstract and Applied Analysis, vol. 2014, Article ID 836895, 8 pages, 2014.View at: Publisher Site | Google Scholar
S. Yin, X. Zhu, and H. R. Karimi, “Quality evaluation based on multivariate statistical methods,” Mathematical Problems in Engineering, vol. 2013, Article ID 639652, 10 pages, 2013.View at: Publisher Site | Google Scholar
GB3838-2002, the surface water environment quality standard.
K. Kambatla, G. Kollias, V. Kumar, and A. Grama, “Trends in big data analytics,” Journal of Parallel and Distributed Computing, vol. 74, no. 7, pp. 2561–2573, 2014.View at: Publisher Site | Google Scholar
B. K. Tannahill and M. Jamshidi, “System of systems and big data analytics-bridging the gap,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 2–15, 2014.View at: Google Scholar
C. A. Steed, D. M. Ricciuto, G. Shipman et al., “Big data visual analytics for exploratory earth system simulation analysis,” Computers and Geosciences, vol. 61, pp. 71–82, 2013.View at: Publisher Site | Google Scholar
J.-P. Belaud, S. Negny, F. Dupros, D. Michéa, and B. Vautrin, “Collaborative simulation and scientific big data analysis: illustration for sustainability in natural hazards management and chemical process engineering,” Computers in Industry, vol. 65, no. 3, pp. 521–535, 2014.View at: Google Scholar
J. Fuzheng, “The general picture of the Three Gorges region,” Chongqing Architecture, vol. 9, no. 1, pp. 1–7, 2010.View at: Google Scholar
M. Samhouri, M. Abu-Ghoush, E. Yaseen, and T. Herald, “Fuzzy clustering-based modeling of surface interactions and emulsions of selected whey protein concentrate combined to l-carrageenan and gum Arabic solutions,” Journal of Food Engineering, vol. 91, no. 1, pp. 10–17, 2009.View at: Publisher Site | Google Scholar
M. A. Ghoush, M. Samhouri, M. Al-Holy, and T. Herald, “Formulation and fuzzy modeling of emulsion stability and viscosity of a gum-protein emulsifier in a model mayonnaise system,” Journal of Food Engineering, vol. 84, no. 2, pp. 348–357, 2008.View at: Publisher Site | Google Scholar
R. J. Hathaway and J. C. Bezdek, “Extending fuzzy and probabilistic clustering to very large data sets,” Computational Statistics and Data Analysis, vol. 51, no. 1, pp. 215–234, 2006.View at: Publisher Site | Google Scholar | Zentralblatt MATH
M. B. Al-Zoubi, A. Hudaib, and B. Al-Shboul, “A proposed fast Fuzzy C-Means algorithm,” WSEAS Transactions on Systems, vol. 6, no. 6, pp. 1191–1195, 2007.View at: Google Scholar
S. R. Kannan, R. Devi, S. Ramathilagam, and A. Sathya, “Some robust objectives of FCM for data analyzing,” Applied Mathematical Modelling, vol. 35, no. 5, pp. 2571–2583, 2011.View at: Publisher Site | Google Scholar | MathSciNet
X. Weixin and L. Jianzhuang, “The mergence of hard clustering and fuzzy clustering-a fast FCM algorithm with two layers,” Fuzzy Systems and Mathematics, vol. 6, no. 2, pp. 77–85, 1992.View at: Google Scholar
X. Gao and W. Xie, “Advances in theory and applications of fuzzy clustering,” Chinese Science Bulletin, vol. 45, no. 11, pp. 961–970, 2000.View at: Publisher Site | Google Scholar
H. Tang, T. Fang, P. Du, and P. Shi, “Intra-dimensional feature diagnosticity in the Fuzzy Feature Contrast Model,” Image and Vision Computing, vol. 26, no. 6, pp. 751–760, 2008.View at: Publisher Site | Google Scholar
M. Samhouri, M. Abughoush, and T. Herald, “Fuzzy identification and modeling of a gum-protein emulsifier in a model mayonnaise color development system,” International Journal of Food Engineering, vol. 3, no. 4, article 11, 2007.View at: Google Scholar
L. Wang, W. Wang, and Y.-X. Li, “Fuzzy clustering algorithm based on artificial immune cell mode,” Computer Engineering, vol. 37, no. 5, pp. 13–15, 2011.View at: Google Scholar
Z.-Y. Yang, X.-Y. Huang, C. H. Du, and M.-X. Tang, “Study of urban traffic congestion judgment based on FFCM clustering,” Application Research of Computers, vol. 25, no. 9, pp. 2768–2770, 2008.View at: Google Scholar
“Chongqing environmental monitoring center environment quality automatic monitoring of water quality weekly report[EB/OL],” 2014, http://www.cqemc.cn/.View at: Google Scholar