#### Abstract

Water quality index is the most convenient way of communicating water quality status of water bodies, but its evaluation requires subjectivity in terms of user involvement and dealing with uncertainty. Recently, artificial intelligence algorithms that are appropriate for nonlinear forecasting and also dealing with uncertainties have been applied to various domains of water quality forecasting. This paper focuses on development of a data-driven adaptive neurofuzzy system for the water quality index using a real data set obtained from eight different monitoring stations across River Satluj in northern India. Novelty in the paper lies in the estimation of water quality index using two different clustering techniques: fuzzy C-means and subtractive clustering-based ANFIS and assessing their predictive accuracy. Each model was used to train, validate, and test the index that was obtained from seven water quality parameters including pH, conductivity, chlorides, nitrates, ammonia, and fecal coliforms. The models were evaluated on the basis of statistical performance criteria. Based on the evaluations, it was found that the SC-ANFIS method gave more accurate result as compared to the FCM-ANFIS. The tested model, SC-ANFIS model, was further used to identify those sensitive parameters across various monitoring stations that were capable of causing change in the existing water quality index value.

#### 1. Introduction

Surface water quality evaluation is an issue that draws the attention of regulatory agencies time and again for the purpose of safeguarding various intended uses. In this regard, continuous water quality monitoring is undertaken so as to assess the water quality and thereof propose adequate measures for its management. One of the several ways in which the large amount of data generated from the monitoring stations is assimilated for easy communication to the stakeholders is water quality index (WQI). The water quality index is a single number that expresses water quality by aggregating the measurements of several water quality parameters. Its output ranges from 0 to 100. A value of 100 represents excellent water quality condition, while zero indicates water not suitable for the intended use.

Several advancements are seen in evaluating WQI, ever since Horton [1] proposed the first index in the year 1965. These advancements have mainly come in the form of using soft computing tools such as data mining techniques, artificial intelligence, and fuzzy modeling system. The techniques have the capability to take care of uncertainties that often accumulate in a traditional way of evaluating WQI [2, 3]. Babbar and Babbar [3] employed different data mining techniques such as support vector machines, *k* nearest neighbor, decision trees, and naïve bayes to develop the predictive environment to classify water quality into understandable terms based on the overall index of pollution. Various researchers have successfully employed AI techniques to evaluate the water quality status [4–13]. However the choice of methods depends on the quantity and quantity of the data available. Fuzzy logic methods on the other hand are heuristic system description that uses if-then rules to establish quantitative relationships among the input and output variables [14]. However, the main problem with the fuzzy model is that there is no systematic procedure to define the membership function parameters, which must be predetermined by expert knowledge about the modeled system. At the same time, artificial neural network (ANN) has the ability to learn from an input and output pair and adapt to it in an interactive way. Based on these understanding, a hybrid technique adaptive neuro-fuzzy inference system (ANFIS) is gaining popularity in dealing with ill-defined and uncertain domains such as water quality predictions. ANFIS is a technique that embeds the fuzzy inference system into the framework of adaptive networks. ANFIS thus draws the benefits of both ANN and fuzzy techniques in a single framework. One of the major advantages of the ANFIS method over fuzzy systems is that it eliminates the basic problem of defining the membership function parameters and obtaining a set of fuzzy if-then rules. The learning capability of ANN is used for automatic fuzzy if-then rule generation and parameter optimization in ANFIS [15]. Jang [16] introduced the concept of ANFIS and since then it has been successfully implemented in various water quality problems. Yan et al. [17] developed the ANFIS model for classifying the water quality status of river and compared its performance with ANN. In this study, different types of membership functions such as generalized bell, Gaussian, trapezoidal, and triangular were compared and tested for training and testing data. The best-fit model was obtained using the Gaussian membership function. Sahu et al. [18] employed ANFIS to predict WQI of groundwater.

Rahimzadeh et al. [19] suggested ANFIS approach for the prediction of oily wastewater microfiltration permeate volume. Similarly, Talebizadeh and Moridnejad [20] compared ANN with ANFIS in forecasting lake level fluctuations, where ANFIS turned out to be superior to ANN in terms of efficiency. For water quality prediction, ANFIS has been applied for biochemical oxygen demand estimation based upon other water quality parameters as inputs [21]. Despite prediction model improvements, increase in model accuracy while avoiding overfitting is still a challenge for most researchers. According to recent researches, model performance can be significantly improved if an appropriate hybrid of multiple models is used for forecasting and prediction than using a single model in this regard [22].

As is apparent from the above studies, ANFIS has its popularity in terms of its applications and is also well established in the literature, but the choice of algorithms in automatically generating membership function is not given due considerations. Therefore, this study presents the automatic generation of membership function for WQI using fuzzy C-means (FCM) and subtractive clustering- (SC-) based ANFIS model which yields the homogeneous clusters (or classes) of WQI. The clustering method is one of the important methods for data analysis and decision making that allows retrieval of the useful information by grouping or categorizing multidimensional data in clusters. Since water quality data are a multidimensional data, it is expected that by performing two different clustering techniques a model performance can be evaluated and compared.

Thus the present study aims to develop the ANFIS model based on two different clustering methods and statistically identify the best amongst the two. The second objective is to employ the identified best model to find out the most sensitive water quality parameter that can cause a change in the predicted water quality.

#### 2. Materials and Methods

##### 2.1. Study Area

River Satluj is one of the five rivers of Indus River system that joins Indus River on its eastern side. It originates from the *Manasarovar-Rakas* Lakes in western Tibet at a height of 4,570 m within 80 km of the source of the Indus. In its course of travel, it flows through the Himalayan range in the Indian state of Himachal Pradesh and enters in the plains of Punjab from the Shivalik hills near Nangal, India. The river carries historical importance because of the fact that its water allocation is a part of famous Indus water treaty between India and Pakistan and also world’s highest gravity dam, Bhakra Nangal Dam, is built across the river at a point where it enters Punjab (India). Satluj River is extensively used for irrigation as well as drinking purposes. The hydrology of the river is controlled by snowmelt from Himalayas and South Asian monsoons.

This study focuses 238 km stretch of Satluj River flowing in the Punjab state of India. The watershed area of the river in this stretch is about 10,880 km^{2} and is geographically bounded between 31° 45′ N, 74° 57′ E and 30° 45′ N, 76° 50′ E. The river is regularly monitored at 8 monitoring stations by Thapar University, Patiala (India), under the river monitoring program initiated by the Government of India in 1996. These monitoring stations are strategically located for comprehensive study on water quality of the river in Punjab (India). The locations of the monitoring stations along the main river are shown in Figure 1.

Studies have shown that the water quality U/S of Headwork Nangal is of “A” class or of pristine quality. But further downstream, the river quality deteriorates in the study stretch area. This deterioration in the river quality is because of release of toxic effluents from industries and untreated sewage discharges, making river water unfit for any beneficial use.

##### 2.2. Water Quality Parameters

The water quality parameters that are routinely monitored at eight monitoring sites are pH, conductivity, chlorides, dissolved oxygen (DO), 5-day biochemical oxygen demand (BOD_{5}), total dissolved solids (TDS), suspended solids (SS), ammonical-N, nitrates, total phosphorous (TP), and fecal coliform (FC). All these parameters are monitored on monthly basis and analyzed according to the APHA standard methods [23]. For the purpose of this study, the historic water quality data from the year 1996–2012 are taken into consideration. For each station and for each parameter, the month wise 16-year average concentration was computed as given in Table 1. From Table 1, it is visible that water quality deteriorates in terms of DO (% sat.), BOD, and fecal coliform from station 6. This is because confluence of small rivulets: *Budha Nallah* and *Chitti Bein* happens after station 6. Both these rivulets discharge a large quantity of industrial and domestic effluent in River Satluj.

##### 2.3. Water Quality Index

The formulation for the water quality index for River Satluj is given in the study of Sharma and Reddy [24]. In it, the calculations for water quality index were performed taking into consideration three different water uses: ecological, irrigation, and municipal and domestic use. The steps involved included (a) selection of parameters, (b) assignment of weights to the selected parameter, (c) transformation of the monitored parameter values into common environmental scale units through use of parameter rating curves/equations (PRC/E) as given in Table 2, and (d) aggregation of the parameter values into the final score.

On the basis of the final score obtained, the water quality is classified into six classes with highest quality having maximum score as very poor (<45), poor (45–60), fair (61–69), good (70–79), very good (80–90), and excellent (91–100).

Mathematically, the WQI is expressed aswhere is the subindex for water quality parameter, is the weighting factor of the weight associated with each water quality parameter based, and is the number of water quality parameters.

In this study, WQI computed was meant for municipal and domestic water use. The parameters considered in this particular category were pH, conductivity, chlorides, TDS, ammonia-N, nitrates, and fecal coliform. The selection of the parameters was based on the fact that these parameters are considered useful for characterizing municipal and industrial waste, are routinely monitored for any polluted river, and therefore, form a small set of parameters that sufficiently convey information of the overall water quality for designated uses. More parameters can be used for estimation of WQI; however, more the number of parameters, more is the logistic concern for their monitoring and analysis. Since every parameter has its own importance with regard to an expression for water quality, there comes the need of ranking them in terms of their relative importance over the other parameter by assigning them a weighting factor which while formulating the index was obtained from expert opinion analysis given in detail in the study of Sharma and Reddy [24]. The transformation equations that were used to develop WQI for 16-year data in this study are given in Table 2. These transformation equations basically represent parameter rating curves that meant to bring different units of different parameters on a single scale unit for the purpose of direct aggregation in (1). These equations were developed on the basis of water quality standards and Indian effluent standards. In the equations given in Table 2, *x* represents the concentration range and *y* represents the scaled value of the parameter under consideration.

#### 3. Description of ANFIS

ANFIS is a multilayer feed-forward network that uses neural network learning algorithms and fuzzy logic to map an input space to an output space [25]. Jang [16] suggested adaptive neuro-fuzzy inference system (ANFIS) to construct an input-output mapping based on the initial given fuzzy system and available input-output data pairs by using learning procedures. This system can achieve a highly nonlinear mapping and is superior to common linear methods in producing nonlinear time series [26]. In the process of mapping input space to an output space, two commonly employed fuzzy inference systems (FIS) are used in various applications. These are Mamdani inference system and Sugeno inference system, which are described in the literature [27, 28]. The consequences of the fuzzy rules for these two inference models are different, and thus their aggregation and defuzzification procedures also differ accordingly. The Sugeno system is, however, considered more compact and computationally efficient than Mamdani’s system [27]. The consequence parameter in Sugeno FIS is either a linear equation, called *first-order Sugeno FIS*, or constant coefficient, called *zero-order Sugeno FIS* [26].

If it is assumed that the system includes two inputs, and , and the output and the rule base contains two fuzzy if-then rules; then the representation of rules for the first-order Sugeno FIS can be expressed aswhere , , and are the linear parameters in the consequent part of the Sugeno fuzzy inference system.

The architecture of ANFIS consists of five layers. Each layer contains several nodes described by the node function. Adaptive nodes, denoted by squares, represent the parameter sets that are adjustable in these nodes, whereas fixed nodes, denoted by circles, represent the parameter sets that are fixed in the system. The output data from the nodes in the previous layers are the input in the present layer. The description of each layer in the ANFIS architecture is given below.

Layer 1 is the fuzzification layer in which each node represents membership grade of the crisp inputs and each nodes output is computed bywhere and are the crisp input to the node , and are the linguistic labels characterized by the proper membership functions and , respectively. The Gaussian membership function is given bywhere are the parameter set of the membership function in the premise part of fuzzy if-then rules that modify the shapes of membership functions. The parameter in the input layer is called the premise parameter.

In layer 2, each node provides the strength of rules by means of multiplication operator given in (5). The output of this layer is firing strength as the products of the corresponding degree obtained from layer 1.

The membership values represented by and are multiplied in order to find the strength of the rule where the variable has linguistic value and has linguistic value .

Layer 3 is the normalization node, which normalizes the strength of all rules according to the equation given below:

Layer 4 is a layer of adaptive node, and every node in this layer computes the contribution of each rule towards the overall output and the function defined aswhere is the output of layer 3 and are the parameter set. Parameters in this layer are referred to as consequent parameters.

Layer 5 is the output layer in which the single node computes the overall output by summing all the rules from the previous layer. Accordingly, the defuzzification process transforms each rule’s fuzzy results into a crisp output in this layer. The output, , is computed as in (7).

ANFIS applies the hybrid-learning algorithm, which consists of the combination of “gradient descent” and “least-squares” methods to update the model parameters. Each epoch of this hybrid learning procedure is composed of a forward pass and a backward pass. In the forward pass of the hybrid learning procedure, the node output goes forward until layer 4 and the consequent parameters are identified by the least squares method. In the backward pass, the error signal propagates backwards and the premise parameters are updated by gradient descent. The detailed description of this algorithm is given in Jang and Sun [25].

#### 4. Modeling with ANFIS

The proposed methodology for ANFIS application to water quality evaluation is shown in the form of a flow chart in Figure 2. In the current study, the input parameters of the ANFIS under consideration are pH, conductivity, TDS, chlorides, nitrates, ammonia and fecal coliforms and the output is the WQI. The data with selected water quality parameters were first converted into the number of principal components and then loaded as input into the ANFIS taking the Gaussian input parameter membership function.

The relationship between water quality index and input variables using ANFIS to generalize the relationship was of the form

For the purpose of modeling, the water quality data obtained from eight monitoring stations equaling a total of 204 observations were divided into three sets: training data, checking data, and testing data. The training data were used for the training of ANFIS, while the checking data were used for verifying the identified ANFIS. The testing data were used to evaluate the model performance. The first data set, containing 70% of the records, was used as the training data; the second data set containing 15% of the records was used as the checking data, while the remaining 15% data were applied as the testing data. The target values represented the WQI computed using (1).

In this study, MATLAB Fuzzy Logic Toolbox ANFIS GUI was used as a modeling tool. Two separate clustering algorithms were used to automatically generate Gaussian-shaped membership functions. These algorithms are fuzzy c-means (FCM) and subtractive clustering (SC). Both these techniques generate fuzzy if-then rules but are different in terms of implementation. Fuzzy c-means (FCM) is a data clustering technique wherein each data point belongs to a cluster to some degree that is specified by a membership grade. Originally introduced by Bezdek [29], it provides a method that shows how to group data points that populate some multidimensional space into 12 different clusters.

On the other hand, if there is no clear idea about how many clusters there should be for a given set of data then subtractive clustering is useful. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and the cluster centers in a set of data [30]. The cluster estimates obtained can be used to initialize iterative optimization-based clustering methods and model identification methods like ANFIS. In this study, 15 cluster centers were determined for the given 204 data sets. The number of fuzzy rule set would be equal to the number of cluster centers, each representing the characteristic of the cluster. The details on various parameters, and their values taken for modeling with the two above clustering methods are given in Table 3.

For each algorithm defined above, ANFIS works on the model and tunes it by means of a hybrid technique combining gradient descent back propagation and mean least squares optimization algorithms. At each epoch, an error measure, which is the sum of the squared difference between actual and desired output, is reduced. When the values of the premise parameters were learned, the overall WQI was obtained as a linear combination of these parameters.

The measured and the predicted WQI values over a range of records are shown in Figure 3. It is evident from this figure that the measured and predicted WQI SC-ANFIS values are in close agreement as compared to the FCM-ANFIS method.

In order to further verify the result, the performances of the two ANFIS models were evaluated according to different statistical criteria. The correlation coefficient (*R*^{2}), root mean square error (RMSE), and MSE were obtained using (10). The evaluated values are given in Table 4.where , , and are the observed value, modeled values, and average value, respectively.

From Table 4, it is found that the values for both training and testing data are 0.9919 and 0.98271, respectively. The RMSE and MSE values from Table 4 also indicate that the SC-ANFIS model predictions are very close to the experimental value as compared to the FCM-ANFIS model.

Figure 4 shows the overall fitness of the SC-ANFIS method, showing that the predicted WQI values were plotted against the measured ones. The WQI values lie around a straight line passing through the origin, which implies a very close agreement between the two.

**(a)**

**(b)**

**(c)**

#### 5. Sensitivity Analysis

Sensitivity analysis is a process that helps us to find out how model output values are affected by changes in model input values. The purpose of performing sensitivity analysis was to determine those parameters that can change the output value (WQI), to an extent that water quality class shifts from its existing class. For this, each input water quality parameter was perturbed by ±2 times its standard deviation, and the changes in WQI were noted using the SC-ANFIS predictive model. The perturbation of ±2 times the standard deviation was taken so as to account the variability of parameter and the associated influence on WQI. Finally, the resulting WQI was compared to the reference values. Figures 5 and 6 show the sensitivity effect of −2d and +2d variation of each input on WQI, respectively, with respect to eight monitoring stations.

The WQI changes its existing class when the perturbed parameter value shifts the WQI trend in a particular group of observation data shown on *x*-axis. In Figure 5, stations 6, 7, and 8 are visibly sensitive to different parameters. Station 6 is sensitive to ammonia, station 7 to TDS and chloride, and station 8 is sensitive to ammonia and fecal coliforms. The change in WQI is significantly high such that the existing water quality class shifts from good to poor; good to poor and good to very poor at stations 6, 7, and 8, respectively.

Similarly by studying the pattern of change in Figure 6, it is observed that fecal coliform is the most sensitive parameters and any perturbation in this parameter is switching the WQI of all stations. At station 8, the WQI changes from good to poor because of +2d change in fecal coliform as compared to other stations where maximum change in water quality class is from good to fair. Chlorides and TDS are most sensitive parameters at station 7, and ammonia is the most sensitive parameter at station 3 and station 5.

The water quality parameters such as ammonia, chlorides, TDS, and fecal coliforms are the indicator parameters that are strongly linked to municipal or domestic waste. The sensitivity of Satluj River to these four parameters across its monitoring stations indicates that the river is under the influence of strong municipal waste, and fluctuations in these parameters should be given due considerations when communicating the water quality class.

#### 6. Conclusions

In this study, two different clustering algorithms were used to develop the ANFIS model for water quality index prediction of River Satluj in northern India. The formulation for WQI of River Satluj is established in literature, and the index was computed by considering seven water quality parameters, crucial for municipal use. The data set for 16 years across eight monitoring stations on the river was used.

The two ANFIS models that were based on subtractive clustering and fuzzy c-means methods were trained, validated, and tested for modeling WQI. Based on the statistical evaluations, it was found that SC-ANFIS model predictions at training and testing stages were very close to the experimental value when compared to the FCM-ANFIS model.

The SC-ANFIS model, because of its good predictive capability over the FCM-ANFIS model, was further used to perform sensitivity analysis. The effect of perturbation in each water quality parameter was modeled and analyzed. The analyses showed that ammonia, chlorides, and fecal coliform were the most sensitive parameters that were capable of switching the existing water quality class to poorer class and hence warrant more attention in terms of their analysis for WQI computation.

The study reveals that ANFIS modeling with SC-ANFIS can be a useful approach to characterization of water quality in the form of water quality index. Since the approach obviates the otherwise lengthy computations of WQI, the present study holds its importance in developing a model and employing it for faster dissemination of information as well identifying the critical water quality parameters affecting WQI. The future scope of the work lies in the usage of a combination of hybrid SC-FCM and ANN (neurodynamic fuzzy expert system) to evaluate water quality.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.