Abstract

Cluster analysis, which is to partition a dataset into groups so that similar elements are assigned to the same group and dissimilar elements are assigned to different ones, has been widely studied and applied in various fields. The two challenging tasks in clustering are determining the suitable number of clusters and generating clusters of arbitrary shapes. This paper proposes a new concept of “epsilon radius neighbors” which plays an essential role in the cluster-forming process, thereby determining both the number of clusters and the shape of clusters, automatically. Based on “epsilon radius neighbors,” a new clustering algorithm in which the epsilon radius value is adapted to the characteristics of each cluster in the current partition is proposed. Recently, clustering has been widely applied in environmental applications, including underground water quality monitoring. However, the existing studies have simply applied conventional clustering techniques, in which the abovementioned two challenging tasks have not been solved already. Therefore, in this paper, the proposed clustering algorithm is applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. The experimental results on benchmark datasets demonstrate the effectiveness of the proposed algorithm. For the quality of underground water, the new algorithm results in four clusters with different characteristics. Through this application, we found that the new algorithm might provide valuable reference information for underground water management.

1. Introduction

Cluster analysis is to discover the underlying structure of a dataset by partitioning the data into groups so that similar elements are assigned to the same group and dissimilar elements are assigned to different ones [15]. Recently, along with the development of big data, cluster analysis has been extensively studied and widely applied in various fields, such as physics, biology, economics, engineering, sociology, and data mining. [6]. For solving the problem of clustering, several approaches have been proposed in the literature, which includes: nonhierarchical clustering (k-means, k-means ++, etc. [7, 8] and other variances), hierarchical clustering [9], clustering for probability functions [1], or fuzzy clustering [10]. Among the abovementioned approaches, k-means clustering is the most well known and widely applied in various fields. However, the k-means algorithm and its extensions usually require a user-defined number of clusters that is often unknown in practice. (i) Furthermore, the k-means algorithm constructs spherical clusters, which is unsuitable for arbitrary-shaped clusters. (ii) The above two problems have been the major drawbacks of clustering so far, which lead to many difficulties and challenges in solving this problem [6].

For (i), to determine the suitable number of clusters, the most commonly used approach is running the clustering algorithm several times with different number of clusters each time, and evaluating them based on a number of internal validity measures, such as S-index, F-index, Dunn index, and Xie-Beni index [1114]. This approach can investigate the suitable number of clusters, but it repeats the clustering process many times to find the best number of clusters, thereby increasing the amount of time and space required, according to [6]. Moreover, the abovementioned evaluation indices are distance-based measures; therefore, they can only evaluate the qualities of spherical clusters and cannot be used for arbitrary-shaped clusters. In [15], Mavridis et al. proposed the algorithm PFClust (Parameter Free Clustering). The term “parameter free” means that the algorithm can automatically determine the number of clusters without requiring any user-defined parameters. For this purpose, PFClust performs an agglomerative algorithm on many subdatasets that are randomly sampled several times. Given an internal validity measure and a set of threshold corresponding to the number of clusters, the suitable threshold is then chosen based on the distribution of the given internal measure for all possible clustering results. In comparison to other conventional clustering algorithms, PFClust can result in a little better performance; however, it repeats the process of sampling and evaluating internal measures of the given thresholds in several times. Consequently, PFClust tends to be more time-consuming and expensive than other clustering methods. References [1618] found the optimal partition by combining the metaheuristic optimization method and the clustering. These studies used the abovementioned internal validity measures as objective functions that need to be optimized to find the best clustering solution. It is well known that the metaheuristic optimization method, e.g., the genetic algorithm, results in an extreme computational cost, which reduces the efficiency of the algorithm. Furthermore, in spite of outputting the number of clusters and partitioning automatically, the metaheuristic optimization method requires a few of its own user-defined parameters that have effects on the optimal solution. As a result, avoiding the challenge of specifying the number of clusters, k leads to the challenge of specifying many other parameters. In [19], an automatic clustering algorithm was conducted using a function of force that can control the movements of the objects. The farther the distance, the weaker the force between two objects. In the end, each object converges to the center of the cluster it belongs to. Since the computing of force also requires a user-defined parameter denoted and the value of also has effects on the number of clusters, the attempt to overcome the problem of [1618] of this algorithm is not too significant.

For (ii), DBSCAN [20], a density-based algorithm, is the most well-known method to construct arbitrary-shaped clusters. The algorithm utilizes two connectivity functions termed as density-reachable and density-connected, and each data instance is indicated as either a core point or a border point. The algorithm works to expand core points to form a cluster around itself. A drawback of DBSCAN is that when clusters of different densities exist, only particular kinds of noise points are captured [21]. Besides, two user-defined parameters regarding the minimum size of clusters and the radius need to be carefully turned. The other approaches, such as kernel k-means [22] and spectral clustering [23] can construct arbitrary-shaped clusters; these methods, however, also require a predefined number of clusters.

Because of the abovementioned drawbacks, an investigation of a new clustering method which can automatically determine the number of clusters and the clusters’ shape is necessary. This paper proposes a new clustering method based on a new definition called “-radius neighbors” of a given point . -radius neighbors play a key role in constructing clusters with arbitrary shapes. When any new -radius neighbor is not found, the algorithm stops processing the current cluster and thereby the number of clusters is automatically determined. Furthermore, the radius can be adapted to specific cluster density, which is an advantage of the proposed methods in comparison with DBSCAN.

The quality of underground water depends on various factors, such as climate, characteristics of aquifers, pH, alkalinity, redox potential of the geological environment, initial sources, contamination due to human activities, and biological processes. The conventional methods of assessing the quality of groundwater are usually based on comparing the parameters representing water quality, which are collected by sensors, with the permitted standards. Clustering can help explain complex data matrix, analyze the similarities in water quality characteristics, and group them into clusters, thereby showing their general characteristics, as well as the causes that affect water quality. Therefore, clustering has been widely applied in environmental applications, including underground water quality monitoring. Some studies, for example, [2428], have applied clustering in order to classify the water qualities in the whole region and design a future spatial sampling strategy in an optimal manner, which can reduce the number of sampling stations and associated costs. However, the abovementioned studies simply applied conventional clustering methods, such as hierarchical clustering with Ward distance, and k-means clustering. These methods, in general, have encountered the disadvantages, as mentioned in the previous parts. Therefore, in this paper, the proposed clustering algorithm is applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. This application is expected to produce more reliable and valuable information so that the administrators can monitor underground water behavior.

The remainder of this paper is organized as follows. Section 2 presents the study area, the data collection, and the proposed method. The results and discussion are presented in Section 3 in which Section 3.1 is the validation of the proposed algorithms for different datasets and Section 3.2 is the application in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. Finally, Section 4 is the conclusion.

2. Materials and Methods

2.1. The Proposed Clustering Method

Let , be a set of n points, be a given point, and be an arbitrarily positive integer. A set is called as -radius neighbors of ifwhere is the Euclidean distance between and .

Obviously, -radius neighbors of are located in a hypersphere of radius around . As a result, a cluster can be extended by searching on the dataset and adding new objects pertaining to any hypersphere of radius around the current objects. This process still depends on the value of . This parameter plays a role which is the same as the parameter in the well-known DBSCAN algorithm. The choice of this parameter has effects on the clustering result. A fixed value of has low generalization ability because different datasets and clusters with different densities in a dataset could require different values of . A natural strategy is simply to adapt using the current cluster density. For the sake of presentation, the set of pairwise distances in the current cluster is called the set of “historical extending.” Based on the set of “historical extending” in the current cluster (samples), we can estimate the maximum “extending” of the entire cluster (population). In this case, two basic principles are as follows:(1)We know that if data has the normal distribution with mean and standard deviation , then 95% of the data values belong to the interval If the two abovementioned parameters are unknown and data is enough large, we can estimate them from the sample data. For example, the mean and the adjusted standard deviation of the sample can be selected as alternatives for and , respectively. Therefore, to estimate the maximum extending of the cluster (population), we can use the following formula:where and are the mean and adjusted standard deviation of “historical velocities” in the current-processing cluster (sample). Obviously, about 97.5% of the extending pertaining to the true cluster (population) must be less than the extending estimated by formula (2), and thus, this formula can be used to approximate the maximum extending of the true cluster (population).(2)Let n be the sample size or the number of objects in the current-processing cluster and and be the sample mean and adjusted standard deviation, respectively. Assuming that the value of n is large enough or d has the normal distribution with the mean and the variance . Consequently, with a significant level of 0.05, the mean of d belongs to the interval As a result, the maximum of the mean extending can be directly estimated using the following formula:

The maximum value of the confidence interval is then used as the representative extending of the cluster.

It can be observed from formulas (2) and (3) that, in the earlier processing stage, when the sample size is too small, the standard deviation and the adaptive extending must be large. Therefore, we can avoid unreasonable extending in the earlier processing stage when the current sample is not a good representation of the population. Meanwhile, in the later stage, the number of objects in the current-processing cluster or the sample size is large enough for maintaining a stable adaptive extending.

Based on formulas (2) and (3), we propose a new clustering method called adaptive radius clustering for automatically determining the number of clusters and clusters shapes. Let , be an original dataset of N objects. The new clustering algorithm is presented as the following pseudocode and in Figure 1.

Initialize , , and , where and are current-processing cluster obtained before and after an update, respectively.

Step 1. Get the first three objects of the cluster using the formulas below:which subject toUpdateIn formulas (4), (5), and (6), argument “arg” of a function is the value that must be provided to obtain the function’s result; hence is a point in X such that the sum of distances between it and other points is the minimum. In other words, is the centroid of the current dataset. Similarly, is the nearest point of and is the nearest point of when excluding . Formula (7) is defined to overcome the problem of bad initialization. For example, if and are two nearest neighbors of , but the corresponding distances are larger than the average of pairwise distances between points in the current dataset, then will be considered as a single cluster and the current extending process will be stopped.
In the abovementioned formulas, d is the Euclidean distance between any two d-dimensional points. In some illustration below, for the sake of visualization, x will be chosen as a 2-dimensional point (x1 and x2) so that we can draw the scatter plot of data. In fact, x1 and x2 not only can be the coordinates but also can be other informations such as height, weight, Ca2+, Mg2+, and Na+. Furthermore, x can be a d-dimensional vector, in general. Certainly, we can calculate Euclidean distance between two d-dimensional points x and y using the following formula:In addition, because variables measured at different scales do not contribute equally when calculating the distance, the data are normalized into [0, 1] interval using the following formula:where is the value of variable j () at the point i (), is the normalized value of variable j at point i and and are the minimum and maximum value of variable j, respectively.

Step 2. For each , compute the adaptive -radius and the corresponding -radius neighbors using Definition 1 and either formula (2) or formula (3); update and by the following formulas:In this step, formulas (2) and (3) are utilized to compute the adaptive -radius and the corresponding -radius neighbors . Note that, the two abovementioned formulas are now just some options that need to be tested. In the numerical results, after applying both, the best option will be selected in the application.

Step 3. If , then and . Repeat Step 2 and Step 3 until , then stop the current-processing cluster.

Step 4. Repeat the three steps above until all objects are assigned to their clusters.
The main idea of the proposed algorithm is that from a number of points initialized using formulas (4), (5), and (6) subject to (7), the cluster can automatically expand based on formulas (2) or (3). When the cluster does not extend more, the abovementioned process will repeat over the rest of the data until all points in the data are assigned to a specific cluster. With formulas (2) or (3), the -radius neighbor can adapt to different cluster densities; hence, the proposed algorithm can determine the number of clusters and find clusters of arbitrary shapes in cases of both balanced and imbalanced cluster densities. This is an advantage of the proposed algorithm in comparison to conventional methods, such as k-means, k-medoids, and DBSCAN.

2.2. Study Area and Data Used

The clustering method proposed above will be applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. The study area and data used are described as follows.

2.2.1. Study Area

Phu My town has a natural area of 33,825 hectares and a population of 137,334 people. To the east, it borders Chau Duc district, Ba Ria-Vung Tau province. To the West, it borders Can Gio district, Ho Chi Minh City, and Vung Tau City, Ba Ria-Vung Tau province. To the South, it borders Ba Ria City, Ba Ria-Vung Tau province, and to the North is the Long Thanh district, Dong Nai province. Phu My town is located in the climate region of the Southern Delta, Vietnam, with a tropical climate and is influenced mainly by the northeast and southwest monsoon. There are two distinct seasons in a year, dry season and rainy season. The first lasts from December to April with an average annual temperature of 26.3 Celsius, and the second is between May and November with an average annual rainfall of 1356.5 mm.

Phu My town is the most concentrated industrial area and is one of the most developed areas in Ba Ria-Vung Tau province, Vietnam. To serve economic development, the demand for water in this area is quite high, but the sources of surface water from rivers and lakes do not meet the demand. According to the 2012 survey data of the Department of Natural Resources and Environment of Ba Ria-Vung Tau province, the total volume of underground water exploitation in this town had accounted for 18,608,430 m3/year (mainly from Phu My-My Xuan water station and Toc Tien Water Plant). Groundwater exploitation has been reported to be mainly in the Pleistocene aquifer, which is composed of coarse-grained soil of Cu Chi Formation, Thu Duc Formation, and Trang Bom Formation with the main minerals: fluorite-apatite, feldspar, gypsum, tourmaline, montmorillonite, ilmenite, and some other impurities.

2.2.2. Data Used

The dataset has been provided by the Department of Natural Resources and Environment of Ba Ria-Vung Tau Province. The groundwater samples in the Middle-Upper Pleistocene (qp2–3) aquifer and Upper Pleistocene (qp3) aquifer, which consist of 11 variables, have been collected from 17 monitoring wells. The locations of 17 monitoring wells are shown in Figure 2, and the detailed dataset is presented in Table 1.

In this study, the contribution of variables is the same when calculating distance, that is, the proposed method considers the equal importance for each chemical parameter. In case in which some chemical parameters are more important than the others, the proposed method can be performed by using the weighted Euclidean distance instead of using the standard Euclidean distance. Also, note that, in this application, well’s location is not considered as a variable, that is, the wells will only be grouped by their chemical parameters. The algorithm thereby will not be too focused on location, but more on chemical properties. Naturally, if wells in the same region have the same chemical properties, they will be assigned to the same cluster. As a result, we have wells sorted by locations. In contrast, through the clustering results, we can still identify wells that are in the same region, but have different chemical properties, or wells that are in different regions, but have similar chemical properties. In such cases, the corresponding explanation will also be provided.

3. Results and Discussion

3.1. Numerical Example

In this section, a simple dataset is used in order to illustrate the proposed algorithm in detail. The dataset consists of 20 bivariate points presented in Table 2; the normalized data points are presented in Figure 3.

Using formulas (4), (5), and (6), we found the three initial points v1, v2, and v3 of the first cluster, which are represented by red in Figure 4. It can be seen from Figure 4 that the distance between these three points is really small in comparison with the distance between all points; therefore, condition (7) is satisfied and we can use these three points for extending the cluster.

Now, we use the points in the processing cluster to build up the cluster itself. For example, in Figure 5, starting from the green point, v2, using formula (3), we calculate the adaptive radius and determine the three new -neighbors, based on the circle formed. After that, the processing cluster will be extended by adding these three new points, and the point v2 will no longer be used to extend the cluster in the next steps. Using another point in the processing cluster, for example, the green point in Figure 6, we also calculate the adaptive radius and determine the new -neighbors, based on the circle formed.

Repeat the abovementioned process until the processing cluster cannot be extended more, that is, all points in the processing cluster have been used for the extending process and we cannot find any new points linked to them, as shown in Figure 7.

Figure 7 completely determines the first cluster; we can repeat the abovementioned process for the remainder of the dataset and obtain the final partition, as shown in Figure 8.

3.2. Experiments in Benchmark Datasets

Section 3.1 step-by-step illustrated the proposed algorithm. In this section, to test the partitioning performance of the proposed algorithm and compare it with other methods, and the proposed algorithm is implemented on different datasets with different characteristics.

The tested datasets can be downloaded from (https://cs.joensuu.fi/sipu/datasets/), which include(i)Spiral: a dataset with spiral-shaped clusters(ii)Aggregation: a dataset with different cluster shapes(iii)Compound: a compound dataset with different cluster shapes and densities(iv)Gauss: a dataset simulated in [6] with three Gaussian clusters

The tested algorithms include(i)ARC1: the proposed method with the adaptive radius defined according to formula (3).(ii)ARC2: the proposed method with the adaptive radius defined according to formula (4).(iii)k-mean, DBSCAN: two popular clustering algorithms. The k-means requires an initial number of clusters and results in the spherical clusters, while the DBSCAN is a density-based clustering algorithm that is suitable for clusters of arbitrary shapes.(iv)SU: an automatic clustering algorithm recently presented by [19] for determining the number of clusters, automatically.

In this paper, the Adjusted Rand Index, ARI [29, 30], is employed to evaluate the performance of the five compared methods. ARI is an external measure that can make the comparison between the partition produced by a clustering algorithm (P) and the actual partition (Q), where “ground-truth” labeling is known. Particularly, given P and Q, the formulation of ARI is defined as follows:where a is the number of pairs of elements in the same cluster in P and Q, b is the number of pairs of elements in the same cluster in P, but in different clusters in Q, c is the number of pairs of elements in a different cluster in P, but in the same cluster in Q, and d is the number of pairs of elements in a different cluster in both P and Q. The closer the ARI is to 1, the better the clustering result is (it can be seen from formula (12) that when P and Q are the same, b = c = 0 and ARI = 1).

Table 3 intuitively presents the clustering results of the five tested algorithms on the four used datasets.

Remarks:(i)For the nonspherical clusters, the performance of the DBSCAN is better than that of SU and k-means algorithms. This result is reasonable because DBSCAN can easily group the data points into arbitrary shape clusters, based on the density and the connection rather than the distance between them. ARC2 algorithm, in general, is quite efficient in terms of ARI and outperforms the DBSCAN on two of the three datasets. Meanwhile, the ARC1 achieves the largest ARI values, which indicates the best performance in terms of clustering accuracy.(ii)For the spherical or Gaussian clusters, most of the methods render good performance, in which ARC1, SU, and DBSCAN are the proper methods. The k-means algorithm also provides the best result, for k = 3; however, when k is randomly changed and does not satisfy k = 3, this method shows poor performance. Tables 3 and 4 also show that the ARC2 performs better than the k-means; however, it is not good enough for the Gaussian clusters.(iii)In summary, it can be claimed that ARC1 is an effective algorithm. Specifically, the ARC1 can automatically determine the number of clusters and has notably larger ARI values or notably better clustering results for any given dataset.

3.3. Application for Underground Water Quality Assessment

In this section, we cluster the samples of groundwater quality parameters provided by the Department of Natural Resources and Environment of Ba Ria-Vung Tau Province. The study area and data used have been presented in Section 2. The clustering results in Figure 9 showed that the 17 monitoring wells are classified into 4 groups based on the water quality characteristics:(i)Cluster 1: NB3A, QT5B, NB4(ii)Cluster 2: NB3B, NB1B, NB1A, QT11(iii)Cluster 3: QT7B, NB2C, VT4B, VT6, QT5A, NB2A, VT4A, VT2B, VT2A(iv)Cluster 4: QT7A

A comparison of some parameters among clusters is shown in Figure 10. We have the following remarks:(i)Cluster 4 consists of only 1 well, QT7A, with very high parameter values. This result demonstrates that the water quality in this well is really bad compared to the remaining clusters. In addition, it can be seen from Table 1 and Figure 10(a) that QT7A has more salt ions (Mg2+, Na+, K+, Ca2+, , , Cl, , and Nitrite) compared to the remaining clusters. According to National Technical Regulation on Groundwater Quality of Vietnam, the permitted standard for Cl is 250 mg/l and for is 400 mg/l. Therefore, the Cl and values of QT7A exceed the permitted standards 3.78 and 1.3 times, respectively. This demonstrates that QT7A may be overaffected by saline intrusion because this well is located near the saline boundary. Additionally, it can be seen in Figure 9 that two wells QT7A and QT7B are located in the same region, but they belong to different clusters. Actually, they are both contaminated wells, but they have different depths, representing separate aquifers. As a result, QT7A exhibits a higher level of contamination than QT7B.(ii)For the three remaining clusters, it can be seen from Figures 9 and 10(b) that Cluster 1 consists of three wells, with high HCO3 values. To our knowledge, the two wells, NB3A and QT5B, are located near My Xuan B1 industrial zone, and the well NB4 is located near Toc Tien landfill. As a result, those wells may be contaminated by the waste discharge process of the abovementioned industrial zone and landfill.(iii)Cluster 2 consists of four wells with relatively good quality. In this cluster, most of the parameter values are lower than those of other clusters and are within safe ranges. It can be concluded that the wells of Cluster 2 are not affected by agricultural activities as well as saline intrusion.(iv)Cluster 3 consists of eight wells with higher values of Mg2+, Na+, K+, Ca2+, Cl, and compared to those of Cluster 1 and Cluster 2. Especially, Cl value exceeds the permitted standard at 2/8 wells. This indicates a number of wells in Cluster 3, which are located near the coast as well as salinity boundaries, are capable of being affected by salinity intrusion. In addition, as shown in Figure 10(b), in Cluster 3, the average value of is higher than that of Cluster 1 and Cluster 2. This demonstrates that agricultural activities taking place around the monitoring area served as large contributors to the underground water quality of this cluster. In particular, well NB2C, VT2B, and VT2A are located near the industrial planting area. Meanwhile, well VT6, which is located near the aquaculture area, may be seriously affected by organic matter from the residual feed; therefore, the value reaches 7.77 times higher than the permitted standard.

4. Conclusion

Based on the definition of epsilon radius neighbors, this paper has proposed a new clustering algorithm that can automatically determine the number of clusters and can find clusters with different sizes, shapes, and densities. The radius or extending is adapted to the current-processing cluster and has good generalization ability. The proposed algorithm is tested on benchmark datasets and is then applied to underground water quality assessment in Phu My Town, Ba Ria-Vung Tau province, Vietnam. For the experiments with many datasets, the ARC1 algorithm exhibits a better performance than the other tested algorithms in terms of the Adjusted Rand index. The ARC2 algorithm performs better than the conventional clustering algorithms in the case of nonspherical clusters but worse in the case of spherical clusters. For the underground water quality assessment in Phu My Town, Ba Ria-Vung Tau province, Vietnam, the proposed algorithm indicated that there are four clusters of water quality that represent different source contributions.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Vietnam National University Ho Chi Minh City (VNU-HCM) under Grant no. C2018-24-01.