<i>K</i>-Means Genetic Algorithms with Greedy Genetic Operators

Kazakovtsev, Lev; Rozhnov, Ivan; Shkaberina, Guzel; Orlov, Viktor

doi:https://doi.org/10.1155/2020/8839763

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2020 | Article ID 8839763 | https://doi.org/10.1155/2020/8839763

K-Means Genetic Algorithms with Greedy Genetic Operators

Lev Kazakovtsev,¹Ivan Rozhnov,¹Guzel Shkaberina,¹and Viktor Orlov¹

Academic Editor: Piotr Jdrzejowicz

Received29 Aug 2020

Accepted11 Nov 2020

Published30 Nov 2020

Abstract

The k-means problem is one of the most popular models of cluster analysis. The problem is NP-hard, and modern literature offers many competing heuristic approaches. Sometimes practical problems require obtaining such a result (albeit notExact), within the framework of the k-means model, which would be difficult to improve by known methods without a significant increase in the computation time or computational resources. In such cases, genetic algorithms with greedy agglomerative heuristic crossover operator might be a good choice. However, their computational complexity makes it difficult to use them for large-scale problems. The crossover operator which includes the k-means procedure, taking the absolute majority of the computation time, is essential for such algorithms, and other genetic operators such as mutation are usually eliminated or simplified. The importance of maintaining the population diversity, in particular, with the use of a mutation operator, is more significant with an increase in the data volume and available computing resources such as graphical processing units (GPUs). In this article, we propose a new greedy heuristic mutation operator for such algorithms and investigate the influence of new and well-known mutation operators on the objective function value achieved by the genetic algorithms for large-scale k-means problems. Our computational experiments demonstrate the ability of the new mutation operator, as well as the mechanism for organizing subpopulations, to improve the result of the algorithm.

1. Introduction

The k-means problem is a continuous unconstrained global optimization problem which has become a classic clustering model. This problem is proved to be NP-hard [1, 2], so it is necessary to find a compromise between the computation time and the solution preciseness. The aim of the problem is to find set of points and called centroids in a -dimensional space that minimizes the sum of squared distances from known points (data vectors) to the nearest centroid [3]:where is the distance between two points (usuallyEuclidean) and is given.

Data vector indexes for which the th centroid is the nearest one form a set (cluster) , . An equivalent problem setting is as follows:where is the centroid of the th cluster.

The simplest and most popular local search algorithm is the k-means algorithm [4, 5] also called Alternate Location and Allocation (ALA) procedure [6, 7] or Lloyd algorithm. Similar procedure called EM (Expectation Maximization) [8, 9] and its modifications [10–12] are the most popular algorithms for separating the mix probability distribution. The k-means algorithm improves an intermediate solution sequentially, which enables us to find a local minimum.

Technically, this is not a true local search algorithm in terms of continuous optimization, as it searches for a new solution not necessarily in the -neighborhood of an existing solution. Nevertheless, it enables us a solution which is locally optimal in -neighborhood.

If we use distances instead of squared distances in (1), we deal with the continuous p-median problem. The similarity of these NP-hard problems [13, 14] allows us to use similar approaches to solving them. However, unlike the p-median problem with Euclidean distances, finding the exact solution of a 1-means problem (k-means problem with or the centroid search problem) in accordance with Algorithm 1 is trivial, and finding a local minimum of the k-means problem takes less computational resources. This allows the local search to be integrated into various effective global search strategies.

Require: Set of initial centroids .
(1)	ForEach centroid , , define its cluster as the subset of data vectors having closest centroid .
(2)	ForEach cluster , , recalculate its centroid as follows:
(3)	Repeat from Step 1 if Steps 1, 2 made any changes.

In the early attempts to solve the p-median problem by exact methods (its discrete modifications), the authors used a branch and bound algorithm [15–17] for solving very small problems. In [18–20], the authors reviewed various heuristic solution techniques for k-means and p-median problems. In [21–23], the authors presented local search approaches including the Variable Neighborhood Search (VNS) and concentric search. In [22], Drezner et al. proposed heuristic procedures including the genetic algorithm (GA), for rather small datasets.

Many approaches, based on the data reducing [24], simplify the problem by selection of some part of the initial dataset and then use these results as an initial solution to the k-means algorithm on the complete dataset [25–28]. Such aggregating as well as reducing the number of the data vectors [29] enables us to solve large-scale problems within a reasonable time. However, such approaches lead to a reduction in preciseness. In our research aimed at obtaining the most precise solutions, we consider only the methods which estimate the objective function (1) directly, without aggregation or approximation approaches.

Modern publications offer many heuristic procedures [19, 30] for setting the initial centroids for the k-means algorithm, most of them belong to various evolutionary and random search methods. Local search algorithms and their randomized versions are widely presented. For instance, Variable Neighborhood Search (VNS) algorithms [23, 31, 32] or agglomerative algorithms [33, 34] sometimes show good results. A large number of articles are devoted to the initialization procedures for local search algorithms, such as random seeding and estimating the distribution of the data vectors [30]. The challenge is that, in many cases, even multiple runs of simple local search algorithms from various randomly generated solutions do not lead to a solution that is close to the global optimum. More advanced algorithms enable us to get the objective function (1) values many times better than the local search methods [32].

The use of genetic algorithms and other evolutionary approaches to improve the results of the local search is a widely used idea [35–38]. Such algorithms recombine the local minima obtained by the k-means algorithm. GAs operate with a certain set (population) of candidate solutions and include special genetic operators (algorithms) of initialization, selection, crossover, and mutation. The mutation operator randomly changes the resulting solutions and provides some diversity in the population.

However, in genetic algorithms, as the number of iterations increases, the population degenerates into a certain set of solutions close to each other. Larger populations as well as dynamically growing populations improve this situation. However, simpler algorithms based on the use of the same greedy agglomerative procedures [32, 39] often show better results within the same computation time.

In this research, we do not discuss the adequacy of the k-means clustering model, which is actually questionable. We only focus on the preciseness and stability of the obtained objective function value (1) within the framework of the k-means model.

There are situations when the cost of error is high [9]. In these cases, as well as when comparing the accuracy of an algorithm with a certain standard solution (not necessarily globally optimal), we need to get a result that would be difficult to enhance by other known methods without meaningful increase of the computation time. Evolution of parallel processing systems such as graphics processing units (GPUs) makes multiple runs of local search algorithms very cheap. In this case, large-scale problems (up to several millions of data vectors) can be solved with the use of the most advanced algorithms providing the highest preciseness. As our study shows, for large-scale problems, further improvement in the results of the genetic algorithms with greedy heuristic crossover can be achieved by using a special mutation operator and partially isolated solution subpopulations.

The aim of this paper is to introduce a new k-Means Genetic Algorithm with the greedy agglomerative crossover operator, a special greedy agglomerative mutation operator, and subpopulations.

The rest of this article is organized as follows. In Section 2, we propose a brief overview of known approaches to the development of k-means genetic algorithms. In Section 3, we give an overview of known mutation genetic operators used in k-median genetic algorithms in accordance with various approaches to chromosome encoding as well as other instruments for increasing the population diversity. In Section 4, we propose new modifications to the genetic algorithms with greedy heuristic crossover operator. Such modifications include partially isolated subpopulations and the use of a new mutation operator based on the greedy heuristic procedure. In Section 5, we describe the results of our computational experiments which demonstrate the efficiency of our new modifications on large datasets.

2. K-Means Genetic Algorithms

The idea of various genetic algorithms is based on a recombination (interchange) of elements in a set (“population”) of candidate solutions (“individuals”) encoded by “chromosomes.” Such elements of the chromosomes are called “genes” or “alleles.” Each chromosome is a vector of genes (bits, integers, or real numbers) representing a solution. The goal of gene recombination is achieving the best value of an objective function called “fitness function.” The appearance of the first genetic algorithms for solving the discrete p-median problem [40] preceded the genetic algorithms for the k-means problem (k-Means Genetic Algorithms). Alp et al. [41] proposed a rather fast and precise algorithm with a special “greedy” (agglomerative) heuristic procedure used as the crossover genetic operator for the network p-median problem. Such algorithms solve discrete network problems and use a very simple binary chromosome encoding (1 for the network nodes selected as the centers of the clusters, and 0 for those not selected).

In the genetic algorithms for the k-means and similar problems with binary-encoded chromosomes, many mutation techniques can be used. For example, in [42], the authors represent the chromosome with binary strings composed from binary-encoded features (coordinates) of the centroids. The mutation operator arbitrarily alters one or more components (binary substrings) of a selected chromosome.

If the centers or centroids are searched for in a continuous space, some genetic algorithms still use the binary encoding [38, 43, 44]. In the k-means algorithm, its initial solutions are usually subsets of the dataset . In such chromosome code, 1 means that the corresponding data vector must be selected as an initial centroid and 0 for those not selected. In this case, some local search algorithm (k-means algorithm or similar) is used at each iteration of the GA to estimate the final value (local minimum) of the objective function (1).

In [45], the authors refer to their algorithm as “Evolutionary k-Means.” However, they actually solved an alternative problem which aimed to increase the clustering stability instead of minimizing (1). Their algorithm operates with binary consensus matrices and uses two types of mutation genetic operators: cluster split (dissociative) and cluster merge (agglomerative) mutation. In [46], the chromosomes are strings of integers representing the cluster number for each of the clustered objects, and the authors solve the k-means problem with simultaneous determining the number of clusters based on the silhouette [47] and Davies and Bouldin criteria [48], which are used as the fitness functions. Thus, in [46], the authors also solve a problem with the mathematical statement other than (1). Similar encoding is used in [37] where the authors propose a mutation operator which changes the assignment of individual data vectors to the clusters.

In [49], the authors described the mutation operator as a procedure that guarantees population diversity (variability). Usually, for the k-means and p-median problems, the mutation randomly changes one or many chromosomes, replacing some centroids [36, 37]. Mutation and crossover are the most important genetic operators playing different roles: the crossover seeks to preserve the features of parent solutions, while the mutation tries to cause small local changes in the solutions. Compared to a crossover, a mutation is usually regarded as a secondary operator with a low probability [50]. High frequency of mutations makes a genetic algorithm to search randomly and chaotically. Nevertheless, many studies have shown that evolutionary algorithms without a crossover can work better than a standard genetic algorithm if the mutation is combined with an effective selection operator [51–54]. Mutation is performed on a single parent solution.

In [36], the authors encode the solutions (chromosomes) in their GA as sets of centroids represented by their coordinates (real vectors or arrays). The genetic algorithms with the greedy heuristic crossover operator use the same principle [55].

Thus, various genetic algorithms for the k-means and similar problems can be classified into three categories in accordance with the chromosome encoding method:(a)Integer encoding each gene represents a data vector , and the value is its cluster number. Such algorithms are declared for solving the k-means or p-median problem; however, they may use other objective function than (1). A local search for the minimum of (1) is sometimes declared their mutation operator.(b)Integer or binary encoding each gene corresponds to a centroid (cluster) and describes the data vector index selected as the initial centroid for the local search method. Such algorithms may use a wide variety of crossover and mutation operators.(c)Real (direct) encoding each gene is a centroid encoded by its coordinates. Such algorithms are able to demonstrate the most precise results. However, the modern literature offers a very limited variety of mutation operators for such algorithms. Usually, they do not use any mutation [38, 41, 43].

The greedy heuristic crossover operator can be described as a two step algorithm. The first step combines two known (“parent”) solutions (chromosomes) into one intermediate invalid solution with an excessive number of centroids (clusters). At the second step (the greedy agglomerative procedure), the algorithm removes excessive centroids in each iteration so that the removal of the centroid results in the least significant growth of the objective function (1) [41, 43], see Algorithm 2.

Require: Final number of centroids , initial solution ,
(1)	Improve with the local search algorithm if possible:
(2)	while
(3)	for all
(4)	Assign
(5)	Calculate //where is the objective function (1)
(6)	end for
(7)	Select a subset of centroids, , with the minimal values of the corresponding variables .Here,
(8)	Obtain the new solution , , and improve this new
solution with the local search algorithm:
(9) End while
(10) return Solution

Algorithms 3 and 4 are known heuristic procedures [32, 41, 43], which implement the first step of the greedy heuristic crossover operator and run heuristic procedure.

	Require: Two solutions (sets of centers) and
	return

	Require: Two solutions (sets of centers) and
	for all
	Merge and one item of : ;
	end for
	return the best of solutions

These algorithms can be included in various global search strategies. Combining items (centroids) of solution with the items of the other solution and running Algorithm 1, we get a set of “child” solutions. These solutions are used as the neighborhoods, in which a better solution is sought for. Thus, the second solution is a parameter of the neighborhood [32].

The general framework of the GA for the k-means and similar location problems can be described as Algorithm 5.

	Require: Initial population size (in ourExperiments, ).
(1)	Assign . Generate initial solutions where . ForEach initial solution, run the algorithm: ;
(2)	loop
(3)
(4)	if stop condition is satisfied
(5)	return solution from the population with minimal value of
(6)	end if
(7)	Selection: Randomly choose two indexes
(8)	Run chosen crossover operator:
(9)	Run chosen mutation operator:
(10)	Run chosen procedure to replace a solution in the population
(11)	End loop

The objective function is (1). We used the tournament selection (tournament replacement, see Algorithm 6) for Step 10 of Algorithm 5:

	Randomly choose two indexes
	if

	Else

	End if

Such algorithms usually operate with a very small population, and other selection procedures do not improve the results significantly [41, 43, 46].

In the GAs with greedy heuristic crossover [43, 44], Algorithms 2 and 3 are used as the crossover genetic operator. These operators are computationally expensive due to multiple runs of the algorithm. In the case of large-scale problems and very strict time limitation, GAs with greedy heuristic crossover operator performs only few iterations for large-scale problems. The population size is usually small, 10–25 chromosomes. Dynamically growing populations [43, 44] are able to improve the results. In this case, Step 7 of Algorithm 5 is replaced by the following procedure (see Algorithm 7).

	;
	if has changed
	initialize the new individual : generate randomly, ;
end if

Thus, in this paper, we intend to improve the GAs with greedy heuristic crossover operator which can be described as follows [43, 46]: k-GA-ONE: GA framework (Algorithm 5) with as the crossover operator, selection (Algorithm 6), dynamic population size adjustment (Algorithm 7), and empty mutation operator. k-GA-FULL: the same but crossover operator. k-GA-RND: the same but the crossover operator or is selected randomly with equal probability.

The empty mutation operator can be replaced with a known or new procedure described in Section 3.

3. Known Methods of Increasing Population Diversity in the Genetic Algorithms

Despite the widespread use of various genetic algorithms for the k-means problems in the modern literature, there is practically no systematization of the approaches used [56–59]. For various methods of chromosome encoding, various mutation operators have been developed: bit inversion for binary encoding [50], exchange, insert, inverse, and offset permutation [60] for variable length chromosomes, Gaussian mutation [61], and polynomial mutation for real coding [62, 63]. Some studies suggest a combination of the mutation operators [64] or the self-adaptive mutation operators [65–67]. The efficiency of various mutation operators depends on the GA parameters [53, 68, 69] and problem type [70, 71]. However, the number of various mutation operators with real encoding for continuous problems is very limited.

The GA for the network p-median problem described in [72] includes the hypermutation operator, which consists in an attempt to replace each gene in the chromosome with each gene from the set of genes that were not originally part of the processed chromosome. After each replacement, the algorithm checks for the improvement of the objective function value. The operator is computationally expensive due to numerous checks of the objective function and actually similar to the local search principle embedded in the j-means algorithm [73]. In [74], the hypermutation algorithm was further developed as the nearest four neighbors’ algorithm. The idea is to reduce computational costs by reducing the set of genes used for the replacement to the nearest neighbors of the gene being replaced. In several works [37, 75, 76], the authors propose using the algorithm as a mutation operator.

Each of these algorithms declares a local search as a mutation operator. The GA framework allows us to use a wide variety of genetic operator options. However, the local search is designed to improve an arbitrary solution by transforming it into a local optimum and thereby reducing, rather than increasing, the variety of chromosomes (solutions).

In [36, 42], the mutation operator is as follows (uniform random mutation). Randomly generate . If (where is mutation probability), then the chromosome will mutate. Randomly generate . If the current position of a centroid is , the mutation operator modifies it as follows:

Signs “+” and “–” are used with the same probability [42]. This mutation operator shifts the centroid coordinates randomly. A similar technique with an “amplification factor” was used in [44, 77]. However, the local minima distribution among the search space is not uniform [49]: the new local minima of (1) can be found with higher probability in some neighborhood of a known local minimum than in a neighborhood of a randomly chosen point (here, by a neighborhood, we do not necessarily mean an -neighborhood but any subset of solutions which can be obtained by application of some defined procedure to the current solution). Combining local minima (subsets of centroids from two locally minimal solutions) must usually outperform the random shift of the centroid coordinates. The idea of combining local minima is the basic idea of the greedy heuristic crossover operator in genetic algorithms [38, 43] and other algorithms [21]. The greedy heuristic crossover operator for the discrete p-median problem proposed in [41] and adapted for continuous p-median and k-means problems in [38, 43] was used in the GAs without any mutation operator. Such algorithms demonstrate more accurate results in comparison with many other algorithms for practically important middle-size problems.

The other common approach to increasing the diversity in a population is to create subpopulations that develop more or less autonomously. Algorithms that produce subpopulations containing individuals gathered around optima are a wide class of such methods. The fitness sharing method [78] allows the evolutionary algorithm to search simultaneously in different areas (niches) corresponding to different local (or global) optima, i.e., this method allows one to identify and localize multiple optima in search space. The group of crowding methods [79–81] also uses a niche approach. The general concept of crowding is for individuals to fight for survival with similar offspring and apply tournament selection to a high-likeness parent-child pair. The main idea of the genetic chromodynamics [82] is to force the formation and maintenance of stable subpopulations. The proposed scheme of local interaction provides stabilization of the subpopulation in the early stages of the search. Subpopulations co-develop and converge to several optimal solutions.

In [83], the authors present the roaming optimization method. By using subpopulations developing in isolation, multiple optima are found. This method uses the tendency of evolutionary algorithms to premature convergence, turning this disadvantage into an advantage in the process of detecting local optima.

4. New Modifications to the Genetic Algorithms

The essence of our new mutation operator (greedy heuristic mutation, GHM) is as follows. We perform the crossover operator to the single parent chromosome and a randomly generated chromosome improved by the algorithm. In Step 9 of Algorithm 5, the operator is replaced with Algorithm 8.

	Require: Solution .
	Randomly generate new solution , ;

	if

	end if
	return

Despite small populations in the genetic algorithms with the greedy agglomerative crossover, the application of a simple approach with two subpopulations allows us to improve the result of the algorithm. In our research, within the population, we organized two subpopulations of equal volume. For the crossover and tournament, both chromosomes are mainly selected within the same subpopulation. If one of the subpopulations during a certain number of iterations does not provide an improvement in solutions and its record (the best solution) is inferior to the record of the second subpopulation, its individuals are replaced by new ones (reinitialization of the population). We assumed that chromosomes in the same subpopulation tend to develop in a similar way under the influence of crossover. Mutation of a separate chromosome increases the population diversity; however, under the influence of the crossover, the differences are gradually levelled. Reinitialization of a subpopulation is a substitute for a complete restart of the algorithm while maintaining the record. Thus, Step 7 of Algorithm 4 (selection) is transformed as Algorithm 9.

	Randomly choose
	if
	Randomly choose two indexes ;
	else if
	Randomly choose two indexes ;
	Else
	Randomly choose two indexes ;
	End if

Similarly, Step 10 of Algorithm 5 changes (see Algorithm 10).

Randomly choose
if
	Randomly choose two indexes
	Else if
	Randomly choose two indexes
	Else
	Randomly choose two indexes
	End if
	if

	Else

	End if

An additional Step is added to Algorithm 5 (see Algorithm 11).

if the algorithm gave no improvement during iterations
	Reinitialize all solutions in the subpopulation with indexes or .
end if

The idea of the Variable Neighborhood Search with randomized neighborhoods (see [32]) is also based on applying the greedy heuristic procedures (Algorithms 2 and 3) to a current solution and a randomly generated one transformed into a local minimum by Algorithm 1. Our computational experiments (see Section 5) show that the new genetic algorithms with as the mutation operator outperform both the original genetic algorithms with greedy agglomerative crossover operator (Algorithm 4 with empty mutation) and the Variable Neighborhood Search with randomized neighborhoods.

As mentioned before, the greedy agglomerative crossover operator is a computationally expensive algorithm. In Algorithm 2, the objective function calculation is performed more than times. Therefore, such algorithms are traditionally considered as methods for solving comparatively small problems (hundred thousands of data points and hundreds of centers). However, the rapid development of the massive parallel processing systems (GPUs) allows us to solve the large-scale problems with reasonable time expenses (minutes).

One of the most important issues of the GAs is the convergence of the entire population into some narrow area (population degeneration) around some local minimum. On the first crossover iterations, the “child” solutions usually have significant advantages in the objective function value in comparison with their “parents” due to the ability of the greedy agglomerative crossover operator to choose much better solutions in comparison with the k-means procedure. On a single central processor unit, such GAs manage to perform only few crossover operations due to the computationally expensive , and the population diversity problem is not important. Our computational experiments show that, with an increase in the computational capacities and increase of the population size (which grows dynamically with the iteration number), the mutation operator plays more important role.

5. Computational Experiments

Parallel (CUDA) implementations of the algorithm are known [84, 85], and we used this approach in our experiments. All other algorithms were realized on the central processor unit.

For our experiments, we used the classic datasets from the UCI and Clustering basic benchmark repositories:(a)Individual Household Electric Power Consumption (IHEPC): energy consumption data of households during several years (more than 2 million data vectors, 7 dimensions), 0–1 normalized data; “date” and “time” columns removed.(b)SUSY ( data vectors, 18 dimensions), 0–1 normalized data. Here, we do not take into account the true labelling provided by the database, and use this dataset to search for internal structure in the data.(c)Chess (King-Rook vs. King-Pawn, 3196 Boolean data vectors, 36 dimensions).(d)BIRCH3 [10]: groups of points of random size on a plane (100000 data vectors, 2 dimensions).(e)Europe (map of Europe, 169308 data vectors, 2 dimensions).(f)Mopsi-Joensuu: locations of users (6014 data vectors, 2 dimensions).

The test system consisted of Intel Core 2 DuoE8400 CPU, 16GB RAM, NVIDIA GeForce GTX1050ti GPU with 4096 MB RAM, floating-point performance 2,138 g flops. For all datasets, 30 attempts were made to run each of 32 algorithms (Tables 1–6).

For comparison, we used the genetic algorithms with greedy heuristic crossover (k-GA-FULL, k-GA-ONE, and k-GA-RND described in Section 2) as well as the procedure in the multistart mode and j-Means algorithm (centers are replaced with the data vectors) [73]. In addition, we ran various Variable Neighborhood Search (VNS) algorithms with randomized neighborhoods formed by greedy heuristic procedure [32], see algorithms k-GH-VNS1 and k-GH-VNS2. For algorithms launched in the multistart mode (j-Means and ), only the best results achieved in each attempt were recorded. The minimum, maximum, average, and median objective function values and its standard deviation are summarized after 30 runs. For all algorithms, we used the same realization of the procedure which consumes the absolute majority of the computation time. The initial population size for all genetic algorithms consisted of chromosomes.

All algorithms were classified into three groups. The first group of algorithms consists of known algorithms including the genetic algorithms with greedy heuristic crossover. Algorithms of the second group are the genetic algorithms with greedy heuristic crossover and known mutation operators (k-GA--m1 for uniform random mutation and k-GA--m2 for scramble mutation [86] where a gene (centroid) is replaced with a randomly chosen data point). We performed our experiments with various values of mutation probability . Algorithms of the third group are genetic algorithms with greedy agglomerative crossover and new instruments for maintaining the population diversity: k-GA--GHM are algorithms with the new mutation operator, and k-GA--SUBPOP are algorithms with the new mutation operator and two subpopulations.

In each group of algorithms, the best average and median values of the objective function (1) are underlined. We compared the best algorithms in the second and third groups with the best algorithm in the first group (the best of known algorithms) with the use of t-test and Mann–Whitney U test.

In the comparative analysis of algorithm efficiency, the choice of the unit of time plays an important role. The astronomical time spent by an algorithm strongly depends on the peculiarities of its implementation, the ability of the compiler to optimize the program code, and the fitness of the hardware to execute the code of a specific algorithm. Algorithms are often estimated by comparing the number of iterations performed (for example, the number of population generations for a GA) or the number of evaluations of the objective function. In our case, some of the algorithms are not evolutionary, and in genetic algorithms, the execution time of the crossover operator with the embedded algorithm can differ hundreds of times. Therefore, comparing the number of generations is unacceptable. Comparison of the objective function calculations is also not quite correct. Firstly, the algorithm which consume almost all of the processor time, do not calculate (1) directly. Secondly, during the operation of the greedy agglomerative crossover operator, the number of centroids changes (decreases from down to or from down to ), and the time spent on computing the objective function also varies. Therefore, we nevertheless chose astronomical time as a scale for comparing algorithms. Moreover, all the algorithms use the same implementation of the algorithm launched under the same conditions.

In our computational experiments, the time limitation was used as the stop condition for all algorithms. As can be seen from Figures 1 and 2, the result of each algorithm depends on the elapsed time. Nevertheless, an advantage of the new algorithms remains regardless of the chosen time limit.

The range of values in all tables is small; nevertheless, the differences are statistically significant in several cases. In all cases, new algorithms with the greedy heuristic mutation outperform known ones or demonstrate approximately the same efficiency (difference in the results is statistically insignificant). Moreover, new algorithms demonstrate the stability of results (narrow range of objective function values). In most cases, the best results were achieved by the genetic algorithms with nonempty mutation operators.

6. Conclusions

When solving some large-scale clustering problems, traditional local search algorithms often give a result very far from the optimal solution. In this research we aimed at developing not only fast but also the most accurate algorithm, based on genetic algorithms with greedy heuristic crossover operator, for solving related optimization problems. Methods for obtaining solutions in a fixed time, which would be difficult to improve by known methods without a significant increase in computational costs, include genetic algorithms with a greedy agglomerative crossover operator. As the computational results presented in this article show, further improvement in the achieved result of such algorithms is possible by increasing the diversity in their populations.

Computational experiments show that the population diversity maintaining mechanisms such as mutation genetic operator and subpopulations improve the features of genetic algorithms with greedy heuristic crossover for the large-scale k-means problem. Moreover, the best results can be shown by algorithms with a mutation operator based on greedy heuristic crossover operator with a randomly generated chromosome (new greedy heuristic mutation).

The similarity in mathematical formulations of k-means, k-medoids, and p-median problems, as well as the problem of a mixture probability distribution separation, gives us a reasonable hope for the applicability of similar approaches to improving the results of solving those problems which determine possible directions for further research.

Data Availability

In our work, we used only data from the UCI Machine Learning and Clustering Basic Benchmark repositories which are available at https://archive.ics.uci.edu/ml/index.php and http://cs.joensuu.fi/sipu/datasets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Ministry of Science and HigherEducation of the Russian Federation (State Contract no. FEFE-2020-0013).

References

M. Garey, D. Johnson, and H. Witsenhausen, “The complexity of the generalized lloyd - max problem (Corresp.),” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 255-256, 1982.
View at: Publisher Site | Google Scholar
D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “NP-hardness ofEuclidean sum-of-squares clustering,” Machine Learning, vol. 75, no. 2, pp. 245–248, 2009.
View at: Publisher Site | Google Scholar
Z. Drezner and H. Hamacher, Facility Location: Applications and Theory, Springer-Verlag, Berlin, Germany, 2004.
S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
View at: Publisher Site | Google Scholar
J. B. MacQueen, “Some methods of classification and analysis of multivariate observations,” in Proceedings of the 5th Berkley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, University of California Press, Berkeley, CA, USA, January 1967.
View at: Google Scholar
L. Cooper, “Heuristic methods for location-allocation problems,” SIAM Review, vol. 6, no. 1, pp. 37–53, 1964.
View at: Publisher Site | Google Scholar
J.-L. Jiang and X.-M. Yuan, “A heuristic algorithm for constrained multi-source weber problem - the variational inequality approach,” uropean Journal of Operational Research, vol. 187, no. 2, pp. 357–370, 2008.
View at: Publisher Site | Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via theEMAlgorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.
View at: Publisher Site | Google Scholar
L. Kazakovtsev, D. Stashkov, M. Gudyma, and V. Kazakovtsev, “Algorithms with greedy heuristic procedures for mixture probability distribution separation,” Yugoslav Journal of Operations Research, vol. 29, no. 1, pp. 51–67, 2019.
View at: Publisher Site | Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: anEffcient data clustering method for very large databases,” in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96), pp. 103–114, ACM, New York, NY, USA, June 1996.
View at: Publisher Site | Google Scholar
L. O’Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha, “Streaming-data algorithms for high-quality clustering, dataEngineering,” in Proceedings 18th International Conference on DataEngineering, pp. 685–694, IEEE, San Jose, CA, USA, March 2002.
View at: Publisher Site | Google Scholar
M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen, and C. Sohler, “StreamKM++,” ACM Journal ofExperimental Algorithmics, vol. 17, no. 2, 4 pages, 2012.
View at: Publisher Site | Google Scholar
S. Masuyama, T. Ibaraki, and T. Hasegawa, “The computational complexity of the m-center problems on the plane,” The Transactions of the Institute ofElectronics and CommunicationEngineers of Japan, vol. 64E, pp. 57–64, 1981.
View at: Google Scholar
O. Kariv and S. L. Hakimi, “An algorithmic approach to network location problems. I: thep-centers,” SIAM Journal on Applied Mathematics, vol. 37, no. 3, pp. 513–538, 1979.
View at: Publisher Site | Google Scholar
R.E. Kuenne and R. M. Soland, “Exact and approximate solutions to the multisource weber problem,” Mathematical Programming, vol. 3-3, no. 1, pp. 193–209, 1972.
View at: Publisher Site | Google Scholar
L. M. Ostresh Jr, “The stepwise location-allocation problem:Exact solutions in continuous and discrete spaces,” Geographical Analysis, vol. 10, no. 2, pp. 174–185, 1978.
View at: Publisher Site | Google Scholar
K.E. Rosing, “An optimal method for solving the (generalized) multi-weber problem,” uropean Journal of Operational Research, vol. 58, no. 3, pp. 414–426, 1992.
View at: Publisher Site | Google Scholar
R. Z. Farahani and M. Hekmatfar, Facility Location Concepts, Models, Algorithms and Case Studies, SpringerVerlag, Berlin Heidelberg, Germany, 2009.
View at: Publisher Site
N. Mladenovic, J. Brimberg, P. Hansen, and J. A. Moreno-Perez, “The p-median problem: a survey of metaheuristic approaches,” uropean Journal of Operational Research, vol. 179, pp. 927–939, 2007.
View at: Publisher Site | Google Scholar
J. Reese, “Solution methods for thep-median problem: an annotated bibliography,” Networks, vol. 48, no. 3, pp. 125–142, 2006.
View at: Publisher Site | Google Scholar
J. Brimberg, Z. Drezner, N. Mladenović, and S. Salhi, “A new local search for continuous location problems,” uropean Journal of Operational Research, vol. 232, no. 2, pp. 256–265, 2014.
View at: Publisher Site | Google Scholar
Z. Drezner, J. Brimberg, N. Mladenović, and S. Salhi, “New heuristic algorithms for solving the planar p-median problem,” Computers & Operations Research, vol. 62, pp. 296–304, 2015.
View at: Publisher Site | Google Scholar
Z. Drezner, J. Brimberg, N. Mladenović, and S. Salhi, “Solving the planar p-median problem by variable neighborhood and concentric searches,” Journal of Global Optimization, vol. 63, no. 3, pp. 501–514, 2015.
View at: Publisher Site | Google Scholar
N. Mishra, D. Oblinger, and L. Pitt, Sublinear Time Approximate Clustering, Hewlett-Packard Labs, Palo Alto, CA, USA, 2001.
View at: Publisher Site
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, NY, USA, 1990.
View at: Publisher Site
F. Eisenbrand, F. Grandoni, T. Rothvosz, and G. Schafer, “Approximating connected facility location problems via random facility sampling and core detouring,” in Proceedings of SODA’2008, pp. 1174–1183, ACM, New York, NY, USA, January 2008.
View at: Publisher Site | Google Scholar
R. Jaiswal, A. Kumar, and S. Sen, “A simple D 2-sampling based PTAS for k-means and other clustering problems,” Algorithmica, vol. 70, no. 1, pp. 22–46, 2014.
View at: Publisher Site | Google Scholar
P. Avella, M. Boccia, S. Salerno, and I. Vasilyev, “An aggregation heuristic for large scale p-median problem,” Computers & Operations Research, vol. 39, no. 7, pp. 1625–1632, 2012.
View at: Publisher Site | Google Scholar
R. L. Francis, T. J. Lowe, M. B. Rayco, and A. Tamir, “AggregationError for location models: survey and analysis,” Annals of Operations Research, vol. 167, no. 1, pp. 171–208, 2009.
View at: Publisher Site | Google Scholar
D. Arthur and S. Vassilvitskii, “k-Means++: the advantages of careful seeding,” in Proceedings of SODA’07, pp. 1027–1035, SIAM, Diego, CA, USA, January 2007.
View at: Google Scholar
P. Hansen and N. Mladenovic, “Variable neighborhood search,” in Search Methodologies, . K. Burke and G. Kendall, Eds., Springer, Boston, MA, USA, 2005.
View at: Publisher Site | Google Scholar
I. P. Rozhnov, V. I. Orlov, and L. A. Kazakovtsev, “VNS-based algorithms for the centroid-based clustering problem,” Facta Universitatis Series: Mathematics and Informatics, vol. 34, no. 5, pp. 957–972, 2019.
View at: Publisher Site | Google Scholar
S. Still, W. Bialek, and L. Bottou, “Geometric clustering using the information bottleneck method,” in Proceedings of the Advances In Neural Information Processing Systems, vol. 16, MIT Press, December 2004.
View at: Google Scholar
Z. Sun, G. Fox, W. Gu, and Z. Li, “A parallel clustering method combined information bottleneck theory and centroid-based clustering,” The Journal of Supercomputing, vol. 69, no. 1, pp. 452–467, 2014.
View at: Publisher Site | Google Scholar
C. R. Houck, J. A. Joines, and M. G. Kay, “Comparison of genetic algorithms, random restart and two-opt switching for solving large location-allocation problems,” Computers & Operations Research, vol. 23, no. 6, pp. 587–596, 1996.
View at: Publisher Site | Google Scholar
U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering technique,” Pattern Recognition, vol. 33, no. 9, pp. 1455–1465, 2000.
View at: Publisher Site | Google Scholar
K. Krishna and M. Narasimha Murty, “Genetic K-means algorithm,” IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol. 29, no. 3, pp. 433–439, 1999.
View at: Publisher Site | Google Scholar
M. N. Neema, K. M. Maniruzzaman, and A. Ohgai, “New genetic algorithms based approaches to continuous p-median problem,” Networks and SpatialEconomics, vol. 11, no. 1, pp. 83–99, 2011.
View at: Publisher Site | Google Scholar
L. A. Kazakovtsev and I. Rozhnov, “Application of algorithms with variable greedy heuristics for k-medoids problems,” Informatica, vol. 44, no. 1, pp. 55–61, 2020.
View at: Publisher Site | Google Scholar
C. M. Hosage and M. F. Goodchild, “Discrete space location-allocation solutions from genetic algorithms,” Annals of Operations Research, vol. 6, no. 2, pp. 35–46, 1986.
View at: Publisher Site | Google Scholar
O. Alp, . Erkut, and Z. Drezner, “AnEfficient genetic algorithm for the p-median problem,” Annals of Operations Research, vol. 122, no. 1/4, pp. 21–42, 2003.
View at: Publisher Site | Google Scholar
K. Kim and H. Ahn, “A recommender system using GA K-means clustering in an online shopping market,” xpert Systems with Applications, vol. 34, no. 2, pp. 1200–1209, 2008.
View at: Publisher Site | Google Scholar
L. A. Kazakovtsev and A. N. Antamoshkin, “Genetic algorithm with fast greedy heuristic for clustering and location problems,” Informatica, vol. 38, no. 3, pp. 229–240, 2014.
View at: Google Scholar
W. Kwedlo and P. Iwanowicz, Using Genetic Algorithm for Selection of Initial Cluster Centers for the K-Means Method, ICAISC 2010: Artifical Intelligence and Soft Computing, Springer-Verlag, Berlin, Heidelberg, Germany, 2010.
Z. He and C. Yu, “Clustering stability-basedEvolutionary K-means,” Soft Computing, vol. 23, no. 1, pp. 305–321, 2019.
View at: Publisher Site | Google Scholar
C. Pizzuti and N. Procopio, “A K-means based genetic algorithm for data clustering,” in Proceedings of the International Joint Conference SOCO’16-CISIS’16-ICEUTE’16, vol. 527, pp. 211–222, San Sebastián, Spain, October 2016.
View at: Publisher Site | Google Scholar
P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.
View at: Publisher Site | Google Scholar
D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224–227, 1979.
View at: Publisher Site | Google Scholar
A. V. remeev, “Genetic algorithm with tournament selection as a local search method,” Discrete Analysis and Operations Research, vol. 19, no. 2, pp. 41–53, 2012.
View at: Publisher Site | Google Scholar
J. H. Holland, Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, UK, 1992.
D. B. Fogel and J. W. Atmar, “Comparing genetic operators with Gaussian mutations in simulatedEvolutionary processes using linear systems,” Biological Cybernetics, vol. 63, no. 2, pp. 111–114, 1990.
View at: Publisher Site | Google Scholar
C. Liu and A. Kroll, “On designing genetic algorithms for solving small- and medium-scale traveling salesman problems,” Swarm andEvolutionary Computation, vol. 7269, pp. 283–291, 2012.
View at: Publisher Site | Google Scholar
E. Osaba, R. Carballedo, F. Diaz, E. Onieva, I. de la Iglesia, and A. Perallos, “Crossover versus mutation: a comparative analysis of theEvolutionary strategy of genetic algorithms applied to combinatorial optimization problems,” The Scientific World Journal, vol. 2014, Article ID 154676, 22 pages, 2014.
View at: Publisher Site | Google Scholar
J. Walkenhorst and T. Bertram, “Multikriterielleoptimierungsverfahren Fur pickup-and-delivery-probleme,” in Proceedings of 21. Workshop Computational Intelligence, pp. 61–76, Dortmund, Germany, December 2011.
View at: Google Scholar
L. A. Kazakovtsev and A. N. Antamoshkin, “Greedy heuristic method for location problems,” Vestnik SibGAU, vol. 16, no. 2, pp. 317–325, 2015.
View at: Google Scholar
D. Q. Zeebaree, H. Haron, A. M. Abdulazeez, and S. R. M. Zeebaree, “Combination of K-means clustering with genetic algorithm: a review,” International Journal of AppliedEngineering Research, vol. 12, no. 24, pp. 14238–14245, 2017.
View at: Google Scholar
. R. Hruschka, R. J. G. B. Campello, A. A. Freitas, and A. C. P. L. F. de Carvalho, “A survey ofEvolutionary algorithms for clustering,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 39, no. 2, pp. 133–155, 2009.
View at: Publisher Site | Google Scholar
A. A. Freitas, A Review ofEvolutionary Algorithms for Data Mining, Data Mining and Knowledge Discovery Handbook, Oxford University, Oxford, UK, 2009.
S. Bandyopadhyay, “Genetic algorithms for clustering and fuzzy clustering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 6, pp. 524–531, 2011.
View at: Publisher Site | Google Scholar
P. Larrañaga, C. M. H. Kuijpers, R. H. Murga, I. Inza, and S. Dizdarevic, “Genetic algorithms for the travelling salesman problem: a review of representations and operators,” Artificial Intelligence Review, vol. 13, no. 2, pp. 129–170, 1999.
View at: Publisher Site | Google Scholar
A. Sarangi, R. Lenka, and S. K. Sarangi, “Design of linear phase firhigh pass filter using PSO with gaussian mutation,” in Proceedings of the Swarm,Evolutionalry, and Memetic Computing, pp. 471–479, Bhubaneswar, India, July 2015.
View at: Google Scholar
D. Deb and K. Deb, “Investigation of mutation schemes in real-parameter genetic algorithms,” Swarm,Evolutionary, and Memetic Computing, vol. 7677, pp. 1–8, 2012.
View at: Publisher Site | Google Scholar
K. Deep and M. Thakur, “A new mutation operator for real coded genetic algorithms,” Applied Mathematics and Computation, vol. 193, no. 1, pp. 211–230, 2007.
View at: Publisher Site | Google Scholar
K. Deep and H. Mebrahtu, “Combined mutation operators of genetic algorithm for the travelling salesman problem,” International Journal of Combinatorial Optimization Problems and Informatics, vol. 2, no. 3, pp. 1–23, 2011.
View at: Google Scholar
T.-P. Hong, H.-S. Wang, and W.-C. Chen, “Simultaneously applying multiple mutation operators in genetic algorithms,” Journal of Heuristics, vol. 6, no. 4, pp. 439–455, 2000.
View at: Publisher Site | Google Scholar
B. McGinley, J. Maher, C. O’Riordan, and F. Morgan, “Maintaining healthy population diversity using adaptive crossover, mutation, and selection,” IEEE Transactions onEvolutionary Computation, vol. 15, no. 5, pp. 692–714, 2011.
View at: Publisher Site | Google Scholar
M. Serpell and J.E. Smith, “Self-adaptation of mutation operator and probability for permutation representations in genetic algorithms,” volutionary Computation, vol. 18, no. 3, pp. 491–514, 2010.
View at: Publisher Site | Google Scholar
C. A. Brizuela and R. Aceves, xperimental Genetic Operators Analysis for the Multi-Objective Permutation Flowshop, Springer-Verlag, Berlin, Heidelberg, Garmany, 2003.
L. Wang and L. Zhang, “Determining optimal combination of genetic operators for flow shop scheduling,” The International Journal of Advanced Manufacturing Technology, vol. 30, no. 3-4, pp. 302–308, 2006.
View at: Publisher Site | Google Scholar
B. H. F. Hasan and M. S. M. Saleh, “valuating theEffectiveness of mutation operators on the behavior of genetic algorithms applied to non-deterministic polynomial problems,” Informatica, vol. 35, no. 4, pp. 513–518, 2011.
View at: Google Scholar
P. Karthikeyan, S. Baskar, and A. Alphones, “Improved genetic algorithm using different genetic operator combinations (GOCs) for multicast routing in ad hoc networks,” Soft Computing, vol. 17, no. 9, pp. 1563–1572, 2013.
View at: Publisher Site | Google Scholar
. S. Correa, M. T. A. Steiner, A. A. Freitas, and C. Carnieri, “A genetic algorithm for the p-median problem,” in Proceedings of the GECCO-2001, pp. 1268–1275, Morgan Kaufmann Publishers Inc, 2001.
View at: Google Scholar
P. Hansen and N. Mladenović, “J-Means: a new local search heuristic for minimum sum of squares clustering,” Pattern Recognition, vol. 34, no. 2, pp. 405–413, 2001.
View at: Publisher Site | Google Scholar
Y. Alkhalifah and R. L. Wainwright, “A genetic algorithm applied to graph problems involving subsets of vertices,” in Proceedings of the 2004 Congress onEvolutionary Computation, Portland, vol. 1, pp. 303–308, IEEE, Portland, OR, USA, June 2004.
View at: Publisher Site | Google Scholar
Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. J. Brown, “FGKA: a fast genetic k-means clustering algorithm,” in Proceedings of the 2004 ACM Symposium on Applied Computing-SAC’04, Nicosia, Cyprus, March 2004.
View at: Publisher Site | Google Scholar
S. S. Cheng, Y. H. Chao, H. M. Wang, and H. C. Fu, “A prototypes-embedded genetic k-means algorithm,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), pp. 724–727, IEEE, Hong Kong, China, August 2006.
View at: Publisher Site | Google Scholar
D.-X. Chang, X.-D. Zhang, and C.-W. Zheng, “A genetic algorithm with gene rearrangement for K-means clustering,” Pattern Recognition, vol. 42, no. 7, pp. 1210–1222, 2009.
View at: Publisher Site | Google Scholar
D.E. Goldberg and J. Richarson, “Genetic algorithms with sharing for multimodal function optimization,” in Proceedings 2nd International Conference on Genetic Algorithms, pp. 41–49, Cambridge, UK, October 1987.
View at: Google Scholar
K. A. D. Jong, “An analysis of the behaviour of a class of genetic adaptive systems,” University of Michigan, Ann Arbor, MI, USA, 1975, Ph.D. thesis.
View at: Google Scholar
O. Mengshoel and D. Goldberg, Probabilistic Crowding: Deterministic Crowding with Probabilistic Replacement, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1999.
I. Ono and S. Kobayashi, A Real-coded Genetic Algorithm Using the Unimodal Normal Distribution Crossover: Natural Computing Series, Springer-Verlag, Berlin, Heidelberg, Germany, 2003.
D. Dumitrescu and C. Stoean, “The genetic chromodynamics metaheuristic,” in Proceedings of TELE-INFO’06, WSEAS, Stevens Point, pp. 92–97, Wisconsin, USA, August 2006.
View at: Google Scholar
R. I. Lung and D. Dumitrescu, “Roaming optimization: a newEvolutionary technique for multimodal optimization,” Studia Universitatis Babes-Bolyai - Informatica, vol. XLIX, no. 1, pp. 99–109, 2004.
View at: Google Scholar
M. Zechner and M. Granitzer, “Accelerating K-means on the graphics processor via CUDA,” in Proceedings of the International Conference on Intensive Applications and Services, pp. 7–15, Valencia, Spain, April 2009.
View at: Publisher Site | Google Scholar
D. Luebke and G. Humphreys, “How GPUs work,” Computer, vol. 40, no. 2, pp. 96–100, 2007.
View at: Publisher Site | Google Scholar
S. N. Sivanandam and S. N. Deepa, Introduction to Genetic Algorithms, Springer, Berlin, Germany, 2007.

Copyright

Copyright © 2020 Lev Kazakovtsev et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1189

Downloads

781

Citations