Abstract

The k-means problem is one of the most popular models of cluster analysis. The problem is NP-hard, and modern literature offers many competing heuristic approaches. Sometimes practical problems require obtaining such a result (albeit notExact), within the framework of the k-means model, which would be difficult to improve by known methods without a significant increase in the computation time or computational resources. In such cases, genetic algorithms with greedy agglomerative heuristic crossover operator might be a good choice. However, their computational complexity makes it difficult to use them for large-scale problems. The crossover operator which includes the k-means procedure, taking the absolute majority of the computation time, is essential for such algorithms, and other genetic operators such as mutation are usually eliminated or simplified. The importance of maintaining the population diversity, in particular, with the use of a mutation operator, is more significant with an increase in the data volume and available computing resources such as graphical processing units (GPUs). In this article, we propose a new greedy heuristic mutation operator for such algorithms and investigate the influence of new and well-known mutation operators on the objective function value achieved by the genetic algorithms for large-scale k-means problems. Our computational experiments demonstrate the ability of the new mutation operator, as well as the mechanism for organizing subpopulations, to improve the result of the algorithm.

1. Introduction

The k-means problem is a continuous unconstrained global optimization problem which has become a classic clustering model. This problem is proved to be NP-hard [1, 2], so it is necessary to find a compromise between the computation time and the solution preciseness. The aim of the problem is to find set of points and called centroids in a -dimensional space that minimizes the sum of squared distances from known points (data vectors) to the nearest centroid [3]:where is the distance between two points (usuallyEuclidean) and is given.

Data vector indexes for which the th centroid is the nearest one form a set (cluster) , . An equivalent problem setting is as follows:where is the centroid of the th cluster.

The simplest and most popular local search algorithm is the k-means algorithm [4, 5] also called Alternate Location and Allocation (ALA) procedure [6, 7] or Lloyd algorithm. Similar procedure called EM (Expectation Maximization) [8, 9] and its modifications [1012] are the most popular algorithms for separating the mix probability distribution. The k-means algorithm improves an intermediate solution sequentially, which enables us to find a local minimum.

Technically, this is not a true local search algorithm in terms of continuous optimization, as it searches for a new solution not necessarily in the -neighborhood of an existing solution. Nevertheless, it enables us a solution which is locally optimal in -neighborhood.

If we use distances instead of squared distances in (1), we deal with the continuous p-median problem. The similarity of these NP-hard problems [13, 14] allows us to use similar approaches to solving them. However, unlike the p-median problem with Euclidean distances, finding the exact solution of a 1-means problem (k-means problem with or the centroid search problem) in accordance with Algorithm 1 is trivial, and finding a local minimum of the k-means problem takes less computational resources. This allows the local search to be integrated into various effective global search strategies.

Require: Set of initial centroids .
(1)ForEach centroid , , define its cluster as the subset of data vectors having closest centroid .
(2)ForEach cluster , , recalculate its centroid as follows:
(3)Repeat from Step 1 if Steps 1, 2 made any changes.

In the early attempts to solve the p-median problem by exact methods (its discrete modifications), the authors used a branch and bound algorithm [1517] for solving very small problems. In [1820], the authors reviewed various heuristic solution techniques for k-means and p-median problems. In [2123], the authors presented local search approaches including the Variable Neighborhood Search (VNS) and concentric search. In [22], Drezner et al. proposed heuristic procedures including the genetic algorithm (GA), for rather small datasets.

Many approaches, based on the data reducing [24], simplify the problem by selection of some part of the initial dataset and then use these results as an initial solution to the k-means algorithm on the complete dataset [2528]. Such aggregating as well as reducing the number of the data vectors [29] enables us to solve large-scale problems within a reasonable time. However, such approaches lead to a reduction in preciseness. In our research aimed at obtaining the most precise solutions, we consider only the methods which estimate the objective function (1) directly, without aggregation or approximation approaches.

Modern publications offer many heuristic procedures [19, 30] for setting the initial centroids for the k-means algorithm, most of them belong to various evolutionary and random search methods. Local search algorithms and their randomized versions are widely presented. For instance, Variable Neighborhood Search (VNS) algorithms [23, 31, 32] or agglomerative algorithms [33, 34] sometimes show good results. A large number of articles are devoted to the initialization procedures for local search algorithms, such as random seeding and estimating the distribution of the data vectors [30]. The challenge is that, in many cases, even multiple runs of simple local search algorithms from various randomly generated solutions do not lead to a solution that is close to the global optimum. More advanced algorithms enable us to get the objective function (1) values many times better than the local search methods [32].

The use of genetic algorithms and other evolutionary approaches to improve the results of the local search is a widely used idea [3538]. Such algorithms recombine the local minima obtained by the k-means algorithm. GAs operate with a certain set (population) of candidate solutions and include special genetic operators (algorithms) of initialization, selection, crossover, and mutation. The mutation operator randomly changes the resulting solutions and provides some diversity in the population.

However, in genetic algorithms, as the number of iterations increases, the population degenerates into a certain set of solutions close to each other. Larger populations as well as dynamically growing populations improve this situation. However, simpler algorithms based on the use of the same greedy agglomerative procedures [32, 39] often show better results within the same computation time.

In this research, we do not discuss the adequacy of the k-means clustering model, which is actually questionable. We only focus on the preciseness and stability of the obtained objective function value (1) within the framework of the k-means model.

There are situations when the cost of error is high [9]. In these cases, as well as when comparing the accuracy of an algorithm with a certain standard solution (not necessarily globally optimal), we need to get a result that would be difficult to enhance by other known methods without meaningful increase of the computation time. Evolution of parallel processing systems such as graphics processing units (GPUs) makes multiple runs of local search algorithms very cheap. In this case, large-scale problems (up to several millions of data vectors) can be solved with the use of the most advanced algorithms providing the highest preciseness. As our study shows, for large-scale problems, further improvement in the results of the genetic algorithms with greedy heuristic crossover can be achieved by using a special mutation operator and partially isolated solution subpopulations.

The aim of this paper is to introduce a new k-Means Genetic Algorithm with the greedy agglomerative crossover operator, a special greedy agglomerative mutation operator, and subpopulations.

The rest of this article is organized as follows. In Section 2, we propose a brief overview of known approaches to the development of k-means genetic algorithms. In Section 3, we give an overview of known mutation genetic operators used in k-median genetic algorithms in accordance with various approaches to chromosome encoding as well as other instruments for increasing the population diversity. In Section 4, we propose new modifications to the genetic algorithms with greedy heuristic crossover operator. Such modifications include partially isolated subpopulations and the use of a new mutation operator based on the greedy heuristic procedure. In Section 5, we describe the results of our computational experiments which demonstrate the efficiency of our new modifications on large datasets.

2. K-Means Genetic Algorithms

The idea of various genetic algorithms is based on a recombination (interchange) of elements in a set (“population”) of candidate solutions (“individuals”) encoded by “chromosomes.” Such elements of the chromosomes are called “genes” or “alleles.” Each chromosome is a vector of genes (bits, integers, or real numbers) representing a solution. The goal of gene recombination is achieving the best value of an objective function called “fitness function.” The appearance of the first genetic algorithms for solving the discrete p-median problem [40] preceded the genetic algorithms for the k-means problem (k-Means Genetic Algorithms). Alp et al. [41] proposed a rather fast and precise algorithm with a special “greedy” (agglomerative) heuristic procedure used as the crossover genetic operator for the network p-median problem. Such algorithms solve discrete network problems and use a very simple binary chromosome encoding (1 for the network nodes selected as the centers of the clusters, and 0 for those not selected).

In the genetic algorithms for the k-means and similar problems with binary-encoded chromosomes, many mutation techniques can be used. For example, in [42], the authors represent the chromosome with binary strings composed from binary-encoded features (coordinates) of the centroids. The mutation operator arbitrarily alters one or more components (binary substrings) of a selected chromosome.

If the centers or centroids are searched for in a continuous space, some genetic algorithms still use the binary encoding [38, 43, 44]. In the k-means algorithm, its initial solutions are usually subsets of the dataset . In such chromosome code, 1 means that the corresponding data vector must be selected as an initial centroid and 0 for those not selected. In this case, some local search algorithm (k-means algorithm or similar) is used at each iteration of the GA to estimate the final value (local minimum) of the objective function (1).

In [45], the authors refer to their algorithm as “Evolutionary k-Means.” However, they actually solved an alternative problem which aimed to increase the clustering stability instead of minimizing (1). Their algorithm operates with binary consensus matrices and uses two types of mutation genetic operators: cluster split (dissociative) and cluster merge (agglomerative) mutation. In [46], the chromosomes are strings of integers representing the cluster number for each of the clustered objects, and the authors solve the k-means problem with simultaneous determining the number of clusters based on the silhouette [47] and Davies and Bouldin criteria [48], which are used as the fitness functions. Thus, in [46], the authors also solve a problem with the mathematical statement other than (1). Similar encoding is used in [37] where the authors propose a mutation operator which changes the assignment of individual data vectors to the clusters.

In [49], the authors described the mutation operator as a procedure that guarantees population diversity (variability). Usually, for the k-means and p-median problems, the mutation randomly changes one or many chromosomes, replacing some centroids [36, 37]. Mutation and crossover are the most important genetic operators playing different roles: the crossover seeks to preserve the features of parent solutions, while the mutation tries to cause small local changes in the solutions. Compared to a crossover, a mutation is usually regarded as a secondary operator with a low probability [50]. High frequency of mutations makes a genetic algorithm to search randomly and chaotically. Nevertheless, many studies have shown that evolutionary algorithms without a crossover can work better than a standard genetic algorithm if the mutation is combined with an effective selection operator [5154]. Mutation is performed on a single parent solution.

In [36], the authors encode the solutions (chromosomes) in their GA as sets of centroids represented by their coordinates (real vectors or arrays). The genetic algorithms with the greedy heuristic crossover operator use the same principle [55].

Thus, various genetic algorithms for the k-means and similar problems can be classified into three categories in accordance with the chromosome encoding method:(a)Integer encoding each gene represents a data vector , and the value is its cluster number. Such algorithms are declared for solving the k-means or p-median problem; however, they may use other objective function than (1). A local search for the minimum of (1) is sometimes declared their mutation operator.(b)Integer or binary encoding each gene corresponds to a centroid (cluster) and describes the data vector index selected as the initial centroid for the local search method. Such algorithms may use a wide variety of crossover and mutation operators.(c)Real (direct) encoding each gene is a centroid encoded by its coordinates. Such algorithms are able to demonstrate the most precise results. However, the modern literature offers a very limited variety of mutation operators for such algorithms. Usually, they do not use any mutation [38, 41, 43].

The greedy heuristic crossover operator can be described as a two step algorithm. The first step combines two known (“parent”) solutions (chromosomes) into one intermediate invalid solution with an excessive number of centroids (clusters). At the second step (the greedy agglomerative procedure), the algorithm removes excessive centroids in each iteration so that the removal of the centroid results in the least significant growth of the objective function (1) [41, 43], see Algorithm 2.

Require: Final number of centroids , initial solution ,
(1)Improve with the local search algorithm if possible:
(2)while
(3)for all
(4)  Assign
(5)  Calculate //where is the objective function (1)
(6)end for
(7) Select a subset of centroids, , with the minimal values of the corresponding variables .Here,
(8) Obtain the new solution , , and improve this new
  solution with the local search algorithm:
(9) End while
(10) return Solution

Algorithms 3 and 4 are known heuristic procedures [32, 41, 43], which implement the first step of the greedy heuristic crossover operator and run heuristic procedure.

Require: Two solutions (sets of centers) and
return
Require: Two solutions (sets of centers) and
for all
  Merge and one item of : ;
end for
return the best of solutions

These algorithms can be included in various global search strategies. Combining items (centroids) of solution with the items of the other solution and running Algorithm 1, we get a set of “child” solutions. These solutions are used as the neighborhoods, in which a better solution is sought for. Thus, the second solution is a parameter of the neighborhood [32].

The general framework of the GA for the k-means and similar location problems can be described as Algorithm 5.

Require: Initial population size (in ourExperiments, ).
(1)Assign . Generate initial solutions where . ForEach initial solution, run the algorithm: ;
(2)loop
(3)
(4)if stop condition is satisfied
(5)  return solution from the population with minimal value of
(6)end if
(7) Selection: Randomly choose two indexes
(8) Run chosen crossover operator:
(9) Run chosen mutation operator:
(10) Run chosen procedure to replace a solution in the population
(11)End loop

The objective function is (1). We used the tournament selection (tournament replacement, see Algorithm 6) for Step 10 of Algorithm 5:

Randomly choose two indexes
if
Else
End if

Such algorithms usually operate with a very small population, and other selection procedures do not improve the results significantly [41, 43, 46].

In the GAs with greedy heuristic crossover [43, 44], Algorithms 2 and 3 are used as the crossover genetic operator. These operators are computationally expensive due to multiple runs of the algorithm. In the case of large-scale problems and very strict time limitation, GAs with greedy heuristic crossover operator performs only few iterations for large-scale problems. The population size is usually small, 10–25 chromosomes. Dynamically growing populations [43, 44] are able to improve the results. In this case, Step 7 of Algorithm 5 is replaced by the following procedure (see Algorithm 7).

;
if has changed
 initialize the new individual : generate randomly, ;
end if

Thus, in this paper, we intend to improve the GAs with greedy heuristic crossover operator which can be described as follows [43, 46]:k-GA-ONE: GA framework (Algorithm 5) with as the crossover operator, selection (Algorithm 6), dynamic population size adjustment (Algorithm 7), and empty mutation operator.k-GA-FULL: the same but crossover operator.k-GA-RND: the same but the crossover operator or is selected randomly with equal probability.

The empty mutation operator can be replaced with a known or new procedure described in Section 3.

3. Known Methods of Increasing Population Diversity in the Genetic Algorithms

Despite the widespread use of various genetic algorithms for the k-means problems in the modern literature, there is practically no systematization of the approaches used [5659]. For various methods of chromosome encoding, various mutation operators have been developed: bit inversion for binary encoding [50], exchange, insert, inverse, and offset permutation [60] for variable length chromosomes, Gaussian mutation [61], and polynomial mutation for real coding [62, 63]. Some studies suggest a combination of the mutation operators [64] or the self-adaptive mutation operators [6567]. The efficiency of various mutation operators depends on the GA parameters [53, 68, 69] and problem type [70, 71]. However, the number of various mutation operators with real encoding for continuous problems is very limited.

The GA for the network p-median problem described in [72] includes the hypermutation operator, which consists in an attempt to replace each gene in the chromosome with each gene from the set of genes that were not originally part of the processed chromosome. After each replacement, the algorithm checks for the improvement of the objective function value. The operator is computationally expensive due to numerous checks of the objective function and actually similar to the local search principle embedded in the j-means algorithm [73]. In [74], the hypermutation algorithm was further developed as the nearest four neighbors’ algorithm. The idea is to reduce computational costs by reducing the set of genes used for the replacement to the nearest neighbors of the gene being replaced. In several works [37, 75, 76], the authors propose using the algorithm as a mutation operator.

Each of these algorithms declares a local search as a mutation operator. The GA framework allows us to use a wide variety of genetic operator options. However, the local search is designed to improve an arbitrary solution by transforming it into a local optimum and thereby reducing, rather than increasing, the variety of chromosomes (solutions).

In [36, 42], the mutation operator is as follows (uniform random mutation). Randomly generate . If (where is mutation probability), then the chromosome will mutate. Randomly generate . If the current position of a centroid is , the mutation operator modifies it as follows:

Signs “+” and “–” are used with the same probability [42]. This mutation operator shifts the centroid coordinates randomly. A similar technique with an “amplification factor” was used in [44, 77]. However, the local minima distribution among the search space is not uniform [49]: the new local minima of (1) can be found with higher probability in some neighborhood of a known local minimum than in a neighborhood of a randomly chosen point (here, by a neighborhood, we do not necessarily mean an -neighborhood but any subset of solutions which can be obtained by application of some defined procedure to the current solution). Combining local minima (subsets of centroids from two locally minimal solutions) must usually outperform the random shift of the centroid coordinates. The idea of combining local minima is the basic idea of the greedy heuristic crossover operator in genetic algorithms [38, 43] and other algorithms [21]. The greedy heuristic crossover operator for the discrete p-median problem proposed in [41] and adapted for continuous p-median and k-means problems in [38, 43] was used in the GAs without any mutation operator. Such algorithms demonstrate more accurate results in comparison with many other algorithms for practically important middle-size problems.

The other common approach to increasing the diversity in a population is to create subpopulations that develop more or less autonomously. Algorithms that produce subpopulations containing individuals gathered around optima are a wide class of such methods. The fitness sharing method [78] allows the evolutionary algorithm to search simultaneously in different areas (niches) corresponding to different local (or global) optima, i.e., this method allows one to identify and localize multiple optima in search space. The group of crowding methods [7981] also uses a niche approach. The general concept of crowding is for individuals to fight for survival with similar offspring and apply tournament selection to a high-likeness parent-child pair. The main idea of the genetic chromodynamics [82] is to force the formation and maintenance of stable subpopulations. The proposed scheme of local interaction provides stabilization of the subpopulation in the early stages of the search. Subpopulations co-develop and converge to several optimal solutions.

In [83], the authors present the roaming optimization method. By using subpopulations developing in isolation, multiple optima are found. This method uses the tendency of evolutionary algorithms to premature convergence, turning this disadvantage into an advantage in the process of detecting local optima.

4. New Modifications to the Genetic Algorithms

The essence of our new mutation operator (greedy heuristic mutation, GHM) is as follows. We perform the crossover operator to the single parent chromosome and a randomly generated chromosome improved by the algorithm. In Step 9 of Algorithm 5, the operator is replaced with Algorithm 8.

Require: Solution .
 Randomly generate new solution  , ;
if
  
end if
return

Despite small populations in the genetic algorithms with the greedy agglomerative crossover, the application of a simple approach with two subpopulations allows us to improve the result of the algorithm. In our research, within the population, we organized two subpopulations of equal volume. For the crossover and tournament, both chromosomes are mainly selected within the same subpopulation. If one of the subpopulations during a certain number of iterations does not provide an improvement in solutions and its record (the best solution) is inferior to the record of the second subpopulation, its individuals are replaced by new ones (reinitialization of the population). We assumed that chromosomes in the same subpopulation tend to develop in a similar way under the influence of crossover. Mutation of a separate chromosome increases the population diversity; however, under the influence of the crossover, the differences are gradually levelled. Reinitialization of a subpopulation is a substitute for a complete restart of the algorithm while maintaining the record. Thus, Step 7 of Algorithm 4 (selection) is transformed as Algorithm 9.

Randomly choose
if
 Randomly choose two indexes ;
else if
 Randomly choose two indexes ;
Else
 Randomly choose two indexes ;
End if

Similarly, Step 10 of Algorithm 5 changes (see Algorithm 10).

   Randomly choose
   if
 Randomly choose two indexes
Else if
 Randomly choose two indexes
Else
 Randomly choose two indexes
End if
if
Else
End if

An additional Step is added to Algorithm 5 (see Algorithm 11).

if the algorithm gave no improvement during iterations
 Reinitialize all solutions in the subpopulation with indexes or .
end if

The idea of the Variable Neighborhood Search with randomized neighborhoods (see [32]) is also based on applying the greedy heuristic procedures (Algorithms 2 and 3) to a current solution and a randomly generated one transformed into a local minimum by Algorithm 1. Our computational experiments (see Section 5) show that the new genetic algorithms with as the mutation operator outperform both the original genetic algorithms with greedy agglomerative crossover operator (Algorithm 4 with empty mutation) and the Variable Neighborhood Search with randomized neighborhoods.

As mentioned before, the greedy agglomerative crossover operator is a computationally expensive algorithm. In Algorithm 2, the objective function calculation is performed more than times. Therefore, such algorithms are traditionally considered as methods for solving comparatively small problems (hundred thousands of data points and hundreds of centers). However, the rapid development of the massive parallel processing systems (GPUs) allows us to solve the large-scale problems with reasonable time expenses (minutes).

One of the most important issues of the GAs is the convergence of the entire population into some narrow area (population degeneration) around some local minimum. On the first crossover iterations, the “child” solutions usually have significant advantages in the objective function value in comparison with their “parents” due to the ability of the greedy agglomerative crossover operator to choose much better solutions in comparison with the k-means procedure. On a single central processor unit, such GAs manage to perform only few crossover operations due to the computationally expensive , and the population diversity problem is not important. Our computational experiments show that, with an increase in the computational capacities and increase of the population size (which grows dynamically with the iteration number), the mutation operator plays more important role.

5. Computational Experiments

Parallel (CUDA) implementations of the algorithm are known [84, 85], and we used this approach in our experiments. All other algorithms were realized on the central processor unit.

For our experiments, we used the classic datasets from the UCI and Clustering basic benchmark repositories:(a)Individual Household Electric Power Consumption (IHEPC): energy consumption data of households during several years (more than 2 million data vectors, 7 dimensions), 0–1 normalized data; “date” and “time” columns removed.(b)SUSY ( data vectors, 18 dimensions), 0–1 normalized data. Here, we do not take into account the true labelling provided by the database, and use this dataset to search for internal structure in the data.(c)Chess (King-Rook vs. King-Pawn, 3196 Boolean data vectors, 36 dimensions).(d)BIRCH3 [10]: groups of points of random size on a plane (100000 data vectors, 2 dimensions).(e)Europe (map of Europe, 169308 data vectors, 2 dimensions).(f)Mopsi-Joensuu: locations of users (6014 data vectors, 2 dimensions).

The test system consisted of Intel Core 2 DuoE8400 CPU, 16GB RAM, NVIDIA GeForce GTX1050ti GPU with 4096 MB RAM, floating-point performance 2,138 g flops. For all datasets, 30 attempts were made to run each of 32 algorithms (Tables 16).

For comparison, we used the genetic algorithms with greedy heuristic crossover (k-GA-FULL, k-GA-ONE, and k-GA-RND described in Section 2) as well as the procedure in the multistart mode and j-Means algorithm (centers are replaced with the data vectors) [73]. In addition, we ran various Variable Neighborhood Search (VNS) algorithms with randomized neighborhoods formed by greedy heuristic procedure [32], see algorithms k-GH-VNS1 and k-GH-VNS2. For algorithms launched in the multistart mode (j-Means and ), only the best results achieved in each attempt were recorded. The minimum, maximum, average, and median objective function values and its standard deviation are summarized after 30 runs. For all algorithms, we used the same realization of the procedure which consumes the absolute majority of the computation time. The initial population size for all genetic algorithms consisted of chromosomes.

All algorithms were classified into three groups. The first group of algorithms consists of known algorithms including the genetic algorithms with greedy heuristic crossover. Algorithms of the second group are the genetic algorithms with greedy heuristic crossover and known mutation operators (k-GA--m1 for uniform random mutation and k-GA--m2 for scramble mutation [86] where a gene (centroid) is replaced with a randomly chosen data point). We performed our experiments with various values of mutation probability . Algorithms of the third group are genetic algorithms with greedy agglomerative crossover and new instruments for maintaining the population diversity: k-GA--GHM are algorithms with the new mutation operator, and k-GA--SUBPOP are algorithms with the new mutation operator and two subpopulations.

In each group of algorithms, the best average and median values of the objective function (1) are underlined. We compared the best algorithms in the second and third groups with the best algorithm in the first group (the best of known algorithms) with the use of t-test and Mann–Whitney U test.

In the comparative analysis of algorithm efficiency, the choice of the unit of time plays an important role. The astronomical time spent by an algorithm strongly depends on the peculiarities of its implementation, the ability of the compiler to optimize the program code, and the fitness of the hardware to execute the code of a specific algorithm. Algorithms are often estimated by comparing the number of iterations performed (for example, the number of population generations for a GA) or the number of evaluations of the objective function. In our case, some of the algorithms are not evolutionary, and in genetic algorithms, the execution time of the crossover operator with the embedded algorithm can differ hundreds of times. Therefore, comparing the number of generations is unacceptable. Comparison of the objective function calculations is also not quite correct. Firstly, the algorithm which consume almost all of the processor time, do not calculate (1) directly. Secondly, during the operation of the greedy agglomerative crossover operator, the number of centroids changes (decreases from down to or from down to ), and the time spent on computing the objective function also varies. Therefore, we nevertheless chose astronomical time as a scale for comparing algorithms. Moreover, all the algorithms use the same implementation of the algorithm launched under the same conditions.

In our computational experiments, the time limitation was used as the stop condition for all algorithms. As can be seen from Figures 1 and 2, the result of each algorithm depends on the elapsed time. Nevertheless, an advantage of the new algorithms remains regardless of the chosen time limit.

The range of values in all tables is small; nevertheless, the differences are statistically significant in several cases. In all cases, new algorithms with the greedy heuristic mutation outperform known ones or demonstrate approximately the same efficiency (difference in the results is statistically insignificant). Moreover, new algorithms demonstrate the stability of results (narrow range of objective function values). In most cases, the best results were achieved by the genetic algorithms with nonempty mutation operators.

6. Conclusions

When solving some large-scale clustering problems, traditional local search algorithms often give a result very far from the optimal solution. In this research we aimed at developing not only fast but also the most accurate algorithm, based on genetic algorithms with greedy heuristic crossover operator, for solving related optimization problems. Methods for obtaining solutions in a fixed time, which would be difficult to improve by known methods without a significant increase in computational costs, include genetic algorithms with a greedy agglomerative crossover operator. As the computational results presented in this article show, further improvement in the achieved result of such algorithms is possible by increasing the diversity in their populations.

Computational experiments show that the population diversity maintaining mechanisms such as mutation genetic operator and subpopulations improve the features of genetic algorithms with greedy heuristic crossover for the large-scale k-means problem. Moreover, the best results can be shown by algorithms with a mutation operator based on greedy heuristic crossover operator with a randomly generated chromosome (new greedy heuristic mutation).

The similarity in mathematical formulations of k-means, k-medoids, and p-median problems, as well as the problem of a mixture probability distribution separation, gives us a reasonable hope for the applicability of similar approaches to improving the results of solving those problems which determine possible directions for further research.

Data Availability

In our work, we used only data from the UCI Machine Learning and Clustering Basic Benchmark repositories which are available at https://archive.ics.uci.edu/ml/index.php and http://cs.joensuu.fi/sipu/datasets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Ministry of Science and HigherEducation of the Russian Federation (State Contract no. FEFE-2020-0013).