Improvement of K-Means Algorithm and Its Application in Air Passenger Grouping

Yu, Donghua; Dong, Shuhua; Yao, Shuang

doi:https://doi.org/10.1155/2022/3958423

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Related Works Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 3958423 | https://doi.org/10.1155/2022/3958423

Improvement of K-Means Algorithm and Its Application in Air Passenger Grouping

Donghua Yu,¹Shuhua Dong,¹and Shuang Yao²

Academic Editor: Maciej Lawrynczuk

Received26 Apr 2022

Revised11 Jun 2022

Accepted03 Aug 2022

Published12 Sept 2022

Abstract

The k-means is one of the most popular clustering analysis algorithm and widely used in various fields. Nevertheless, it continues to have some shortcomings, for example, extremely sensitive to the initial center points selection and the special points such as noise or outliers. Therefore, this paper proposed initial center points’ selection optimization and phased assignment optimization to improve the k-means algorithm. The experimental results on 15 real-world and 10 synthetic datasets show that the improved k-means outperforms its main competitor k-means and under the same setting conditions, namely, using the default parameters,its clustering performance is better than Affinity Propagation, Mean Shift, and DBSCAN. The proposed algorithm was applied to analyze the airline seat selection data to air passengers grouping. The clustering results, as well as absolute deviation rate analysis, realized customer grouping and found out suitable audience group for the recommendation of seat selection services.

1. Introduction

Clustering is to divide the dataset into nonoverlapping subsets, such that the objects in the cluster are as similar as possible, and the objects between the clusters are as dissimilar as possible [1]. There are numerous kinds of clustering algorithms, such as AP [2], DPC [3–6], which show excellent clustering performance. However, as one of the most classic clustering algorithm, the k-means aimed to partition the given dataset into subsets so as to minimize the within-cluster sum of squared distances continues to be one of the most popular clustering algorithms [7]. Its efficiency and simplicity of implementation make it successfully applied in various fields, such as image [8, 9], education [10], bioinformatics [11], medical [12], partial multiview data [13], agricultural data [14], fuzzy decision-making [15].

Optimizing the initial center points may be one of the most effective methods to improve the performance of k-means algorithm. The study of Fränti and Sieranoja [16] reported that (a) the k-means clustering algorithm can be significantly improved by using a better initialization technique and by repeating (re-starting) the algorithm; (b) when the data have overlapping clusters, k-means can improve the results of the initialization technique; (c) when the data have well separated clusters, the performance of k-means depends completely on the goodness of the initialization; (d) initialization using simple furthest point heuristic (Maxmin) reduces the clustering error of k-means from 15% to 6%, on average. With the popularity of deep learning in various fields, optimizing data representation is also a means to improve clustering performance, especially in the face of high-dimensional data. The robust deep k-means (RDKM) algorithm [17] exploit the hierarchical information of multiple-level attributes with using the deep structure to hierarchically perform k-means.

The [18] provided a simple and effective initial center points optimization method called . It adds new center point one by one and assigns different selection probabilities to each potential center point. Since then, especially after being embedded in scikit-learn as the default k-means algorithm, it has almost become the first choice based on partitioning clustering algorithms. However, due to k-means randomly selects the first center point uniformly and randomly adds subsequent center points according to the probability, some special data distribution can also lead to k-means poor results, even unreasonable clustering results. For example, a dataset with five clusters is synthesized and some noise points half-circle surrounding them are added. The clustering result of k-means was shown in Figure 1, where each color represents a cluster. The desired clustering result should be that the points in the upper left corner are divided into five clusters, but the actual result is that the points in the lower (green points) are clustered into a single cluster to be a wrong result. In this paper, some methods were proposed to solve this problem.

Cluster analysis is one of the basic methods of data knowledge discovery. With the development of airline business, ancillary services that satisfy passengers’ personal requirement are becoming more and more important for airlines [19, 20]. However, owing to the impact of COVID-19, the airline market faced a dramatic regression (2019–2021), compelling airlines to seek revenue other than from flight tickets [21, 22]. Therefore, establishing ancillary services is significantly important for airlines due to the ability to increase the airline’s revenue. In this paper, the improved k-means algorithm is applied into cluster analysis an airline seat selection dataset, which aims to group airline passengers to serve the establishment of auxiliary services.

Based on the above analysis and application requirements, this paper proposed an improved k-means algorithm, called as k-means2o, based on initial center points selection optimization and phased assignment optimization, and realized the clustering analysis on airline seat selection dataset. The main contributions are summarized as follows: (1)Two optimization methods are proposed for the k-means algorithm: initial center points selection and phased assignment. In the initial center points selection optimization, this method inherits the center point incremental strategy of k-means [18], K-MC [23] and AFK-MC [24], but redefines the first center point selection strategy and the subsequent center point incremental strategy. In the phased assignment optimization, the Tukey’s rule is adopted to divide dataset into core and noncore sets to realize two-stage assignment, then two assignment strategies are proposed corresponding to the core and noncore sets, respectively.(2)Four popular algorithms, k-means [18], affinity propagation [2], mean shift [25], and DBSCAN [26], are used to verify the effectiveness and the performance improvement of k-means2o based on 15 real-world and 10 synthetic datasets. Further, the impact of core and noncore sets on the clustering result is analyzed.(3)The improved k-means algorithm is applied to an airline seat selection dataset, and the passenger groups who are more willing to pay for seat selection are found out. The absolute deviation rate is defined to analyze the significance of passenger grouping. This provides valuable information for auxiliary services.

There are many possible ways to optimize the initial center points. The k-means [18] provided method which assigns different selection probabilities to each potential center point. Bachem et al. [23] replaced the in k-means with MCMC-sampling and obtained a nearly linear improved k-means algorithm K-MC. However, this algorithm defines two data-dependent hypothesis , which will have an important impact on the clustering result and the algorithm complexity. Subsequently, Bachem et al. [24] solved the hypothesis defect of the K-MC algorithm. They extended a regular term based on of k-means . This new algorithm is called AFK-MC. Whether it is K-MC or AFK-MC, they all follow the first center point selection strategy of the algorithm, namely that it first samples an initial center uniformly at random. At the same time, they all have similar center point selection methods, that is, a point farther from the currently selected center points has a greater probability of being chosen as the next center point. For more information on the optimization method of the initial center point, please consult the literature [27].

Phased assignment, generally speaking, is to divide the data into different stages to complete the cluster label assignment, or assign the cluster labels to only part of the data, and the remaining part will be removed as outliers, noise, etc. Zhou et al. [28] proposed a three-stage k-means algorithm to cluster data and detect outliers. In the first stage, the fuzzy c-means algorithm is applied to cluster the data. In the second stage, local outliers are identified, and the cluster centers are recalculated. In the third stage, certain clusters are merged, and global outliers are identified. Im et al. [29] proposed the NK-means algorithm which emphasizes the removal of noise/outliers and is a two-stage k-means algorithm. In the first stage, a greedy algorithm is utilized to remove abnormal points. In the second stage, the center points are optimized in the constructed core set, and cluster label is assigned to each point. In term of preprocessing techniques, k-means is utilized as an additional filtering step to remove out of data points as outliers before applying the conventional k-means. The clustering process is only performed on the remaining data which are outlier-free. The outliers data are completely removed and not classified to any known cluster as collected initially. The KMOR algorithm is proposed by Gan and Ng [30] assigns outliers to an additional cluster. This algorithm redefines the clustering objective function and takes into account the between outliers and center points. However, it introduces two new parameters to adjust outlier number. The k-means-sharp is proposed by Olukanmi et al. [31] to eliminate the outliers’ influences from the clusters’ centroid. The detected outliers are completely excluded from the mean measurement only, but they are involved later in the clustering process. However, the data point with all attributes is eliminated completely from centroid measurement. In this case, the algorithm cannot recognize an outlier’s presence in every attribute independently. This is because the single value of the distance metric represents the entire vector instead the single attribute be removed. Therefore, an empty cluster may occur in case of the presence of at least one outlier in each data point [32]. The phased assignment is not only used to optimize the k-means algorithm. For example, Yu et al. [33] also adopted a two-stage assignment strategy based on boundary conditions to optimize the DPC clustering algorithm. For a dataset to be clustered, in many cases, users do not care whether it contains outliers, because the outliers themselves are difficult to define, but they definitely want to assign them cluster labels. Wang et al. [34] proposed an improved integrated clustering learning strategy based on three-stage affinity propagation algorithm with density peak optimization theory (DPKT-AP). In the first stage, the clustering center point was selected by density peak clustering. In the second stage, the k-means algorithm was used to cluster the data samples. In the third stage, DPKT-AP used the AP algorithm to merge and cluster the spherical subgroups.

3. Proposed K-Means Algorithm

Suppose a given dataset , and divide it into mutually disjoint sets , so that and .

3.1. Initial Center Points Optimization

Like the k-means algorithm, the k-means2o adopts a strategy of increasing center points one by one until the desired points are reached. However, the difference is that the new algorithm redefines the selection of the first center point and subsequent center points. For this purpose, first, define the distance function between the point and the set :where represents the distance between two points . In this paper, Euclidean distance is selected.

Let represent the center point of cluster , then the first center point is selected as follows:where represents the number of elements in the core set . Then, the (2) shows that is the mean value of the core set .

Let represents a set containing center points, then the selection method of center point is as follows:then . Equation (3) shows that is the point farthest from the selected center points in the core set . The whole process above is shown in Figure 2.

3.2. Phased Assignment

The k-means2o is mainly divided into two stages to complete the clustering. The first stage is to assign cluster label to the core set , and the second stage is to assign cluster label to the noncore set . The Tukey’s rule is adopted to divide the dataset into sets . Tukey’s rule is one of the most robust used techniques for anomaly detection in univariate data [35].

In the first stage, the k-means2o establishes the Tukey’s rule for each attribute of the data, and then the judgment results in all dimensions are integrated to determine whether the sample point belongs to the core set .

First, calculate the first quartile and third quartile on each attribute:

Then, calculate the upper and lower bounds as follows:where and is a scale factor.

Finally, calculate the core set and noncore set as follows:

Equation (6) shows that this paper will evaluate each attribute of the data individually, and then integrate all attributes to determine whether it belongs to the core set . As long as any attribute does not satisfy the inequality constraints, it will be judged as belonging to . According to equations (3) and (6), it is obvious that almost will be the point in the noncore set , that is, , and will also select the point in the noncore set with a high probability.

The scale factor in equation (5) is a predefined adjustable parameter. If you have sufficient prior knowledge of dataset, you can set it depending on experience. If not, it recommends to set . Although in the field of anomaly detection research, is often regarded as the boundary value of the outlier. In cluster analysis, points in cannot be regarded as outliers and discarded, and they still need to be assigned cluster labels. Whether points in or in , in the final clustering result, it is necessary to assign cluster labels which are also one of the goals of cluster analysis. On the 15 real datasets in this paper, each sample has an exact class label, but the of almost all datasets are not empty. After constructing , it is more helpful to obtain a more excellent initial center points. Not only that effectively assists the selection of the initial center points but also has a positive effect on the update of center points.

When we obtain , use the initial center points selection method described in Section 3.1 to select the initial center points set from , and then use the traditional center points update method of k-means to complete clustering in . Obtain the optimal clustering center points set and clusters . The first stage of clustering ends.

In the second stage, points in will be assigned cluster label. With the help of the optimal clusters obtained in the first stage, determine the cluster label of :where are defined in (1), and is the cluster.

The whole process above is shown in Figure 3.

3.3. Algorithm Flow and Complexity Analysis

The k-means2o algorithm that optimizes the initial center points selection and phased assignment are performed. The algorithm 1 shows its detail process. The steps 1–15 corresponds to the first stage, including that the Step 1 determines , and the Steps 2–4 optimize the initial center points. The Steps 16–19 correspond to the second stage.

	Input: Dataset , cluster number , scale factor
	Output: Clustering results , center points set , sum of squared error
(1)	Using (6) divide dataset into
(2)	Using (2) generate
(3)	For to do
(4)	Using (3) generate
(5)	End for
(6)	For to do
(7)	for do
(8)	According to the principle of the nearest distance between and , classify into the corresponding cluster
(9)	end for
(10)	if does not change then
(11)	break
(12)	end if
(13)	end for
(14)	Update the center points set and compute
(15)	Compute the optimal center points
(16)	for do
(17)	According to the principle of the nearest distance between and , classify into the corresponding cluster
(18)	end for
(19)	Compute
(20)	Return clustering results , center points set , sum of squared error

According to the detailed steps in algorithm 1, the complexity of k-means2o algorithm is analyzed with data size , attribute , and cluster number . The number of iterations is denoted as , and its maximum value is . Step 1 generates with . Steps 2–5 select initial center points with . Steps 6–13 are a traditional k-means clustering process; however, Step 8 is a new label assignment strategy, so the complexity of these steps becomes . In summary, the complexity of the k-means2o algorithm is .

4. Performance Analysis of the Proposed Algorithm

In this section, the improved k-means algorithm, k-means2o, testing and verification for clustering performance compared with the well-known k-means [18] which is the most commonly used partition-based algorithm with different initializations of the centroids to reduce the sensitivity. Then, the performance of the k-means2o will be compared with affinity propagation (AP) [2], mean shift (MS) [25], DBSCAN [26]. Although the latter obtain excellent clustering performance on some special datasets, they require to preset one or more important parameter(s), which is a very difficult task. The k-means2o is designed with Python and k-means , AP, MS, DBSCAN are called from scikit-learn [36].

4.1. Datasets and Evaluation Metrics

A total of 15 real-world datasets used in the experiments were taken from UCI [37]. The data size , attribute , and cluster number are summarized in Table 1 and Table 2 shows 10 synthetic datasets from references [38, 39], where the K1 dataset is synthesized by this paper, see Figure 1. All datasets are publicly available1.

An appropriate and uniform evaluation index is both required and meaningful to compare the different clustering algorithms. Therefore, the quality was measured via the accuracy (ACC), the Adjusted Rand Index (ARI) [40], the Normalized Mutual Information (NMI) [41] and the Fowlkes–Mallows Index (FMI) [42] between the produced clusters and the truth categories. Larger evaluation index values indicate improved clustering performance, and all index upper bounds , representing perfectly correct clustering:where are predicted label and true label.

4.2. Experimental Results and Discussion

The experimental datasets were clustered using k-means and k-means2o. The ACC, ARI, NMI, and FMI of them are listed in Tables 3 and 4, where k- represents k-means and k-2o represents k-means2o. The best clustering performance evaluation values are shown in bold, and 1 means that the clustering result is completely correct. The value 0.0000 in the table represents its real metric value .

From Table 3, the k-means and k-means2o simultaneously obtained the maximum FMI value for 8 of the 15 datasets. This shows that the two algorithms have the same performance, and further performance comparison and analysis of other evaluation indicators are required. From the view of ARI in Table 3, the most significant and direct conclusion is that the k-means2o outperforms the k-means on most datasets, and the performance of the two algorithms is also very close on a few datasets that are inferior to k-means . Specifically, the k-means2o achieved the maximum ARI value for 10 of the 15 datasets, as well as the NMI and it obtained the same result, and the k-means achieved the best clustering performance only on 6 datasets in ARI, as well as in NMI. For banknote, iris, wine datasets, the k-means2o is only inferior to k-means with a small gap. For ACC evaluation, it comes to the exact same conclusion as NMI and ARI, that is, the k-means2o clustering performance is better than the k-means .

For the synthetic datasets in Table 2, the four evaluation metrics in Table 4 show that k-means and k-means2o have similar clustering performance. For datasets with spherical cluster distribution, such as D31, R15, S1, and S3, the clustering results of the two algorithms are close to the real cluster partition, while for datasets with nonspherical distribution such as spiral, flame, circlesA3, the clustering performance of them drops sharply. When the size of the distribution area of spherical clusters is significantly different, the performance difference between k-means and k-means2o can be revealed. For example, in the aggregation dataset, the two algorithms’ clustering results are shown in Figure 4. The evaluation values of ARI, NMI, and FMI all show that k-means is better than k-means2o, but ACC gives the opposite conclusion. Figure 4(a) shows that k-means selects seven center points in six real clusters, and two different clusters (green points in the figure) are wrongly classified into the same cluster. Figure 4(b) shows that k-means2o can select center points in seven real clusters, respectively.

(a)

(b)

Further, the performance of the k-means2o will be compared with AP, MS, and DBSCAN. The ARI and NMI of these algorithms are listed in Table 5, and the ACC and FMI are listed in Table 6. The values larger than the one of the k-means2o are marked in bold. The three comparison algorithms all use default parameters. Considering better performance, the data are normalized here. From the perspective of ARI values, compared with AP, MS, and DBSACN, the k-means2o obtained better clustering performance on 12,14,13 datasets, respectively. The evaluation results of NMI are similar to ARI, except for the AP algorithm. The AP’s measurement results of NMI and ARI are very different, which may be tied to the number of error clusters given by the AP algorithm. The ACC evaluation conclusion is consistent with ARI, but FMI and NMI reach opposite conclusions. For the MS algorithm, its FMI value is better than k-means2o algorithm in 9 out of 15 datasets, while for the AP algorithm, its FMI value on all datasets is smaller than k-means2o algorithm. Based on the four evaluation metrics, the k-means2o algorithm is superior to the comparison methods in at least three of these metrics on most datasets. Therefore, k-means2o has better clustering performance.

As for the abnormal conclusion given by a certain evaluation metric for a specific algorithm, for example, the NMI evaluation metric for the AP algorithm, the FMI evaluation metric for the MS algorithm, it may be caused by too many or too few clusters. Table 7 shows that the AP and MS algorithms give the wrong number of clusters on any datasets, and the former far exceeds the true number of clusters, while the latter divides more than half of the datasets into one cluster. Undeniably, the AP, MS, and DBSCAN algorithms provide a method to identify the number of clusters. If the parameters for the AP algorithm, damping factor, and preference value are carefully adjusted, it maybe achieves better clustering performance in these real-world datasets. In those clustering algorithms that contain parameters, careful selection of parameters is often time-consuming and requires prior knowledge. Therefore, these algorithms have poor universality.

The performance of all five algorithms can be directly compared in Figure 5. In this radar chart, each axis represents a dataset, and its value is the cluster evaluation ARI value. According to the previous analysis, the k-means2o has the best performance, and its corresponding red line in the radar chart reaches the maximum value on more polar axes, that is, farther away from the center point.

4.3. Comparative Analysis of Different Initialization Methods

In this subsection, the effects of three different initialization methods on the performance of the k-means clustering algorithm are compared. These three methods are represented by , -sampling, respectively, see the header of Table 8. means randomly initializing the center point. -sampling means assigning a selection probability to each noncenter point and randomly selecting the center point. means the center point initialization optimization method proposed in this paper. In fact, the k-means algorithm based on -sampling is the famous k-means algorithm.

The initial center points optimization plays an important role in the performance improvement of k-means2o. However, Table 8 shows that only using the initialization method proposed in this paper cannot improve the clustering performance. From the evaluation value of ARI, the optimal initialization method is -sampling, followed by , and the worst is which is the initialization method proposed in this paper. Except for tiny numerical differences on individual datasets, the NMI evaluation shows similar conclusions. Combined with the conclusion of k-means2o performance improvement, it is the combination of initial center point optimization and phased assignment that improves the performance of k-means2o, not just the center points optimization.

4.4. Impact Analysis of Core and Noncore Sets

This paper uses Tukey’s rule to realize the division of and . Therefore, a scale factor needs to be given. Tukey’s rule comes from the field of anomaly detection. Generally, the scale factor is set to 1.5. Points that do not meet the conditions of the scale factor are called outliers. In most cases, these points are directly abandoned. This idea is introduced into cluster analysis and used in the data preprocessing stage. As a result, the points detected as abnormal will be discarded and not assign cluster label. There will be great hidden trouble in this way. Table 9 shows the number of elements in and in 15 real-world datasets when . Except that the of compound dataset is empty, the of the remaining 14 datasets are not empty. However, as well as we known, all points in these datasets are labeled with class labels. Therefore, it is unreasonable to abandon these suspected outliers simply and rudely. For this reason, this paper proposes a two-stage assignment method, whose first stage assigns cluster label to the points in and second stage assigns the points in the . For the compound dataset, the empty indicates that Tukey’s rule has no effect on this dataset and will directly lead to the failure of the second stage assignment.

The k-means2o algorithm relies on a predefined scale factor , so it is necessary to perform a sensitivity test of this parameter. Therefore, we took the iris, wine, breast_cancer, banknote, and bupa datasets as an example to investigate the effects of different on ARI and NMI, as shown in Figure 6. Its shows that the ARI and NMI curves of the five datasets do not fluctuate drastically, so the clustering performance of the k-means2o algorithm based on the scale factor is relatively robust. Nevertheless, the scale factor still has a slight impact on the clustering performance. For example, in the iris dataset, when , its ARI and NMI values reach 0.8340 and 0.8191, respectively. This clustering result is better than k-means , see Table 3 (the values are ARI and NMI ).

(a)

(b)

In the above analysis, the k-means2o outperforms k-means , AP, MS, and DBSCAN. Combined with the fact, almost all of these datasets in Table 9 are nonempty. These results show that the combination optimization of the initial center point and the core subset works and improves the k-means clustering performance.

5. The Application of K-Means2o

In this section, the k-means2o is applied into cluster to analyze the airline seat selection dataset provided by Neusoft. According to the meaning of clustering, the samples in the same cluster are as similar as possible, and the samples between different clusters are as dissimilar as possible. If most samples in the same cluster have a certain property, it can be inferred that other samples in the same cluster are also most likely to have the same property. If the most passengers in the cluster are willing to accept some of the personalized recommendation service, such as paying for seat selection, the same service should be recommended to other passengers in the cluster, and a clearer audience group will increase the personalized recommendation service success rate. For the airline seat selection dataset, the appropriate clusters number is required to be determined first.

The silhouette coefficient is a simple and effective method to determine the appropriate clusters number for the k-means algorithm. The silhouette coefficient of the k-means2o algorithm on this dataset is shown in Figure 7. The figure shows that the SSE change tends to be gentle from 16 clusters. Therefore, the optimal number of clusters would be selected as 16. Then, the k-means2o is applied and divides the data into 16 clusters. The number of passengers in each cluster is shown in the column named as size in Table 10. The 3rd, 4th, and 5th columns of Table 10 (payment, no-payment, payment ratio), respectively, show the number of paid passengers, the number of nonpaid passengers and the proportion of paid ones in the airline seat selection. The absolute deviation rate in the last column is defined as follows:where is the payment rate in cluster and is the payment rate in the dataset. The larger the value, the more significant the difference between the payment behavior of passengers in the cluster and the whole dataset.

The clustering results show that the number of passengers in each cluster is not close. The cluster with the largest number of passengers is , with 2580, while the smallest one is , with 379.

Further, the significant differences are explored between clusters. Figure 8 shows the kernel density estimation curves of three attributes, pax_fcny, pax_tax, recent_gap_day. On the whole, these curves in each cluster are not completely coincident, and there are significant differences, which show that the data distribution of each cluster is different. This conclusion is consistent with the expectation of cluster analysis, that is, the samples between clusters are dissimilar as much as possible. From a single attribute point of view, the discrimination of pax_fcny attribute is the most significant, with different mean point, peak point, and data span. Followed by pax_tax attribute. The third one is recent_gap_day attribute. Its mean and span are very similar, but the peak point is still different. The difference of peak points indicates that there are differences in the concentration of data distribution in the cluster. The larger the peak value, more points are distributed near the mean value.

Table 10 discusses the k-means2o algorithm clustering results of the airline seat selection dataset from the similarities within clusters and dissimilarities between the clusters. The clustering results will be a good reference basis for customer grouping. Air passenger grouping will enable the decision-makers to more accurately find the audience of the personalized recommendation service, such as payment for airline seat selection. The dataset shows the label of payment for airline seat selection. The value of each cluster is greater than 12%, which is significantly different from the payout rate of the entire dataset of 6.29%. The cluster with the largest value is C13, reaching 66.45%, and the one with the smallest value is c5, reaching 12.56%. These results show that passenger payment behavior within clusters is more agglomerated compared to the entire dataset. Since the payment rate of C13 is 2.11%, it is a reverse difference. In other words, the indicates that passengers in C13 are extremely unwilling to pay for seat selection, and the willingness to pay is significantly lower than the overall level. In 9 of the 16 clusters, the ratio of paying for airline seat selection exceeds 5%. According to the precise recommendation or personalized marketing strategy, enterprises should pay more attention to the passengers in these nine clusters, and their marketing is more likely to succeed. Compared with the passengers in other clusters, the ones in these clusters will be more willing to accept such recommendations and enhance their stickiness. On designing a recommendation system, this clustering result will become a good auxiliary prior information.

6. Conclusion

In this paper, two optimization methods for k-means are initial center points selection and phased assignment were proposed, and then the improved k-means algorithm, k-means2o, were proposed. In contrast to the previously introduced algorithms, k-means , K-MC, and AFK-MC, the new initial center points selection optimization redefines the first center point selection strategy and the subsequent center point incremental strategy. The phased assignment optimization adopted the Tukey’s rule to divide dataset into core and noncore sets, then two assignment strategies were proposed corresponding to the core and noncore sets, respectively. These two optimization methods complement each other to form combinatorial optimization. The experimental results on 15 real-world and 10 synthetic datasets show that the k-means2o outperforms its main competitor k-means , and under the same setting conditions, namely using the default parameters, the clustering performance of k-means2o is better than affinity propagation, mean shift, and DBSCAN.

The improved k-means algorithm, k-means2o, is applied to analyze the airline seat selection dataset. Combined with the data label of paying for seat selection, the clustering results realize customer grouping, and find suitable audience group for the recommendation of seat selection services. Through the analysis of the newly defined absolute deviation rate index, the appropriate groups for service recommendation are found, and the groups that are not suitable for recommendation are distinguished. Therefore, the airline enterprises can use limited resources to promote the groups with high-payment willingness, improve the success rate, and avoid promoting seat selection services to the groups with low-payment willingness which not only wastes resources but also causes passengers’ disgust.

After a lot of experimental tests, the k-means2o algorithm, like other algorithms, cannot be adapted to all fields and situations, such as high-dimensional sparse data. If the data are a huge number of attributes or higher dimensions, it will easily lead to fewer samples in , and in extreme cases, it may be less than the number of clusters. The Olivetti Face image data with dimension have been tested and found that , that is, the number of samples in the core set is less than the number of clusters; therefore, the clustering fails. Due to the division of the core and noncore sets, the k-means2o algorithm is not suitable for huge number of attributes or higher dimensions. We will continue to study this problem and hope to solve this problem in the future.

Data Availability

The data are available at https://gitee.com/ydh-usx/k-means2o-data/tree/master/data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62002227) and the School-level scientific research project of Shaoxing University (No. 2021LG004).

References

J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques, Elsevier, Amsterdam, Netherlands, 2011.
B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
View at: Publisher Site | Google Scholar
A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
View at: Publisher Site | Google Scholar
M. Parmar, D. Wang, A. H. Tan, C. Miao, J. Jiang, and Y. Zhou, “A novel density peak clustering algorithm based on squared residual error,” in Proceedings of the 2017 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), pp. 43–48, IEEE, Shenzhen, China, 2017.
View at: Google Scholar
M. Parmar, D. Wang, X. Zhang et al., “REDPC: a residual error-based density peak clustering algorithm,” Neurocomputing, vol. 348, pp. 82–96, 2019.
View at: Publisher Site | Google Scholar
M. D. Parmar, W. Pang, D. Hao et al., “FREDPC: a feasible residual error-based density peak clustering algorithm with the fragment merging strategy,” IEEE Access, vol. 7, pp. 89789–89804, 2019.
View at: Publisher Site | Google Scholar
H. Xie, L. Zhang, C. P. Lim et al., “Improving k-means clustering with enhanced firefly algorithms,” Applied Soft Computing, vol. 84, Article ID 105763, 2019.
View at: Publisher Site | Google Scholar
N. Shah, D. Patel, and P. Fränti, “k-means image segmentation using mumford–shah model,” Journal of Electronic Imaging, vol. 30, no. 06, Article ID 063029, 2021.
View at: Publisher Site | Google Scholar
A. R. Khan, S. Khan, M. Harouni, R. Abbasi, S. Iqbal, and Z. Mehmood, “Brain tumor segmentation using k-means clustering and deep learning with synthetic data augmentation for classification,” Microscopy Research and Technique, vol. 84, no. 7, pp. 1389–1399, 2021.
View at: Publisher Site | Google Scholar
A. Moubayed, M. Injadat, A. Shami, and H. Lutfiyya, “Student engagement level in an e-learning environment: clustering using k-means,” American Journal of Distance Education, vol. 34, no. 2, pp. 137–156, 2020.
View at: Publisher Site | Google Scholar
X. Qian, M. Di Renzo, and A. Eckford, “K-means clustering-aided non-coherent detection for molecular communications,” IEEE Transactions on Communications, vol. 69, no. 8, pp. 5456–5470, 2021.
View at: Publisher Site | Google Scholar
Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “A cluster-based oversampling algorithm combining smote and k-means for imbalanced medical data,” Information Sciences, vol. 572, pp. 574–589, 2021.
View at: Publisher Site | Google Scholar
H. Liu, J. Wu, T. Liu, D. Tao, and Y. Fu, “Spectral ensemble clustering via weighted k-means: theoretical and practical evidence,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 5, pp. 1129–1143, 2017.
View at: Publisher Site | Google Scholar
A. A. Aldino, D. Darwis, A. T. Prastowo, and C. Sujana, “Implementation of k-means algorithm for clustering corn planting feasibility area in south lampung regency,” In Journal of Physics: Conference Series, vol. 1751, Article ID 012038, 2021.
View at: Google Scholar
Z.-S. Chen, X. Zhang, W. Pedrycz, X.-J. Wang, K.-S. Chin, and L. Martínez, “K-means clustering for the aggregation of HFLTS possibility distributions: N-two-stage algorithmic paradigm,” Knowledge-Based Systems, vol. 227, Article ID 107230, 2021.
View at: Publisher Site | Google Scholar
P. Fränti and S. Sieranoja, “How much can k-means be improved by using better initialization and repeats?” Pattern Recognition, vol. 93, pp. 95–112, 2019.
View at: Publisher Site | Google Scholar
S. Huang, Z. Kang, Z. Xu, and Q. Liu, “Robust deep k-means: an effective and simple method for data clustering,” Pattern Recognition, vol. 117, Article ID 107996, 2021.
View at: Publisher Site | Google Scholar
S. Vassilvitskii, “K-means: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035, SODA 2007, New Orleans, Louisiana, USA, 2006.
View at: Google Scholar
D. Warnock-Smith, J. F. O’Connell, and M. Maleki, “An analysis of ongoing trends in airline ancillary revenues,” Journal of Air Transport Management, vol. 64, pp. 42–54, 2017.
View at: Publisher Site | Google Scholar
P. Chiambaretto, “Air passengers’ willingness to pay for ancillary services on long-haul flights,” Transportation Research Part E: Logistics and Transportation Review, vol. 147, Article ID 102234, 2021.
View at: Publisher Site | Google Scholar
B. Vinod, “Airline revenue planning and the covid-19 pandemic,” Journal of Tourism Futures, vol. 8, no. 2, pp. 245–253, 2021.
View at: Publisher Site | Google Scholar
G. Gunardi and R. S. MoodyH. k Martono, “Covid-19: the impact on air transportation tariff in Indonesia,” in Proceedings of the International Conference on Economics, Business, Social, and Humanities (ICEBSH 2021), pp. 344–349, Atlantis Press, Noord-Holland, The Netherlands, 2021.
View at: Google Scholar
O. Bachem, M. Lucic, S. Hamed Hassani, and A. Krause, “Approximate k-means++ in sublinear time,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1459–1467, AAAI, Palo Alto, California, U.S, 2016.
View at: Google Scholar
O. Bachem, M. Lucic, H. Hassani, and A. Krause, “Fast and provably good seedings for k-means,” Advances in Neural Information Processing Systems, vol. 29, pp. 55–63, 2016.
View at: Google Scholar
D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002.
View at: Publisher Site | Google Scholar
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Dbscan revisited, revisited: why and how you should (still) use dbscan,” ACM Transactions on Database Systems, vol. 42, no. 3, pp. 1–21, 2017.
View at: Publisher Site | Google Scholar
M. E. Celebi, H. A. Kingravi, and P. A. Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm,” Expert Systems with Applications, vol. 40, no. 1, pp. 200–210, 2013.
View at: Publisher Site | Google Scholar
Y. Zhou, Y. Hong, and X. Cai, “A novel k-means algorithm for clustering and outlier detection,” in Proceedings of the 2009 Second International Conference on Future Information Technology and Management Engineering, pp. 476–480, IEEE, Sanya, China, 2009.
View at: Google Scholar
I. Sungjin, M. Montazer Qaem, B. Moseley, X. Sun, and R. Zhou, “Fast noise removal for k-means clustering,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 456–466, PMLR, New York City, NY, USA, 2020.
View at: Google Scholar
G. Gan and M. K. P. Ng, “K-means clustering with outlier removal,” Pattern Recognition Letters, vol. 90, pp. 8–14, 2017.
View at: Publisher Site | Google Scholar
O. Peter and B. Twala, “K-means-sharp: modified centroid update for outlier-robust k-means clustering,” in Proceedings of the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp. 14–19, IEEE, Bloemfontein, South Africa, 2017.
View at: Google Scholar
N. H. M. M. Shrifan, M. F. Akbar, and N. A. Mat Isa, “An Adaptive Outlier Removal Aided K-Means Clustering Algorithm Journal of King Saud University-Computer and Information Sciences,” vol. 34, no. 8, 2021.
View at: Google Scholar
D. Yu, G. Liu, M. Guo, X. Liu, and S. Yao, “Density peaks clustering based on weighted local density sequence and nearest neighbor assignment,” IEEE Access, vol. 7, pp. 34301–34317, 2019.
View at: Publisher Site | Google Scholar
L. Wang, W. Sun, X. Han et al., “An improved integrated clustering learning strategy based on three-stage affinity propagation algorithm with density peak optimization theory,” Complexity, vol. 2021, Article ID 6666619, 12 pages, 2021.
View at: Publisher Site | Google Scholar
N. Huyghues-Beaufond, S. Tindemans, P. Falugi, M. Sun, and G. Strbac, “Robust and automatic data cleansing method for short-term load forecasting of distribution feeders,” Applied Energy, vol. 261, Article ID 114405, 2020.
View at: Publisher Site | Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort et al., “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
View at: Google Scholar
D. Dua and C. Graff, UCI Machine Learning Repository, vol. 588, 2017.
L. Liu and D. Yu, “Density peaks clustering algorithm based on weighted k-nearest neighbors and geodesic distance,” IEEE Access, vol. 8, pp. 168282–168296, 2020.
View at: Publisher Site | Google Scholar
D. Yu, G. Liu, M. Guo, and X. Liu, “An improved k-medoids algorithm based on step increasing and optimizing medoids,” Expert Systems with Applications, vol. 92, pp. 464–473, 2018.
View at: Publisher Site | Google Scholar
D. Steinley, “Properties of the hubert-arable adjusted rand index,” Psychological Methods, vol. 9, no. 3, pp. 386–396, 2004.
View at: Publisher Site | Google Scholar
N. Xuan Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010.
View at: Google Scholar
E. B. Fowlkes and C. L. Mallows, “A method for comparing two hierarchical clusterings,” Journal of the American Statistical Association, vol. 78, no. 383, pp. 553–569, 1983.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Donghua Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

275

Downloads

386

Citations

Computational Intelligence and Neuroscience

Improvement of K-Means Algorithm and Its Application in Air Passenger Grouping

Abstract

1. Introduction

2. Related Works

3. Proposed K-Means Algorithm

3.1. Initial Center Points Optimization

3.2. Phased Assignment

3.3. Algorithm Flow and Complexity Analysis

4. Performance Analysis of the Proposed Algorithm

4.1. Datasets and Evaluation Metrics

4.2. Experimental Results and Discussion

4.3. Comparative Analysis of Different Initialization Methods

4.4. Impact Analysis of Core and Noncore Sets

5. The Application of K-Means2o

6. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright