Abstract
The Kmeans algorithm has been extensively investigated in the field of text clustering because of its linear time complexity and adaptation to sparse matrix data. However, it has two main problems, namely, the determination of the number of clusters and the location of the initial cluster centres. In this study, we propose an improved Kmeans++ algorithm based on the DaviesBouldin index (DBI) and the largest sum of distance called the SDKmeans++ algorithm. Firstly, we use the term frequencyinverse document frequency to represent the data set. Secondly, we measure the distance between objects by cosine similarity. Thirdly, the initial cluster centres are selected by comparing the distance to existing initial cluster centres and the maximum density. Fourthly, clustering results are obtained using the Kmeans++ method. Lastly, DBI is used to obtain optimal clustering results automatically. Experimental results on real bank transaction volume data sets show that the SDKmeans++ algorithm is more effective and efficient than two other algorithms in organising large financial text data sets. The Fmeasure value of the proposed algorithm is 0.97. The running time of the SDKmeans++ algorithm is reduced by 42.9% and 22.4% compared with that for Kmeans and Kmeans++ algorithms, respectively.
1. Introduction
Clustering is the process of dividing a data set into clusters (subsets) so that the objects in the same cluster are similar to each other and the objects in different clusters are dissimilar. Its main aim is to discover the natural grouping law in a set.
As an unsupervised data mining technique, text clustering does not require pretraining models or manual text preannotation [1]. Therefore, compared with other natural language processing algorithms, the clustering algorithm is more efficient and does not require human intervention [2]. Many clustering algorithms have been proposed. They are mainly divided into three categories, namely, overlapping/nonexclusive, partitional, and hierarchical [3]. Amongst them, partitionbased algorithms are widely used in various fields because of their easy implementation [4]. The most typical partitional method is Kmeans [2]. The Kmeans algorithm can adapt to sparse matrix data sets, and it is efficient in organising large data sets. However, the number of clusters and the selection of initial centres have a huge impact on the clustering results of the Kmeans algorithm. Setting an inappropriate initial value can easily cause the algorithm to fall into a local optimum.
Several algorithms propose improved similarity measurement methods to adapt to different data types. Wang et al. [4] utilised knowledge graphs to optimise the calculation of the similarity of text data types, which effectively improves the accuracy of text clustering. Huan et al. [5] proposed using KL divergence to calculate the similarity between cluster centres and text data objects, thereby making the Kmeans algorithm increasingly efficient and effective. To analyse the load profiles of smart meters, Xiang et al. [6] proposed measuring the shape characteristics of such load profiles by using the segmented slope of the load planes. Cheng et al. [7] presented a novel distance based on the common neighbourhood of dense cores and used geodesic distance to calculate the similarity between density cores. Experimental results showed that the algorithm has good performance in clustering data sets that contain much noise. To organise clusters with complex structures, Cheng et al. [8] combined shared neighbours and local density peaks to define a new distance for describing the dissimilarity of manifold data.
To discover global optimal clustering centres, several researchers proposed combining natureinspired optimisation algorithms to optimise the given objective function [9]. Gao et al. [10] proposed using particle swarm optimisation based on the Gaussian estimation of distribution method to update population information. Experimental results showed that the algorithm has high effectiveness and robustness. Meena and Singh [11] used the genetic algorithm and discrete difference evolution to search for the location of the global optima solution whilst reducing the number of algorithm iterations. Abualigah et al. [3] comprehensively reviewed the application of optimisation algorithms based on metaheuristics in the field of text clustering. However, the abovementioned algorithms optimise the clustering centres successively through iteration, and algorithm efficiency is not ideal.
To reduce the number of iterations and avoid falling into the local optimum, many scholars proposed optimising the selection methods of initial clustering centres directly. Wang et al. [4] used concept distance to optimise the selection of initial clustering centres and thus enhance algorithm stability. Guo et al. [12] proposed a partitionbased clustering algorithm to analyse biological sequences. The algorithm eliminates noise interference by deterministically initialising cluster centres. Cheng et al. [13] optimised the decision graph of the density peak (DP) algorithm through natural neighbourhood density and graph distance. The newly defined decision graph can help the DP algorithm avoid noise interference when selecting the initial cluster centres. Other improved methods, such as the semisupervised clustering algorithm based on pairwise constraints, can enhance clustering performance. It advocates the use of prior knowledge as pairwise constraints to enable the clustering algorithm to obtain abundant heuristic information and reduce blindness in the search process [14, 15]. However, this type of algorithm has two main problems: it is unsure whether a solution that satisfies all constraints exists, and it relies too much on prior knowledge.
A possible method to reduce the number of iterations and improve the clustering quality of an algorithm is to optimise the selection method of the initial clustering centre during the initialisation phase. For example, the Kmeans++ algorithm selects the initial cluster centres on the basis of the farthest distance criterion. Inspired by this feature, we propose a clustering method based on the sum of the farthest distance criteria to select initial clustering centres; the proposed method is called the SDKmeans++ algorithm. It can effectively describe the difference between initial cluster centres and generate the initial cluster centres in different clusters. Moreover, we use the Davies–Bouldin index (DBI) to evaluate the clustering results and obtain the optimal number of clusters. It is efficient and improves clustering accuracy when organising many data sets. In the SDKmeans++ algorithm, we represent financial text data based on term frequencyinverse document frequency (TFIDF). The similarity between data objects is calculated using cosine similarity. After that, the initial cluster centres are generated based on the maximum density and the newly proposed maximum distance sum criterion. Then, with the Kmeans method, the clustering result is obtained through the movement of the centre points and the change in the objects in the clusters. Finally, we automatically obtain the best results through DBI. The SDKmeans++ algorithm is compared with classic Kmeans and Kmeans++ algorithms to verify the effectiveness of the proposed algorithm proposed. The experimental results on real bank data sets show that the SDKmeans++ algorithm is more effective and efficient than the two other algorithms.
The main parts of this paper are organised as follows. Section 2 introduces related work on data preprocessing and classic clustering algorithms. Section 3 presents the specific steps and theoretical advantages of the proposed SDKmeans++ algorithm. Section 4 introduces several methods to evaluate the results of text clustering. Section 5 presents the experiments and discussions, and Section 6 provides the conclusions of the experiments and future development directions.
2. Related Work
2.1. Text Preprocessing
The purpose of the data preprocessing stage is to represent text data information quantitatively. This stage is crucial to text clustering [16]. Many scholars have proposed extracting the themes and features of text data by using optimisation algorithms or statistics and achieved good text clustering effects [3, 16]. Generally, the preprocessing steps of text data clustering are as follows: (i) tokenisation, (ii) stop words, and (iii) feature vector space.
2.1.1. Tokenisation
Tokenisation is the process of converting successive text data sets into words [2, 16]. English text is composed of words and separated by special characters and spaces. However, Chinese text is based on Chinese characters. Words or phrases are formed by a variable number of Chinese characters, and the words are continuous. These two factors make it difficult to use English text tokenisation methods to process Chinese text information. Therefore, China developed various Chinese text tokenisation technologies independently. For example, the Institute of Computing Technology and Chinese Lexical Analysis System is the world’s best Chinese lexical analyser developed by the China Institute of Computer Science.
2.1.2. Stop Words
The main purpose of removing stop words is to save storage space and improve the effectiveness of clustering algorithms [3]. Stop words have two main types, namely, function and highfrequency words. Function words often appear in documents but have no practical meaning. Typical function words, such as prepositions (e.g., “to,” “for,” and “in”), need to be deleted directly in the program by default. Highfrequency words appear in most documents. Because these words contain a tiny amount of information, they are difficult to distinguish in different documents. Highfrequency words can be automatically removed through the calculation of inverse document frequency.
2.1.3. Feature Vector Space
Feature vector space was proposed based on the idea of partial matching [17]. It gives each independent item in the text a weighted performance to characterise text data sets [3]. Many weighting techniques have been proposed in the past few decades. TFIDF is one of the most commonly used methods [4].
TFIDF is a statistical method that is often used for information retrieval and mining. It comprehensively considers the weighting of words from local and overall aspects. The normalised calculation formula of TFIDF is as follows:where is the normalised weight of term in document . is the total number of training texts. is the number of texts containing feature term . is the ratio of the number of occurrences of term in document to the total terms in document . Each TFIDF value is stored in a twodimensional array to form the feature vector space of the data set .
2.2. Similarity Measurement Method
Similarity or distance is used to determine the affiliation of data objects. To date, no unified method has been proposed to measure the similarity of all data types. In accordance with the characteristics of data types, researchers use different measurement methods. Generally, text clustering algorithms adopt cosine similarity to evaluate the similarity between texts [4]. The formula of cosine similarity is written aswhere is the ith eigenvector of data object . According to formula (2), the value of cosine similarity is within [−1, 1]. However, because TFIDF is used to characterise text data, the actual range of cosine similarity is [0, 1].
2.3. Classic Clustering Algorithms Based on Partition
2.3.1. KMeans Test Clustering Algorithm
Kmeans is an efficient clustering algorithm based on the partition method [18]. Its first step is to set the number of clusters (). The second step is to randomly generate K initial cluster centres, that is, . The third step is to assign each data object to the cluster with the highest similarity. The fourth step is to search for reasonable cluster centres through iteration. Lastly, spherical clusters are formed around the cluster centres [19]. The Kmeans algorithm has the advantages of easy understanding, simple implementation, high convergence speed, and adaptability to sparse matrix data [20].
Random selection of initial cluster centres results in multiple initial cluster centres in the same class, especially when the data are complex. Moreover, finding the globally optimal cluster centre through limited iterations is difficult. Therefore, the Kmeans algorithm easily converges to the local optimum, leading to unsatisfactory clustering results.
2.3.2. KMeans++ Test Clustering Algorithm
The Kmeans++ algorithm generates initial cluster centres based on the idea of making the distance between initial cluster centres as large as possible. Its optimisation strategy for initial cluster centres is simple. The first step is to select a data object randomly from the data set as the first cluster centre c1. The second step is to select the initial cluster centres according to the probability formula until initial cluster centres are obtained. The subsequent steps are similar to those of the Kmeans algorithm. Compared with the original algorithm, the Kmeans++ algorithm improves clustering accuracy and reduces running time. However, Kmeans++ is random when selecting the first initial cluster centre. Besides, it also has a certain degree of randomness to select the initial cluster centres according to the probability formula. Therefore, the clustering result is still not ideal. The flow chart of the algorithm is shown as Algorithm 1.

3. SDKMeans++ Algorithm
Given that classic partitionbased clustering algorithms use unreasonable methods to select the initial centre; they waste much time on iterative calculation and easily fall into the local optimum. This section proposes a novel Kmeans++ algorithm called SDKmeans++ based on the largest sum of the distance. Initially, the proposed algorithm generates a feature vector space based on TFIDF and uses cosine similarity to calculate the vector distance. Then, the algorithm selects the initial cluster centres based on the largest sum of the distance to all existing initial cluster centres. Then, it iterates the cluster centres and obtains the clustering results with the Kmeans method. Afterwards, DBI is used to obtain optimal clustering results automatically. The main process of the SDKmeans++ algorithm is shown in Figure 1.
3.1. Selection of the Initial Cluster Centres
The selected initial clustering centres have a huge impact on the clustering results. However, the farthest distance criterion proposed by the Kmeans++ algorithm cannot clearly reflect the overall dissimilarity of the initial cluster centres. We propose a new method to describe the dissimilarity between the initial cluster centres. The first step is to select the first initial cluster centre based on the maximum density. The second step is to select the remaining initial cluster centres on the basis of the largest sum of the distance. The calculation of the density value draws on the concept of local density in the density peak. To avoid setting parameters, we set the cutoff distance to half the mean value of the distance between data objects. The largest number of data objects in the neighbourhood is set as the first initial cluster centre. Afterwards, the existing initial cluster centre is regarded as a whole, and the next initial cluster centre is selected in accordance with the distance from the data object to the whole. The calculation of the newly defined selection method is shown in Figure 2.
After that, the data object with the largest , is selected as the next initial cluster centre.
The new selection method links each new initial cluster centre with all existing initial cluster centres. It reflects the overall difference between the initial cluster centres, and the initial centres are generated in different classes. The difference between the proposed selection method and the classic clustering algorithm is shown in Figure 3.
According to Figure 3, the newly proposed selection method has the best effect. The Kmeans algorithm has three selection situations. (1) The initial cluster centres are generated in different classes (with a probability of 26.47%). (2) The two initial cluster centres are generated in the same class (with a probability of 66.18%). (3) The initial cluster centres are generated in the same class (with a probability of 7.35%). As the number of data objects and the number of clusters increase, the probability of selecting the optimal initial cluster centres decreases.
Although the Kmeans++ algorithm avoids all the initial cluster centres that appear in the same class, two initial cluster centres could still be in the same class. Two main factors cause the initial cluster centres to be selected each time to be inconsistent. The first reason is that the first initial cluster centre point is randomly selected. The second reason is that the selection method is based on the probability formula.
The proposed selection method based on the largest sum of the distance ensures that the initial cluster centres are generated in different clusters. The first clear initial centre point and the selection method can ensure that the same clustering result can be obtained every time.
3.2. Obtaining the Optimal Number of Clusters
Generally, the optimal number of clusters is unknown. However, validity indexes can be used to evaluate the results of a clustering algorithm to obtain the optimal number of clusters [21]. Validity indexes can be divided into internal and external categories [22]. Although the use of external validity indexes to evaluate clustering results is accurate, these indexes need to be combined with prior knowledge (labels) [23, 24]. Internal validity indexes can be used to evaluate clustering effects on the basis of internal information only [25].
Researchers have proposed many internal cluster validity indexes, such as betweenclass distance, withinclass distance, DBI [26], local coresbased cluster validity (LCCV) index [27], and silhouette index [28]. Betweenclass distance and withinclass distance are the methods used in this study to evaluate the clustering results. They are introduced in Section 4. DBI comprehensively considers the independence and cohesion of clusters. This method does not depend on the number of clusters or data partitioning methods, and it can be widely used to guide clustering algorithms. The smaller the DBI value is, the better the clustering effect is.
To avoid the problem of fuzzy inflexion points in the traditional elbow method, we search for the minimum value of DBI to obtain the optimal number of clusters. The first step is to set the number of clusters [2, ], where is the number of sample points. The second step is to use DBI to evaluate the clustering results in this range. To improve algorithm efficiency, we set the following condition: if the value of three consecutive DBI after DBI () is greater than DBI (), then will be regarded as the optimal number of clusters. The calculation formula of DBI is written aswhere represents the average distance between data objects and the cluster centre in the ith cluster. represents the distance between the cluster centre in the ith cluster and the cluster centre in the jth cluster. The SDKmeans++ algorithm is shown as Algorithm 2.

4. Validation Techniques
The clustering evaluation method is crucial for the analysis of clustering results. It is divided into internal and external validity indexes. The main difference is whether the data are labelled. This section introduces and analyses several classic validity indexes.
4.1. Internal Validity Indexes
Many internal cluster validity indexes have been proposed in the past few decades. They have different characteristics in accordance with the definition of varying clustering concepts. Amongst them, betweenclass distance and withinclass distance are the most frequently used because of their simple and clear principles. Betweenclass distance describes the dissimilarity between clusters. Withinclass distance can measure the cohesion of data objects in the cluster. They are used to evaluate the clustering results of the experiments in this study. The calculation formulas of betweenclass and withinclass distance are, respectively, written as follows:where is the betweenclass distance, is the mean of all data objects in the ith class, and is the mean value of all data objects.where is the cluster centre of the data object , is the cosine similarity between and , and is the cosine distance.
4.2. External Validity Indexes
External validity indexes can accurately evaluate clustering results when the data are labelled. External validity indexes can be divided into different types, such as matchingbased approach, entropy, and paircounting measure (Rand and Jaccard indexes) [29, 30].
4.2.1. MatchingBased Measures
(i)Purity: it entails evaluating the purity of a cluster in accordance with the data objects that have an advantage in quantity [31]: where is the number of all terms and is the number of data objects with the quantitative advantage in each cluster(ii)Recall: it entails evaluating the clustering results from the perspective of the original data set(iii)Precision: it entails evaluating the clustering results from the perspective of the clustering result(iv)Fmeasure: it combines recall and precision to reflect the quality of clustering
Recall, precision, and Fmeasure are all based on confusion matrix. The confusion matrix is shown in Table 1.
TP and TN on the diagonal of the confounding matrix represent the correct clustering results, and FP and TN on the offdiagonal are misjudged.
4.2.2. EntropyBased Measures
(i)Entropy: entropy is essentially a mathematical measure of uncertainty [32]. where is the number of all data objects involved in the entire clusters, represents the number of all data objects in the ith cluster, and represents the number of data objects belonging to the jth class in the ith cluster.
4.2.3. Pairwise and Counting Measures
(i)Rand index: it represents the proportion of documents in the data set which are correctly clustered.(ii)Jaccard index: it handles asymmetric binary variables [33].
Detailed information on internal and external validity indexes is shown in Table 2.
5. Experimental Results and Discussion
We conduct experiments on five bank data sets with different data volumes to evaluate the performance of the proposed SDKmeans++ algorithm. The experimental data belong to the transaction volume data sets in the Business Performance Centre (BPC). The data set is mainly divided into mobile banking, online banking, WeChat, and financial products. These divisions can continue to be classified; for example, financial products include stocks, bonds, funds, and other wealth management products. We compare SDKmeans++ with typical partitionbased clustering algorithms, namely, Kmeans and Kmeans++.
In the experiment, two internal validity indicators and seven external validity indicators are used to evaluate the clustering results. We conduct experiments on a notebook computer with an Intel Core i79750H processor at 2.60 GHz, 16 GB of RAM, Windows 10 OS, and JAVA 1.8.0_231.
5.1. The BPC Data Set Preprocessing
The BPC data set needs to be preprocessed to characterise text data. Firstly, tokenisation is used to convert text information into terms. Secondly, the function and highfrequency words in the terms are removed. Lastly, the terms are given weights through TFIDF calculation. The detailed information on the BPC data set is shown in Table 3.
As shown in Table 3, the amount of data gradually increases from S_1 to S_5. Classes denote the optimal number of clusters. Documents refer to the number of documents in the data set. Terms refer to the words in each data set after tokenization. Unique terms are the keywords, that is, the dimensions of the vector space model. The feature vector space constructed by data set S_1 is shown in Table 4.
The number of elements with a value of 0 in the matrix is larger than the number of nonzero elements, and the distribution of nonzero elements is irregular. Therefore, the feature vector space we constitute is a sparse matrix. The adaptability of the Kmeans algorithm to sparse matrix data is one of the main reasons we select it.
5.2. Analysis of Algorithm Performance
5.2.1. Obtaining the Optimal Number of Clusters
This experiment uses DBI to obtain the best clustering results and avoid the problem of fuzzy inflexion points in the elbow method. It indirectly obtains the optimal number of clusters through the evaluation of the clustering results. The experiment uses data set S_1, and the theoretical optimal number of clusters is 7.
In data set S_1, classes 4 and 6 are similar. Although most of the data formats in their text messages are the same, the crucial gateway and alarm messages are different. Given the fact that Kmeans and Kmeans++ cannot distinguish between classes 4 and 6, a more reasonable method is to select 6 as the number of clusters, as reflected in Figure 4. The proposed SDKmeans++ can distinguish 7 clusters perfectly, so the best clustering result is obtained when the number of clusters is 7. The experiment shows that using any partitionbased clustering method can approximate the optimal number of clusters through DBI.
5.2.2. Verification and Analysis of Algorithm Effectiveness
To verify the effectiveness of the algorithm, we compare and analyse the experimental results of SDKmeans++, Kmeans, and Kmeans++. The experiment is divided into two parts, namely, evaluations of internal and external validity indexes. The first set of experiments uses internal validity indexes to evaluate the clustering results of data set S_1 with different clustering numbers. The clustering numbers of the three algorithms are set to the maximum range during the experiment [2, 11].
As shown in Figure 5, when clustering number is close to the optimal clustering number 7, SDKmeans++ has a large betweenclass distance value and a small withinclass distance value. The experimental results show that the proposed SDKmeans++ exhibits good performance in dissimilarity between classes and cohesion within classes.
(a)
(b)
The second set of experiments uses external validity indexes to evaluate the clustering results of all data sets in Table 3. The number of clusters of the three algorithms is set to the optimal number of clusters 7 to determine the best clustering performance.
The experimental results in Table 5 show that the proposed SDKmeans++ performs effectively in all the data sets. Data set S_5 has the largest amount of data, which represents a complex data situation. SDKmeans++ in data set S_5 has the best performance in the evaluation of seven external validity indexes. The purity value is 0.89, recall is 0.32, precision is 0.88, Fmeasure is 0.47, the Rand index is 0.82, the Jaccard index is 0.31, and entropy is 0.43. Figure 5 and Table 5 show that the clustering accuracy of SDKmeans++ is better than that of Kmeans++, and Kmeans++ is better than Kmeans. However, the stability of the proposed method is not good. The main reason is that the partitionbased clustering methods are difficult to distinguish classes with complex structures and highdimensional data sets. In the future, we will continue to improve on this shortcoming.
When data sets S_3, S_4, and S_5 are used, the evaluation results of the external validity indexes are inferior to those when data sets S_1 and S_2 are used for three main reasons. Firstly, as the amount of data increases, the differences between clusters are reduced. Secondly, the Kmeans algorithms find it difficult to distinguish classes in complex data sets. Thirdly, the experimental data sets use gateway information as labels instead of real category information.
5.2.3. Verification and Analysis of Algorithm Efficiency
In this section, the efficiency of the proposed algorithm is verified on all data sets in Table 3. Clustering number is set as the optimal number of clusters in accordance with Table 3. Afterwards, the running time and number of iterations of the three clustering algorithms are tested.
When dealing with massive amounts of data, the total time spent by the SDKmeans++ algorithm on clustering is lower than that spent by the two other clustering algorithms, as shown in Figure 6. The results prove that the SDKmeans++ algorithm has a significant improvement in time complexity. The iteration times experiment shows that, in most data sets, the SDKmeans++ algorithm has fewer iterations compared with the other two algorithms. When dealing with data set S_5, the number of iterations of SDKmeans++ is reduced by 47% and 26% compared with those for Kmeans and Kmeans++, respectively. Figure 6 proves the efficiency of SDKmeans++ in organising massive data.
(a)
(b)
6. Conclusion
Classic Kmeans and Kmeans++ algorithms have randomness when selecting the initial clustering centres, resulting in unstable clustering results, easy detainment into local optima, and large number of iterations. Moreover, the number of clusters needs to be set manually. This study proposes a new Kmeans++ algorithm called SDKmeans++ based on the largest sum of the distance and DBI to solve these shortcomings. The algorithm selects the first initial cluster centre based on the maximum density value and selects the remaining initial cluster centres based on the largest sum of the distance. This selection method makes the result of each initialisation the same. Then, the Kmeans method is used to obtain the clustering results. Afterwards, the best clustering result is automatically obtained through DBI. The experimental results show that the proposed SDKmeans++ algorithm outperforms the classic partitionbased method in terms of effectiveness and efficiency.
SDKmeans++ has two limitations. Firstly, many invalid features are extracted, which increases the amount of calculation and affects clustering accuracy. Secondly, partitionbased clustering methods find it difficult to identify classes with complex structure data sets. Therefore, our future research will mainly include mining representative feature words in text data sets through heuristic optimisation algorithms [34–36]. In addition, density peaks will be combined to find cluster centres quickly in complex highdimensional data sets [13].
Data Availability
The original data set involved in this study cannot be shared because the bank information is confidential.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the enterprise entrusted project under Grant no. K2020004 and in part by the Foundation of Beijing Information Science and Technology University under Grant no. 2025020.