Abstract

Medical data analysis is an important part of intelligent medicine, and clustering analysis is a commonly used method for data analysis of Traditional Chinese Medicine (TCM); however, the classical K-Means algorithm is greatly affected by the selection of initial clustering center, which is easy to fall into the local optimal solution. To avoid this problem, an improved differential evolution clustering algorithm is proposed in this paper. The proposed algorithm selects the initial clustering center randomly, optimizes and locates the clustering center in the process of evolution iteration, and improves the mutation mode of differential evolution to enhance the overall optimization ability, so that the clustering effect can reach the global optimization as far as possible. Three University of California, Irvine (UCI), data sets are selected to compare the clustering effect of the classical K-Means algorithm, the standard DE-K-Means algorithm, the K-Means++ algorithm, and the proposed algorithm. The experimental results show that, in terms of global optimization, the proposed algorithm is obviously superior to the other three algorithms, and in terms of convergence speed, the proposed algorithm is better than DE-K-Means algorithm. Finally, the proposed algorithm is applied to analyze the drug data of Traditional Chinese Medicine in the treatment of pulmonary diseases, and the analysis results are consistent with the theory of Traditional Chinese Medicine.

1. Introduction

Clustering belongs to unsupervised learning, so it can improve the objectivity of the results when applied to medical research. The earliest application of clustering technology to assist medical diagnosis was in the 1970s [1]. With the rapid development of intelligent medicine in 5G era, some scholars study the medical auxiliary diagnosis and have made some achievements [25]. For example, Xu et al. simulated the process of TCM diagnosis and created an online analysis platform for TCM based on Latent Tree to assist TCM diagnosis. When using clustering to study TCM syndrome differentiation, it can show obvious objectification and quantification characteristics [6, 7]. Therefore, clustering analysis has become a common data analysis method in TCM diagnosis and treatment and provides an objective method for TCM clinical syndrome differentiation and treatment. However, at present, most studies apply clustering to TCM symptoms and syndromes, while few studies apply clustering to drug analysis [7].

K-Means is a classical clustering algorithm, which has the advantages of simple implementation, fast convergence, and high efficiency. However, in the K-means clustering algorithm, it is necessary to determine the number of clusters K in advance based on experience and randomly select the initial clustering center. Therefore, the results of cluster analysis are greatly affected by the selection of initial clustering center, outliers, and noise data, which will lead to the unstable results and fall into local optimal solution. It is a feasible idea to determine the initial clustering center and optimize the location by the optimization algorithm. Differential Evolution (DE) is a relatively new stochastic optimization algorithm, which has strong robustness and global optimization capability [8]. At present, although some scholars have introduced global optimization algorithms such as genetic algorithm and ant colony algorithm into K-Means clustering algorithm [9, 10], the DE algorithm is more efficient and easier to implement than the above optimization algorithms [1120].

This paper proposes an improved mutation strategy of DE and optimizes the determination problem of K-Means clustering center, which can replace the traditional K-Means clustering algorithm to update the clustering center continuously. In this way, it can effectively avoid the K-Means algorithm falling into the local optimum. Accordingly, the high-quality initial clustering center can be obtained, and the convergence speed of DE also can be improved. To verify the effectiveness of the proposed algorithm, three UCI datasets are used to compare K-Means, DE-K-Means, and the proposed algorithm. The experimental results show that the proposed algorithm has better clustering effect.

Finally, the proposed algorithm was used to conduct cluster analysis on the data of TCM drugs in the treatment of diffuse interstitial pulmonary disease, and the method that using TCM to treat the disease and the compatibility rule of drugs are obtained. The contributions of this paper are as follows:(1)An improved DE clustering algorithm is proposed for analyzing the data of Traditional Chinese Medicine(2)Experimental studies are used, using UCI standard datasets to verify the performance of the proposed algorithm

The rest of this paper is organized as follows: Section 2 introduces the relevant theories. Section 3 presents an improved differential evolution-based K-Means clustering algorithm. Section 4 describes the experiment and evaluation. Section 5 surveys related works and Section 6 concludes the study.

2. Relevant Theories

The clustering algorithm divides similar data objects into the same class when analyzing data, and its definition can be described as follows: the known set D = {O1, O2, …, On}, Oi represents the ith object, i = {1, 2, …, n}, Ct = {Ot1, Ot2, …, Otn}, Ct ⊆ D, t = {1, 2, …, k}, in the set Ct, the first subscript t represents the category in the set, and the second subscript represents a data object in the category t. If proximity (Oi, Oj) represents the similarity between objects Oi and Oj, then each Ct satisfies the following formula:

For all the Cx, Cy ∈ D and Cx ≠ Cy, if Cx ∩ Cy = ϕ (only for rigid clustering), then

The result of clustering is that the data in the same category are less different from each other and have greater similarity, and the data of different categories have large differences and small similarity. The similarity between the data is estimated based on the property values of the data objects and is measured by density, distance, connectivity, etc. The distance between data objects is taken as the measurement indicators. The smaller the distance, the greater the similarity. Similarly, the larger the distance, the smaller the similarity. At present, a variety of distance calculation formulas are available; the most commonly used are as follows [1].

Manhattan distance:

Euclidean distance:

Cosine distance:

The data object Oi = {Oi1, Oi2, …, Oin}, and n represents that the data object has n attributes.

2.1. K-Means Algorithm

K-Means algorithm belongs to hard clustering algorithm, which is a prototype-based objective function clustering method. It obtains the optimized objective function by calculating the distance from data points to the prototype and obtains the adjustment rules of iterative operation by using the function to calculate the extreme value.

2.2. Differential Evolution Algorithm

Differential Evolution (DE) is a population-based heuristic algorithm, which has the characteristics of strong robustness, high speed, and simple structure. The basic operations of Differential Evolution algorithm include mutation, crossover, selection, and iteration. Its process is briefly introduced below.

First of all, the DE algorithm needs to initialize the parameters and generate the initial population randomly. Then, mutation operations operation is performed on the population. The common mutation strategies are as follows:DE/rand/1:DE/best/1:DE/current-to-best/1:

After that, the cross operation is performed to improve the diversity of the population, and binomial crossover is generally selected as follows:

Binomial crossover intersects the generated mutation vector with the parent individual vector to obtain the experimental vector , in which the symbol represents the jth gene of the ith individual in the generation populations, j = 1, 2, …, D, and D denotes the dimension of the problem. The symbol jrand denotes a random integer with uniform distribution in [1, D], which ensures that at least one dimension of the experimental vector comes from the mutation vector. Crossover probability CR controls the convergence speed of the algorithm, and CR ∈ [0, 1].

Finally, the selection operation is performed, in which the excellent individuals with the optimal objective function value are preserved and evolved into the next generation. Take the solution minimization as an example, as shown in the following equation:

3. Improved Differential Evolution-Based K-Means Clustering Algorithm

3.1. Population Initialization

The clustering algorithm based on DE randomly generates the initial population POP = [x1, x2, x3, …, xNP], xi = [xi,1, xi,2, xi,3,…, xi,D]; the symbols NP and D denote the population size and the data dimension, respectively. Compared with the traditional K-Means algorithm, it can provide a larger search space for finding the optimal clustering center.

3.2. Population Diversity-Based Double-Mutation Operation
3.2.1. Population Diversity Calculation

The ability of the algorithm to search the optimal solution depends on the current population diversity. Tang et al. [21] defined the population similarity coefficient to judge the population diversity, and Wang et al. [22] defined the variance of the population fitness value to reflect the aggregation degree of all individuals in the population. Referring to their studies, this paper proposes a new indicator to evaluate population diversity, and the indicator can be calculated by the following formulas:

Here, the symbols NP, , , and represent the population size, the individual i of th generation, the central individual in the population, and the average distance from all individuals in the population to the central individual. As shown in Figure 1, it is assumed that there are three individuals x1, x2, and x3 in the population, and the central individual is . The larger the value of , the greater the distance between individuals, that is, the better the diversity. The smaller the value of , the worse the population diversity, and the individuals in the population are more clustered.

3.2.2. Double-Mutation Strategy

In the evolution process, in order to balance the development ability and convergence speed of algorithm, Zhang and Sanderson [23] and Islam et al. [24] adopted a new mutation strategy, and Qin et al. [25] and Yi et al. [26] proposed the multimutation strategy. Based on the previous studies, this paper combines two mutation strategies to carry out mutation operation on individual population, which is recorded as double-mutation operation. That is, according to the current population diversity, the appropriate mutation strategy is selected.

As shown in formula (14), in the early stage of evolution, the population diversity is good, and the value of is greater than the set threshold. At this time, the mutation strategy DE/best/1 is selected to guide the search direction of the population with the optimal individual, which can enhance the development ability of the algorithm and accelerate the convergence speed of the algorithm. With the increase of evolution generation, the population diversity will rapidly decline. When the population diversity evaluation indicator is less than the set threshold, the mutation strategy DE/rand/1 is selected, which selects individuals randomly to guide the search direction and improves the population diversity to avoid falling into the local optimal solution.

In evolution algorithms, population diversity is generally approximate to the variance of individual variable values. The larger the variance, the higher the diversity. The average indicator proposed in this paper includes the distance from all individuals to the central individual, which belongs to the variation of variance measurement and can reflect the change of population diversity.

3.3. Fitness Function

Clustering belongs to unsupervised learning method. When using evolution algorithm to solve the clustering, it should be transformed into an optimization problem at first, and the optimal objective function (i.e., fitness function) should be established. In this paper, the sum of within-class distances (WCD) is taken as the fitness function.

As shown in formula (15), the symbols k, mk, , and ck represent the number of clustering, the total number of data in the K class, the ith data in the K class, and the clustering center of the K class, respectively. In this paper, formula (4) is used to calculate the distance from each data point to each clustering center. The smaller the value of WCD, the more concentrated the data points in various types, and the better the clustering effect; that is, the minimization of WCD is solved.

3.4. Improved Differential Evolution Clustering Algorithm

The improved DE is combined with K-Means clustering algorithm to obtain the optimized clustering algorithm, that is, the clustering algorithm based on the improved differential evolution. The initial clustering center of the algorithm is randomly selected, and the optimal location of the clustering center is realized in the evolution process, so that the final clustering result can reach the global optimal. The overall flow of the algorithm is given in Algorithm 1.

Input: Data set D = {d1, d2, …, dn}
Output: The optimal clustering
Begin
(1)Initializing the population and parameters;
(2)Evaluating fitness of population and keeping optimal value;
(3) Do
(4)Calculating the indicator of population diversity;
(5)Guiding all individuals to perform mutation operation;
(6)Performing cross operation;
(7)Performing selection operation;
(8)Updating the population;
(9)Keeping the current optimal value;
(10)While (Not the optimal solution or the maximum number of iterations)
End

In Algorithm 1, the population POP and each parameter value should be initialized at first. Then, according to formula (15), the objective function value of each individual can be calculated, and the current optimal value can be obtained. After that, the indicator of population diversity is calculated by formulas (11)–(13), and the mutated individual is obtained by directing all individuals to perform variation operations based on current population diversity. Then, the experimental individuals can be obtained by performing cross operation on the mutated individuals. Formula (15) is used to evaluate the fitness of experimental individuals and contemporary individuals, and the better individuals are selected to enter the next generation; accordingly, the objective function value of the optimal individual is retained. Finally, the algorithm will go to statement 3 for execution until the optimal solution is obtained or the maximum number of iterations is reached.

4. Simulation Experiment and Analysis

4.1. UCI Standard Test Set

In order to verify the performance of the algorithm, this paper compares K-Means, K-Means++, and DE-K-Means clustering algorithm with the proposed algorithm. Three data sets were selected from the UCI as test datasets, and the properties are described in Table 1.

In the DE-K-Means algorithm and the proposed algorithm, the mutation factor F is set to 0.6, the crossover probability CR is set to 0.5, the population size dim, and the threshold value of λ in the proposed algorithm is set to 0.005. Moreover, dim represents the number of individual attributes. If the algorithm converges to the same optimal solution more than 400, then the algorithm is terminated. The maximum evaluation times is 1500, and each algorithm will run 40 times independently for the test set. The simulation software used in the experiment is MATLAB R2016b.

The clustering results are shown in Tables 24, the maximum value, minimum value, and average value of the inner-class distance which are obtained through 40 independent experiments on three UCI datasets. From these experimental results, it can be seen that K-Means algorithm and K-Means++ algorithm have a fast convergence speed with the least number of iterations. However, there is a large gap between the maximum and minimum values of the inner-class distance, and the results fluctuate greatly. Moreover, the tightness between data in the same class is poor, and the stability of clustering results needs to be improved. Compared with K-Means and K-Means++ algorithms, the objective function value optimized by DE-K-Means algorithm and the proposed algorithm are better, the stability and accuracy of clustering results are improved, and the clustering results obtained by the proposed algorithm are better. In short, the performance of the proposed algorithm is better than other algorithms in three datasets, especially in the Zoo dataset.

The comparisons of convergence curves between DE-K-Means algorithm and the proposed algorithm on UCI data are shown in Figures 24. It is found that, compared with DE-K-Means algorithm, the target function value of the proposed algorithm tends to be optimal earlier; that is, the convergence speed of the proposed algorithm is better than that of the DE-K-Means algorithm. To sum up, the proposed algorithm performs well in stability, accuracy, and convergence speed.

4.2. Data Comparison of Lung Diseases in Traditional Chinese Medicine

Diffuse pulmonary interstitial disease is characterized by alveolar damage and interstitial fibrosis [27]. Since it has high morbidity and mortality, with the deterioration of air quality, how to prevent the disease and the usage of drugs for disease are the hot spots that people pay attention to. In this paper, the clustering algorithm based on differential evolution is used to analyze the usage rules of prescriptions of Traditional Chinese Medicine in the treatment of diffuse interstitial lung disease.

The data of this section comes from the “Database of Literature Research on the Diagnosis and Treatment of Diffuse Pulmonary Interstitial Disease by modern famous veteran doctors of TCM,” which contains 39 kinds of TCM works and 16 literatures, with a total of 270 data [28].

Based on the experimental results of the UCI dataset, in this section, the DE-K-Means algorithm and the proposed algorithm are used for clustering the drug data of diffuse interstitial pulmonary disease (hereinafter referred to as TCM data). In these two algorithms, the values of variation factor F and crossover probability CR are set to 0.6 and 0.5, respectively, and the population size NP equals 10∗dim. The threshold value in the proposed algorithm is set to 0.001. If the algorithm converges to the same optimal solution more than 400 times, the algorithm will be terminated. The maximum number of evaluations is 2500. Each algorithm will independently run the data for 40 times. The simulation software used in experiments is MATLAB R2016b.

A reasonable experience value K = 7 can be obtained by analyzing and comparing the experimental results of the number of different categories. The experimental clustering results are shown in Table 5. The convergence graphs of DE-K-Means and the proposed algorithm on the given data are shown in Figures 5 and 6, respectively.

From Table 5 and Figures 5 and 6, it can be seen that the clustering effect of the proposed algorithm is better than the DE-K-Means algorithm for TCM data. Combined with the theory of TCM, the seven clustering results are described as follows.

The main drugs of class 1 include Angelica, Astragalus membranaceus, honeysuckle, and raw Astragalus. Among them, Astragalus membranaceus can nourish the middle and Qi. Angelica can replenish blood and activate blood. Honeysuckle can clear away heat and detoxify. Raw Astragalus can nourish the surface and stop sweating and invigorate the Qi and Yang. These drugs are matched to replenish Qi and blood, replenish diarrhea, and clear away heat and toxins. It is applicable to those who have the syndrome of deficiency of Qi and Yin, deficiency of Qi and blood, and stagnation of heat and toxin.

The main drugs of class 2 include Salvia miltiorrhiza, Angelica sinensis, red peony root, and Ligusticum wallichii. Among them, Salvia miltiorrhiza can activate blood circulation and regulate menstruation and can cool blood to eliminate carbuncle. Red peony root can clear heat and cool blood and can activate blood circulation to remove blood stasis. Ligusticum wallichii can open depression and can activate blood and relieve pain. These drugs are matched to promote blood circulation and remove blood stasis and are suitable for the symptoms caused by blood stasis.

The main drugs of class 3 include Fritillaria sichuanensis, Fritillaria thunbergii, Scutellaria baicalensis Georgi, and Schisandra chinensis. Among them, Fritillaria sichuanensis can clear away heat and moisten the lung, dissipate phlegm and stop cough, and can disperse the knot and eliminate carbuncle. Fritillaria thunbergii can clear away heat and phlegm and stop cough, detoxify the knot, and eliminate carbuncle. Scutellaria baicalensis can clear away heat and dry dampness and can relieve fire and detoxify. Schisandra chinensis can collect lung and stop cough and can nourish astringent essence. The combination of these drugs can clear the heat and reduce phlegm, which is suitable for the syndrome of phlegm-heat accumulated in lung.

The main drugs of class 4 include Ophiopogon japonicus, Adenophora verticillata, Schisandra chinensis, Fritillaria sichuanensis, almond, coix seed, Flos Farfarae, cortex mori, and aster. Among them, Ophiopogon japonicus can promote the secretion of saliva to quench thirst and can moisten lung to stop coughing. Adenophora verticillata can nourish yin and clear heat, moisten lung and dissipate phlegm, benefit stomach, and generate body fluid. Almond can relieve cough and asthma, moisten intestines, and relieve constipation. Coix seed can invigorate the spleen to arrest diarrhea, clear damp, and promote diuresis. Flos Farfarae can relieve cough. Aster can dissipate phlegm. Cortex Mori can purge the lung to calm panting, and induce diuresis to alleviate edema. The combination of these drugs can dissolve phlegm and arrest cough, moistening lung and promoting fluid production, which are suitable for the syndrome cough and asthma with deficiency of Qi and Yin and stagnation of phlegm heat.

The main drugs of class 5 include Codonopsis pilosula and licorice. Among them, Codonopsis pilosula can tonify middle-Jiao and Qi, strengthen spleen, and tonify lung. Licorice can tonify spleen and Qi, expel phlegm to arrest coughing, and relieve spasm and pain. The combination of these two drugs can invigorate the spleen and lung, which are suitable for the syndrome of deficiency of lung and spleen.

The main drugs of class 6 include honeysuckle, Trichosanthes, loquat leaf, and licorice. Among them, Trichosanthes can clear heat and remove phlegm and moisturize and smooth the intestines; loquat leaf can clear the lungs and relieve cough. These drugs are matched to clear the heat and reduce phlegm, and it is suitable for the wind heat to make the lung cough and asthma on the inverse.

The main drugs of class 7 include tuckahoe and atractylodes. Among them, tuckahoe can clear damp and promote diuresis and tonify spleen and heart. Atractylodes can tonify the spleen and strengthen the stomach. These two drugs are matched to strengthen the spleen and dampness, which is suitable for the syndrome of deficiency of spleen.

The analysis of the above seven clustering results is consistent with the basic knowledge of Traditional Chinese Medicine. In the treatment of diffuse pulmonary interstitial disease, there are both clearing heat, resolving phlegm, relieving cough, relieving asthma, promoting blood circulation, removing blood stasis, eliminating dampness and clearing damp and tonifying Qi, nourishing Yin, enriching blood, vitality, profiting lung, tonifying the spleen and kidney, so as to support the main etiology and pathogenesis of diffuse pulmonary interstitial disease is the combination of deficiency and excess.

Differential evolution has emerged as one of the fast, robust, and efficient global search heuristics of current interest. Das et al. [11] described an application of DE to the automatic clustering of large unlabeled data sets. In contrast to most of the existing clustering techniques, the proposed algorithm requires no prior knowledge of the data to be classified. To study whether the performance of DE can be improved by combining several effective trial vector generation strategies with some suitable control parameter settings, Wang et al. [12] proposed a novel method, called composite DE (CoDE), which uses three trial vector generation strategies and three control parameter settings and randomly combines them to generate trial vectors. For the unconstrained global optimization problems, Liu et al. [13] proposed a hybrid DE based on the one-step k-means clustering and 2 multiparent crossovers, called clustering-based differential evolution with 2 multiparent crossovers (2-MPCs-CDE). In their method, the k cluster centers and several new individuals generate two search spaces. Xu et al. [14] proposed a superior-inferior (SI) crossover scheme based on DE. In their scheme, when population diversity degree is small, the SI crossover is performed to improve the search space of population. Otherwise, the superior-superior crossover is used to enhance its exploitation ability. Mohamed et al. [15, 16] proposed an adaptive guided differential evolution algorithm (AGDE) for solving global numerical optimization problems over continuous space, and they also propose a novel differential evolution algorithm, called NDE, for solving constrained engineering optimization problems. The key idea of the proposed NDE is the use of new triangular mutation rule, which is used to search for better balance between the global exploration ability and the local exploitation tendency as well as enhancing the convergence rate of the algorithm through the optimization process. Meng et al. [18, 19] proposed the parameter adaptive DE (PaDE) to tackle the weaknesses of DE, such as the improper control parameter adaptation schemes and the defect in a given mutation strategy. They also proposed a novel DE variant, named Depth information-based Differential Evolution with adaptive parameter control for numerical optimization (Di-DE), in which the novel mutation strategy, grouping strategy, and cooperative strategy are adopted to tackle the weaknesses of DE, such as the premature convergence to some local optima of a mutation strategy and the misleading interaction among control parameters. Wang et al. proposed a self-adaptive mutation differential evolution algorithm based on particle swarm optimization (DEPSO) to improve the optimization performance of DE, in which the population diversity can be maintained well in the early stage of the evolution, and the faster convergence speed can be obtained in the later stage of the evolution.

6. Conclusions

This paper proposes an improved differential evolution algorithm, which uses a new indicator to evaluate population diversity, and adopts the double-mutation strategy according to the current population diversity. The improved DE is applied to K-Means clustering to optimize and locate the clustering center, which can improve the performance and stability of clustering algorithm. The simulation results show that the improved clustering algorithm can improve the global optimization and convergence speed. Finally, the improved clustering algorithm is used to analyze the medication data of TCM in the treatment of pulmonary diseases. The clustering results are in accord with the theory of Traditional Chinese Medicine, which verify that the main etiology and pathogenesis of pulmonary diseases are intermingled deficiency and excess, deficient root and excessively superficial. As a whole, it not only provides reference for clinical treatment, but also verifies the practicability of the proposed method.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 81703946 and 61902113); the subproject of the National Key Research and Development Program (Grant no. 2017YFC1703506); the Science and Technology Research Project of Henan Province (Grant no. 212102310362); the Young Teacher Program of Higher Education Institutions of Henan Province (Grant no. 2020GGJS104); and the Scientific Research Nursery Project of Henan University of Chinese Medicine (Grant no. MP2020-07).