Abstract
Railway signal equipment fault data (RSEFD) are one of the issues with indepth traffic big data analysis throughout the life cycle of intelligent transportation. In the course of daily operation and maintenance, the railway electrical maintenance department records equipment malfunction information in a natural language. The data have the characteristics of strong professionalism, short text, unbalanced category, and low efficiency of manual analysis and processing. How to effectively mine the information contained in these fault texts to provide help for onsite operation and maintenance plays an important role. Therefore, we propose a railway fault text clustering method using an improved Dirichlet multinomial mixture model called ICHGSDMM. In this method, first, the railway signal terminology thesaurus is established to overcome the inaccurate problem of RSEFD segmentation. Second, the traditional Chi square statistics is improved to overcome the learning difficulties caused by the imbalance of RSEFD. Finally, the Gibbs sampling algorithm for Dirichlet multinomial mixture model (GSDMM) is modified using an improved chisquare statistical method (ICH) to overcome the symmetry problem of the word Dirichlet prior parameters in the traditional GSDMM. Compared to the traditional GSDMM model and the GSDMM model based on chisquare statistics (CHGSDMM), the quantitative experimental results show that the GSDMM model based on improved chisquare statistics (ICHGSDMM internal)’s evaluation index of clustering performance has greatly improved, and its external evaluation indices are also the best, with the exception of external index NMI of data set DS2. Simultaneously, the diagnostic accuracy of a select few categories in RSEFD has considerably improved, demonstrating its efficacy.
1. Introduction
The intelligent transportation system (ITS) is the development trend of transportation system in the future and it has received more and more attention. Depth analysis of traffic big data in the whole life cycle is becoming one of several scientific and technical problems in China’s intelligent transportation, and at present, it is in a primary stage where the data is not wide enough and the application is not deep enough, and it has the problem that data integration and intelligence needs to be further improved. How to fully dig the value of massive data covering the entire life cycle of the transportation field has become the basis research and it has promoted the construction of a new generation of intelligent transportation systems [1]. Railway signal fault data (RSEFD) are one part of the massive data of the whole life cycle of the transportation field and it has received more and more attention.
Railway signal equipment generally refers to track circuits, signals, turnouts, and other equipment related to train operation, and these equipment are the basis for ensuring the safe operation of trains. In the daily operation process, maintenance personnel record the fault phenomenon, the handling process of equipment failure and fault diagnosis results in a natural language, and store the fault data in paper or electronic files in text. With the increase of railway mileage and operation, a large number of RSEFD have been accumulated. These data are stored in unstructured and textual form for a long time, and it is not conducive to computer processing and understanding [2]. Equipment maintenance workers must frequently learn from the processing experience of a significant number of existing equipment fault data, as well as manual inquiry and analysis of this fault data, during normal maintenance of railway signal equipment. This results in low data processing efficiency and low intelligence of data information [3, 4].We effectively reduce the search space, improve the discovery efficiency, and mine a large amount of valuable fault identification and diagnosis information contained in the fault text [5, 6] as well as established the association between fault feature words and fault classes that will make fault identification effective and similar situations handling easy in the future [7]. Railway personnel manually classify the severity and domain reflected by the textual semantics of railway faults based on professional knowledge [8]. Due to the unstructured structure of railway text data and the irregularity and randomness of personnel records [3], it is currently a challenge to extract accurate fault information from unstructured natural language.
The topic model is a traditional text clustering method, which can well mine the semantic information of text and is widely used. As the most popular topic model [9], LDA is often used for text clustering, and it is successfully applied to long text clustering and the effect is successful. Short texts tend to have fewer words and data sparsity. Due to the lack of repeated words in short texts, it is a challenging task for the traditional LDA topic model to screen relevant feature words. Meanwhile, in short texts, the context is very limited, and semanticbased feature word extraction is challenging. When traditional topic modeling techniques are applied to RSEFD, it is necessary to consider the characteristics of short texts, and feature extraction algorithms that copy long texts are often ineffective.
GSDMM [10] can automatically deduce the number of clusters and it has a good balance between the completeness and homogeneity of clustering results, as well as a fast convergence speed, which is more effective than LDA to extract hidden topics from short texts [11]. The GSDMM model assumes that the parameter of words Dirichlet prior distribution is symmetric, that is, the same Dirichlet prior distribution is given to all words, and all words are treated equally when the model is generated. In practice, different words should have different clustering effects on topics, GSDMM should consider the influence of global weighted metrics for each word [12], and the parameters of Dirichlet prior distribution of each word should be different.
To address the challenges posed by the symmetric assumption of the parameter of words Dirichlet prior distribution of the GSDMM model, chisquare statistics is introduced. Chisquare statistic tests the significance of the relationship between the value of a variable and that class [13]. The importance of different words to different classes can be well distinguished by chi square statistic (CHI). The larger the chisquare statistic value of a feature item in a specific class, the more representative the word is for that class. Chisquare statistics have greatly improved the sparseness of feature words in short text datasets. However, chisquare statistics also have shortcomings. The traditional chisquare statistical algorithm does not take the uniform distribution of feature words within the class into account and ignores some features that rarely appear in the specified category but can well represent this category [14–16]. The imbalance of fault data categories affects the performance of feature extraction algorithms and also brings serious difficulties to most clustering models and classifier learning algorithms that assume a relatively balanced data distribution [7].
To solve the above problems, in order to further improve the mining quality of the hidden information of the fault text and improve the clustering effect of the railway signal fault text, this paper proposes an ICHGSDMM model for railway fault text clustering, and the main contributions are summarized as follows:(1)A professional word segmentation dictionary in the field of railway signal is constructed. The natural language of signal fault text is highly specialized and general text segmentation tools are not effective for some professional vocabulary segmentation. The establishment of this dictionary effectively improves the word segmentation accuracy of signal fault text and provides a good basic environment for feature words to better represent text semantics and improve text clustering effect.(2)A feature word extraction method based on prior knowledge of improved chisquare statistics is proposed. This method filters out the feature words of each category based on the relationship between the feature words and the categories, which effectively alleviates the problem of loose topics in short texts and greatly improves the problem of inaccurate feature word extraction caused by imbalanced data categories.(3)An ICHGSDMM model based on prior knowledge of railway domain is constructed. By changing the weight of Dirichlet prior distribution of each word in the GSDMM model, the model improves the semantic balance of the fault text representation vector generated by the GSDMM model and improves the text clustering effect.
The remainder of this paper is organized as follows. Section 2 reviews the literature on topic models and chisquare statistics. Section 3 explains feature word extraction algorithm with improved chisquare statistics. Section 4 elaborates the GSDMM and the ICHGSDMM model. Section 5 is the experimental data and analysis. Section 6 summarizes the paper and proposes future work.
2. Related Work
How to remove hidden fault information from fault text for clustering and equipment fault type identification is the main work carried out in the field of railway signal fault text earlier. For example, the authors of [4] used the TFIDF algorithm for feature word extraction, and then integrated multiple classifiers based on voting to achieve fault text classification learning. The authors of [17] applied Word2vec to generate word vectors and the SMOTE algorithm to balance the amount of data, and finally used convolutional neural networks to automatically classify faulty texts. The authors of [18] put forward a method for fault text classification based on Word2vec and parallel convolutional neural networks. Based on the highspeed rail signal equipment fault text, the authors of [3, 19] adopted the PLSA model and the labeledLDA topic model for feature extraction and fault text clustering respectively, so as to realize fault diagnosis of onboard equipment in highspeed rail signaling systems. In the study of [7], to classify the problematic text, the authors presented the syntactic feature extraction approach of enhanced chisquare statistics and the semantic feature extraction method of LDA topic model based on prior knowledge. The above method usually represents the text as a vector by calculating the word frequency or semantic information of the feature words in the fault text and then calculates the similarity and realizes clustering or classification.
Topic modeling approaches make it possible to cluster enormous amounts of unlabeled data efficiently. It is an unsupervised machine learning model that belongs to the soft clustering method and can effectively extract semantic information in the text to mine the topic of clustered text. Each text is supposed to be a mixture of topics in the LDA model [20], with each topic consisting of a set of connected words that usually communicate some semantic information [9, 21]. Since the railway signal fault text belongs to the short text domain, there are few repeated words in the short text, and the data are sparse, which lead to the unsatisfactory estimation of the topic distribution of the text and the topic distribution of words by LDA. The GSDMM proposed by the authors of [10] is more suitable for short text clustering. Compared with other topic clustering methods, the short text topic vectors generated by GSDMM are of better quality, the clustering results have good integrity and homogeneity, and the convergence speed is fast, and it can also deal with the sparse and highdimensional problems of short texts. The GSDMM model is the Dirichlet multinomial mixture (DMM) model based on the folded Gibbs sampling algorithm, which assumes that each document can only be represented by one topic. The authors of [22] adopted the GSDMM method for short text clustering in the field of web services, and the performance study showed that GSDMM is a more effective clustering method compared to other traditional topic modeling methods. The authors of [23] first used the GSDMM topic model to generate the corresponding topic vector of the text, and then applied the AGNES algorithm to analyze the clustering effect of the topic vector. The research results showed that the GSDMM topic model method has better clustering quality for the service text. The authors of [24] proposed a FGSDMM + algorithm, which uses multiple runs of the folded Gibbs sampling algorithm to complete online text clustering. Compared with the GSDMM and FGSDMM algorithms, the final clustering performance shows that the FGSDMM + algorithm has better data clustering performance. The authors of [25] put forward an adaptive Dirichlet multinomial mixture clustering model (eGSDMM), which utilizes a hyperparameter tuning algorithm to automatically capture temporal dynamics to obtain the temporal variation of topics and word distributions for short texts, the clustering results show that eGSDMM outperforms existing GSDMM methods on short text streaming data. In summary, at present, there are few improvement studies on the assumption that the word Dirichlet prior distribution is symmetrical in the GSDMM model.
The larger the chisquare statistic value of a feature item in a specific class, the more representative the word is for that class. Chisquare statistics are often used for feature selection [26, 27]. Because basic chisquare statistics are insufficient, several researchers have improved them. The authors of [15] proposed a modified chisquare statistics for feature selection approach and confirmed its efficacy based on the word frequency of feature items and their distribution features between and among classes. Aiming at the problem of missing attributes in some classes in chisquare statistics, the authors of [28] balanced the screening of the number of feature words in each class by improving the chisquare statistical algorithm and combines the SVM classifier to modify the performance of the Arabic text classification model. The above research on chisquare statistics in text classification models also illustrates the effectiveness of chisquare statistics in the field of text classification. For above considerations, in this paper, a research on railway signal fault text clustering based on ICHGSDMM is carried out.
3. Feature Extraction Based on Improved ChiSquare Statistics
The purpose of chisquare statistics reference is to effectively extract the fault feature words of each category and reduce the impact of fault category imbalance on text clustering.
3.1. ChiSquare Statistics
Chisquare statistics (CH) is used to measure the degree of correlation between words and classes, and it is assumed that words and c_{i} classes conform to a χ^{2} distribution with a first degree of freedom. The higher the χ^{2} statistic value of the entry for a certain category c_{i}, the greater the correlation between it and the category, and the smaller the independence. The chisquare statistic is defined as [7]where is the number of documents in the corpus, indicates that the word is not included, indicates other categories except class in the corpus, shows the relevance between the word and class indicates the number of texts in the corpus that contain the word indicates the number of texts in the corpus that does not contain the word indicates the number of texts in the corpus that belong to class and indicates the number of texts in the corpus that do not belong to class .
3.2. Improved ChiSquare Statistics
We refer to the class with a small number of texts as the minority class, and the class with more texts as the majority class, for clarity. For traditional chisquare statistics, only the frequency of documents containing feature words is considered, and the frequency of each feature word in these documents is not considered, which has disadvantages for corpora with uneven data distribution. The notion of frequency is presented to overcome the problem of unreliable feature word extraction due to the tiny amount of text contained in the minority class. The ideas of interclass concentration and intraclass dispersion are developed to overcome the problem that standard chisquare statistics increases the weight of feature words that appear less frequently in this class but commonly exist in other classes [16].
To facilitate understanding, we define K as the number of categories of a corpus, and a category (1 ≤ i ≤ K) contains text d_{i1}, …, d_{ij}, …, d_{iM} (1 ≤ j ≤ M) documents. The document frequency d_{i}^{t} of the feature word t appearing in the category C_{i} is defined as the intraclass dispersion, df_{ij}^{t} is the frequency of the feature word t appearing in the text d_{ij}, and cf_{i}^{t} is the frequency of the feature word t appearing in the category C_{i}, which is calculated as follows formula:where cf_{t} is the mean value of under all categories and the calculation is as follows:where tf_{i}^{t} is the interclass concentration of the feature word t in the category C_{i}, and the calculation is as follows:
The calculation of improved chisquare statistics (ICH) is as follows:
3.3. Feature Word Extraction
This paper first selects a fixed number of words as important feature words representing category according to the ICH. This filtering method effectively improves the feature words extraction quality of minority class and reduces the clustering problem due to class imbalance in the corpus. We define the improved chisquare statistic value of feature words as the ICH value, and the traditional chisquare statistic value of the feature words as the CHI value.
The feature word extraction method based on ICH feature selection is as in Algorithm 1. The RSEFD set S, the fault term dictionary Ω, the fault category set C, and the threshold γ is the number of important words in each category.

Algorithm 1 first initializes five empty sets, FS is the corpus set, which is used for the word set after data preprocessing, FI is the ICH value set, FI′ is the normalized FI set, Fw_c is the priori ICH value set, and FS′ is the important feature word set. (Line12). According to the fault term dictionary Ω, the corpus set FS is obtained after preprocessing the RSEFD set S, such as word segmentation and remove stop words (line34). Then calculate ICH values for all words and each category in the corpus set FS according to formula (5), and store them in the ICH value set FI (line6–9). In order to facilitate the comparison of the relationship between different fault feature words and different categories, the ICH value of each word in the set FI is normalized according to the following formula (line 10):where is the number of categories in the RSEFD set S, is a feature word in the corpus set FS, and c_{j} (1≤j≤K) is a category in the maintenance data set S. Next, FI′ is filtered according to the threshold γ to obtain the priori ICH value set Fw_c (line11–13). Finally, the important feature word set FS′ is obtained according to the priori ICH value set Fw_c (line1415).
4. Clustering Algorithms
This section first introduces the traditional GSDMM model and its implementation algorithm and then explains the ICHGSDMM model proposed in the text.
4.1. GSDMM Model
GSDMM is a DMM model with the folded Gibbs sampling algorithm, and it is a probabilistic generative unsupervised model. Under the assumption of onetoone correspondence between topics and documents, GSDMM adopts an iterative Gibbs sampling algorithm to approximate the model, and finally generates the topic distribution of documents. Figure 1 shows a graphical representation of the simulated process of DMM generating documents.
In the DMM model, α is the topic Dirichlet prior distribution parameter, β is the word Dirichlet prior distribution parameter, is the topic distribution matrix of the document, is the topic distribution matrix of the word, and satisfieswhere _{k,d} is the probability distribution of document d on topic k, and all topic distributions of the same document d satisfieswhere is the probability distribution of word on topic k, and the topic distribution of all words in the same document satisfies
The topic distribution of each document d obeys the following:
The process of document generation by DMM model can be described as follows: it first selects a mixed cluster k from formula (8). Then, it uses different algorithms to solve the model and finally get the probability that a topic k generates a document d as follows:
GSDMM is an approximate solution algorithm model of the folded Gibbs sampling of DMM model. The approximate model of Gibbs sampling algorithm obtains and by continuously sampling different topics of a word according to formula (12), and finally we deduce the topic of each document.
4.2. ICHGSDMM Model
In this section, we explain the ICHGSDMM model suggested in this paper, which introduces frequency, intraclass concentration, and interclass dispersion in the traditional chisquare statistics. First, the important feature words W_{imp} of each classification are screened out according to the threshold γ, and then, the ICH value of the important feature words of each category is mapped to [λ_{1}, λ_{2}], and used as the Dirichlet prior distribution of these important words, namely, β_{1}′, and the Dirichlet prior distribution β_{2}′ of the remaining feature words are all as λ_{1}.
In the ICHGSDMM model, the probability of document d selecting cluster k is as follows:where χ′max is the maximum value in the Fw_c set and χ′min is the minimum value in the Fw_c set.
Table 1 displays the symbols in the ICHGSDMM model, and Algorithm 2 describes the main steps of the ICHGSDMM model.

First, , , and are initialized by line 1. Then, the important feature word set FS′ and the priori chi square set Fw_c is obtained by calling algorithm 1. Next, the correction parameter β′ of the Dirichlet prior distribution in the GSDMM model is obtained according to formula (13) by line3. The topic of each document in the corpus is then initialized (Lines 4–8). (line9–18) is the iterative calculation process of GSDMM based on the folded Gibbs sampling algorithm according to formula (13). Finally, the documenttopic distribution matrix of the corpus is obtained according to the ICHGSDMM model.
5. Experiments and Analysis
5.1. Evaluation Metrics
The evaluation indicators used to evaluate the performance of clustering algorithms can generally be divided into two categories: internal and external evaluations. Internal evaluation does not require groundtruth labels, and it evaluates the clustering effect by using some similarity measurement techniques to measure intraclass and interclass relationships. External evaluation requires ground truth labels, whether the clustering is reasonable is evaluated by analyzing the relationship between the clustering labels and the ground truth labels.
In our study, the internal evaluation indices adopt the silhouette coefficient (SC) [29] and the Davies–Bouldin coefficient (DBI) [30]. The external evaluation indices adopt normalized mutual information (NMI) [29], adjusted mutual information (AMI) [29], homogeneity (H) index [11], and integrity (C) [11].
5.1.1. Internal Evaluation
(1) Silhouette Coefficient. The silhouette coefficient (SC) is used to measure the separation distance between clusters. The formula for a single cluster SC is as follows:where a_{i} is the average distance of element i from other elements in the same category and b_{i} is the average distance of elements that are closest to element i and belong to other different categories, N is the total number of elements in a cluster k.
The mean value of SC_{k} for each cluster k is the final silhouette coefficient score for all clusters with the following formula:
The value of SC represents the quality of clustering performance, the higher the value, the better the clustering performance.
(2) Davidson Boding Coefficient (DBI). DBI calculates the distance between clusters and within clusters, and it is defined is as follows:where N is the number of categories of clusters, x_{i} and x_{j} are the ith and jth cluster centers, respectively, and σ_{i} and σ_{j} are the average distances from all points in the ith and jth clusters to the center point, respectively. DBI values reflect how similar texts are within the same and different clusters. The lower the DBI value, the better the clustering algorithm.
5.1.2. External Evaluation
The external evaluation indices NMI, AMI, and ARI all require ground truth labels and cluster labels.
(1) Normalized Mutual Information. Normalized mutual information (NMI) is defined as follows:where X = {x_{1}, x_{2}, …, x_{N}} is the cluster division after clustering and Y = {y_{1}, y_{2} ,…, y_{N}} is the real category division. H(X) and H(Y) denote the entropy of X and Y, respectively, MI(X, Y) represents the mutual information calculation formula between X and Y.
(2) Adjust Mutual Information. Adjust mutual information (AMI) calculation formula is as follows:where E(.) is the expectation of MI(X, Y).
(3) Homogeneity (H). where H(C) is the category division entropy, H(CK) is the conditional entropy of category division under the given clustering condition, n is the total number of texts in the corpus, n_{c} is the number of texts in category c, and n_{k} is the number of texts under cluster k. n_{c}, k represents the number of texts in class c which is divided into cluster k.
Homogeneity expresses the goal that each cluster contains elements of only one true group. A cluster is perfectly homogeneous if all elements in a cluster have the same external label.
(4) Completeness (C).
The variable definitions of completeness are similar to homogeneity, and the definition of completeness is the conditional entropy of the cluster distribution given the external class labels. Completeness expresses the goal that all members with the same ground truth labels are assigned to one cluster.
5.1.3. Classification Correct Rate
To compare classification accuracy, we introduce the classification correct rate (CCR) [31]. The formula for CCR is as follows:where n represents the total number of texts in the cluster and y′_{d} and y_{d} represent the predicted class label of document d and the highestranked label among the predicted class labels, respectively. δ(.) is an indicator variable, when classifying a multilabel data set, we define δ(y_{d}, y′_{d}) = 1 if y′_{d} is in y_{d}, and 0 otherwise. The larger the CCR value, the better the clustering performance. The introduction of classification accuracy can provide a good assessment of the performance of clustering models.
5.2. Experimental Data Set
The experimental data set DS1 selected in this paper is a Chinese data set, which is a RSEFD set collected by a railway company in China from 2016 to 2020, with a total of 1527 samples. In order to better test the clustering performance of the ICHGSDMM model put forward in this paper, the English data set DS2 is also introduced. The data set DS2 is provided by https://github.com/pokarats/gsdmm, with a total of 20000 records. Table 2 shows examples of data set DS1 and DS2. Table 3 describes each fault category of data set DS1 and its proportion in the whole data set. It can be seen from Table 1 that the RSEFD set DS1 is a typical imbalanced data set. Track circuit fault (i.e., C2) and Switch fault (i.e., C5) are the majority, LKJ fault (i.e., C3) and Cab signal fault (i.e., C6) are the minority class. The classification accuracy of any fault category plays a key role in ensuring the safety and efficiency of the railway system. The data set DS2 contains 20 categories and each category contains 1000 samples.
5.3. Experimental Setup and Results
The experimental machine is configured with i710510u, 16.0GBRAM and win10.
Operating system and the program is written in Jupyter Notebook.
This section is described in two sections. The parameter settings for each topic modeling are described in the first section. In the second section, the clustering performances of GSDMM, CHGSDMM, and ICHGSDMM are evaluated and analyzed, respectively.
5.3.1. Parameter Setting
The β value of different prior Dirichlet distributions affects the performance of GSDMM. According to the literature [10], when the β value is [0.08, 0.1], the GSDMM model has high homogeneity and integrity, so this paper selects β = 0.08.(1)In the GSDMM model, α = 0.1 and β = 0.08.(2)In the CHGSDMM and ICHGSDMM models, α = 0.1 and λ_{1} = 0.08, λ_{2} = 0.2, and β_{2}′ = λ_{1}.(3)K = 20, γ = 50 in data set DS1. K = 40, γ = 200 in data set DS2. The number of iterations is 20, 40, and 60, and all experimental data are the mean values under different iterations.
5.3.2. Analysis and Discussion
(1) Internal Evaluation. Tables 4 and 5 indicate the SC, CH, and DBI results for the three topic models for 20, 40, and 60 number of clusters respectively. Table 6 depicts the mean values of internal evaluation results in DS1 and DS2. The mean values of SC, DBI, and CH of the ICHGSDMM model are all better than those of the CHGSDMM and GSDMM models, and the internal evaluation scores are improved more. For example, in the data set DS1, the SC score of ICHGSDMM is 0.943, while the SC score of GSDMM model is 0.121. Compared with the traditional GSDMM model, the CHGSDMM model or the ICHGSDMM model can all significantly improve the internal evaluation performance of the clustering model.
(2) External Evaluation. The NMI, H, C, and DBI results for the three topic models for 20, 40, and 60 clusters are shown in Tables 7 and 8. Table 9 displays the mean values of external evaluation results in DS1 and DS2. The NMI and AMI score of the ICHGSDMM model in the dataset DS1 are the same as the NMI and AMI score of the GSDMM model, and the scores of the rest external evaluation index H and C in the ICHGSDMM model are the best among the three models. In the data set DS2, overall external evaluation result of ICHGSDMM is better than CHGSDMM and GSDMM models.
(3) CCR Analysis. CCR value is the average of the CCR values of each category in data set DS1 and DS2. Table 10 shows the results of the CCR scores in the data sets DS1 and DS2. It can be seen that the CCR score of the ICHGSDMM model is the highest at 0.614, followed by CHGSDMM and GSDMM.
The results of the CCR scores indicators for each class in the datasets DS1 and DS2 are shown in Figure 2.
(a)
(b)
In Figure 2(a), the data set DS1 contains 7 groundtruth labels, C0∼C6. Because C1 (ATP fault) has little correlation with other classes, its CCR value reaches 1.0, which is better than CHGSDMM and GSDMM models. Except for the C2 class, the CCR scores of other classes of the data set DS1 in the ICHGSDMM model are better than those of the GSDMM and CHGSDMM models. The reason for the lower CCR score of class C2 may be that C2 (Track circuit fault) is a basic ground equipment system for railway signals, which belongs to the majority classes in the data set DS1, and it has a greater correlation with class C2, C3, C4, and C5. Compared with the GSDMM model, the CCR scores of the minority classes C0, C3, and C6 in the ICHGSDMM model have been greatly improved. It can be seen that the prediction effect of the ICHGSDMM model in the minority classes has been improved.
From the analysis of CCR performance of each class in Figure 2, it can be seen that the overall performance of the ICHGSDMM model among the three models is still the best.
(4) Effect Analysis of the Number of Clusters. To research the effect of the number of iterations on the number of clusters discovered by the ICHGSDMM, CHGSDMM, and GSDMM models, we set the initial cluster number parameter K of data set DS1 to 20, and the initial cluster number parameter K of data set DS2 to 40. Figure 3 displays the number of clusters discovered by ICHGSDMM, CHGSDMM, and GSDMM models at different iterations.
(a)
(b)
Figure 3(a) displays that the number of clusters discovered by the ICHGSDMM, CHGSDMM, and GSDMM models decreases rapidly and remains stable after approximately 9, 15, and 7 iterations, respectively. The closest order to the actual number of clusters is the ICHGSDMM, CHGSDMM, and GSDMM models.
Figure 3(b) shows that the number of clusters discovered by the ICHGSDMM and CHGSDMM models drops rapidly after about 6 iterations, while the GSDMM model drops rapidly after about 17 iterations, and the number of clusters finally discovered by the GSDMM model has the largest difference from the actual number of categories in the data set DS2. Both the ICHGSDMM and CHGSDMM models discover the number of clusters faster, and the ICHGSDMM model found the number of clusters closest to the actual number of clusters after about 28 iterations. The number of documents in data set DS2 is 92.34% larger than that in data set DS1, which may be the reason why the number of clusters discovered by the ICHGSDMM model in Figure 3(b) did not remain stable for a long time.
6. Conclusion
Compared with traditional topic modeling techniques, the GSDMM model is more suitable for short text clustering. However, in the GSDMM model, the Dirichlet prior distribution of words is supposed to be symmetric, i.e., all words are given the same prior distribution. When the model is constructed, all words are treated equally, which is obviously not realistic. To solve this problem, we proposed the ICHGSDMM model. The improved chisquare statistics (ICH) method is the introduction of frequency, intraclass concentration, and interclass dispersion in the traditional chisquare statistical (CH) method. The ICHGSDMM model is based on the ICH method to generate the Dirichlet prior distribution of important words of each category in the corpus to modify the traditional GSDMM model. Finally, we evaluate the internal and external clustering performance of traditional GSDMM, CHGSDMM models, and the proposed ICHGSDMM model in this paper. The results indicate that the internal evaluation index of the ICHGSDMM model has improved greatly. The external evaluation index has improved except for NMI in the data set DS1. For the imbalanced data set DS1, the classification accuracy rate of minority classes is significantly improved, which also verifies the effectiveness of the model.
Future work will additional optimize the calculation method of the Dirichlet prior distribution of words in the GSDMM model and evaluate the impact of the number of important words in each category on the clustering effect to improve the ICHGSDMM model and improve its external evaluation performance.
Data Availability
All data, models, and code generated or used during the study appear in the submitted article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was funded by the National Natural Science Foundation of China under Project No. 51967010.