Abstract

In this paper, the multilevel classification model of high-speed railway signal equipment fault based on text mining technology is proposed for the data of high-speed railway signal fault. An improved feature representation method of TF-IDF is proposed to extract the feature of fault text data of signal equipment. In the multilevel classification model, the single-layer classification model was designed based on stacking integrated learning idea; the recurrent neural network BiGRU and BiLSTM were used as primary learners, and the weight combination calculation method was designed for secondary learners, and k-fold cross verification was used to train the stacking model. The multitask cooperative voting decision tree was designed to correct the membership relationship of classification results of each layer. Ten years of signal switch machine fault data of high-speed railway are used for experimental analysis; the experiment shows that the multilevel classification model can effectively improve the classification of signal equipment fault multilevel classification task evaluation index and can ensure the correctness of the subordinate relations’ classification results.

1. Introduction

High-speed railway signal equipment is an important infrastructure to ensure the safety of high-speed railway operation [1]. With the accumulation of high-speed railway operation mileage, a large number of fault data of signal equipment are generated. Most of these fault data are stored in the form of unstructured text. This data contains important information on high-speed railway safety. For a long time, business personnel has been carrying out fault equipment for the data according to experience diagnosis and classification because there are many different types of signal equipment and equipment faults of different principles in high-speed railway, and the subordinate relationship between equipment and fault causes is strict; to carry out in-depth fault data analysis of high-speed railway, multilevel classification of fault data is required, while the manual multilevel classification work is easy to cause classification inaccuracy. In smart railway and railway under the construction of big data, it is urgent to study the machine learning algorithm based on text mining to realize the multilevel classification of high-speed railway signal fault equipment.

Multilevel classification methods include top-down classification, global classification, and shrinkage classification [2, 3]. High-speed railway signal equipment fault multilevel classification applies the strategy of “divide and rule” in the top-down classification method, decomposes the large-scale fault equipment multilevel classification problem into a single-level classification problem, obtains the classification results of each level through the design of a single-level classification model, and finally collects and affiliates the classification results of all levels through the voting strategy of multitask collaborative decision tree correction to realize multilevel classification of signal fault equipment.

The single-layer classification model is a typical text classification model. Generally, the feature vector extraction methods such as BOW (Bag of Words) [4], TF-IDF (term frequency-inverse document frequency) [5], TM (Topic Model), and Word2Vec [6] based on deep learning are used to quantify the text data. Based on the text feature vector, the machine learning model is used to learn and classify the text features. The text classification machine learning model includes a single classifier, integrated learning classifier, and deep learning model. The single classifier model includes DT (Decision Tree), SVM (Support Vector Machine), and NBC (Naive Bayes Classifier). The integrated classifier improves the classification performance by combining multiple single classifiers, mainly including bagging and boosting. The integrated classifier based on the idea of stacking [7] improves the classification and generalization performance by stacking different types of single classifiers. The two most common models of the deep learning classification model are RNN (Recurrent Neural Network) [8] and CNN (Convolutional Neural Network) [9], as well as BiGRU (Bidirection Gated Recurrent Unit) [10] and BiLSTM (Bidirection Long Short Term Memory) [11].

In this paper, the multilevel fault classification model of high-speed railway signal equipment is designed based on the research methods of feature extraction and single-layer classification model of railway safety text. Firstly, according to the characteristics of fault data of high-speed railway signal equipment, an improved feature extraction method based on TF-IDF is proposed. To avoid the problem of overfitting caused by a single classifier, the k-fold cross validation + stacking classification model is used to implement the single-layer classification model. In the stacking model, the weight allocation mechanism, which combines the whole and the category weight as the secondary learner, is proposed to improve the single-layer classification performance of the model. Based on the staking model, tasks of different levels are classified, and a multitask cooperative voting decision tree is designed to correct the membership relationship of the classification results of different levels and improve the classification performance of the whole multilevel classification model. Finally, the experiment is carried out with the signal equipment fault data of high-speed railway from 2009 to 2018 to verify the validity and correctness of the multilevel classification model.

2. High-Speed Railway Signal Fault Text Feature Extraction

The fault data of high-speed railway signal equipment comes from the railway traction power supply management information system (EMIS). The fault data records the detailed information of the fault in a structured form. The reason information of the fault is stored in the form of natural language text. Based on the analysis of text data of high-speed railway fault causes, this paper classifies the fault causes and parts. The longest character length of fault record in the dataset is 100. A total of 2596 fault cases are selected, as shown in Table 1.

The text data of high-speed railway signal equipment fault cause analysis contains keywords with features such as a switch, red light band, and paster. TF-IDF is used to extract the features of fault text data [12]. The principle of the TF-IDF method is that if a word appears more frequently in the sample and there are fewer samples with the word in the full-text file, it means that the word has a higher impact on the sample identification, with a good ability to distinguish. Because of the large number of fault documents of high-speed railway signal equipment, but each fault document is a short text, the TF-IDF method is used to extract features directly, which is easy to cause redundancy and sparseness of feature vector and lack of data specificity, so this paper improves the feature extraction method of high-speed railway signal equipment fault data based on the TF-IDF method.

The improved TF-IDF high-speed railway signal equipment fault text data feature extraction method is shown in Figure 1. First of all, the Chinese text content should be segmented. In this paper, the Jieba word segmentation tool based on the professional corpus and the common corpus is used to segment the signal fault text [13], and the auxiliary words such as “de,” “Le,” and other words that cannot represent the document features are cleaned up, and then, the TF-IDF is used to calculate the weight of the vocabulary set to form a vocabulary weight matrix, and the number of each vocabulary is counted to form a vocabulary dictionary. The TF-IDF weight matrix is the number of documents and the vocabulary of all documents, so it has a large dimension. The TF-IDF weight matrix has a serious sparsity. According to the TF-IDF value, the words in each sample are sorted, allowing words to repeat, selecting the first 100 words with the most sample features, reducing the dimension of the feature vector, replacing the word frequency with the corresponding word ID, forming the feature dictionary matrix, and inputting the text feature vector and the labels of each level after one hot coding into the text classification model.

3. Single-Layer Text Classification Model of Signal Equipment Fault Based on Stacking

The fault text feature dataset of high-speed railway signal equipment is divided into the training set, verification set, and test set, which are input into the stacking single-layer classification model. Based on the fault single-layer text classification of high-speed railway signaling equipment based on stacking, by using the cyclic neural network BiGRU and BiLSTM as the first learning device of stacking and using the prediction results of the two neural networks as features to train the combined weighted secondary learning device, the prediction results of the primary learning device are integrated by the secondary learning device. To avoid the overfitting problem caused by the model trained from the training set and to train multiple single-layer classification models and achieve the purpose of producing multiple prediction results from the same test set, the k-fold cross-validation method is adopted, as shown in Figure 2.

3.1. Principles of BiLSTM and BiGRU Primary Learners

With the deepening of neural network layers, the “gradient vanishing” phenomenon is more serious. In order to overcome gradient disappearance and realize deep learning of the neural network, DNN (deep neural networks) is formed. The potential problem is the expansion of the number of parameters. So the convolutional neural network CNN is formed. CNN connects all neurons in the upper and lower layers by using “convolutional kernels” as intermediaries. However, it is impossible to model the changes in time series. To meet this demand, the recurrent neural network (RNN) appears. RNN is a kind of neural network for processing sequence information. Because of its pre-existing and postdependent structure, RNN has been widely used in natural language applications. The particularity of RNN lies in its output at time , which not only depends on the input layer , but also depends on the output of the previous node. Its learning process is a process of predicting the next word, such as , , and , which is an input “switch positioning none.” Then, and correspond to “positioning” and “no” these two items, predicting what is the most likely next time. Through the training of signal fault corpus, is most likely to be “representation.” represents the state of the hidden layer at time , represents the input at time , represents the output at time , and represents the memory unit at time and the linear parameter matrix of , models. As shown in Figure 3, its learning process is a process of predicting the next word.

If the sequence is too long, RNN will lead to gradient dissipation. LSTM can solve this problem by learning to rely on information for a long time. LSTM is a variant of RNN. LSTM and GRU make information selectively affect the state of every moment in RNN by designing the gate structure. The structure of “gate” is an operation that uses sigmoid and a bit-by-bit multiplication. The combination of these two operations is a “gate” structure. Since the output of the sigmoid is a value between 0 and 1, it is helpful to select and forget the information. 0 means to discard all the information, and 1 means to retain all the information. Generally, the sigmoid function is selected as the activation function and tanh(z) function as the output function.

LSTM is divided into three gates, input gate, forgetting gate, and output gate, as shown in Figure 3. The LSTM relies on a number of “gates” that allow information to selectively influence the state of the RNN at each moment. The forgetting gate decides to discard information, and input gate consists of two layers. First, the sigmoid layer is used as the input layer to determine the value to be updated. Then, the tanh(z) layer generates a new vector into the cell state, and replaces the new input information with the information to be forgotten. Finally, the sigmoid layer determines which cell states need to be output in the output layer :

GRU is used to change the forgetting gate, input gate, and output gate of LSTM into update gate and reset gate and combine the unit state and output into one state, as shown in Figure 3:

In the classical recurrent neural network, the transmission of the state is one-way from front to back. In some problems, the output of the current time is not only related to the previous state but also related to the subsequent state. At this time, BiRNN (Bidirection Recurrent Neural Network) is needed to solve this kind of problem. The two-way RNN considers both the above information and the following information of the prediction word and retains the important information of the word from the front to the back and from the back to the front, which can more effectively predict. BiRNN is composed of two RNNs superimposed together. The output is determined by the states of the two RNNs. By replacing the RNN in BiRNN with LSTM or GRU structure, BiGRU and BiLSTM are formed.

3.2. Principle of Combined Weighted Secondary Learner

Combined weighted secondary learners not only consider the overall learning ability of neural networks but also consider the performance of the neural network in different categories. According to the learning results of a single neural network on the same input, the weight of a single neural network is assigned. The higher the accuracy is, the greater the weight of the neural network is. This method can effectively suppress the influence of few values and extreme values in the learning process of the neural network. The weight of the neural network in each category is calculated according to formulas (4) and (5). The weight of a neural network in a category is calculated by calculating the logarithm of an error proportion of the classification neural network in a category. The better the performance is, the greater the weight of the small error proportion is. When the error proportion exceeds 0.5, the weight is calculated as 0. Finally, according to formula (6), the weight of the whole neural network and the weight of the category are added, and the prediction value of the model to the label is recalculated:

In the above formula, represents the prediction error ratio of the th neural network in the th label, represents the weight of the th neural network in the th label, represents the weight value of the th neural network as a whole, and .

3.3. Stacking Single-Layer Classification Model

In the stacking model, two neural networks are used as primary learners. In the data preprocessing layer, character feature vectors are reduced to dimension and then transformed into embedding, which is input into two bidirectional neural networks BiGRU and BiLSTM, respectively. After learning, two neural networks output the prediction probability of each classification label in the Softmax layer, respectively, and pass the combined weight classifier. The prediction results of the two primary learners are integrated and calculated, and finally, the classification of input data by the stacking model is output, as shown in Figure 4.

4. The Principle of Multitask Cooperative Voting Based on the Decision Tree

After k-fold cross learning, the stacking model generates k single-layer classification models, and the test data use k-fold cross learning to generate k prediction results. For different prediction results of the same data, the secondary and above prediction results have subordination. A multitask collaborative voting decision tree is designed to further improve the accuracy of the stacking model and the final prediction reliability.

A decision tree is mainly used for classification and regression tasks. Through ID3, C4.5, and C5.0 algorithm, train the decision tree and carry out appropriate pruning operation; then, it can effectively classify the unknown data. According to the subordination relationship between different levels of signal fault, this paper introduces the idea of multitask cooperative voting; after k-fold cross validation, the k-mode produces multiple prediction results for the same data to vote, and different levels of voting strategies are adopted. Represented by a decision tree, a multitask cooperative voting decision tree is formed. As shown in Figure 5, the multitask data is reclassified according to the voting decision tree.

This paper designs a three-level voting decision tree, which can be expanded according to the third-level voting principle. Set the level relation Dictionary of 3 layers as . After k-fold cross validation, each prediction task has prediction results at each layer, which are expressed as , as the value of , as voting for multitask , means voting results including {0: valid, 1: abandoned, 2: all members abandoned}, is the residual vote, i.e., task removes the members selected in the last voting and votes the result again, and indicates the membership relationship between the voting result of and set D. Because this relationship is clear, it will be described from the two values of 0 and 1; 0 indicates membership and 1 indicates nonmembership. means to randomly select a subset from set D as the prediction result, and the voting decision tree of three-tier multitasking is shown in Figure 5.

5. Experimental Verification and Result Analysis

In this paper, the 10-year data of switch equipment in high-speed railway signal basic equipment from 2009 to 2018 are used for verification, of which 70% are training set samples, 20% are verification set samples, and 10% are testing set samples. The data includes 7 categories of first-class labels and 64 categories of second-class labels. The comprehensive evaluation model of the F1 value is constructed by precision and recall.

The precision calculation formula is

The recall calculation formula is

The F-score calculation formula iswhere is the total number of all samples, is the total number of all categories, is the number of samples correctly classified into this category, is the number of samples correctly identified not in this category, is the number of samples mistakenly classified into this category, and is the number of samples belonging to this category but mistakenly classified into other categories.

5.1. Experimental Analysis of Fault Text Single-Layer Classification of the Signal Switch Machine Based on the Stacking Model
5.1.1. Overall Weight Distribution of BiGRU and BiLSTM

The overall weight of BiGRU and BiLSTM is based on the learning results of the same eigenvector of a single neural network. In this paper, BiGRU and BiLSTM are designed to have the same network parameters. k-fold cross validation is set as k = 5, the number of iteration rounds of the neural network is 50, the network input batch size is 256, the embedded layer dimension is 100, and the hidden layer dimension is 512. The loss function values for the BiGRU and BiLSTM primary and secondary training processes are shown in Figure 6, from which it can be seen that the number of iterative rounds is between 30–50, and the loss function loss value is close to the minimum and is largely stable. Compared with BiLSTM, BiGRU has a smaller loss function and better classification performance, and the second classification is less variable than the primary classification loss function. As the k value increases, the loss value gradually decreased and tended to be stable during each iteration. The commonly used k values are 3, 5, 6, 10, etc. As shown in Figure 6, when k = 5, the loss value tends to be consistent with k = 4, so k = 5 is chosen to avoid the occurrence of the overlearning and underlearning state.

After k = 5 times of training, average the results of each training and get the training results of the two neural networks, as shown in Table 2.

It can be seen from Table 2 that, under the same parameters of the two neural networks, each evaluation index of BiGRU is higher than that of BiLSTM, and a higher overall weight can be obtained.

5.1.2. Weight Calculation of BiGRU and BiLSTM Categories

The number and weight of each category in the primary category of high-speed railway signal equipment fault are shown in Table 3. Due to the large number of secondary categories, considering the length of the article, only the analysis results of the category weight of the primary category are listed. From Table 3, it can be seen that the failure data of the paste checker, external locking and installation device, and switch machine are less. When the database is small and the number of errors is slightly larger, the category weight is small. On the contrary, when the number of supporting equipment and unexplained failures is large, the network learning effect is good and the category weight is also large.

Through the above experiments, the weight of each category of the neural network is obtained. Under different overall weight allocation, the evaluation indexes of the first-level fault classification and the second-level fault classification of the deep learning integration model are shown in Figure 7. It can be seen from the figure that when the weight of BiGRU is 0.7 and the weight of BiLSTM is 0.3, the evaluation index of the deep learning integration model is the highest.

5.1.3. Stacking Model Classification

By recomputing the output of the two networks with combined weighting, the common classification prediction results are obtained. The final classification results are shown in Table 4. It can be seen that the value of each classification evaluation index has been improved. Experiments show that the stacking single-layer classification model is a model that can effectively improve the fault text classification index of high-speed railway signal equipment.

5.2. Experimental Analysis of the Voting Mechanism Based on the Decision Tree

After cross validation, the stacking model has 5 classification models at each level, and 5 classification results are generated for the same test dataset. Based on the multitask collaborative voting decision tree designed in this paper, the voting results at each level are shown in Table 5. The first level has no affiliation relationship, and its votes are all valid; the second-level effective votes account for the majority, and the invalid votes and the total votes are also large. This is also the main reason for the low accuracy of the secondary classification.

After re-calculating the classification indicators of voting results, the results are shown in Table 6. It can be seen from Table 6 that the classification indexes after multitask collaborative voting are improved compared with the classification indexes of the stacking model, especially the improvement range of secondary classification is more obvious.

In order to get a more comprehensive neural network in the classification of various categories, the fault sample data of signal switch equipment represented by the TF-IDF eigenvector is input into RF (Random Forest) and GBDT (Gradient Boost Decision Tree) for the experiment. RF represents the bagging algorithm. GBDT is the representative algorithm of boosting. 30% of real samples are used for evaluation, and the number of base classifiers is set to 50. The final experimental results are shown in Table 7. It can be seen from the table that the evaluation index of the deep learning integration model designed in this paper is obviously higher than that of the mature integration learning algorithms, RF and GBDT.

5.3. Implementation Summary

According to the above experimental analysis, the classification indexes of each model are calculated according to the average value of corresponding evaluation indexes at all levels. The performance is analyzed experimentally, regarding BiGRU model, BiLSTM model, stacking model, and BiGRU in parallel with BiLSTM model. The classification performance of each model is shown in Figure 8.

From Figure 8, it can be seen that the stacking model designed in this paper effectively improves the classification index of each level, solves the subordination of classification results based on the mechanism of multitask cooperative voting, and improves the overall classification performance of the stacking model. Experiments show that the stacking model proposed in this paper is based on the stacking model of text mining technology has advantages in solving the multilevel classification problem of high-speed signal equipment.

6. Conclusion

Text data of high-speed railway equipment fault is important data for mining high-speed railway operation safety status and safety laws. It is necessary to realize multilevel classification of high-speed railway equipment fault data based on text mining technology. In this paper, a multilevel classification model is designed for the fault text data of high-speed railway signal equipment to solve the membership relationship between all levels of classification and effectively improve the classification evaluation index. In this paper, the k-fold cross validation single-layer classification model based on the stacking idea ensures the algorithm difference and diversity of primary learners, effectively reduces the risk of classification overfitting, and improves the classification index compared with single neural network classifier, and the multitask cooperative voting mechanism ensures the membership relationship of classification results. The stacking single-level classification model and multilevel classification model in this paper have reference value in railway text classification.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (no. 2020YFF0304100), Key Project of China Railway Research Institute Group Limited (2019YJ115), Scientific Research Project of China Academy of Railway Sciences Corporation Limited (2052DZ1201), and National Science Foundation for Young Scientists of China (51707128).