Abstract

In railway operation, unsafe events such as faults may occur, and a large number of unsafe event records are generated in the process of unsafe events’ recording and reporting. Unsafe events have been described in unstructured natural language, which often has inconsistent structure and complex sources, involving multiple railway specialties, with multisource, heterogeneous, and unstructured characteristics. In practical application, the efficiency of processing is extremely low, leading to potentially unsafe management utilization. Based on the data on unsafe events, this paper utilizes big data processing technology, conducts association rules mining and association degree analysis, extracts the word segmentation, and obtains the feature vector of unsafe fault event data. At the same time, the unsafe event data analysis model is constructed in combination with regular expression and pattern matching technology. This paper establishes the matching model of high-speed railway derailment-based external environment risk factors and applies it to the occurrence of unsafe events. This model could be utilized to analyze and excavate the link between external environment risk factors and the occurrence of unsafe events and carry out the automatic extraction of characteristic information such as risk possibility and consequence severity; hence, it has potential for identifying, with enhanced accuracy, high-risk factors that may lead to high-speed railway derailment. Based on this study, we could make full use of the unsafe event data, forecast the risk trend, and discover the law of high-speed railway derailment. This study introduces a viable approach to analyzing the unsafe event data, forecasting risk trend, and conceptualizing high-speed railway derailment. It could also enforce the accurate quantification of high-speed railway safety situation, refine the risk index and conduct in-depth analysis combined with the model, and effectively support the digitalization and intellectualization of high-speed railway operation safety.

1. Introduction

The occurrence of high-speed railway derailment accident may result in severe financial and human losses, which have significant disaster characteristics and strong nonlinear characteristics and pose a challenge to high-speed railway safety management. Technological advances have helped to mitigate the internal factors behind railway derailment, but the external factors remain an underexplored area of research.

Advances in computer technology have benefited large-scale numerical calculation by enhancing operation speed, storage capacity, and operation scale. Big data analysis, furthermore, has been building on progress in accuracy, quality, and reliability and has become a major area of academic interest in the context of high-speed railway safety.

To minimize risks that may lead to unsafe events, the railway corporation has accordingly built the reporting system to keep the records, and it has generated a large amount of data thereof. Figure 1 exemplifies a typical record of an unsafe event.

However, the records on unsafe events, which are now described in natural language as shown above, could be incorporated into a digital database, and it could not be established without a consistent description standard. At the same time, the railway infrastructure, rolling stock, and other equipment are diverse and complicated, and the data sources are complex, involving many specialties, such as track maintenance, power supply equipment maintenance, signal and communication equipment maintenance, EMU maintenance, passenger transportation, and external environment. The database covers a wide range of faults’ fixed equipment faults, mobile equipment faults, perimeter intrusion events, and others. A sizable amount of data have been gathered. In practice, the information needs to be retrieved, read, and updated manually, which costs a huge amount human and material resources, and the processing efficiency is considerably low, resulting in the low utilization rate of railway unsafe event data.

For the requirements of railway safety management and risk analysis for real time and accuracy to be met, it is urgent to carry out railway risk effect factors association analysis based on unstructured data, while taking into account different data structures, different sources, and scattered and independent records of railway unsafe events. Through this research, the heterogeneous data sources could be integrated, the heterogeneity of data can be eliminated, and the accurate association between data and railway risk could be achieved. Therefore, this paper will take the external environment risk of high-speed railway derailment as the research focus and carry out a correlation analysis of external environment risk factors of high-speed railway derailment based on unstructured data. The study seeks to identify the risk factors of high-speed railway derailment and select appropriate models to process unstructured data, including data collection, data cleaning, data dictionary construction, data extraction, data storage, and other steps. Accordingly, the association between data and risk occurrence possibility and consequence severity will be realized. The paper then moves on analyzing the high-speed railway derailment risk factors and effective extraction of unstructured data.

In recent years, with the breakthrough of big data and artificial intelligence technology, a number of investigations have been carried out regarding the railway safety index, unstructured data analysis, and multisource heterogeneous railway safety data identification and extraction. In terms of railway safety index, since 2015, the International Union of Railways (UIC) has started to build the global safety index (GSI) [1]. Based on safety data and accident information, it evaluates the safety level of railway in Europe, some Asian, and Middle Eastern countries and regions and analyzes the statistical data of safety accidents, the impact of accidents, and the safety level and development trend. Zhao et al. [2] established a railway accident index by measuring the occurrence frequency and consequence severity of railway accidents, which is used to evaluate the overall situation of China’s railway safety. In the aspect of unstructured data analysis and mining, Zhang [3] built an unstructured data analysis platform based on report documents with Chinese word segmentation technology, unstructured data extraction method, pattern matching, and other methods. Zhu et al. [4] put forward a new HGD tree index technology and a new partition method, in order to use probability density function to partition data and improve the speed of data access, and gave a solution based on the optimization operation method. Wang et al. [5, 6] analyzed and studied the safety data of dangerous goods transportation based on the data mining method. In the aspect of multisource heterogeneous railway safety data extraction and data analysis, Wang et al. [7] conducted quantitative analysis on railway derailment and the change of accident rate based on American railway safety data. Lin. et al. [8] analyzed the data of American trunk line passenger trains and quantitatively analyzed the causes of passenger train accidents. Liu et al. [9] analyzed the causes of major train derailment and their effect on accident rates. Turla et al. [10] analyzed the freight train collision risk in the United States. Li [11] recognized and extracted the fault features of high-speed railway equipment by establishing the +bilstm and +CRF method for character representation and the +transformer method for word segmentation representation. Zhou and Li [12] established a method of fault data feature recognition and extraction for railway signal equipment based on MCNN.

To sum up, previous studies on high-speed railway risks mostly employ the expert evaluation method, which is arguably based on subjective deliberation and may undermine the research validity. Besides, past investigations mainly focus on feature recognition, extraction for specific structure of safety data, and the processing of unstructured data. As such, there still lacks research on the correlation analysis between safety data and risk. This study proposes a data-driven risk judgment method for analysis to facilitate accurate association between data and railway risk, with implications that the proposed model can contribute to improving the feasibility and accuracy of risk judgment.

2. Analysis of External Environment Risk Associated Factors

To explore the derailment mechanism of high-speed railway, this paper establishes a dynamic derailment-related element model of high-speed railway. As shown in Figure 2, the derailment of high-speed railway is mainly related to EMU subsystem and line subsystem, and the coupling relationship between wheel and rail has an important impact on the derailment of high-speed railway. In addition, the external environmental factors such as natural geology and perimeter intrusion factors also play an important role in the derailment of high-speed railway. The external environmental safety for high-speed railway derailment has been widely investigated and is the focus of the current study [13].

Based on the above analysis, the paper puts forward the framework model of risk associated external environmental factors for high-speed railway derailment, which fall into two categories:: natural and geological factors (natural hazard factors and geological hazard factors) and perimeter intrusion factors (perimeter intrusion of animals, perimeter intrusion of objects and plants, and perimeter intrusion of people), as shown in Figure 3, and the classification is detailed in Table 1 [1417].

3. Unstructured Data Mining Method

The raw data stem from a database of a railway company that has kept its records of, as many as, 15,000 past unsafe events is obtained.

3.1. General Analysis

At the same time, based on the scientific analysis for the unsafe event data, the study combines regular expression and pattern matching technology and establishes the matching model of external environmental factors for high-speed railway derailment risk associated unsafe events. This paper analyzes and mines the relationship between the external environmental factors of high-speed railway derailment and the unsafe events, automatically, quickly, and accurately extracts the key characteristic information such as the possibility of risk occurrence and the severity of the consequences, so as to transform unstructured data into structured information. The main data analysis and mining process is shown in Figure 4. The process includes (1) unstructured railway safety data, (2) split and match keywords, and (3) association rules’ mining and association degree analysis. Through the process, the accurate association between data and railway risk could be acquired [1820].

3.2. Keyword Extraction and Matching

If you want to extract keywords in the text, it is relatively simple for English and other languages. Keywords’ extraction can be achieved in a number of languages, including English. In the case of English, for example, there are spaces between words as segmentation. In Chinese, however, such expressions are unavailable, so it is necessary to break coherent sentences into keywords. Expressions in Chinese may vary widely, leading to potential ambiguity in the word segmentation. In keyword extraction and matching, the railway safety dictionary is designed, and the algorithms and models such as tire tree, DAG, Viterbi, HMM (hidden Markov model), and keyword matching are comprehensively used. The main processing of keyword extraction and matching is shown in Figure 5 [2123].

3.3. Association Analysis-Based Apriori Algorithm

This study utilizes the association rules to explore the correlations between data generated by different mode methods, so as to build rules that may inform the decision-making.

The data mining of association rules mainly includes two processes. First, identify the frequent item sets whose frequency is not less than the minimum support degree of all item sets. Second, conduct mining strong association rules that satisfy the minimum confidence based on the frequent item sets obtained. The overall performance of association rule data mining is determined by the operation of the previous process [2428].

Finding frequent item sets is not easy because the data explosion involved in the calculation process may lead to unacceptable computational complexity. However, as long as frequent item sets are obtained, association rules whose confidence is not less than minimum confidence could be explored. The association rules’ mining algorithm used in this paper is Apriori algorithm. The following data in Table 2 is an example to illustrate the implementation process of Apriori algorithm [2932].

Apriori algorithm could be used to mine association rules. The process of mining frequent item sets is shown in Figure 6.: frequent item sets of length k: candidate item set of length kSupport_count (k): the support count of k-item sets

It is concluded that all item sets of L1, L2, and Ln are frequent item sets, and then, the confidence of each frequent item set is calculated. When the support threshold is set to 40% and the confidence threshold is set to 50%, the results shown in Table 3 can be obtained.

3.4. Grey Relation Analysis

Grey relation analysis (GRA) is a multifactor statistical analysis method. The basic method of calculating the correlation degree is to initialize the original data sequence, then calculate the correlation coefficient, get the correlation degree and the correlation matrix through the combination of the correlation coefficient, and finally sort them according to the correlation degree calculation results of each correlation factor sequence [33].

The calculation method is as follows.

Let the characteristic behavior sequence of the system be

Because the units or initial values of each data sequence are different, in order to make them comparable, it is necessary to implement dimensionless processing on the original data so that the data of different dimensions (or magnitudes) could be compared, and the initial value method could be used for calculation:

Correlation coefficient refers to the degree of correlation based on the geometric shape and development trend of each factor sequence. The expression is as follows:

Among them, is called resolution coefficient. The smaller the value is, the greater the resolution is.

The characteristic of association sequence is that it has a huge amount of data. When the information is processed in a centralized way, it is necessary to summarize the association coefficients at different positions of different times into a specific value and calculate their average value. The average value obtained is the correlation degree. The expression is as follows:

4. Risk Analysis Based on Unstructured Data

This paper attempts to analyze the risk of external environment associated with high-speed railway derailment from the possibility and the severity of the consequences and realizes the scientific measurement of the risk by mining the possibility and the severity of the consequences of the unsafe events related to the external environment risk factors associated with high-speed railway derailment. The determination of the occurrence possibility is mainly based on the unstructured safety data mining and then accurately associates the unsafe events with the corresponding risk factors. It is realized by accumulating the occurrence frequency of the unsafe events associated with the risk factors outside the high-speed railway derailment. Because the basic data involved in the study is mainly unsafe event data, the consequence is mainly the interruption time, so the severity of the consequence is mainly considered to mine the interruption time caused by the unsafe events associated with the risk factors of high-speed railway derailment [3437].

For a small number of factors with low probability of occurrence, they may not be associated with events in the data. In this case, we could consider using some evaluation methods based on expert experience as a supplement, such as analytical hierarchy process. Based on the above method, after analysis, mining, and unifying the dimensions, the probability of external environment factors associated with high-speed railway derailment is shown in Figure 7, the consequence severity of external environment factors associated with high-speed railway derailment are shown in Figure 8, the risk distribution scatter diagram of natural and geological factors is shown in Figure 9, and the risk distribution scatter diagram of perimeter intrusion factors is shown in Figure 10.

The distribution of comprehensive risk index of external environment factors related to high-speed railway derailment is shown in Figure 11. It could be seen that Z13, Y21, and Y33 are high-risk factors, especially Y33. Risk management procedures should be implemented, and targeted measures should be taken to control them. When implementing control measures, we should pay attention to the actual effect and offer feedback to the implementation to ensure the full implementation of control measures [3841].

5. Conclusion

Utilizing data on railway fault unsafe events, this paper establishes a matching model that builds correlations between unsafe events and external environment factors in the context of high-speed railway derailment. Operating in an automatic fashion, the model may be employed to analyze and mine the relationship between external environment factors of high-speed railway derailment and unsafe events. The model may also be used, with an enhanced accuracy for identifying high-risk elements, to extract the key feature information such as risk possibility and consequence severity.

The current investigation contributes to the field of research by introducing a statistical method for analyzing unsafe events’ data. It seeks to identify high-risk elements of high-speed railway derailment, refines external environment risk index for high-speed railway derailment, and analyzes data in combination with the proposed model. As such, the study achieves a dynamic display of the results arising from external environment risk analysis in the context of high-speed railway derailment. The study is significant in which it seeks to rationalize the methods for analyzing external environment risk and to better visualize the safety laws of high-speed railway derailment. Therefore, the study helps to advance the operation of high-speed railway and its safety arrangements towards a more digitalized and smarter system.

Previous studies on high-speed railway risks mostly employ the expert evaluation method, which is arguably based on subjective deliberation and may undermine the research validity. The current project proposes a data-driven method of risk judgment, which may help to advance the feasibility and accuracy of the analysis. This study contributes to improving the safety level of railway operation by putting forward a method for integrating heterogeneous data sources, minimizing data heterogeneity, and thus, with enhanced accuracy, and building the association between the data and railway risks.

According to the needs of high-speed railway operation safety management, with the continuous accumulation of railway unsafe event data, the external environment risk model related to high-speed railway derailment could be continuously modified and improved, and the correlation matching between risk and unsafe event could be more accurate, which could ensure the continuous improvement of high-speed railway operation safety.

Data Availability

Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the Fundamental Research Funds for the Central Universities (2021QY008) and Science & Technology Development Plan of China Railway Beijing Bureau Group Co., Ltd. (2020AQ01).