Abstract
The paper discusses the problem of probability distribution category identification of train delay data by a genetic programming algorithm. This train delay frequency function and the probability distribution simply derived from it are significant to train traffic modelling and management. The genetic programming algorithm was used as an uninformed tool to prevent the influence of a priori information, which should be biased. The real traffic data were aggregated into predefined bins and then the frequencies of the individual delays were computed. The genetic programming algorithm was used in the next step as a symbolic regression tool to discover their frequency function in the form of an algebraic expression. The results concluded that although data has no known distribution, their distributions are similar to exponential ones.
1. Introduction
Train delays represent inconvenience of the rail transport. They are unwelcome not only in passenger rail transport but also in freight rail transport where delays disrupt logistic chains. The most sensitive is the combination of passenger and freight rail operations for a large difference in the passenger and freight train dynamics which results in complicated train operation scheduling and management.
1.1. Relevant Research
Rail transport delay modelling is significant for identification of delay causes which is essential for this delay reduction or even elimination. The delay modelling is also necessary for rail traffic modelling, control, scheduling, and future rail network state prediction. It is also necessary to mention that delays are typically computed for trains, not for passengers [1]. It is significant if we reason that the preservation of the train change possibility spreads a delay to other trains and can increase the magnitude of this delay, e.g., for passengers transferring between lines. A similar situation occurs in freight train transport if the wagons are switched between different trains.
The train delay models gain different forms. They can be based on a precise model of the rail network, train dynamics, and train timetable [2] on one side and a simplified probabilistic model on the other. These simplified probabilistic models are frequently applied for their easier evaluation and maintenance. For their application, it is required to identify probability distribution functions for all variables of the model. These distribution functions or frequency functions can be obtained from the model or from the data measured in real rail traffic. Analysis of the real traffic data represents a task of large or even Big data size problem. There are many variables influencing railway related processes and a huge number of records. To solve the traffic delay problem by probabilistic techniques, it is necessary to compute train delay distribution or frequency functions applicable in train delay management and train delay models.
It is impossible to reason about the development of the “absolutely” precise model for lack of related data and their uncertainty caused, e.g., by measurement errors. For herein presented research, only the data about delayed trains were available, but not the data about nondelayed trains, their load, weather, large sport or cultural activities, and so on, and especially no data about operational dependencies like prescribed waiting rules between passenger trains in each station. Only the common preferences between different train categories were defined.
The main reason of this research was to find delay frequency function (FF) for delayed trains of a given category. Such a function can be simply transformed into relative frequency and probability density functions. The infrastructure owner was using exponential distribution in its models, but there was suspicion of imprecision of this model and the possibility that different train categories should be described by different distribution models due to different dynamics and operation rules. The genetic programming algorithm was chosen to find FFs describing observed data as objective tools with little a priori information.
The use of genetic programming algorithm (GPA) was decided on the basis of the observation in [3] that no standard probability distribution category FF describes data perfectly and the closest is an exponential distribution.
1.2. Genetic Programming Algorithm Evolution
Genetic programming (GP) was developed by J. Koza [4] on the basis of genetic algorithm (GA). Koza continued research of John Holland on “Adaptation in Natural and Artificial Systems” with the idea that programming can be transformed into optimization of a randomly generated population of individuals representing instructions and parameters. The state space of GPA is discrete in view of the change of operators (called functions) like addition, subtraction, or substitution during mutation and crossover operations. Latter many alternatives to GPA were developed like grammar evolution (GE), gene expression programming (GEP), or Cartesian GP (CGP). GE tries to reach better efficiency by application of a userspecified grammar (usually a grammar in Backus–Naur form) [5]. Gene expression programming (GEP) represents a different approach suitable to solving the symbolic regression (SR) problems [6], applying not only replication and mutation operations but also the transposition and recombination ones. A graphbased Cartesian GP (CGP) [7] is a GP algorithm where the candidate solutions are represented as a string of integers of a fixed length that is mapped to a directed graph. CGP can efficiently represent such structures as mathematical equations, computer programs, or neural networks. Another modification of genetic programming is represented by a hybrid single node genetic programming (SNGP) [8] [9]. SNGP is a rather new graphbased GP system that evolves a population of individuals, each consisting of a single program node. Similarly to CGP, the evolution is provided by a hillclimbing algorithm using a single reversible mutation operator. The SNGP represents a very promising development of GPs. Also, the original GPAs were widely studied since its introduction, for example, in ([10–12]), and these works have allowed its expansion. A symbolic regression takes specific position between the GPA applications. The goal of SR is to find a function modelling training data. As a technique of uninformed computer learning, the GP has large application potential.
SR is also easily verifiable, and thus it serves as a test bed frequently. Unfortunately, till now, symbolic regression has had some weaknesses, especially problems with large training data sets, e.g., in Big data applications and low efficiency if human comparative quality results are required. Possible ways of their elimination will be discussed further in this study. According to the paper [13], genetic programming (GP) is a particularly interesting machine learning (ML) algorithm when dealing with symbolic regression.
The GPAs directly evolve mathematical expressions [14], typically represented by a tree structure. While GP is understood to be capable of generating whitebox models, i.e., humaninterpretable expressions in simple form, the evolved models are often overcomplicated and far from being interpretable [15]. These problems tend to improve, especially in the areas of accurate symbolic regression, hybrid evolutionary algorithms, and some other approaches. Depending on the purpose, any rapidly developed or precise model is searched. While linear regression means findings of the coefficients of the chosen function to best fit the data, symbolic regression searches for a suitable function. If the linear regression result is not good, it is possible to choose a different function. The quality of the regression result depends on the selection of this function. Symbolic regression is capable of going far. Its goal is to find such a function to fit the data, not only its parameters. Since the first application of genetic programming [4], many studies and approaches to solve this problem have been published.
Except the abovementioned problems, many authors published in their works the need to use large populations of thousands or more individuals. This is caused by the need to optimize many constants and the sensitivity of individual selection on them. If the modelled problem is complicated and described by training data containing tens or even hundreds of variables (e.g., Big data problems), it is necessary to reduce this amount to prevent loss of GPA efficiency [16] to keep reasonable processing time. In these cases, the efficiency of machine learning algorithms including GPA significantly decreases and it is better to use a separate algorithm to identify significant features, e.g., by the feature selection algorithm [17].
1.3. Hybrid Evolutionary Algorithms
Unpublished work [11] prefigured small but significant research in an area of hybrid evolutionary systems combining GPA with other techniques with the intent to eliminate some of their weaknesses. This work is continued in the paper [18]. Unfortunately, the main research areas of both authors of this work moved out of evolutionary programming and thus these papers were not extended by them. Luckily, they found a continuation in [19]. Raidl assigned multiplication parameters to each node of the parse tree and optimized their magnitudes by the method of least squares. The nonlinear optimization method was used in [18, 20]. The problem of these early works was in large computational requirements which have allow only a few steps of the timeconsuming nonlinear optimization to be applied to each new solution to keep the total running times acceptable. Particle swarm optimization (PSO) was used for arbitrary constant magnitudes optimization in [21]. PSO is a population using a computational optimization method developed by Eberhart and Kennedy in 1995 [22], inspired by the social behaviour of bird flocking. The system is initialized with a population of random solutions and searches space by the coordinated movement of a particle swarm. Accurate SR tries to eliminate wellknown problems of original early GPA applications. They were less efficient, overcomplicated, and imprecise. The accurate SR as it was described in [23] applies structural risk minimization based on the Vapnik–Chervonenkis dimension to estimate the difference between the generalization and the empirical error. The problem of these algorithms is that they move GPAs close to overfitting limitation known from the other machine learning algorithms. These works prove that in the hybrid evolutionary algorithm it is possible to apply in the hybrid algorithm any optimization tool in combination with GPA. Thus, there evolutionary strategy of genetic algorithms can be placed too, as in this work.
2. Materials and Methods
2.1. Analysed Train Delay Data Set
To analyse train delays on the Czech national railway network, the data describing train delays within three months of 2011 was used. Originally, the data were stored in Oracle data warehouse. These data were analysed in [3] with recommendation to use exponential distribution in models.
There were problems with trains free of delays because such trains were not presented in the filtered data provided for the research. Thus, the number of observations of the train passing without delay was smaller than in reality, and they were not used to identify delay FF.
Regardless of the abovedescribed limitations, the amount of processed data was not totally satisfying the attributes of Big data, but it was difficult for processing. Big data are characterised by so called 3 or 4 v’s. They are mean volume, variety, velocity, and veracity. The meaning of these terms is described as follows:
Volume—The quantity of generated and stored data. The Big data applications work with large amount of data that are hard to process by standard technology.
Variety—Variety means the type and nature of the data. The data processed by the Big data systems are semistructured or unstructured. It makes it difficult to process them by standard strongly typed relational database management system (RDBMS).
Velocity—Many Big data applications require high speed reactions and fast processing because they operate in real time. There is also the problem of the continuous incoming of a new data.
Veracity—Referring to data volume and their value, the data can vary too fast for accurate analysis.
Because only a small, timelimited sample of data (3 months) was analysed, the volume of analysed data was relatively small from Big data viewpoint, but its size was still many gigabytes. The data were also well structured. The velocity in this offline analysis is not significant. Nevertheless, with respect to the future possibility of full data collection analysis, the Big data technology was used. There it is possible to imagine online analytics (with high requirements on velocity) and processing of full data collection representing all train movements in the past 20 years.
For the herein presented analysis, the basic programming model of Big data processing called MapReduce was used (selection of data related to the analysed category and aggregation on the basis of delay magnitude into the relevant bin).
There it is justifiable to expect distinct results for different train categories (Table 1) due to different operational rules like maximal speed limit, typical weight and length of the train limiting maximal acceleration, waiting for connections (and concluding delay propagation) in the case of passenger trains, and priority in access to railway line. Especially, the last two influences strongly determine delays. Figure 1 demonstrates an example of input data for passenger train categories.
2.2. Data Preprocessing
The original data set was transformed from the Oracle database into commaseparated values (CSVs) dale file and imported into the Apache Spark application. This application was written in Python using Jupyter Notebook and served to select data related to the studied train type and reduce data volume by their transformation into counts of train passes across measurement points with a specified delay. The following step was application of symbolic regression to model FF describing the data.
2.3. Used GPAES Evolutionary Algorithm
In this research, a modification of the hybrid GPAES algorithm for algebraic dependency modelling is used (standard GPAES was designed for symbolic regression of ordinary, linear, or nonlinear differential equations—e.g., 10, 24, or 25). The algorithm evolves equations on the basis of the minimization of residual errors—fitness. This algorithm combines the standard GP algorithm for solution structure development and an evolutionary strategy (ES) algorithm for optimization of parameters of each individual in a GPA population. Such a design of the hybrid evolutionary algorithm prevents situations when a wellstructured solution (e.g., wellcomposed equation) but with wrongly estimated constants (coefficients) is eliminated from the population and replaced by an individual of worse structure but better fitted constants, which has worse evolutionary potential. A comprehensive introduction to GPA can be found in [26]. A complex review of the hybrid evolutionary algorithms is in [27], and an explanation of symmetric onepoint crossover is described in [28]. Two different variants of crossover increase the efficiency of the Algorithm 1.

The size of the GPA population and the size of the ES populations related to each individual related to the GPA population are the most significant parameters of the GPAES algorithm. The influence of the population size was studied in many publications, for example the study conducted in [2]. Small GPA populations forces evolutionary pressure, and in the case of specific conditions it might speed up evolution. On one hand, in small populations there is an increased risk of getting stuck in local optima instead of global ones. Very large populations have problems with low speed and low efficiency of evolution frequently. On the other hand, they bring less dispersion of needed evolutionary cycles and higher reliability of solution discovery. Because extremely small GPA populations bring big dispersion of needed evolutionary cycles, it is difficult to predict the needed computational time. In the case of ES populations, analogical reasons are valid only if there are eliminated random influences of task switching and other sources.
During the initial stage of processing, the records about the delays of a selected type of train were reduced into a vector representing the delay discrete distribution FF by eliminating temporal and spatial information. There are present multiplication, exponential, divide, and factorial functions in the set of GPA functions to allow “discovery” of exponential or Poisson distribution related frequency functions (and similar ones), the candidates to possible solution. The number of reasoned data categories was limited to 20. In the case of large training data vectors, the less significant (outlier) data representing large delays were not reasoned, e.g., in the case of 1minute delay resolution. Thus, the training data vector was a vector of 20 pairs (delay time and frequency).
Larger delay bins, like the extreme 100 minute one, contain less amount of accumulated noise, but such large bins also lose information. They lose information about delay distribution, and thus they might cause wrong results.
The use of grouping many delays, e.g., from interval <1, 30> minutes, can decrease noise caused especially by small number of samples in the given bin. On the other hand, such grouping represents a significant loss of information. It tends to be a simple function to find and in the extreme case, the linear FF instead of the exponential one can be found.
Because Figure 1 points to functions similar to exponential or Poisson distribution ones, the function set for GPA contains functions that allows to reconstruct them and build a class of similar ones. Thus, there has been a present factorial, but it has a limited scale of argument magnitudes due to the limitations of number representation in computer. Also, the number of distinguished the most significant (the smallest) delays from the interval <1, 20> minutes was reasoned. Exponential or power functions, divide or inverse functions (1/x), multiplication, addition, and subtraction are useful in FF regression, and thus they were incorporated GPA function set. Herein the presented experiments are based on ungrouped data measured with 1minute resolution containing the first 20 values (delays 1, 2, …, 20 minutes).
3. Results and Discussion
The GPAES algorithm worked with a maximal limit of 10000 GPA evolutionary cycles and 200 ES cycles in each GPA one, with 100 GPA individuals and 200 individuals in each ES population. This algorithm was written in the C++ language and it was executed as a singlethread task on a single core of the 24 HT cores Intel Xeon processor. Execution times were between 80210 seconds for Sp trains and 274081 seconds for Ex trains due to stochastic character of the algorithm and resulting function shape.
The experiment majority tend to the presence of power function ‘^’ in the resulting function. Add function ‘+’ is the more useful (the more frequently present) than the divide function ‘/‘. Also, the multiplication function ‘’ was frequently used. Factorial function ‘!’ was never present in discovered functions. Thus, in future experiments, it can be left out and large data vectors can be used for probability density function learning.
The differences between passenger and freight train behaviour were observed rather in the coefficient magnitudes than in the structure of the discovered functions. All functions described in Table 2 are similar to FF of exponential distribution, but not identical.
Figures 2 and 3 provide examples of results for some train categories and allows comparing of discovered FF with original data frequencies.
4. Conclusions
The presented study was devoted to the estimation of data distribution class by the application of the hybrid evolutionary algorithm GPAES and symbolic regression of frequency function equations.
The experiments described in this paper confirmed the exponential like distribution of train delays on the basis of FF regression from Czech railway network data. The evaluated amount of data was 6 GB and they were preprocessed by the Apache Spark application written in the Python language.
Founded FFs are similar to ones related to exponential distribution, but not identical.
The GPAES algorithm is capable to discover algebraic relations in applicable form for extremely small data vectors of 20 elements [24, 25].
Data Availability
The csv data used to support the findings of this study were supplied by the state company: Správa železnic, s.o. (Rail Infrastructure Administration, state organization) under license and are subject of commercial confidentiality, so they cannot be made freely available. Address of the data owner is Správa žželeznic, s.o., Dlážděná 1003/7, PSČ 110 00 Praha 1, The Czech Republic.
Conflicts of Interest
The author declares no conflicts of interest.
Acknowledgments
The work was supported from ERDF/ESF “Cooperation in Applied Research between the University of Pardubice and Companies, in the Field of Positioning, Detection and Simulation Technology for Transport Systems (PosiTrans)” (No. CZ.02.1.01/0.0/0.0/17_049/0008394).