Short-term forecasting of OD (origin to destination) passenger flow on high-speed rail (HSR) is one of the critical tasks in rail traffic management. This paper proposes a hybrid model to explore the impact of the train service frequency (TSF) of the HSR on the passenger flow. The model is composed of two parts. One is the Holt-Winters model, which takes advantage of time series characteristics of passenger flow. The other part considers the changes of TSF for the OD in different time during a day. The two models are integrated by the minimum absolute value method to generate the final hybrid model. The operational data of Beijing-Shanghai high-speed railway from 2012 to 2016 are used to verify the effectiveness of the model. In addition to the forecasting ability, with a definite formation, the proposed model can be further used to forecast the effects of the TSF.

1. Introduction

As of September 10, 2016, the operating length of China’s HSR has exceeded 20,000 km (the HSR in China refers to the train services that have an average speed of 200 km/h or higher. The HSR network consists of upgraded conventional railways and newly built HSR lines). The HSR provides a new choice of transport for some people who used to travel by airline or the highway. The income of the HSR mainly comes from two parts of the passenger flow, one is the newly generated HSR passenger traffic, and the other one is the passenger who transforms from other modes of transport. Therefore, the accurate short-term forecasting of OD passenger flow on HSR is significantly important because (i) it provides the basis for the planning and enhancement of the railway network; (ii) it is the fundament for the investment and construction of HSR; (iii) it affects the revenue management, technical specifications, operation mode, and facility improvement on HSR line [1].

In the past decades, a lot of attention has been paid to the short-term forecasting. These models can be generally divided into three categories: time series models, causal models, and hybrid models.

Firstly, time series models are the functions in which the traffic flow is modeled by the observed values. In general, time series models mainly include autoregressive integrated moving average (ARIMA) [24] and exponential smoothing method [59]. The ARIMA is a linear combination of time-lagged variables and error terms. It has been widely used in the forecast since the 1970s because it has good performance in modeling linear and stationary time series. The exponential smoothing method includes one exponential smoothing method, quadratic exponential smoothing method, and triple exponential smoothing. The triple exponential smoothing method is also known as the Holt-Winters method, which was first suggested by Holt’s student, Winters, in 1960 after reading a signal processing book from the 1940s on exponential smoothing [10], and the model takes into account seasonal changes as well as trends. For ARIMA model and Holt-Winters method, there has not been a consensus as to whether one dominates the other. In different cases, the most appropriate method may be different [1114]. However, Holt-Winters method may have some advantages in forecasting long-term series because it can use recent data into forecast to update model parameters and improve the accuracy [11].

Second, causal models are that the traffic flow is modeled as a function of some exogenous factors or endogenous factors. The model which considers the effects of multi-influencing factors can greatly enhance the flexibility of the forecasting model. By describing the relationship between transportation capacity, passenger volume, and quantity demanded, Luo et al. [15] construct the passenger volume forecasting model of HSR line, but their work concentrates on the passenger volume of the whole line; thus, it cannot be used to forecast the passenger volume in OD pairs. For the operator, the knowledge of the passenger volume of each OD pair is sometimes more important than a total volume of the railway line. Du et al. [16] considered the influencing factors of HSR passenger volume, but it cannot be applied to the forecast of the DPF due to the limited influence of some factors because some factors cannot be statistics every day. Wardman [17] proposed a forecasting model for railway traffic flow using external factors including GDP, variations in times, and fuel costs. Moreover, some scholars have also conducted other related studies [1820]. Although the multifactor model is considered to better deal with a timely manner when the external environment changes, the description of the periodic variation in the predicted data is less than the time series model.

Third, hybrid models are the well-established and well-tested approach to improving the forecasting accuracy [21]. For instance, Zeng et al. proposed a hybrid model that combines multilayer artificial neural network and ARIMA [22], and Xu et al. combined genetic algorithms and a grey SVM in the forecasting model [23]. Wei and Chen [24] forecasted the short-term metro passenger flow with a hybrid method, in which empirical mode decomposition (EMD) is done first to extract the features of flow pattern and secondly the neural network is used to forecast the passenger flow based on these extracted features. In other research domains, several studies (such as those of [21, 25, 26]) obtained the result that hybrid forecasts have generally been shown to outperform the forecasts from the single prediction model.

In this paper, we construct a hybrid forecasting model to forecast the daily passenger flow (DPF) by taking the advantages of the time series model and the causal model considering the impact of the TSF. The remainder of this article is organized as follows. In Section 2, features of the TSF on HSR are introduced and the temporal features of OD passenger flow including trend feature, the day of the week feature, and month-of-year feature are discussed, respectively. In Section 3, the Holt-Winters time series model and the linear regression model considering the TSF is constructed, and the hybrid model is built by using the minimum absolute value method. The accuracy of the model is verified by the historical operation data onto the Beijing-Shanghai HSR in Section 4. Finally, Section 5 gives the conclusion of this paper.

2. Daily Passenger Flow Analysis

In order to have a prior knowledge of the DPF on the HSR, the features of the TSF and the temporal features of OD passenger will be introduced here. For different OD markets on the HSR line, the patterns of their DPF might be different. However, there is no doubt that the DPF of any particular OD market creates some regularities over time. In this section, we focus on the OD market of Beijing South Station to Shanghai Hongqiao Station on Beijing-Shanghai HSR line. This HSR line is studied because it is a representative line in China’s HSR network. In the first subsection, we will analyze the time series characteristics of the DPF of this OD, including the periodic changes of DPF in weeks, months, and years. In addition, the second subsection illustrates that the train service frequency (TSF) also puts obvious influences on the OD’s DPF.

2.1. Time Series Characteristics

Figure 1 shows the flow changes of Beijing South Station to the Shanghai Hongqiao Station from January 2013 to December 2016. In order to examine the long-term trend of the OD’s DPF, we calculate a linear regression of the passenger flows of time. The result is demonstrated by the blue dash line, which is similar to the upward trend of HSRs in other countries since the commercial operation. It is believed that the DPF will keep this growing tendency in the following years.

Then, we use the moving average method to generate the red dash line. It is found that the whole line can be divided into four similar parts, and each part corresponds to one year. Hence, a yearly regular pattern of the change of DPF exists.

Moreover, in Figure 2, we compared the DPF of three different weeks, and there has similar change trend in each week. In a cycle of the week, the passenger flow of Monday, Tuesday, Wednesday, and Thursday is relatively lower. On Friday, the flow reaches the highest point as many business people would go back home and tourists go for travel before the weekend. Saturday is another low point, and Sunday is the return date of those business people and tourists traveling on Friday.

From the above analysis, we can find that there exist the yearly, monthly, and weekly periodic characteristics of the OD’s DPF. In order to describe the time characteristics of passenger flow changes, this time series data will be modeled using the Holt-Winters model in the next section.

2.2. Analysis of Train Service Frequency Impact

China Railway Corporation (CR) modifies the train timetable at least twice per year, typically in January and July. In each modification, three types of new timetable are generated, namely, normal timetable, weekend timetable, and peak timetable, which are applied during Monday and Thursday, during Friday to Sunday, and in the festival (i.e., the National Day and the spring festival) respectively. Generally, the number of trains in different types of timetable is different. The peak timetable is the most saturated one, followed by the weekend timetable, which is still more saturated compared to the normal one.

As an illustration, Figure 3 shows the DPF and TSF over time; the period in Figure 3 is one day. From the figure, we can see that passenger flow changes when the TSF will show the same trend of change (increase or decrease), which shows the relevance of passenger changes and TSF changes exists. However, when the TSF remains unchanged, the passenger flow is still indicating a trend, pointing out that although TSF can affect the passenger, the cyclical changes in the passenger flow are still the main factor.

From above, the change of TSF may influence the daily passenger flow. As Figure 3, although the daily OD’s TSFs are different, the change rate of daily TSF is not obvious. In order to show the changes of OD’s TSF in the different time interval during a day, Figure 4 shows the OD’s (Beijing South Station to Shanghai Hongqiao Station) TSF from March 27, 2013, to January 14, 2015, on every Wednesday. The display period is from 7 a.m. to 2 p.m., -axis for each interval is 10 minutes, and -axis for each interval represents the date. The yellow cell represents the TSF of this OD twice during this period, the green cell represents once, and the white cell represents zero. From Figure 4, we can find that the OD’s TSF changes significantly in the different time interval during a day. The passenger not only will choose which day to travel but also will choose when to travel, so the change of OD’s TSF in different time interval will affect the choice of the passenger. If some of the passengers could not buy the train ticket which they originally want to buy, they would choose another mode of travel or choose another train in different time.

From Figure 4, we can find that when we forecast the passenger flow, we should consider the effect of the TSF. When considering the impact of TSF on the DPF, you cannot directly use the daily OD’s TSF but should consider the TSF in the different time interval during a day. Therefore, in the next section, we will build a linear regression model of the DPF. Due to the substantial impacts of TSF on the DPF, the model will take into account not only the timing characteristics as the input variables, but also the TSF in different time intervals during a day.

3. Hybrid Forecasting Model

In this section, we build the Holt-Winters time series model based on the historical DPF’s data. Meanwhile, the linear regression model which considers TSF is proposed. The minimum absolute value method is used to combine the two models to form a hybrid model to complement the advantages of the two.

3.1. Holt-Winters Method

The Holt-Winters model consists of three smoothing formulas that reflect the long-term trend of the data, the incremental trend, and the seasonal variation. The prediction formula is used to predict the extrapolation, which applies to situations where demand data exhibits trend and seasonal cycle characteristics.

3.1.1. Holt-Winters Method

Trend feature is represented by a single variable , which means the length of time. The multiplicative smoothing formulas are as follows:where are the historical DPF; is the length of the seasonal cycle; let represent the smoothing parameters. We denote the smoothing value, the trend value, and the period coefficient at the time by , , and , respectively.

And the predicted value at time of the value periods ahead is given by

3.1.2. Choosing Initial Value

To initiate the updating procedure, we must choose the starting values for the smoothing value, the trend value, and the period coefficient. At present, there are many methods for the selection of initial values. Here, we use a more general way to select the initial value. The equation is as follows:

We denote the number of cycles to initialize. “mod” is a symbol which means take the remainder.

3.1.3. Determine the Smoothing Parameters

To obtain more accurate smoothing parameters , we need to ensure that the predicted value and the actual value of the minimum error by

And is the number of days to forecast. However, with the increase of time series, the complexity of the above solution will increase; to solve the above problems quickly, this paper adopts the genetic algorithm to get the appropriate smoothing coefficient.

3.2. A Linear Regression Model considering Capability

From the analysis in Section 2.2, we can see that the TSF also changes over time, and there exists the relationship between the DPF and the TSF. It is found that different period of TSF has a different effect on the DPF, so we divide a day into different time periods and consider the variables of temporal features to build the regression model.

3.2.1. Variables of Temporal Features

The temporal features considered in the regression model include trend feature, day-of-week feature, and month-of-year feature. Trend feature is represented by a single variable like the Holt-Winters model described above. Similar to Lee et al. [27], dummy variables are used to represent day-of-week, month-of-year, and year-of-period features. For the day-of-week feature, there are six dummy variables: . If the date is the th day of the week, is 1; otherwise, it is 0. Similarly, there are eleven dummy variables to reflect the different month of the year, . The number of dummy variables of year-of-period is determined by the number of years to be used, marked as . It should be noted that, for all these dummy variables, there is an alternative (last day of the week and last month of the year) which is set as the null alternative.

3.2.2. Variables of TSF

As the size of time interval changes, not only the number of variables used will be different, but also the final prediction accuracy will be different for the model. So we assume that the time interval variable is . We divide the day into different time segments according to the time interval and assume that the number of time segments is , and indicates the TSF of OD in different time segments.

3.2.3. The Linear Regression Model Which Considers TSF

Next, we constructed a linear regression model to describe the relationship between TSF and the DPF, so the DPF can be expressed as . Here, is a constant and is the coefficients of the variable ,  , and   is the number of variables. is a collection of all designed variables including different temporal features in time and OD’s TSF under different time segments like (7), and is the random error:

The variables contain the sequential variables and discrete variables, and the magnitude of some variables is different. So the size of the partial regression coefficient cannot directly explain the magnitude of the linear effect of the dependent variable. Therefore, it is necessary to standardize the variables and the dependent variables in advance.

3.3. Hybrid Forecasting Model

Both of the above methods can be used to forecast the DPF, but they have some insignificance. The Holt-Winters model takes advantage of time series characteristics of passenger flow, ignoring the impact of different periods of departure frequency. The other method considers the change of OD’s TSF in different times, but there are some shortcomings in the time characteristics of the impact of the method. It is natural for a decision maker to consider time-varying forecast combination schemes to avoid the disadvantages of those two methods. This paper uses the minimum absolute method [28] to combine the two approaches to take full advantage of the two methods.

In the hybrid forecasting, the essential step is to determine the weight coefficient of the hybrid forecasting model, so as to achieve the purpose of synthesizing the information of different forecasting methods to improve the accuracy of prediction. The traditional method of combination forecasting is equalization prediction method, least squares method, and so on. However, these methods have several disadvantages, but the minimum absolute method has excellent characteristics to overcome these problems.

From the above, and are predictions of the actual DPF ; is the predicted value of the hybrid model. So ; and are the weight values of the two methods, .

The minimum absolute value method was based on the absolute value error as the objective function of the minimum; the mathematical model is as follows:

The constraint satisfies the following formula:whereWe assume ,  ,  . Thus, the above objective function can be converted into the following equation:

So we can transform (11) into a typical linear programming problem, which uses the simplex method to calculate the weight value.

4. Case Analysis

4.1. Experiment Design

In data preparation, daily data of the Beijing South Railway Station to Shanghai Hongqiao Railway Station’s passenger flow and TSF from July 1, 2012, to the end of 2016 was collected in our experiments. The official operation time of the Beijing-Shanghai HSR was in July 2011. However, since the passenger flow was still in the breeding stage and the passenger flow characteristics were not visible, this article abandoned the passenger flow data from July 2011 to June 2012. Moreover, the DPF in the holiday has a significantly different pattern from that in ordinary days, among the over two years’ daily data, points that are during China’s legal holidays (e.g., Spring Festival, Labor Day, and National Day) are removed from the regression. That is to say, only ordinary days are considered in training and testing experiments. For example, when evaluating the forecasting ability of the period from October 16 to November 15, 2013, daily data from July 1, 2012, to October 15, 2016, are used for training.

In China, HSR in the working day (except Friday) use the daily timetable, on Friday and the weekend use the weekend timetable, and in the holiday use the peak timetable. Different types of timetable are designed to accommodate different passenger demand, so in a day at various intervals and the same OD’s TSF, there will be some differences. Typically, the daily TSF of the three timetables for the same OD has a quantitative relationship, , where , , and are the OD train TSF of daily timetable, weekend timetable, and peak timetable. Thus, when calculating the weight of different methods, this article based on the impact of the various types of timetable will be calculated separately.

According to the data used, starting from Beijing to Shanghai trips starting time ranges from 6 a.m. to 19 p.m. Therefore, this article uses the 6:00 a.m. as the starting point and 17:00 p.m. for the endpoint, according to a particular time interval which will be divided into different time segments. For comparison of several prediction methods, we use three methods to predict the traffic after 30 days, respectively, and compare it with actual traffic.

The performance measures adopted in our research are the widely used mean absolute percent error (MAPE, ) in (12) and variance of absolute percentage error (VAPE, ) in (13). The former one measures the average forecasting accuracy, and the latter one measures the stability of prediction:where denotes the number of days predicted; in this article, we forecast the number of days for 30 days, so .

4.2. Hybrid Forecasting Model Analysis

MAPE results of different models for 25 testing experiments were shown in Figure 5 and Table 1, and the time interval selected is 45 minutes. It is found that hybrid model proposed in this work performs the best of accuracy (14.4% in average) among these tested models, and the prediction accuracy of the other two methods is approximately the same. For most of the experiments, MAPE of hybrid model is lower than that of the other two approaches. Even for those experiments in which the hybrid model has a worse accuracy than Holt-Winters model, their difference is not significant.

VAPE results are demonstrated in Figure 6 and Table 2. The results show that the proposed hybrid model has a higher stability (1.1% in average) than others and is followed by Holt-Winters, Linear.

From the above comparisons, it is found that hybrid model proposed in this work can perform better in both forecasting accuracy and stability. Besides, since our model has established a definite relationship between passenger time characteristics and TSF, it also gives a way to forecast the real DPF. Taking experiment 25 (corresponding to the period of November 5, 2016, to December 4, 2016) as an example, the real DPF, together with the forecasted DPF by three different ways, are illustrated in Figure 7.

From Figure 7, we can find that the forecast value of Holt-Winters model is always higher than the actual value, the estimated value of linear model is always lower than the real value, and this leads to the most of the prediction of the hybrid model between the two, making it closer to the actual value. From the red dotted rectangle frame out of the part, we can find two phenomena. One is where the Holt-Winters only consider the timing characteristics of the DPF, so when the OD’s TSF becomes larger, its degree of deviation will increase; the other one is that the linear model overreacts to frequency changes, and also it makes the deviation of the predicted value increase. But the hybrid model is that the use of the difference between the two changes makes the forecast close to the actual DPF.

4.3. The Impact of Different Time Intervals

From the model, we can see that the selection of the interval will affect the accuracy of the linear regression model, which in turn affects the accuracy of the hybrid model prediction. 17 different time intervals were selected to explore the influence of different time intervals on the model. At the same time, to enhance the reliability of the results, we will increase the number of samples at each time interval from the 25 to 60.

The MAPE of different time intervals is plotted as a boxplot. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually.

From Figure 8, we can see that when the interval time is 45 minutes, the 25th and 75th percentiles of MAPE are lower than other interval times. Moreover, its box length also is the shortest. Figure 9 shows the average MAPE for different samples at different time intervals. We can see that the mean error is the smallest 45 minutes, followed by 120 minutes. Therefore, it is shown that the hybrid model with the 45-minute interval is the best in the accuracy and stability of the prediction. Although it has outliers, the overall size of its outliers is still smaller than the outliers of other time intervals.

From Figure 9, we can see that from 15-minute to 180-minute time interval, the predicted MAPE is similar to the parabola. For more than 180-minute time interval, the model prediction MAPE remains unchanged.

5. Conclusions

This paper constructs a hybrid forecasting model, and the model not only makes full use of the time series model for the more accurate prediction of the periodic data but also utilizes the effects of TSF. From the example, we can see that the hybrid model is more accurate and more stable than the single model. Compared with other models, the data sources used are less and easier to obtain. Moreover, the model in this paper can be employed to provide the basic information for the adjustment of the train timetable.

However, in this paper, we consider the TSF in different time during a day but never take the rated transportation capacity into account. Other factors such as the strategy of seat allocation are not considered for simplicity. The refined representation of transportation capacity of HSR line is necessary for the future work.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was supported by China Railway Corporation Technology Research and Development Plan Project (2016X005-E) and National Key R&D Program of China (2016YFB1200600).