Abstract

Stranding too many passengers at the stations will reduce the service level; if measures are not taken, it may lead to serious security problems. Deeply mining the time distribution mechanism of passenger flow will guide the operation enterprises to make the operation plans, emergency evacuation plans, and so on. Firstly, the big data theory is introduced to construct the mining model of temporal aggregation mechanism with supplement and correction function, then, the clustering algorithm is used to mine the peak time interval of passenger flow, and the passenger flow time aggregation rule is studied from the angle of traffic dispatching command. Secondly, according to the rule of mining traffic aggregation, passenger flow calculation can be determined by the time of train lines in the suburbs of vehicle speed ratio, to match the time period of the uneven distribution of passenger flow. Finally, an example is used to prove the superiority of model in determining train ratios with the experience method. Saving energy consumption improves the service level of rail transit. The research can play a positive role in the operation of energy consumption and can improve the service level of urban rail transit.

1. Introduction

The success of urbanization in China has also brought some problems as follows: urban traffic congestion in megacities is becoming more and more serious, and the focus of urban life is shifting to suburbs, and the passenger flow between suburbs increases. The passenger flow of the rail transit has a great change that is difficult to predict, resulting in increased difficulty of passenger organization; large passenger flow stranded will lead to major safety issues; for example, the serious trampling accident in Chen Yi Square in Shanghai Bund on New Year’s Day of 2015 was a security alarm. By means of big data theory analysis and advanced data mining technology, the study on the aggregation rule of rail passenger flow at the station can obtain the characteristics of passenger flow at different stations and different means in time and help the operators adjust the driving schedule in time and provide decision-making basis for daily transportation dispatching.

In many passenger travel distribution models, the gravity model and the growth coefficient method are widely used. In 1940, Stouffer first proposed the interventional opportunity model. In 1955, Casey proposed a gravity model, which was subsequently interpreted NT in terms of maximum entropy and maximum likelihood. In 1965, Fumess proposed the well-known growth coefficient method. Subsequently, the linear regression model was brought into use in traffic planning to predict travel occurrence and attractiveness in the United States. At the end of 1960, the British scholars put forward a cross-classification method; on this basis, Gordon proposed a nonfamily travel model [1]. Yam et al. developed a special model to predict official travel [2]. Bowman and Ben-Akiva established a model to predict the travel occurrence from the traveler’s activities as the starting point [3]. These foreign scholars made a more in-depth study in the passenger flow prediction and distribution models, which laid a theoretical basis. In recent years, many studies have been conducted in China: Xiaoguang proposed a polynomial distribution lagging model for short-term traffic flow forecasting, which gave overall considerations to factors impacting on the traffic state time series except its own lagging term, and the model calculation is more accurate [4]. Enjian analyzed the theoretical flaws in the traditional disaggregated models and constructed an OD distribution forecasting model based on aggregated data [5]. On the joint model of OD distribution and path selection, Professor Peng used the extrapolation method on parameter calibration because of the lack of survey results on destination selection, which resulted in the separation of two phases and affected the prediction accuracy of the model [6]. A short-term passenger flow OD estimation model based on the state space method was proposed, and the accuracy of the prediction results is improved [7]. In the OD matrix, the objective function was estimated by using the generalized least square model as the OD matrix, and the corresponding solving process using the Lagrangian algorithm was given. However, the accuracy of back stepping results was barely satisfactory [8, 9]. A multiobjective mathematical model was proposed to address a situation in which several projects are candidate to be invested completely or partially. The computational results were compared with solutions obtained by NSGA-II algorithm [10]. A multiobjective model for relief resources distribution facilities under an uncertain condition is investigated in two ways of demand satisfaction by considering the relief resources accessibility and demand satisfaction in a fuzzy logic. To solve the problem and to do the sensitivity analysis, the NSGA-II algorithm was presented. A constraint method was also proposed to evaluate the performance of the proposed algorithm [11].

The above research results have not been studied deep in the mass data preprocessing, and there have still been no supplementary mechanisms for coping with lack of data field value and so on. The application of passenger data mainly focuses on the calibration of model parameters. If we can carry on the fine pretreatment and the missing supplement to the passenger flow data, and on this basis, we explore deeper to probe the station time aggregation mechanism of passengers, which will provide the data support for the optimization of the scheme. Based on the passenger flow data of AFC inbound and outbound traffic from the operators, this paper studies the information tracking based on passengers’ transportation card numbers and travel time and calculates the travel time of passengers by combining relevant time parameter data, so as to construct the model of mining passenger aggregation mechanism. It is of great help on train operating plan making and temporary adjustment and can provide theoretical and methodic references for supplementation of passenger flow forecasting and emergency relief model.

2. Urban Rail Transit Passenger Flow Data Cleaning and Processing

After many years of operation, the rail transportation enterprise has accumulated a great deal of operation-related data, and the Oracle database has a large capacity, and because of not-so-reasonable design of fields, stored procedures, data rules, and paradigms in the early stage of database construction, the stored operational data contains redundancy or error that not only occupies a lot of storage space and causes a waste of storage resources but also greatly reduces the effectiveness of data analysis and stability. Therefore, for abnormal data identification, missing data complement, error data correction, redundant data cleaning, and so on [12], big data analysis and processing methods are of great significance.

Data redundancy, missing data, and nonstandard quality problems will have a fatal impact on big data application, so there is a need for cleaning big data having quality problems. You can use the parallel technology to achieve high scalability of big data cleaning. However, due to the lack of effective design, common data cleaning methods based on MapReduce have redundant computation, which leads to the performance degradation. To optimize the data cleaning to improve the efficiency of cleaning operations may start from the process shown in Figure 1 to design the optimization process.

In the data generated by the rail transit operation, some are filled in by hand, some are automatically generated by the machine, and some are inevitably missed. For this reason, a missing value filling mechanism based on naive Bayesian classification is proposed. See Figure 2 for the basic process:

means that when incident occurs, the probability of incident occurs, means that when incident occurs, the probability of incident occurs, means the probability of incident , and means the probability of incident .

(1) Parameter Estimation. The entire missing value is populated with the value of the maximum probability as the filling value calculated by Bayesian classification [13]. The task of the parameter estimation module is to calculate all the probabilities using (1), where all values of are constant, so only the calculation of is needed. If the prior probability of missing values is unknown, then it can be assumed to be equal probability, and can be calculated using

means that when incident occurs, the probability of incident occurs, means that when incident occurs, the probability of incident means that when incident occurs, the probability of incident occurs.

So the entire parameter estimation module is used to calculate the probability of each value of all attributes.

From the perspective of probability theory, when the sample space is large enough, the probability can be approximately equivalent to that of the frequency, and the system uses frequency of occurrence of each attribute value in tuples without missing values to calculate probability .

(2) Connection Module. According to (2), the probability of various values to be filled in the tuples containing missing values that are determined by its dependency attribute value can be calculated. However, since MapReduce can only process one record at a time in the Map phase and Reduce phase, the system must make the dependency attribute value and its conditional probability correlate; this is existence necessity and the problem to be solved of the connection module. The input data of the connection module is the output data of the parameter estimation module and the original data to be filled. The output data is the file correlating the dependent attribute values in the tuples containing the missing data with the conditional probability of the values.

(3) Filling Module. The filling module is implemented with MapReduce. First, the output result and the original input data of the connection module are subject to connective operation with offset as the key. The Map phase is similar to the connection module. The Reduce phase uses (2) to calculate the conditional probability corresponding to each (a possible value of an attribute to be filled) and select with the maximum probability for filling the missing value.

3. The Time_cluster Model of Aggregation Mechanism Mining

When the database is changed, the fuzzy frequent attribute set may not be frequent in the updated database. In this paper, the definition of negative bounds of the fuzzy frequent attribute sets is proposed for the fuzzy sets of the fuzzy frequent attribute set, the negative boundary, and the fuzzy attribute set in the original database and used for mining the time passenger flow.

3.1. Mining Association Rules of Passenger Flow Time Series in Rail Transit

In fuzzy discretization of time series, set the time series , the time window of the width acts on the subsequence formed by , and a series of subsequences of the width is formed by single-step slip

is regarded as points in Euclidean space, randomly assigned to the class , to calculate each type of center as a representative of each category and the elements of the set , and the membership attribute function belonging to the class- representatives iswhere denotes a constant that controls the fuzzy degree of the clustering results. represents the square of the distance from each point to the representative point of the class .

In the calculation of the support, if occurs, occurs in time ; that is, we need to determine the frequency of occurrence of :where is the membership of point belonging to the th representative attribute.

The support of the correlation rule is , where denotes frequency of occurrence of in time after happens, with reference to the method of the rule where denotes probability of occurrence of the rule in , represents occurrence forms of in time after occurs, and the right side of (5) represents the information acquisition process from a priori probability to posteriori probability .

3.2. Mining of Temporal Passenger Flow OD Pairs Based on the Support of Correlation Rules

By preprocessing the AFC passenger flow data of the rail transit stations, the obtained data have been more accurate and in accord with the data format and redundancy requirement of the application data mining tools. The main algorithm is used to explore temporal aggregation correlation rules for passenger data for the purpose of mining the temporal passenger flow and the passenger flow OD pairs with traffic card numbers as the major cord and the designed algorithm is shown as follows.

Step 1. Calculate first and then according to the support of passenger flow of the rail transit, and set .

Step 2. Initialize AFC passenger flow database according to the mining condition with the cleaned and preprocessed data as initial data for correlation rule mining, scanning the transaction database , and finding all the item sets whose length is , forming candidate 1-item set (), substituting into (5) and (6), and calculating the support of each item, comparing with the minimum support one by one, and forming the frequent 1-item set if the support is greater than the .

Step 3. Generate candidate item set according to the frequent -item set, if , go to the next step; otherwise the loop ends.

Step 4. Scan the database to calculate and determine the support for each candidate item set.

Step 5. Delete candidate items with the support smaller than , forming a frequent item set of the item.

Step 6. , go to Step 3.

Step 7. This results in the frequent item set correlation rules of temporal passenger flow aggregation.

Step 8. Output 2-hour particle size of the temporal passenger flow aggregation and the results of OD pairs to guide the preparation and adjustment of rail transit operation plans.

4. Case Study and Analysis

4.1. Cleaning and Missing Filling of Passenger Flow Data

Passenger flow data of this paper comes from the AFC gate of Shanghai Rail Transit Line 9, and passenger flow period covers 6:00–22:00. The main data field includes card number, arrival time, inbound station, departure time, outbound station, P + R (the indication if the P + R parking lots is used), and total time (travel time), many of which are redundant data such as fare evasion, transportation card lost, and other reasons that resulted in the invalid data, and some raw data are shown as Table 1.

From the original data we can see that some data fields are missing, especially the fields that have to be filled; the data cleaning and filling method designed in Section 2 is used to deal with the original data to facilitate data mining.

In order to fill the loss values of the original data, substituting (2) into (1) yields

Based on the card number field value of AFC data, travel times and the OD percentage of each passenger are collected and retrieved by SQL statement and substituted into (7), and through parameter estimation, connection, filling, and other modules, supplementing the original data yields results of Table 2.

4.2. Mining of Passenger Flow Period Aggregation

According to the time series discretization design in Section 3.1, the passenger flow period is divided according to the 2-hour particle size, and single-step slip forms a series of time series with width of 2.

Calculating and the membership function obtains the membership of station passenger flow aggregation of every 2-hour particle size.

Passenger flow period covers 6:00 to 20:00 and is divided as 2-hour particle size into 6:00–8:00, 8:00–10:00, 10:00–12:00, 12:00–14:00, 14:00–16:00, 16:00–18:00, 18:00–20:00, and 20:00–22:00, and membership , calculation of support , and passenger flow of each period are shown in Table 3.

The above-mentioned mining algorithm can be used to calculate 2-hour particle size period passenger flow of the Shanghai Metro Line 9: Guilin Road, Caohejing Development Zone, Hechuan Road, Xingzhong Road, Qibao, Zhongchun Road, Jiuting, Sijing, Sheshan, Dongjing, Da Xue Cheng Station. The results are shown in Table 4.

Figure 3 shows the change trend of distribution time data. It can show the passenger flow distribution of the stations in the 2-hour particle size range visually. The distribution of the passenger flow can guide the operation directly.

The mining results in Table 4 can provide the data support for the travel rush hour of the passengers and traffic adjustment scheme of the rail transit station, so that the train capacity can be matched with the passenger flow, and the passengers are quickly distributed to the road network. It can also provide data support for the determination of the speed ratio of express/slow trains of the subway lines of the rail transit. The ratio of the express/slow trains can be determined according to the temporal passenger flow, thus saving energy consumption and passenger travel time.

4.3. Mining of Passenger Flow OD Pairs Based on Correlation Rules

According to Section 3.2, of the OD pairs of each traffic card number and compared with , the pairs with values higher than it are determined as the OD pairs. The mining results are shown in Table 5. The mining results can provide accurate decision-making support for the determination of the arrival scheme of the express trains in the express/slow mode.

It can be seen from Table 5 that if we can combine the travel time of passengers to mine the OD pairs, it can provide more elaborated decision support for the optimization of the express/slow train program. To a certain extent, the accessibility of passenger travel is increased; especially in the peak commuting, accessibility is particularly important to reduce the unnecessary transfer, and transfer passenger flow in the road network is relatively reduced.

5. Conclusions

Suburban rail transport passenger transport organization problems have troubled operation and management enterprises over years. Passenger transport organizations are involved in key indicators such as the temporal passenger flow, passenger OD, and speed car ratio of the express/slow trains. Passenger flow data of the rail transit are complicated in structure, in which redundancy and missing coexist. In this paper, we use the big data processing method to purify and supplement the original data and then explore the time distribution rule and the OD pairs of the passenger flow in depth. The passenger flow AFC data of Shanghai Rail Transit Line 9 are taken as inputs and are revised, supplemented, and mined to give out the time distribution characteristics of the passenger flow of Line 9 in the end and verify the feasibility of the method. In the future, the time distribution rule and OD pairs of passenger flow under the network condition will be mined in order to guide the establishment of the operation plan of the whole road network.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research described in this paper was supported by the National Natural Science Foundation of China (Grant no. 71601110), the National Key Research and Development Plan of China (Grant no. 2016YFC0802505), and the National Key Technologies R & D Program of China (Grant no. 2015BAG19B02-28). The authors also gratefully acknowledge those who provided data and suggestions.