#### Abstract

Data quality is essential for its authentic usage in analysis and applications. The large volume of automated collection data inevidently suffers from data quality issues including data missing and invalidity. This paper deals with an invalid data problem in the automated fare collection (AFC) database caused by the erroneous association between the fare machines and metro stations, e.g., a fare machine located at Station A is wrongly associated with Station B in the AFC database. It could lead to inappropriate fare charges in a distance-based fare system and cause analysis bias for planning/operation practice. We propose a tensor decomposition and isolation forest-based approach to detect and correct the invalid associated fare machines in the system. The tensor decomposition extracts features of passenger flows and travel times passing through fare machines. The isolation forest coupled with a neural network (NN) takes these features as inputs to detect the wrongly associated fare machines and infer the correct association stations. Case studies using data from a metro system show that the proposed detection approach achieves over 90% accuracy in detecting the invalid associations for up to 35% invalid associations. The inferred association has a 90% accuracy even when the invalid association ratio reaches 40%. The proposed data-driven invalid data detection method is useful for large-scale data management in terms of data quality check and fix.

#### 1. Introduction

Smart card data collected from the automatic fare collection (AFC) system (i.e., AFC data) enable many beneficial applications in the public transportation system such as collective and individual mobility analysis, system state monitoring, and operation planning and control [1]. The usefulness of these analysis applications is highly dependent on the data quality. The AFC data are collected online and in a large scale that may inevitably encounter data quality issues such as data missing and invalidity.

Data problems are prone to happen due to the following reasons:(i)Human factors: in the AFC system, the transaction records may be missing if passengers fail to tap in/out properly.(ii)Infrastructure failure: for example, AFC records are triggered when a passenger taps in/taps out through an entrance/exit fare machine. The malfunctioning of fare machines may lead to issues of missing data (machine fails to record or upload transactions) and invalid data (erroneous transactions).(iii)Inadequate data management. Daily data management for transportation systems is a complex practice. Missing and invalid data may happen in the process of database merging, maintenance, or system update.

Among those data problems, missing and invalid data problems are the most critical and common ones. Figure 1 illustrates the characteristics of these two problems and also their difference. The missing data are cognizable and clearly identifiable via the data structure. For example, some AFC transactions may have missing data on tap-out records (empty cells). However, the invalid data are impossible to be directly recognised since the data structure is exactly the same as the valid data. Generally, the invalid data problem can be divided into two categories: data record and association errors. The data record error originates from the facility malfunctioning in the AFC system as mentioned above. The data association error occurs in the process of merging different sources of data (e.g., AFC fare machine records and station dictionaries). The data association error may come from the incomplete information inference and invalid information matching.

The paper deals with the invalid data problem to detect the hidden association errors of the complete and seemingly valid data. Specifically, it aims to detect the invalid association between fare machines and stations in the AFC data. For example, fare machine 001# is located in Metro Station A, but wrongly associated to Station B in the AFC database (Figure 2). The problem is prone to happen as the fare machines are frequently added, replaced, etc. in the metro systems, but the fare machine-station dictionary may fail to update timely. The consequences of invalid associations could be significant, e.g., under/over charging for a large amount of passengers. In addition, it is costly to fix this problem by manpower. One should manually check all the machines in metro stations to rebuild the correct association between fare machines and stations. Especially, it is impossible to manually detect such problem in the historical dataset since the fare machine distribution may not consistent with the current system.

We develop a data-driven approach, based on tensor decomposition and machine learning techniques, to automatically detect such invalid associations using AFC data, and also infer the correct association stations that a fare machine belongs to. The approach works in two steps: the tensor decomposition is utilized to extract the flow volume and travel time patterns of each fare machine. Then, the isolation tree technique and NN models are designed to detect the incorrect linked fare machines and infer their correct association stations based on the extracted features from tensor decomposition.

The remaining is organised as follows: Section 2 reviews the relevant studies on data quality issues, including overview of data quality problems, feature extraction techniques, and anomaly detection; the problem formulation and methodology are presented in Section 3; Section 4 reports the case study using the AFC data from a large metro system; the final section concludes the paper and discusses potential further studies and applications.

#### 2. Related Work

##### 2.1. Data Quality Problems

Data quality is one of the most important issues in big data area. Low or bad data quality is costly. For example, it is reported that bad data or poor data quality costs US businesses 600 billion dollars annually [2]. For metro systems, AFC systems collect massive transaction data of metro passengers. The literature has reported plenty of data quality problems related to AFC data. Robinson et al. [3] reported that the reasons of AFC data quality problems can be grouped into 4 categories: (1) software; (2) data; (3) hardware; (4) user. A recurrent information missing problem of the boarding station in Beijing Metro has been reported by Ma et al. [4]. Liu et al. [5] reported a time synchronisation problem of the AFC and AVL system, which causes the recorded boarding time information to be invalid in a large scale. Network, scheduling, fare table, etc. are important data stored in the AFC database. Errors in these data will lead to significant consequences. For example, the London Oyster smart card system crashed on Saturday 12th July 2008 due to erroneous data resulting in over 40,000 Oyster cards having to be replaced [6].

Although many studies deal with missing data in transportation, to the best of our knowledge, there is no study on detecting or fixing the association errors in transportation or other related areas, particularly the fare machine-station invalid association problem.

##### 2.2. Feature Extraction Techniques

The key idea for a data-driven detection approach is to extract the passenger flow or/and travel time patterns between fare machines and stations. Feature extraction is one of the most important issues in the machine learning field. Feature extraction reduces the resources required to characterize a large set of data or/and a huge dimensions of input information. Plenty of methods are proposed in the machine learning community dealing with the feature extraction. These methods can be roughly divided into two parts: conventional statistical learning methods and deep learning-based method. Conventional statistical learning methods such as principle component analysis (PCA) [7], Isomap [8], and partial least squares (PLS) regression [9] mainly based on the statistical learning-based algorithms. The advantages of these methods are they are robust to small dataset, i.e., do not need large amount of samples to maintain the performance. However, the disadvantages are also critical. For example, they are not robust to noisy samples, and the feature extraction quality is highly dependent on specific tricks in different tasks, thus which are less generalized. Deep learning-based feature extraction methods become more and more popular recently. Variety forms of neural networks, e.g., convolutional neural network (CNN) [10] and long short-term memory (LSTM) [11] neural network. can be treated as feature extraction models. Different from the statistical learning-based algorithms, they extract the features in a latent, end-to-end manner. The advantage is that the extracted features are more representative and comprehensive. However, these models always require a large dataset in the training procedure; thus, they are not suitable in the few-shot scenario. In conclusion, there is no a generalized feature extraction method for all the tasks. Feature extraction methods should be designed based on the characteristics of the focused problem.

In our problem, passenger flow and travel time patterns are related to multiple modes, e.g., time and location. Tensor is a nature choice to represent and capture these patterns. Tensor is a multidimensional extension of matrix [12]. Tensor has been widely used in transportation area to deal with multidimension data. Tan et al. [13] utilized a tensor decomposition approach to capture the multimode correlations in traffic data and recover missing traffic data by reconstructing the traffic flow tensor. The results show that the proposed algorithm performs well even when the missing ratio is high. Chen et al. [14] proposed singular value decomposition (SVD)-combined tensor decomposition framework to complete the traffic data using traffic speed information. Sun and Axhausen [15] utilize a probabilistic tensor decomposition method to mine the urban mobility patterns. Mobility patterns of different passenger groups (e.g., students, adults, and elders) are explored. In our study, we also use tensor decomposition to extract the flow pattern related to each fare machine.

##### 2.3. Anomaly Detection

The invalid associations (between fare machines and stations) are treated as anomalies. Anomaly detection is an important topic in data mining. The anomaly detection could be roughly divided into three categories, statistical, machine learning, and deep learning models.(1)Statistical method: statistical methods are the early explorations of the anomaly detection. The methods in this category first make assumptions of the distribution of the studied dataset. The samples with low probabilities are treated as anomalies. Rousseeuw and Driessen [16] proposed an anomaly detection method based on the Gaussian assumption of the data. The performance of statistical anomaly detection methods highly depends on the fitting between the assumption and the reality, thus exhibiting limited performance.(2)Machine learning-based methods: the most widely used anomaly detection methods are the machine learning-based methods, which generally have two categories: supervised and unsupervised methods. Supervised methods [17, 18] refer to the models applying to the dataset that the training data are labeled with “nominal” or “anomaly.” The models are trained with the labeled data and use to identify new instances. Unsupervised methods deal with the dataset without labels. These methods automatically detect the anomalies based on certain criteria. Popular unsupervised methods include LOF [19], DBSCAN [20], *k*-means [21], and the isolation forest [22] method.(3)Deep learning-based methods: the emerging deep learning models bring new opportunities to better solve the anomaly detection problem. Hundman et al. [23] propose an LSTM network-based framework for anomaly detection; [24] utilized a generative adversarial network (GAN) to detect the anomalies in time series data. Nguyen et al. [25] detect the anomalies by constructing the model snapshot and outputting the ensembles of the NN models. Deep learning-based methods tend to have more a promising performance compared to other techniques. However, these methods require a large amount of training data to produce reasonable results. Its performance is low in scenarios with a small set of training data, e.g., the fare machine-station association problem studied in this paper.

#### 3. Methodology

##### 3.1. Problem Formulation

Let be a fare machine, and its actual station and current association station in the AFC dataset, respectively, where contains all the stations in the metro system. Note that different fare machines could share the same station, i.e., located in the same station. If , fare machine is defined as valid association fare machine; if , fare machine is defined as invalid association fare machine. The fare machine-station association detection problem is defined as follows.

Given an AFC dataset and a set of fare machines recorded in , detect invalid association fare machines and infer their associated stations for fare machines in .

Mathematically, the problem is defined as follows:(i)*Invalid Association Detection*. Find , s.t. , and (ii)*Station Inference*. For each fare machine in , find s.t.

##### 3.2. Fare Machine Features

For convenience, we define the concept of fare machine-related passenger flow (MRF). For an entrance fare machine, MRF refers to the passenger flow tapping in an entrance fare machine of the origin station and tapping out at a destination station (using any machine) during a certain time slot. For an exit fare machine, MRF represents the passenger flow tapping in at an origin station (using any machine) and tapping out at an exit fare machine during a certain time slot. MRF can be characterized using different features, such as flow volume and travel time. Indicators extracted from the MRF features can be used to characterize fare machines. The hypothesis is that MRF features share more similar patterns if the fare machines are located at the same station than at different stations.

The flow volume and travel time are selected to characterize the MRFs of fare machines. These two features reflect system dynamics from both the demand (mobility patterns) and supply (network and operations) points of view as well as their interactions. They provide complementary knowledge and therefore give a more comprehensive view of the MRF patterns. They are defined for entrance and exit fare machines separately:(i)For entrance fare machines, MRF flow volume measures the number of passengers passing through each fare machine at an origin station and going to a destination station. For exit fare machines, it represents the number of passengers entering the metro system at an origin station and tapping out through an exit fare machine. MRF flow volume reflects the mobility behavior of passengers.(ii)MRF travel time indicates the average travel time from a fare machine to a destination station for entrance fare machines and from an origin station to a fare machine for exit fare machines. It reflects the supply characteristics of the metro system, e.g., geographical relationship between stations and scheduling, but also demand characteristics of certain stations as it includes time waiting to board a train under capacity constraints.

Figure 3 shows the overview of the proposed framework. It consists of three modules: MRF feature extraction, invalid association detection, and associated station inference:(i)MRF feature extraction module: it constructs the MRF flow volume and travel time tensors to characterize fare machines and extracts latent MRF flow and travel time features using the tensor decomposition technique.(ii)Invalid association detection module: it detects the invalid associations (between fare machines and stations) in two steps. The valid and invalid associations are initially detected using the isolation forest method. Then, the invalid associations are reinspected (the feedback arrow) using neural networks (trained with the valid association data).(iii)Association station inference: it infers the station that a fare machine (detected as invalid association) belongs to using the trained neural networks.

##### 3.3. MRF Tensor Construction

For data representation, tensors are used to characterize the MRF flow volume and travel time. A tensor is a high-order generalization of a matrix. The multiway property of a tensor fits the nature of MRF features. For example, MRF flow volume can be characterized by “machine mode” (*M*), “time mode” (*T*), “day mode” (*D*), and “station mode” (*S*). For entrance fare machines, “machine mode” denotes the related fare machine ID, “time mode” represents the time interval of a day (e.g., 6:00 to 7:00 AM), “day mode” denotes the date, and “station mode” denotes a destination station ID. For exit fare machines, the definitions of tensor modes are the same with entrance fare machines, except for the “station mode.” The “station mode” of an exit fare machine is the origin station ID. In this way, two 4-way tensors are used to represent the MRF flow volume of entrance and exit fare machines, respectively. For example, an entry: 50 at (A, 8:00 to 9:00 AM, January 1, B) of entrance machine tensor represents “the passenger flow volume passing through entrance machine A in the interval 8:00 to 9:00 AM on January 1 and exiting at Station B is 50 passengers.” The methodology for fare machine-station association is the same for entrance and exit fare machines. Entrance fare machines are used to illustrate the proposed framework. Unless stated, the “fare machines” and “MRF tensors” refer to entrance fare machines and entrance MRF tensors, respectively.

To construct the MRF flow volume tensor, the mode variables above are transformed into numerical indices:(i)Machine mode: the fare machines are labeled from 1 to *M*. Then, the machine IDs belong to a set , where represents the total number of fare machines.(ii)Time mode: the hourly interval is used to represent the tap-in time . Note that only the operating hours of the metro system are considered, where the element in *T* denotes the operating hour of the day.(iii)Day mode: day mode represents the date, thus where 1 and *D* represent the first and the last day of the studied time span, respectively.(iv)Station mode: the stations are labeled from 1 to *S*, where *S* denotes the set of stations in the metro system.

The MRF flow volume is represented by a size tensor . Figure 4 shows the structure of the MRF flow volume tensor. Each entry of , denoted as , represents the MRF flow volume entering through fare machine *m* and exiting at destination station *s* during the time interval of day *d*. For the exit fare machines, the tensor construction procedure is the same as the entrance machines. Accordingly, the entry denotes the MRF volume entering though station and tapping out though fare machine during the time interval of day *d*.

Similarly, the MRF travel time tensor is denoted as . An entry in represents the average travel time of all passengers entering through fare machine *m* and traveling to destination station *s* during the time interval of day *d*.

The properties of MRF flow volume and travel time tensors are different, though they share the same structure. The difference stems from the tensor cells that have no AFC observation. For the MRF flow volume tensor, the value of such cells is 0 since the MRF flow volume for the corresponding is 0 (no passenger flow). However, for the travel time tensor, cells having no observation cannot be directly filled with a zero. No observation in the MRF travel time tensor only means that no passengers traveled for the specified case. However, the corresponding travel time cannot be 0. An initial idea is filling these cells using the average travel time of such OD pairs in the historical data. Unfortunately, nonobservation cells always account for a large ratio of the MRF travel time tensor (e.g., 63.5% in the studied AFC dataset). Therefore, it is hard to estimate a reasonable average travel time for each cell based on limited information. Instead, “NaN” values are used to fill those cells to represent the unknown travel times.

##### 3.4. Tensor Decomposition

Tensor decomposition is used to extract fare machine features from the MRF flow volume and travel time tensors. Given the different properties of these two tensors, different tensor decomposition methods are developed to extract the MRF flow volume and travel time features, respectively.

###### 3.4.1. Tensor Decomposition of MRF Flow Volume

For MRF flow volume tensor , the CANDECOMP/PARAFAC (CP) decomposition [12] is used to extract the fare machine features. CP decomposition factorizes a tensor into a summation of a series of rank-1 tensors. A rank-1 tensor ( is the dimension of mode ) is an outer product of *N* vectors: , where , denotes a vector, denotes the element of , and the symbol denotes the outer product of vectors.

The CP decomposition of can be formulated as follows:where represents the total number of components, , , , and represent the component vector of the machine, time, day, and station modes, respectively. Figure 5 illustrates the process of CP decomposition of .

Computing the CP decomposition of can be treated as an optimization problem. The goal is to find a CP decomposition with components that could best approximate . The decomposition is the solution of the following optimization problem, i.e., findwhere denotes the Frobenius norm. This optimization problem can be solved using the alternating least squares (ALS) method [26]. Details of the solution procedure can be found in [12].

The feature matrix is constructed utilizing all the component vectors in . Since each entry in is calculated as the outer product of all the 4 component vectors, could be treated as an indicator of the hidden information of all the other 3 modes. The entries in that are related to the fare machine are calculated only using the elements in row of . Therefore, each row of can be used as a latent feature vector to represent each fare machine’s MRF flow volume pattern.

###### 3.4.2. Tensor Decomposition of MRF Travel Time

CP decomposition cannot be applied directly to extract travel time features. This is because the travel time tensor has nonnumerical (i.e., NaN) entries, which makes the operation infeasible. A variation of CP decomposition, CP Weighted OPTimization (CP-WOPT) [27], is used to deal with the MRF travel time tensor decomposition. CP-WOPT is widely used to recover tensors with missing entries. CP-WOPT utilizes a weight tensor to indicate the location of NaN entries. The formulation is as follows:

The weight tensor has the same shape as and is defined as

In the initialization phase, NaN cells are filled with random values. As these values are multiplied by 0 during the optimization, they do not influence the results of the optimization objective (optimal solution). After optimization, can represent features of the observed travel time data well. As there exists strong relationship between the cells in , the features of the entries without observations can also be represented in the reconstructed tensor . A feature matrix is constructed using the machine mode component vectors in to represent the multimode travel time features of fare machines. Details about the CP-WOPT method can be found in [27].

The MRF flow volume and travel time feature vectors of each fare machine are concatenated into one single vector to characterize the corresponding fare machine.

##### 3.5. Fare Machine-Station Association

As fare machines at the same station share similar surrounding Point of Interests (POIs), the MRF features of these fare machines tend to be similar. Therefore, we should first extract the MRF feature of each station. Then, the MRF feature of each machine is compared to the station MRF feature. If a fare machine has a similar MRF feature with a station, then this station is likely to be the association station of the fare machine. We divide the inference process into two successive problems P1 and P2.

###### 3.5.1. P1: Invalid Association Fare Machine Detection

To solve P1, we first give two assumptions: (1) the MRF features of the invalid associations are anomalies to their recorded stations. More formally, let be the count function, anomaly means for . This indicates that the number of fare machines with association station but recorded station should be far less than the number of valid association fare machines in . Note that this assumption does not mean the total number of invalid association fare machines of is less than the valid fare machines. We only restrict that fare machines recorded as but actually associated with should be minority to . This assumption is reasonable since the error leads to fare machine-station invalid association tends to be random; for example, it is unlikely to have many fare machines located in the same station wrongly recorded as another station simultaneously. (2) The invalid associations happen randomly. This assumption indicates that for a fare machine in station , it experiences equal probability being wrongly associated to all the other stations in the system. This assumption is reasonable since the invalid associations mainly because of the inadequate data management in the process of database merging, maintenance, or system update.

Based on this assumption, the isolation forest method is adopted to solve P1. The isolation forest model is an unsupervised model for anomaly detection, which could be directly used for the contaminated dataset. The only requirement of this method is that the outlier should be few and different with the normal instances. This exactly fits the aforementioned assumption. The isolation forest detects the outliers using a special measurement: partitions. The isolation forest “isolates” observations by randomly selecting a dimension of the MRF feature vector and then randomly splitting the space between the maximum and minimum values of the selected dimension. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate an MRF feature is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular fare machines, they are highly likely to be anomalies [22].

Based on the results from the isolation forest, we can divide the fare machine MRF feature vectors into two parts: contains all the MRF feature vectors that are inferred as invalid (i.e., abnormal) by the isolation forest contains all the MRF feature vectors that inferred as valid (i.e., normal) by the isolation forest

The fare machines with their MRF features in are detected as valid, while the fare machines in are reinspected in the process of solving P2.

###### 3.5.2. P2: Association Station Inference

In P2, a reinspection of the fare machines in is conducted to refine the detection results from P1. The reinspection detects which associations are wrongly detected as invalid in . In practical applications, the inference provides a certain sense about the data quality in their AFC database. The model outputs the potential association stations of the detected invalid association fare machines, which facilities effective field investigation and reduces manpower.

Neural network (NN) is used to model the station MRF feature using the MRF features in (detected as valid). As the number of samples (i.e., fare machines) are limited (e.g., 2000 fare machines in the studied network), the NN training may face underfitting issues. We built one shallow neural network for each station, which denotes as the station-NN. For a certain station-NN of station , we label the fare machines with the recorded station in as 1 and label other fare machines in as 0. It is inadequate to directly train the station-NN with the labeled features. Since a metro system has many stations (e.g., 90 stations in the studied metro system), for one certain station, the number of positive samples (i.e., MRF features labeled as 1) is much less than the negative samples (MRF features labeled as 0), which will lead to the learning bias. We utilize the adaptive synthetic sampling (ADASYN) [28] approach to oversample the positive samples, ensuring that the number of the oversampled positive samples is similar with the number of negative ones. is then trained with the oversampled MRF features and their corresponding labels. After is well-trained, the output of the network will be the probability that the input fare machine MRF feature belongs to this station.

For an MRF feature in , we input it into all the well-trained station-NNs. Let denote the output probability from each station-NN, and is the descend order permutation of , where . The top- stations would be the most possible association station of the corresponding fare machine of .

Using , the reinspection for P1 is conducted for the fare machine in with the following rule: given a fare machine , if , is inferred as invalid, otherwise as valid. For the fare machines inferred as invalid after the reinspection, the top- station is treated as the potential association stations set. In the implementations, one can first check the stations in this set to find if this fare machine is there.

#### 4. Case Study

We utilize AFC data from an urban metro system to evaluate the proposed detection and inference approach. The data cover 7 days from January 15 to 21 in 2018. The fare machine-station association information is carefully checked to ensure its validity for benchmarks. Figure 6 illustrates the statistic of the number of machines in the metro system during the studied time span.

##### 4.1. Experimental Setup

We randomly select 1000 entrance fare machines and 1000 exit fare machines and collect the corresponding AFC transaction records to construct the experimental dataset. We randomly choose a set of fare machines and modify their associated stations (invalid associations). The proposed approach is validated with the ratio of invalid associated fare machines ranging from 5% to 40%. The approach runs 20 times per scenario to avoid random errors. Table 1 summarizes the model parameters used in the experiments.

##### 4.2. Performance Evaluation

Table 2 shows the tabularised relations between truth/falseness of the detection and valid/invalid association.

A set of performance metrics is used to comprehensively evaluate the model performance, including accuracy (Accu), true positive rate (TPR), and false positive rate (FPR):where is the total number of associations (or fare machines) and the number of correctly detected associations (between fare machines and stations). The correctly detected fare machines include cases that are truly positive and negative:where is the number of truthfully detected invalid association (correctly inferred an invalid association as invalid), and is the number of invalid associations. TPR measures the model’s sensitivity towards invalid associations:where is the number of falsely detected valid associations (falsely inferred a valid association as invalid) and is the total number of valid associations. FPR measures the misjudgement rate of the valid associations.

###### 4.2.1. Evaluation of Invalid Association Detection (P1)

Figure 7 shows the detection results of associations with the invalid association ratio ranging from 5% to 40%. The results indicate that the isolation forest model is robust to the invalid associations when the invalid association ratio is less than 20% (the detection accuracy is over 96%). It can still achieve a detection accuracy of 87%, and even 40% of the fare machines are wrongly associated with stations in the data. The TPR is an essential characteristic of the detection of invalid associations in P1, since there is no reinspection of the invalid associations in in the following procedures of the approach. That is, the wrongly associated fare machines in will remain undetected which may eventually impact practical applications in reality. Also, it is favorable to detect more invalid associations to ensure a clean MRF feature set for each station, which benefits the correction of invalid associations in P2. The TPR is over 90% when the invalid association is less than 20%, which indicates the promising performance of the proposed approach in detecting the invalid associations. The falsely detected valid associations (FPR) is very low (less than 5%), and it decreases with the increase of the invalid association ratio as expected.

**(a)**

**(b)**

**(c)**

###### 4.2.2. Evaluation of Association Inference (P2)

For the P2 evaluation (rematching wrongly associated fare machines to stations), we quantify the model’s capability to effectively allocate large probabilities to the correctly matched stations. We use the top-*k* accuracy measure. Depending on different *k* values, it measures the probability that the inferred set of the top-*k* stations (ordered by probabilities) includes the actual associated station in reality:where is the number of fare machines in with their matched station contained in and is the number of fare machines in .

Table 3 summarizes the model performance of P2 with varied levels of invalid association ratios in the dataset.

The results show that the top-*k* accuracy exceed 90% when is greater than 3, regardless the invalid association ratio. It indicates that the top 3 inferred stations from the model are highly likely to include the correctly associated station of the studied fare machine. This provides an important implication for further field investigations to these probable stations in practice, i.e., checking the most likely stations that the invalid associated fare machines may belong to.

##### 4.3. Latent Feature Analysis

The foundation of the detection or inference model being effective is the quality of the MRF features. That is, the fare machines at different stations are preferable to have significantly different MRF features. To explore the feature quality, we utilize the principle component analysis (PCA) [7] to reduce the dimension of the MRF feature vector to two. We randomly choose 5 stations in the studied metro system, select one station as the reference station, and compare its MRF feature vector to that of the other 4 stations, respectively.

Figure 8 shows the MRF feature visualization results. The results show that the MRF features between stations exhibit significant differences, which indicates a high quality. This benefits the model to formulate relatively distinct MRF feature for each station, thus which is effective to detect the invalid associations and infer the associated station of the fare machines. For different stations, the MRF feature of fare machines appears different patterns. For example, the MRF features of machines in Station E (Figure 8(d)) are very similar to each other, while the MRF features of Station B (Figure 8(a)) appear a distributed manner. The reason partly lays in the different layout of the stations. For some large stations (e.g., transfer stations in the commercial center), there are many gates entering/exiting the stations, which may lead to variances in travel time between the same OD pairs. It would be the main reason for the miss and wrongly detection of the proposed model.

**(a)**

**(b)**

**(c)**

**(d)**

#### 5. Conclusion

Ensuring data quality is essential for its effective use in practice. The paper proposes a model to detect the invalid data in the AFC dataset, caused by the erroneous association between fare machines and stations (e.g., due to delayed updating dictionaries or incorrect data merging). It combines tensor decomposition, isolation forest, and NN methods to detect the invalid associations in the recorded dataset and infer the correct association station that a fare machine belongs to.

The model is validated using the AFC data in a busy metro system. The experiment results show that the invalid association can be detected with more than 90% accuracy when the invalid association ratio is low. Also, the model is robust to invalid associations and it can still achieve 69.62% accuracy in the extreme case when the invalid association ratio is 55%. The association station inference results indicate that the top 3 inferred stations from the model are highly likely to include the correctly associated station of the studied fare machine (around 90%). This provides an important implication for further field investigations to these probable stations in practice.

The proposed model provides useful knowledge for the AFC data management in terms of data quality check and fixing invalid data. Though the study focuses on the invalid data detection problem, the model is general and can be generalized to inference applications, e.g., inferring the alighting stations for the bus system having only the boarding records. As the extracted MRF features are meaningful, further studies could focus on the analysis based on the MRF features, for example, analysing the different utilization of fare machines in different gates of the same station to improve the infrastructure efficiency.

#### Data Availability

The AFC data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.