Abstract

Public transportation users increase as the population grows. In Taipei, Taiwan, this tendency is observed by analyzing historical data from the Mass Rapid Transit (MRT) and economy-shared bicycle (known as YouBike) riders. While this trend exists, the Taipei City government promotes green transportation by providing discounts to users who transfer from MRT or bus to YouBike within a particular period. Therefore, this study focuses on analyzing the patterns of users in order to identify possible clusters. Clusters of customers can be considered fundamental and competitive factors for the Ministry of Transportation to encourage the use of green transportation and promote a sustainable environment. Based on big data smart card information, this paper proposes using the RFM and K-means clustering algorithm to analyze and construct mode-switching traveller profiles on MRT and YouBike riders. As a result, three distinct clusters of MRT-YouBike riders have been identified: potential, vulnerable, and loyal. There are also suggestions regarding the most profitable groups, which customers to focus on, and to whom give special offers or promotions to foster loyalty among transit travellers.

1. Introduction

Public transport, defined as high-capacity vehicle sharing with fixed routes and schedules, will remain an essential engine to economic activities, social connections, and the standard of living. Due to traffic density and the demands on road infrastructure, land, material, energy, and workforce have been invested in providing transport services and developing its infrastructure. As transport demand continues growing, particularly in fast-developing nations, many cities expand their transportation networks and support infrastructure, indicating how vital the transport system is to economies and social welfare. A 2018 McKinsey report[1] concluded that wealthier cities have greater opportunities to build advanced transportation systems, but such prosperity does not guarantee the successful development of such systems. According to the 2020 “Foresight Research Survey,” as many as 81.1% of Taiwanese people have access to private transportation and only 44.4% rely on public transport for their daily commute. In addition, over 80% of those aged 18 or older rely on private transportation, primarily gasoline-powered motorcycles. Evidence has shown that people prefer to travel using their vehicles, which imposes considerable challenges to reducing private transportation dependence and encouraging the use of public transportation. However, since people naturally avoid transit they perceive as incompatible with their demands, such transformations are fraught with difficulty.

Taipei’s Mass Rapid Transit (MRT), the first subway system built in Taiwan, has already become a hallmark of Taipei City. Residents in Taipei welcomed its arrival and viewed it as an example of the city’s bright future. The Taipei Metro, once known as the Taipei Rapid Transit Corporation (TRTC), is a city government public transit operator. The MRT has made commuting more accessible for people in Taipei; however, its annual ridership from 770 million visits in 2018 drops to 690 million in 2020, partially due to the outbreak of COVID-19. In addition, the Taipei Metro launched many promotions to encourage people to take public transportation. For instance, a public transportation monthly pass, a trip discount on mode-switching within an hour, and the first 30 minutes of free YouBike rental. In order to retain existing travellers, discounts and incentives seem appropriate. However, the impact on encouraging new riders is questionable, especially when substantial decreases in operating costs resulting in significant profits come through various government subsidies. As a result, the Taipei Metro must allocate or invest resources in the desired services and travellers. The same action taken by travellers may have resulted in a value destroyer rather than a value creator for the system. Hence, it needs a strong foundation of customer-oriented strategic development.

In this regard, it is necessary to grasp the dynamics and heterogeneity of public transportation users. Li and Schmöcker [2] and Lin et al. [3] used questionnaires to have travellers indicate their reasons for public transport for descriptive analysis of behavioural changes. Tang et al. [4] collected data on users’ socioeconomic characteristics, vehicle ownership, public bicycle use, and user satisfaction using online questionnaires. The Unified Theory of Acceptance and Use of Technology (UTAUT) was utilized by Jahanshahi et al. [5] to investigate travellers” opinions and identify factors that influence the adoption of bike-share systems. In the context of the movement of travellers and public transit stations, some studies focused on regional analysis [68], route analysis [911], site analysis [5, 10, 12, 13], ticketing channel [1416], mode choice [1720], and traveller characterization [21, 22]. In particular, Kim et al. [12] used ridership counts of selected intervals to classify the subway stations regarding their diurnal ridership patterns associated with land use. Gan et al. [7] analyzed the daily mobility patterns concerning land use from a station ridership perspective. As interest in people movement and urban mobility has surged in recent years, studies have attempted to predict individual travel-mode choices using traditional random utility models and machine learning approaches [2, 14, 2232].

In segmentation, the K-means algorithm, an unsupervised learning cluster approach, is still very popular because of its speed, efficiency, and simplicity. The algorithm finds optimal groups (clusters) of customers, transactions, or other behaviours and things with high similarities and characteristics within the clusters. Many applications of cluster analysis have been applied in various industries, such as banking [3335], energy supply [36], agriculture, food [3739], health and insurance [4043], telecom [4448], postal service [49, 50], transportation [14, 17, 26, 29, 5156], and retail [38, 5762]. Furthermore, other researchers focus on customer relationship management (CRM) models and adopt them in K-means clustering. For RFM-based traveller segmentation, Reades et al. [51] used every 15-minute interval of boarding and alighting information, rather than the day of the week. Qian et al. [29] proposed customer segmentation rules of Electronic Toll Collection based on vehicle behavioural characteristics. In Chiang [54]; the concept of RFM was applied to discover valuable airline travellers, and the association rules led to identifying the optimal target markets. Also, based on the insights mentioned above, determining traveller values in each type of transportation study has its characteristics that are not fully met by the same model [56].

To our knowledge, there has been less research into different transfer and transit portfolio modes. Therefore, this study aims to profile the travel patterns of visitors to Taipei solely based on the MRT and the bike-share network (“YouBike”). From longitudinal observations of ridership patterns, travellers can be grouped into specific categories based on the RFM scoring model [63]. Furthermore, such a transit ridership portfolio reveals insights capturing travellers’ travel behaviours, assisting public transit operators such as Taipei Metro in developing effective customer relationships and market strategies, and efficiently allocating resources.

2. Research Methodology

This section discusses how to combine the RFM model and clustering for constructing a transit ridership portfolio. Assuming that the transaction data are largely unlabelled, we begin considering a K-means clustering algorithm. We next intend to examine transfer behaviour by considering attributes that reflect travel spending and preferences, namely, RFM indicators. Further information is provided.

2.1. Data Description and Preprocessing

Original data were extracted from a smart card system (called the Easy Card) used by MRT and YouBike, with records spanning 31 months between January 2017 and July 2019. The dataset consists of 11 fields for the MRT service, such as card number, ticket type, entry/exit time, entry/exit code, transaction amount, transfer code, transfer discount, and commuter ticket. The dataset also contains 18 fields for the bike-sharing service, including card number/type, deduction time/amount, borrowing/return time, borrowing/return station code and slot, bicycle number, rental free, mobile phone, rate type, and others. In this study, only trips that included a YouBike-to-MRT transfer were included in the analysis. Data fields for the MRT and YouBike systems contained some inconsistencies, and certain details have been omitted for confidentiality reasons.

Table 1 lists the MRT and YouBike data fields and descriptions. Following the Metro Taipei website, all passengers purchase different passes on which the trip and fare amount are recorded. E-ticket types include standard, students, welfare (seniors and charities), and children. A ride begins and ends when travellers swipe cards to enter and leave the system, and its time is thus recorded and calculated as the travel time. A trip is considered when swiping a card from an original station to a destination station; nevertheless, a transfer made within a time window will be regarded as only a single trip. This study can identify the transfer behaviour between MRT-YouBike modes since both exact boarding and alighting stations are known. Finally, the fare for the trip is calculated and deducted after counting all discounts. A few travellers purchase All-Pass Tickets, which cover nearly all public transportation modes in Taipei and are eligible to rent YouBikes for free within the Taipei area for the first 30 minutes of each rental session.

2.2. Data Extraction

As soon as the data are preprocessed, the first step is to group travellers. In customer segmentation, three common indicators are recency (i.e., the most recent transfer), frequency (the number of transfers), and money (the total amount spent by the traveller). Figure 1 illustrates the notations for RFM data for a transit traveller. SD and ED are the start and end dates of the study period. TDt denotes the date when a transfer occurs in the period t, and traveller transactions are monitored till the end of period T.

The recency value can be determined by the number of days between the last trip date and the end of the analyzing period. The closer the last transaction date is to the end of the analysis period, the greater its value. Therefore, a traveller’s recency (R-value) is determined as follows:

Secondly, the number of transactions () made by a transit traveller at period t during the analysis period T shall be considered as the frequency (the F-value). (2) calculates the F-value as follows:

The frequency count is in this case since travellers can transfer from one route to another by different modes to complete the same one-way trip. Therefore, the more frequent the traveller travels, the more valuable and loyal they are.

Finally, denotes the monetary value a traveller has paid at the end of the trip for period t. Its total amount is directly related to the number of transactions in the public transportation system, set as M in (3). The higher the value, the more profit will be generated.

2.3. The K-Means Algorithm

A K-means clustering algorithm, proposed by [64], is used here to group travellers transferring between YouBikes and MRTs. Initially, the algorithm divides each object (or observation) into an arbitrarily determined number of clusters (k) based on the minimum distance between the object and its centroids. If a set of objects () contains a u-dimensional index (i.e., the RFM index), then the k value should be an optimal number effective for clustering. Accordingly, a high degree of similarity homogeneity (i.e., compactness) and a high degree of heterogeneity (i.e., separation) must be apparent between different groups must be observed (as illustrated in Figure 2). So, a common method for validating the appropriate size of clusters is the elbow method. The k value range, in this case, is set so that the results can be compared with the RFM analysis later on. We then perform the K-means clustering for each k value.

Clustering and the average distance are determined as follows:(1)Select the number of k partitions in which the objects will be clustered(2)Partition the object () into k subsets in a u-dimensional feature space(3)Choose k random points from the partitioning sets as the initial cluster centroids ()(4)Calculate the distance between the data point () and the initial cluster centroids for each cluster () using Euclidian distance measure E as follows:(5)Assign objects to the group with the shortest distance(6)Identify the new cluster centroid by recalculating the positions of all objects assigned to that cluster(7)Repeat steps 3 and 6 until convergence or reach a fixed number of iterations, and confirm that the object has the shortest Euclidean distance from the cluster centroid as defined in the following equation:(8)Calculate the average dissimilarity of the cluster, where n is the total number of objects in the RFM index dataset, using the following formula:

3. Results and Discussion

This section first introduces the datasets used in the case study. Data are from the transit authority of Taipei Metro and the YouBike company in Taiwan. Millions of transactions have been transformed and preprocessed before being compiled into the dataset for the K-means clustering. The results of this implementation are then discussed.

3.1. Data Set Description

Figures 3 and 4 illustrate data for MRT and YouBike. There may be inconsistencies, irrelevant, and abnormal transaction information (e.g., entry and exit times). The data were cleaned using Python, EmEditor, and Spotfire. To display transfer patterns, MRT and YouBike data need to be combined. Consequently, we extracted the data by matching the cardholder ID and the transfer code from the MRT transaction data. Afterwards, we concatenated the matching data with the YouBike data. Following the Taipei Metro policy, there is a limit to the transfer duration, namely, one hour. Therefore, transfer behaviour is analyzed in this study based on the time spent riding a YouBike to the MRT at location A or vice versa. Thus, if a traveller returns a bicycle to a YouBike station and enters an MRT station within one hour, the system considers this a transfer behaviour. Figure 5 presents information regarding the transfer behaviours between YouBike and MRT.

Table 2 summarizes the final data set, consisting of 5,023,808 records of transfer transactions from January 2018 through July 2019. By transforming the EasyCard usage of individual travellers into time-dependent transit frequency in Table 3, the average daily transfers increased from Sunday to the middle of the week before plateauing (or slightly decreasing) till Saturday. Figure 6 indicates that most users are either standard or student cardholders. The number of transactions during the weekdays remains higher than during the weekend, despite various cardholder types. This suggests that most transfers are likely to be made by daily commuters.

The data were then transformed into RFM features for each traveller. Some illustrative examples are shown in Table 4. In this study, all the model features are given equal weight (i.e., are equally important), and the results of the RFM data conversion are used to perform a nonhierarchical cluster analysis. The RFM values differ due to scale differences, which in turn affects the clustering analysis. Accordingly, we standardized the RFM values using the simple z-score method.

3.2. Cluster Analysis

Several methods in the literature are used to determine the number of optimal clusters. However, according to Horvat et al. [65] and Raza et al. [66], clustering algorithms can be discriminatory, making it difficult to evaluate the results objectively. In addition, the categories that emerge from this process can take on different meanings depending on their context. For our study, since no ground-truth label of data exists for our problem, we validate the number of clusters using the elbow method, a type of internal clustering validation. In the elbow method, K-means clustering is performed on the dataset for a range of values of k, and the sum of the square of the distance between each point and its closest centroid, also known as the inertia, is calculated. We represent the inertia as the mean distortion in our graph, while others may represent it as the sum of squared errors (SSEs), the Within Cluster Sum of Squares (WCSSs), etc.

We start with k = 2 and increment by 1 until k= 10. Upon reaching a certain value of k, the cost of training (i.e., the diminishing return) will drop dramatically and eventually reach a plateau as the k value increases further (see Table 5). The diminishing return is greatest when k = 1. From k = 4 onward, the change rate becomes indistinguishable; the movement is almost parallel to the X-axis. Figures 7 and 8 show that both distortions decline rapidly as k increases from 1 to 3, their diminishing returns hit at k = 3 and slow down after k = 4. Therefore, k = 3 is the optimal number of clusters.

Tables 6 and 7 provide details regarding three clusters in 2018 and 2019. In 2018, there were 526,697 travellers, while in 2019, there were 426,717. It is important to note that 2019 begins in January, i.e., for seven months, while 2018 is for the entire year. Based on its proportionality, we can conclude that the transit ridership is increasing, providing a promising outlook for the use of public transportation. In the three clusters in 2018, cluster 1 has 116,154 travellers with average recency of 27.73 days, a frequency of 2.43, and an average monetary value of NTD50.16. Cluster 2 travellers have average recency of 165.17 days, a frequency of 1.62, and an average monetary value of NTD37.29; cluster 3 travellers have average recency of 54.93 days, a frequency of 23.66, and an average monetary value of NTD523.33. The results for 2019 are similar to those of 2018, with RFM values slightly lower than in 2018.

The average RFM values of each cluster are compared with the overall average RFM values to determine the RFM score tendency. An upward symbol () is placed if the average R, F, and M value is greater than the total average; otherwise, a downward symbol () is used. Table 8 summarizes transit traveller profiles for 2018 and 2019. For instance, a traveller in cluster 2 (), with recency values of 165.17 and 122.85, higher than its overall averages of 82.61 and 59.18, respectively, is likely to be a vulnerable customer who has not used transit for a long time. The percentage of travellers in this group decreased slightly from 62.35% in 2018 to 60.59% in 2019. In addition, a traveller in cluster 1 () may be a potential customer who has just begun travelling by transit. This group represents 22.05% of all consumers in 2018 and 24.66% in 2019. Moreover, travellers in cluster 3 () tend to be loyal (regular commuters), incurring significant travel expenses and frequent use of transit. The number of travellers was 15.60% in 2018 and 14.75% in 2019.

Cluster 2 is a vulnerable segment of the travel market since customers are likely not using public transit. By contrast, cluster 1 customers tend to be transit-savvy. Lastly, cluster 3 customers tend to be loyal travellers with monthly or weekly passes. For this reason, we further break down the data into quarters to understand the transit behaviour, as shown in Tables 914. Recency gaps between clusters 1 and 2 have narrowed from 12 to 2 days on average quarterly. Likewise, the difference between these two clusters has been relatively large in most quarters, except for the first quarter of 2018, when Taipei’s city government launched a promotional scheme for two-way transfers among YouBike, the MRT, and the bus. Table 15 indicates the market sizes for all clusters have remained steady across all quarters. Cluster 3 is the most profitable among these three clusters, with the highest frequency and monetary value.

As shown in Figure 9, we first exclude the frequency data from our analysis to gain a deeper understanding of this target segment. Consequently, the points illustrating the recency coordinates mostly fall between 10 and 20 days. Afterwards, target segments are analyzed based on the number of transit users and their monetary value. Based on Figure 10, the points illustrating the Frequency-Monetary values are clearly divided into two distinct groups: one group with fewer than three transits and monetary values less than 50, and a second group with more than 15 transits and monetary values at least NTD300. Due to this fact, even though one category is more profitable, the two categories can still be promoted differently to maximize profits. In Figure 11, the target segments are visually displayed as three clusters, with the upper left cluster representing the loyal customer (K = 3). These findings allow Public-transit operators such as Taipei Metro and YouBike companies to conduct microtargeted campaigns offering incentives to each segment and promoting transit options and special fare subsidies. In addition, public transportation operators may approach large employers in some areas to encourage employees to use public transportation. Together, public-transit operators and employers can facilitate, and employer-sponsored passes can be beneficial.

Tables 16 and 17 summarize the distribution of each group (potential, vulnerable, loyal) over the week. The highest percentages of transfers are different in 2018 and 2019, from mainly the vulnerable segment shifting to the potential segments. Loyal customers are mainly found from Wednesday to Friday, which indicates that most regular passengers are students and employees. Conversely, the transfer of vulnerable passengers varies from 2018 to 2019. The vulnerable group’s participation rate has increased on Saturday and Sunday. Therefore, it can be concluded that the number of casual users has increased over the weekend.

Table 18 presents the annual MRT revenue contributions among the three segments. The “11∼20” represents the amount the passenger pays between NTD11 and NTD20 for this MRT ride. The total percentages of the three segments illustrated different contributions in 2018 and 2019, with 51.43% in 2018 and 78.81% in 2019. It is pertinent to note that the base charge in Taipei MRT is NTD20 once the passenger leaves the boarding station by MRT. We also found that revenue contributions for all segments were generated through the expense of NTD11∼20 per trip, lower than the base charge. In fact, by adding up all three segments, travellers spending NTD11∼20 per ride accounted for approximately 58.94% of the 2018 revenue and 61.10.% of 2019, and the traveller segment changed from the vulnerable to the potential in 2019.

In Table 19, the highest YouBike annual revenue percentages in 2018 and 2019 were 50.34% and 45.73%, respectively. In 2018, 68.85% of total revenue came from passengers who spent no more than NTD10 on an excursion, which increased to 76.25 percent in 2019. As such, more travellers are using YouBike as their first and last-mile mode of transportation since it charges NTD10 per 30 minutes, and most customers return their bikes within a half hour. Therefore, while it was encouraging to see more recent travellers using the MRT and YouBike, there seems to be a need to encourage more frequent transfers since a fare increase seems implausible.

4. Conclusions

In order to gain back riders, especially in this period of the pandemic, it is essential to focus on first- and last-mile issues. In this study, we examine the travel patterns of visitors to Taipei exclusively using the MRT and the bike-sharing network (“YouBike”). According to longitudinal observations of ridership patterns, travellers can be categorized into distinct groups within distinctive profiles associated with stations and their areas of influence.

This study contributes to the stream of research. Ridership categorization is derived from the Recency-Frequency-Monetary (RFM) scoring model [63], capturing travel patterns. In transit travel, the number of days since the last transfer is used as the recency parameter, the number of transfers made in a given period as the frequency parameter, and the amount of profit generated over the number of transfers made by travellers in that period as the monetary parameter. Using the K-means approach, a transit ridership portfolio reveals interesting relationships between the local environments of metro stations and urban mobility patterns that differentiate their contributions to the system.

Passes popular before the pandemic, such as monthly or weekly passes, are unlikely to appeal to workers who have switched to hybrid work schedules or riders wary of using public transportation. As a result, agencies can eliminate fares for a limited period, such as a few weeks after the summer holidays, to encourage riders to make transit part of their new routine. It is also possible for agencies to offer deep discounts during off-peak hours to reduce crowding. In addition, agencies may replace monthly or weekly unlimited passes with more flexible arrangements, such as fare capping and the option to buy one-way tickets between one origin and one destination at a specified discount valid for a certain time frame.

This research is a preliminary exploration of the riders’ patterns to identify possible clusters. Utilizing data extracted from contactless smart cards has its limitations as factors affecting transit ridership in a large metropolitan setting involve more than just smart cards’ POS (point of sale) records. Other critical factors such as socioeconomic characteristics, technology (e.g., the effect of intelligent transportation information systems), energy price, urbanization, and automobile dependence should not be overlooked. An examination of the aforementioned factors, combined with the formation of rider clusters, may provide better policy implications for combating congestion and carbon emissions in the near future. [67–71]

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.”