Abstract

The public transportation network (PTN) provides mobility and access to community resources, employment, medical care, infrastructures, and other resources in the city. This research studies the process of the formation of links among nodes in different real-world PTNs. We have found that this process may be appropriately explained by a generalized linear model (GLM) using local, global, and quasilocal similarity indexes as explanatory variables. In modeling, the response variable was described by a binomial probability density function, and the logit function was used as a link function. In the crossvalidation process, utilising a downsampling approach, both average accuracy and area under the receiver operating characteristic curve (AUC) metrics presented higher values than 0.99. The kappa parameter had magnitudes larger than 0.93 for most of the PTNs. In the final validation stage, recall and specificity metrics took the value 1. Accuracy and precision parameters were larger than 0.99 and 0.87, respectively, for the majority of PTNs. Only one of the PTNs required utilising a smoothed bootstrap approach in order to achieve better results. The similarity measures with the greatest influence on the model were determined. We also assessed the impact of link removal on the global efficiency of PTNs, considering several similarity indexes. Additionally, we find that most of the networks show low local and global efficiencies (≤0.20), as well as travel times with a relevant variability, exhibiting standard deviations larger than 790 seconds. Significant similarities exist between the cumulative probability distributions of the local efficiency in all PTNs. With respect to the centrality measures, the eigenvector centrality presented a strong correlation with the hub/authority centralities (>0.80), while the pagerank showed a moderate, high, or very high correlation with the degree in all PTNs, >0.50.

1. Introduction

Link prediction methods have been the subject of research [14], which suggests several mechanisms to detect hidden connections. These mechanisms take into account the path information between pairs of nodes in order to estimate their common neighbors. They also consider a mutual information perspective in order to evaluate the similarity index between pairs of nodes. The conditional probability for the existence of a link is calculated, given the common neighbor of two nodes, as described in [5]. Finally, the weight of the links are considered, developing the mechanisms described in [6], which are based on the common neighbor, resource allocation (RA) [7], and adamic adar (AA) [8] indexes. The above is combined with the weighted mutual information (WMI) [9] score estimated between node pairs. Reference [10] suggests a new local information-based link prediction method, tie connection strength index (TCS), concerning the efficient paths between the target node-pair and their common neighbor. An adaptable parameter is presented in order to estimate the impact of the TCS and the topology of the network on the similarity of pairs of nodes. Reference [11] establishes a new type of triangle structure, which consists of one seed node, one common neighbor, and another node. Based on this, a new similarity index, named TRA index by the authors, is proposed for link prediction. The authors integrate the new triangle structure and the idea of RA [7] index [7]. Reference [12] proposed a new similarity measure based on the AA score, information related to communities generated from the topological structure of the network and the degree centrality. The link prediction algorithms use two open implementations of a bulk synchronous parallel programming model [13]. They are Apache Giraph and Apache Graphx. Reference [14] demonstrates that similarities with respect to structural features (eigenvectors) optimize the link prediction task in multiplex networks. This is done using a layer reconstruction method (LRM), which considers the unconnected node pairs in the target layer as similar, provided that they are not only analogous from the point of view of the target layer but also from the perspective of other layers. Tests on real multiplex networks show that LRM takes advantage of existing information redundancy in different layers.

The application of link prediction methods in real contexts has also been analyzed. A great deal of research is done on the analysis of social networks. Reference [15] carries out a comprehensive review and discusses some link prediction applications in social networks such as recommender systems, community detection, anomaly detection, and influence analysis. Because social networks are highly dynamic with the come-and-go of nodes and links, some research considers temporal aspects. Reference [16] characterizes the likelihood of a link between two nodes from both existing connectivity topology and the popularity of both nodes. Several datasets are considered in order to test and calculate the performance of algorithms. Reference [17] builds a linear model for integrating neighborhood similarity measures and node specific information and uses an evolutionary algorithm to locate the coefficients, which optimizes the prediction of links. The authors assign different weights to each index using the Covariance Matrix Adaptation Evolution Strategy (CMAES) [18, 19]). In addition, the protein-protein interaction (PPI) networks (PPI) have been examined using link prediction methods. Reference [20] utilises the support vector machine learning method for protein-protein interaction (PPI) prediction. Features, often used in social networks, like some similarity index, have been progressively put into practice to make predictions in PPI [21, 22].

This paper studies the link formation process in several PTNs using various similarity measures, which have been applied in a link prediction theoretical framework. The most influential indexes in the pattern followed by link formation between pairs of nodes are determined.

PTNs have been examined from different points of view. Thus, models have been implemented to analyze travel behaviours. Reference [23] forecasts, based on surveys, some characteristics related to the passenger flow. Reference [24] implements a Bayesian network to detect the relationships between travel happiness and several parameters that affect travel behavior. Reference [24] checks pretravel information-seeking behaviours of the passengers using data collected during an extensive public transport on-board survey. For this purpose, the authors implement a multivariate binomial logistic regression model. The model takes into account factors related to sociodemographics, aspects of the travelers, characteristics of the trip, and devices used for information consultation.

The main novelty of our research is that it shows that the link formation pattern in PTNs can be appropriately explained by means of a generalized linear model (GLM), which has local, quasilocal, and global similarity measures between nodes as explanatory variables. The response variable, which establishes whether or not a link exists between pairs of nodes, is described by a binomial probability density function. The link function used is the logit function.

Studies exist that analyze topological parameters in PTNs (degree distributions, path length distribution, and betweenness), as well as growth models. However there are no analysis that we know of, which does this demonstration on PTNs. Research exists, which has developed growth models for PTNs, based on other considerations. Reference [25] replicates some statistical features of PTNs, describing their evolution in terms of adding routes in P-space. The authors use a self-avoiding walk (SAW) as a route model. In the aforementioned P-Space [26], one node symbolizes one stop, and one link joins a pair of stops, if at least one route exists that supports a direct service between them. Reference [27] developed an area-based model of highway growth. Specifically, a binary logit model in order to estimate the new route growth probability of divided highways and secondary highways using high-quality geographic information system (GIS) data of land-use, population distribution, and highway network for the Twin Cities Metropolitan Area from 1958 to 1990 was obtained in [28]. A growth model that iteratively invested in constructing new links or incrementing the capacity of those existing was implemented. The objective of the research was to establish the impact the demand distributions and operational costs have on the evolution of a PTN. The model considered parameters related to grid geometry, demand characteristics, operating mode parameters (operational speed per mode, cost per km, and capacity). On the contrary, the model described in this paper explains the appearance of links in PTNs based on exclusively topological parameters.

The PTNs also been studied as complex systems [29, 30] describes a geospatial layout for distributing stops and uses a maximum allowable walking distance in order to link the routes. The PTNs are optimized, considering aspects as efficiency and robustness. Reference [31] studies common problems that have been found when a complex system scheme is used for the analysis of the topology of a transportation system (such as mechanisms for the evaluation of the scale-freeness, metrics for the analysis of the network structure, and examination of the vulnerability of the networks using methods with an unacceptable computational time). The vulnerability of the PTNs has also been analyzed in depth [26, 32].

This paper studies the impact that the removal of links, with certain similarity characteristics, has on the global efficiency of PTNs. The relationships between similarity characteristics and the local efficiency of nodes are also checked. Other research has analysed the effect that the node elimination has on the global efficiency of PTNs [33], and the robustness of PTNs has been examined from other points of view, such as the evolution of the giant component when several nodes are deleted [26, 33]. The fault propagation [20, 26] from nodes with certain topological characteristics (highest betweenness, degree, eigenvector centralities, and pagerank) has also been analyzed. However, a detailed study of the effect on the global efficiency in PTNs when certain links are removed according to similarity indexes analysed in this research has not been found.

This paper also examines the correlation between some centrality measures and relates them to other traffic flow characteristics. Some research exists [3437] that analyze the correlations between centrality measures in networks of different types. However, we focus on the study of centralities in PTNs and relate them to the flow of vehicles. These characteristics, that we know, have not been previously studied specifically in the PTNs presented here. Moreover, the networks analyzed here are of very different sizes and nationalities, which suggests that they can also operate differently, bringing generality to the analysis. The correlation between centrality measures can explain some of the patterns found in PTN, when a target attack or a fault propagation is suffered by them [26].

The same applies to the study of travel times. It has been shown that, in general, the size, complexity, and variability of available routes in PTNs produce trip times that are highly different between routes. We also study the local efficiency, demonstrating that there are commonalities between PTNs with respect to this feature.

The PTNs studied are AVL, CFL, RGTR, and TICE in Luxembourg, which has 1,372 nodes and 340,684 links; Island Transit in USA, which has 358 nodes and 5,946 links; Lanta in USA, which consists of 2,150 nodes and 91, 583 links; Linja-Karjala Oy in Kuopio, Finland, which has 551 nodes and 63,339 links; Metlink in New Zealand, which has 3007 nodes and 355621 links; Prague Public Transit Company (PPTC), Regional Organiser of Prague Integrated Transport (ROPIT) in Prague, which consists of 5,152 stops and 1,602,778 links; STAR in France, which consists of 1,415 stops and 9,477,213 links; Thunder Bay Transit in Ontario, Canada, which consists of 825 nodes and 78,247 links; TransAntofagasta in Chile, which has 650 nodes and 58 724,362 links; and finally, Sage in California, which has 31 stops and 66 links. It can be observed that the networks are of small, medium, and large sizes.

The vulnerability of AVL, CFL, RGTR, TICE; Linja-Karjala Oy, STAR; Thunder Bay Transit; and TransAntofagasta networks was analyzed in [26].

The objectives of this research were as follows:(1)To analyze whether a GLM, which has as input variables certain measures of similarity between nodes, can correctly explain the formation of links. To establish which of the measures have greater significance in this process.(2)To detect the influence that the links can have on the global efficiency of the network, according to their similarity characteristics.(3)To find common features in the networks that allow to characterize their efficiency and trip times).(4)To determine the relationships that may exist between some centrality measures (eigen vector, pagerank, betweenness, hub, and authority), as well as with other traffic flow characteristics.

2. Materials and Methods

2.1. Overview of Used Resources

Information related to the stops and routes based on the studied networks, which is available on the websites, was utilised. Several programs in R [38] and Python [39] were specifically implemented to carry out this research, using the R.3.6.0 and 3.8.3 version, respectively. The networks and igraph packages were used. In addition, the proxfun, caret, nortest, stats, vip, and rose packages in R were utilised.

The programmes specifically developed to perform this research allowed:Processing of information related to the PTNs to be able to work with it (routes, stops, stop times, trips, and calendars) (in Python, ProcessPTNInf.py).Construction and simplification of the graphs that describe a PTN. Obtaining the similarity measures between nodes (in R, ConstGraphCalcSim.R).Estimation of centralities (in R and python, CalcCentralities.py and CalcCentralities.R).Building of a binary classification model, evaluating their results (in R, ModelingPTN.R).Obtaining frequency and cumulative probability distributions related to efficiency and trip times (in R, CalcDistr.R).Get graphs showing the results (in R, DrawGraphs.R).

These programs followed the typical development life cycle with phases of specification, detailed design, coding, and testing.

2.2. Overview of Used Methods
2.2.1. Generalized Linear Models

This is the generalized linear model (GLM) we have used for the simulation of link formation in PTNs.

Consider the response and the set of independent variables for . A GLM consists of both a random and a systematic component, as well as a link function.

Regarding the random component, it is assumed that , are independent random variables described by a probability density function from the exponential family:where are known functions, and are parameters, called natural and dispersion parameters, respectively.

The systematic component relates some vector to the features.where are called regression parameters.

The link function relates the linear predictor to the mean of . If , that is, if  = , holds. The link function is called the canonical link function.

The exponential family contains commonly used distributions such as gamma, normal, inverse Gaussian, Bernoulli, binomial, Poisson, geometric, negative binomial, and exponential.

In particular, a probability density function characterized as a binomial distribution, where is the number of trials, can be defined as

Therefore,

To evaluate the parameters of an exponential family, GLM maximum likelihood can be applied,

Therefore, log-likelihood for the sample is

We use as link function , a logit function. It returns values between 0 and 1 for any input,

In order to maximize over all choices of coefficients , it is necessary to consider that each natural parameter may be expressed using the mean of the exponential family distribution. Taking it into account, and recalling that a link function exists, such aswhich joins the mean to the parameter . It is possible to compute as in and then use these estimates to state that ; .

Therefore, it is possible to establishwhere the terms that do not depend on , have been removed.

If the canonical link function (8) considers to maximize over is

In order to maximize to form , it is possible to carry out iteratively reweighted least squares regressions (IRLS) [40, 41]. Finally, the coefficients can be managed as a result of a single weighted least squares regression, the last one in the IRLS succession.

Specifically in this research, it is shown that the pattern of link formation in various PTNs can be well explained through a GLM. In this case, the response takes a categorical value, whether or not a link exists between two stops. The independent variables, , correspond to several indexes describing the similarity between stops. The probability density function f is characterized as a binomial distribution. The similarity indexes utilised as predictors are described in the labeled link building process in PTNs and the Supplementary materials section.

In order to check the importance of predictors using the t-test, it is required to examine if j is normally distributed. This is checked by applying the Anderson–Darling test [30] with a significance level . The considered hypotheses are as follows:(i)Null hypothesis : is normally distributed”(ii)Alternative hypothesis : is not normally distributed”

If <, is rejected, is accepted. Else is taken.

The R package nortest was utilised for the calculation of the Anderson–Darling test.

Once it has been verified that j is normally distributed, t -tests [42] were carried out with a level of significance . This allows us to know the contribution of each individual explanatory variable, to the model. The possible hypotheses are as follows:(i)Null hypothesis : “explanatory variable has a slope that is equal to zero, that is, is not useful to predict ,  = 0”(ii)Alternative hypothesis : “explanatory variable has a slope that is different from zero, that is, contribute to predict ,

The results obtained in the test can be:(i)If <, is rejected, is taken(ii)Else is accepted, is rejected

Next, the importance of the predictors is determined using a t statistic estimator, which is defined as the ratio of the estimated parameter to the standard error SE of the estimation,

For a given , the higher the value of the estimator, the higher value of the .

If the null hypothesis is accepted, a high estimator produce evidence against it, similar to when the is very far from the hypothesized value.

In order to implement the GLM model and to evaluate the importance of the predictors, the caret and vip packages in R are used.

2.2.2. Topological Representation of PTNs

A PTN can be represented in a topological space named L-Space in which a network is mapped as a graph G = (N; L), where N is the set of nodes symbolizing the stops and L is the set of links established between them. In the L-Space, one node represents one stop, and one link means a union between two consecutive stops. This tells us that there is a link between two stops, if one stop is the successor of the other on a route.

2.2.3. Link Building Process in PTNs

In each network, it was analyzed whether a GLM could adequately describe the link formation process. As was explained in Section 2.2.1, the caret package in R was used in order to carry out the stages of training and validation of the model. The process was as follows.

The L-Space was constructed. All the loops and multiple links from the graph were deleted, obtaining a graph , where the maximal connected components were obtained. Then, with the largest cluster, the giant component (CG), the following operations were performed:

The number of pairs of connected and unconnected nodes were estimated, and several similarity measures were calculated for each one of them. Local, quasilocal, and global methods were applied.

The local similarity indexes used were: Adamic-Adar (dsimaa) [43], common neighbours (dsimcn), cosine (dsimcos) [44], cosine similarity on L+ (dsimcos_l) [45], hub promoted (dsimhpi) [46], jaccard (dsimjaccard) [47], hub depressed (dsimhdi) [3, 7], Leicht–Holme–Newman (dsimlhn_local) [48], preferential attachment (dsimpa) [49], and Sørensen (dsimsor) [50]. The global similarity measures used were: average commute time (dsimact) [37], normalized average commute time (dsimact_n) [51], Katz (dsimkatz) [52], L+ directly (dsiml) [45], matrix forest (dsimmf) [53], and random walk with restart (dsimrwr) [54]. Finally, the quasilocal measures of the similarity utilised were graph distance (dsimdis) and local path (dsimlp) [6, 55]. These indexes are described in detail in the Supplementary materials section.

The model has the values that describe the different similarities between pairs of nodes as input variables (features) and the indication of whether or not there is a link between them as output variable. In order to build the model, supervised learning is used. In this technique, the relations among the input variables (features) and outgoing ones (target) are learnt. That is, from some labeled examples (in each the correct input and output are known), the algorithm that is able to predict the value of the output for new cases not utilised in the learning (training process). For each PTN, a set of data is provided with different features, and the outcome or target (label) is known for each case (pair of nodes). The goal is to predict the label of new cases (pairs of nodes) with the minimum possible error. Since the outcome variable is a categorical value, whether or not a link exists, the prediction corresponds to a binary classification problem.

Crossvalidation is used as a procedure to estimate the model. Instead of splitting the dataset into a training and a test subset, in the crossvalidation mechanism, equal partitions of the dataset are made. The model is trained times: each time one of the partitions is taken as a test set, and the model is trained with the rest of the data (with the remaining folds). Each fold is used once as a test set. Finally, several predictions exist about the whole dataset. This process results in k estimates of a parameter related to the effectiveness of the model. An average of an estimated parameter (EP) can be made,

EP can be accuracy (14), area under the curve (AUC) [56], and kappa [57].

These parameters are described as follows:TP: truth positives, TN: truth negatives, FP: false positives, and FN: false negatives.   AUC: AUC represents the probability that a classifier ranks a randomly selected positive instance higher than a randomly chosen negative instance. This EP can be defined, in general terms, as follows, given a binary classification task that has positive and negative instances, respectively. The outputs of a binary classifier can be considered as a rigorously ordered list for these instances, which can be appropriately represented by , which is an indicator function of a set . Therefore, is a fixed classifier, where are its outputs on the positive instances and are its outputs on the negative instances. The AUC related to is described [58] as which is the value of the Wilcoxon–Mann–Whitney statistic [59].   Kappa: this EP is defined aswhere andFinally, an independent end estimation of the accuracy, recall, precision, and specificity of the model can be obtained using the validation set. The last three parameters areIn addition, the confusion matrix as an estimation of the provided solution was obtained in the end validation for each PTN. Table 1 describes the confusion matrix general concept for a binary classification problem.

The final validation was performed on 20% of the total samples.

The selection of the similarity measures to be used as input variables to the model required checking the existing correlation between them. To determine whether this correlation should be estimated using Spearman’s or Pearson’s method, we checked whether the variables were normally distributed. The Anderson–Darling test [60] was applied with a significance level equal to 0.05. The following hypotheses were used:(i)H0: “the sample comes from a normal distribution”(ii)Ha: “the sample does not come from a normal distribution”

If <0.05, H0 is rejected; otherwise, H0 is accepted.

The R package nortest was utilised for the calculation of the Anderson–Darling test.

2.2.4. Study of the Efficiency

In a graph, G, the distance between the two nodes (i and j), d(i, j), is the number of links that form the shortest path between them. If there is no link between i and j, then d(i, j) = ∞. The efficiency between i and j [60] can be defined as

Since is estimated based on the shortest path length between node pairs, an increase in would result in a decrease in the local efficiency between i and j.

In addition, the global efficiency of G can be described as

This parameter is the average of the efficiencies calculated over all pairs of nodes in G. For a given number of nodes N, GlobEff (G) increases with the addition of links. According to the previous definition 0 ≤ GlobEff (G) ≤ 1, being the value 1 reached for a complete graph [61].

GlobEff (G) has been estimated in several PTNs as one of its features [62, 63]. This research analyses the impact that the elimination of links between pairs of nodes, with certain similarity characteristics, has on the GlobEff of the GC in . The result could help to achieve better network planning, since, depending on which links are removed or built, higher or lower GlobEff can be obtained. Common characteristics regarding efficiency in PTNs are also identified.

The relationship between and network density is also analyzed. This last characteristic for undirected graphs such as PTNs can be defined as

2.2.5. Correlations between Topological Measurements

Certain investigations have been performed focusing on the study of centrality measures [35] in a PTN. In [36], the authors study some centralities in 58 existing social networks. Further studies examine the correlation between centrality metrics: using Pearson, Spearman, and Kendall methods [37]. The authors use the degree as the base to approximate three other metrics: closeness, betweenness, and eigenvector. They check the correlation between centrality metrics in several real networks, categorized as social, technological, and biological networks. Authors find that the betweenness occupies the highest coefficient, closeness is at the middle level, while eigenvector fluctuates dramatically between networks. They also put forward the idea that rank correlation performs better than the Pearson one in scale-free networks. In [40], several different real-world network graphs, representing several contexts (social club network, birds’ social network, word adjacency network, airports network, games network, and related book network) with the number of nodes ranging from 34 to 332, were used. The authors classify the main centrality metrics into two categories: degree-based (degree and eigenvector centralities) and shortest path-based (betweenness, closeness, distance, and eccentricity centralities). They analyze the correlation between the aforementioned centrality metrics, showing that two degree-based centrality metrics (degree and eigenvector centrality) are highly correlated across all the studied networks. There is predominantly a moderate level of correlation between any two of the shortest path-based centrality metrics (betweenness, closeness, distance, and eccentricity). The authors explain that a poor correlation exists between a degree-based centrality metric and a shortest path-based centrality metric for regular random networks. As the variation in the degree distribution of the nodes increases, the correlation coefficient between the two classes of centrality metrics increases. Reference [34] uses a regression model to show a correlative relationship between passenger flow distribution and the conventional network properties (in/out degree, betweenness, and closeness) for the train system in Hague and Amsterdam cities.

Due to the classification, social, technological, and biological networks can encompass networks of very different types, and our investigation focuses on the study of centralities in PTNs. These correlations are studied in . Specifically, the following centralities are calculated:(i)The degree of a node , for an undirected graph, G, such as a PTN, is [26, 64]where is the element of the adjacency matrix, A, such as  = 1, if the node is linked to node and 0, otherwise.(ii)The minimum distance between two nodes in G, l, is the length of the shortest path between them.(iii)The betweenness centrality of a node in G, is [26, 65]where is the total number of shortest paths from node to node , and is the number of those paths that pass through .(iv)Regarding the eigenvector centrality of a node in G, [26, 65, 66]: , are the eigenvalues of the adjacency matrix A =  of G. Then, the largest eigenvalue of matrix is with an eigenvector  = T such that . The eigenvector centrality for node represented as can be defined as(v)Pagerank, PR, of a node in G, is [26, 6668]where [26] is the number of nodes in G, is the pagerank of a node , and is the outdegree of node , being the sum of executed over the nodes pointing towards . In the case of the PTNs, it is considered that G is an undirected graph; therefore, . is the damping parameter, ∈ [0, 1].(vi)A hub is a node that points to many relevant nodes, and an authority node is the one that is focused on by many important nodes. Both are based on the eigenvectors related to the highest eigenvalues of the matrices and .The hub centrality of the node i, denoted by HC (i), is the i-th entry of the following vector y satisfying equation:Similarly, the authority of a node i, symbolized by AC (i), is the i-th entry of the following vector x satisfying equation:For an undirected graph, such as a PTN, the adjacency matrix A is symmetric. The two scores, AC(i) and HC(i), are identical.

3. Results and Discussion

3.1. Link Building Process in PTNs

As was previously displayed in 2.2.2, the network was represented in the L-Space. All loops and multiple links were eliminated, obtaining graph . This is where we calculate the existing maximum number of connected components. Table 2 contains information collected after the explained process, for all analysed networks, the number of links and existing nodes and clusters in . In addition, there are the number of nodes and links present in the largest cluster GC. As well as the fact some of them have several clusters, detection of clusters in cities over PTNs can also allow us to find urban groups, which are strongly connected through transportation. The comparison between PTN clusters and urban agglomerations can be used to estimate whether the PTNs are capable of supporting these human distributions [69]. Identifying under- and overserviced areas can also help in policy decisions, including infrastructure planning and local development [70].

As was explained in 2.2.1, we used the caret package in R for the building of the model. As described in 2.2.3, the model was trained times: each time one of the partitions was taken as a test set, and the model was trained with the rest of the data (with the remaining folds). Each fold was used once as a test set. Finally, several predictions exist about the whole dataset. This process results in estimates of the accuracy, AUC, and kappa parameters. Additionally, if two similarity measures had a correlation greater than 0.9, one of them was not considered in the prediction. Table 3 shows the similarity indexes that present a Spearman correlation higher than 0.9 with another.

In order to know the method to be used for the calculation of correlations, Pearson or Spearman, the Anderson–Darling test was applied with a significance level α = 0.05. All networks showed a <0.05. Therefore, the null hypothesis, H0 was rejected, inferring that the distributions did not follow a normal pattern. Spearman’s method was used to calculate correlations.

The importance of each predictor in the model was estimated calculating the absolute value of the [71], whose definition has been presented in 2.2.1 The importance of predictors is shown in Table 4.

Tables 5 and 6 show, in each PTN, the average of the estimators (accuracy, AUC ,and kappa) calculated over the k times that the model was trained. Since the number of links between pairs of nodes was much lower than the number of unconnected pairs of nodes, the down-sampling approach was utilised, randomly removing the observations. In order to improve the results, artificial balanced samples were generated according to a smoothed bootstrap procedure [60] in the Thunder Bay Transit network. The rose package in R was used.

Table 7 shows, in each network, the confusion matrix [72] obtained in the final validation. In Table 8, accuracy, recall, precision, and specificity parameters are presented.

All networks showed good results applying down-sampling, according to the parameters chosen for the evaluation of the model. In the crossvalidation process, average accuracy and AUC values were higher than 0.99 and kappa larger than 0.93. In the validation stage, accuracy and recall showed values higher than 0.99, and specificity had a value equal to 1. The only exception was the Thunder Bay Transit network, where it was necessary to apply the rose method in order to achieve better kappa and precision values.

As a result, the process of building links was appropriately modeled using a GLM, which had some measures of similarities between nodes as input variables. The response variable, which establishes the existence or not of a link between pairs of nodes, is appropriately described by a binomial probability density function. The link function used is the logit function, as we explained in 2.2.1. The model has the novelties described in Section 1, with respect to other models that have already been developed for PTNs.

In most networks, the figure with the highest influence was dsimdis, followed by simact. In addition, the simcos_l and simlp showed high or moderate importance in some networks.

3.2. Study of Trip Times

The trip times are analyzed in order to estimate things in common between networks. Several statistical parameters are calculated (average, standard deviation, median, moda, maximum, and minimum values). The results and the frequency distribution are displayed in Table 9 and Figure 1, respectively.

The cumulative probability distributions are also checked. They are shown in Figure 2. The stats package in R was used. The similarity between two distributions is examined, applying the Kolmogorov–Smirnov test [73]. A significance level equal to 0.05 is taken, while the following hypotheses are considered:(i)Null hypothesis (H0): “the samples come from the same distribution.”(ii)Alternative hypothesis (Ha):“the samples come from different distributions.”

If a <0.05 is obtained in the test, the null hypothesis is rejected. Table S.1 shows the results obtained in the test.

It can be noted that similarities do not exist between the PTNs in relation to the trip times. All networks presented a high standard deviation. The lowest is 14.02 minutes (790.23523 seconds) and the highest is 11.12 hours (42,027.19610 seconds). This shows that the size, the complexity, and variability of available routes in the PTNs cause trip times to be highly inconsistent between routes. Trip times allow the evaluation of how travelers choose a service based on whether or not it is convenient. Trip times have been considered by some researchers to evaluate the performance of PTNs [74, 75].

3.3. Study of Efficiency
3.3.1. Local Efficiency

All networks showed a large majority of nodes with low local efficiency ≤0.20, as can be noted in Figures 3 and 4.

As was done with trip times, the similarity between local efficiency distributions is examined, applying the Kolmogorov–Smirnov test. A significance level equal to 0.05 is taken, resulting in the following hypotheses being considered:(i)Null hypothesis (H0): “the samples come from the same distribution.”(ii)Alternative hypothesis (Ha): “the samples come from different distributions.”

If the obtained in the test is <0.05, the null hypothesis is rejected.

The networks presented high analogies in the cumulative distributions of local efficiency. The test yielded a >0.05 in all pairwise comparisons performed, as can be appreciated in Table S.2. Therefore, in general, if a stop is unavailable, the remaining connections between its neighbours are distinct from direct connections. This is revealed by the low value of the local efficiency [76].

3.3.2. Global Efficiency

The calculation of the GlobEff was carried out in the GC of , and it can be observed, according to the results depicted in Table 10, that the higher the density of , the higher the GlobEff.

Most of the analyzed networks presented a GlobEff of small value (<0.20). Some pieces of research use the GlobEff as a parameter to compare PTNs [77, 78], and others apply it to identify hubs [79, 80]. Consequently, the degree of a node is ranked by comparing the changes in PTN efficiency after eliminating the node. In contrast, this research analyses the variation in GlobEff when links with certain similarity characteristics were removed. The results are shown in Table 11. Similarity measures with a correlation higher than 0.9 with another were not considered. It can be noted that in most of the networks, the link deletion in which a 75% reduction was reached most quickly was dsimpa and dsimlp, and the one that took the longest to reach was dsimcos_l. Figures 57 show the variation in GlobEff when certain links are removed.

Table 11 shows, for each similarity measure, the number of removed links that causes the reduction of GlobEff by 75%.

3.3.3. Correlations between Topological Measurements

The eigenvector, betweenness, pagerank, degree, hub, and authority centralities were calculated in , in order to study the correlation between them. The correlation of these variables with the amount of transport arriving and departing weekly from a stop were also estimated. Enabling us to know which method, Pearson or Spearman, should be used in the calculation, the Anderson–Darling test with a significance level α = 0.05 was applied. In this way, it could be known whether or not the variables were normally distributed. The test yielded a <0.05 for all variables, so the null hypothesis H0 was rejected, and the alternative hypothesis Ha was accepted.

The correlations obtained by applying Spearman's method are shown in Tables S.3S.12. In all networks, the eigenvector centrality presented a strong correlation with hub and authority centralities. Pagerank showed a moderate, high, or very high correlation with the degree. Therefore, also in this network, a high degree usually has a significant influence. The pagerank and degree only presented a moderate or high correlation with betweenness in some networks, demonstrating that specifically in these few networks a node with a high degree also usually presents an important level of connectivity. Eigenvector and degree, in most networks, exhibited a low or very low correlation. Furthermore, the number of weekly buses arriving and departing from a bus stop showed no strong correlation with any of the centrality measures. Strong correlations between degree and pagerank and degree and betweenness have also been found in some Chinese PTNs [78].

4. Conclusions

Regarding the model followed by the formation of links between stops, this research shows that it can be correctly explained through a generalized linear model, which has certain similarity measures as input variables. Although the similarity measures that explain the model are different among networks, in most of them, dsimdis has a higher significance. It has a value equal to 100. In addition, dsimcos_l and dsimlp presented relevant importance in some PTNs with values higher than 30. Additionally, dsimact and dsimpa showed values equal to 100 and larger than 10, respectively, in certain PTNs.

Regarding travel times, these showed a high variability between networks (with standard deviations greater than 790.23 seconds), as well as very different cumulative probability distributions ( ≥0.05 in Kolmogorov–Smirnov test).

The study of local efficiency reveals that its cumulative distributions have strong analogies in all network distributions (Kolmogorov–Smirnov test showed <0.05). The local efficiency showed values ≤0.2 in the most of PTNs. Similarly, the overall efficiency exhibited reduced values (≤0.25). This seems to be a common feature of PTNs.

With respect to the centrality measures, they did not show correlation with the flow of vehicles, suggesting that traffic dynamics in the network may be strongly influenced by other different parameters as opposed to topological ones. In all networks, strong correlations of the eigenvector centrality with the hub and authority centralities were detected (with values higher than 0.80). The pagerank showed moderate, high, or very high correlation with the degree (it was larger than 0.5 in all networks). Therefore, these correlation characteristics seem to be a commonality in PTNs.

This research can be continued with a detailed study on the interactions between the different existing modes of transport modes in the cities. A multimodal transportation system, embodied as a multiplex network, can be considered in order to face the problem of urban mobility. In a multiplex network, a node symbolizes a specific origin/destination stop, which exists in each of the network layers. Nevertheless, the links are represented by a different layer of interaction determined by the type of transportation mode used for connecting two nodes.

Data Availability

Information of stops, routes and trip times of AVL, CFL, RGTR, and TICE; Island Transit; Lanta; Linja-Karjala Oy; Metlink; PPTC, ROPIT; Sage; STAR; Thunder Bay Transit; and TransAntofagasta were retrieved from the operating companies’ public web sites, the Deconet Public Transport Network Data, and GTFS Data Exchange repositories.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially funded by Telefonica Chair at Francisco de Vitoria University.

Supplementary Materials

Supplementary Material includes (i) description of similarity measures (local, global, and quasilocal methods), (ii) tables related to the study of the trip times, (iii) tables regarding analysis of the local efficiency, and (iv) tables related to correlations between centrality measures. (Supplementary Materials)