Abstract

We seek to quantify the extent of similarity among nodes in a complex network with respect to two or more node-level metrics (like centrality metrics). In this pursuit, we propose the following unit disk graph-based approach: we first normalize the values for the node-level metrics (using the sum of the squares approach) and construct a unit disk graph of the network in a coordinate system based on the normalized values of the node-level metrics. There exists an edge between two vertices in the unit disk graph if the Euclidean distance between the two vertices in the normalized coordinate system is within a threshold value (ranging from 0 to, where k is the number of node-level metrics considered). We run a binary search algorithm to determine the minimum value for the threshold distance that would yield a connected unit disk graph of the vertices. We refer to “1 − (minimum threshold distance)” as the node similarity index (NSI; ranging from 0 to 1) for the complex network with respect to the k node-level metrics considered. We evaluate the NSI values for a suite of 60 real-world networks with respect to both neighborhood-based centrality metrics (degree centrality and eigenvector centrality) and shortest path-based centrality metrics (betweenness centrality and closeness centrality).

1. Introduction

The weights assigned to nodes (a.k.a. vertices) in a complex network are either topology-based or domain-based or a combination of both. Centrality metrics quantify the topological importance of the nodes in a network [1]. There exist several centrality metrics, each proposed to capture a particular topological aspect; the four commonly studied centrality metrics are degree centrality (DEG), eigenvector centrality (EVC), betweenness centrality (BWC), and closeness centrality (CLC). While DEG and EVC could be categorized as neighborhood-based centrality metrics, BWC and CLC could be categorized as shortest path-based centrality metrics. More detailed information about these four centrality metrics and the procedures to individually compute them is available in [1]. Some of the examples for domain-based metrics are age, height, and weight of a patient (health information networks), number of publications and h-index of an author (citation networks), the processing capacity and the number of ports available for a router (communication networks), etc. Throughout the paper, the terms 'node' and 'vertex', 'link' and 'edge', and 'network' and 'graph' are used interchangeably. They mean the same.

Similarity assessment of nodes in complex networks has been so far conducted only at the node-level (e.g., [29]) and not at the network-level. To the best of our knowledge, all the similarity measures available in the literature quantify the extent of similarity between two nodes (like cosine similarity [10], matching index [11], etc.) or a set of nodes (the notion of equivalence classes [1], Rich club coefficient [12], etc.), but not among all the nodes in a network. It would not be appropriate to quantify the similarity among all the nodes in a network as a statistical function (like average or median) of the pair-wise similarity metric values. Also, the currently available similarity measures (like assortative index [11, 13]) use just one node-level metric (typically, the degree centrality metric) to assess the similarity between two vertices or a set of vertices. There is currently no quantitative measure available to rate the extent of similarity among all the vertices in a network with respect to a combination of node-level metrics (topological metrics and/or domain-based metrics). In this paper, we seek to develop a “network-level” node similarity index (NSI) to comprehensively quantify the extent of similarity (in a scale of 0 to 1) among “all” the nodes in a network with respect to a set of node-level metrics.

We propose that two vertices are to be considered “similar” with respect to a set of node-level metrics if the vertices are “closer” on the basis of the Euclidean distance between their coordinates (represented by the normalized values of the node-level metrics for the vertices in the network). For example, let BWC and CLC be the two node-level metrics considered. Let there be four vertices v1, v2, v3 and v4 in the network whose normalized BWC values are 0.49, 0.62, 0.11, and 0.79, respectively, and normalized CLC values are 0.38, 0.42, 0.87, and 0.48, respectively. Then, the coordinates of the vertices v1, v2, v3, and v4 are given by (0.49, 0.38), (0.62, 0.42), (0.11, 0.87), and (0.79, 0.48), respectively, wherein the first entry in the coordinate tuples represents the normalized BWC values of the vertices and the second entry in the coordinate tuples represents the normalized CLC values. The Euclidean distance between vertices v1 and v2 is = 0.136 and the Euclidean distance between vertices v3 and v4 is = 0.784. According to our notion of similarity, vertices v1 and v2 are relatively more similar to each other, compared to vertices v3 and v4 with respect to BWC and CLC.

Our approach to determine the NSI for a network is briefly summarized below (more details are in Section 2). Given a network of nodes and edges and a set of node-level metrics of interest (let k be the number of node-level metrics considered), we first determine the raw values for the nodes with respect to each of the k node-level metrics and individually normalize them (using the sum of the squares approach). We then distribute the vertices in a k-dimensional coordinate system wherein the coordinate of a vertex is a tuple represented by the normalized values for the k node-level metrics. We seek to construct a unit disk graph of the vertices in the k-dimensional normalized coordinate system (the range of coordinate values for any dimension is 0 to 1) such that two vertices are connected with an edge if the Euclidean distance between them is within a threshold value. We run a binary search algorithm to determine the minimum value for this threshold distance so that the unit disk graph of the vertices in the k-dimensional normalized coordinate system is connected. Our hypothesis is that the closer the vertices in this coordinate system (i.e., more similar the vertices based on the node-level metric values), the smaller the value for the minimum threshold distance to obtain a connected unit disk graph. We hence propose the value for the node similarity index (NSI) to be 1 − (minimum threshold distance), where is the maximum distance between any two vertices in a coordinate system based on the normalized values of the k node-level metrics considered for similarity assessment.

Some of the applications we envision for the proposed NSI measure and the normalized coordinate system of the node-level metrics used to compute the measure are as follows: a communication network with a smaller NSI value is more likely to have a single point of failure (one or few routers would have more connections and through which more traffic flows compared to the rest) and is also more vulnerable for security attacks. A social network with a larger NSI value could be considered to comprise of users who are more peers/similar to each other. Health professionals may decide on coming up with a single treatment plan or different treatment plans for the patients depending on the NSI value (with respect to a set of node-level metrics) for a health information network; if the values for the health metrics for all the patients are similar (a larger NSI value), then a single treatment plan for all the patients might be a good choice to at least begin with. Further, we could run clustering algorithms on the unit disk graph corresponding to the NSI value for a network and determine clusters of “similar” vertices that need not be directly connected to each other. For example, we could identify the cluster/set of vertices that have similar values for the health parameters physically spread (but need not be connected) over a health information network. Finally, the proposed model of unit disk graph-based node similarity index could be applied for outlier detection: for any unlabeled dataset of features and their normalized values, we could construct a unit disk graph (to represent the dataset) wherein the vertices are the data points (rows) in the dataset with coordinates corresponding to the normalized feature values and two vertices are connected if the Euclidean distance between the two vertices is within a threshold distance. The NSI value for such a dataset would quantify the extent of similarity among the data points with respect to the feature values. Any vertex with a degree of one in the unit disk graph (especially with a larger NSI value) is a potential candidate for being classified as an outlier.

In Sections 3 and 4, we consider a total of 60 real-world networks for similarity assessment and determine their NSI values with respect to a combination of node-level metrics. Since these networks belong to different domains, we do not consider the domain-based metrics as node-level metrics in our assessment calculations. We consider only the topology-based centrality metrics (DEG, EVC, BWC, and CLC) as the node-level metrics for the similarity assessment tests conducted in this paper. The rationale behind the choice of the above four centrality metrics is that they are widely considered as representatives of neighborhood-based (DEG and EVC) and shortest path-based (BWC and CLC) centrality metrics as well as are considered “prototypical” metrics representing three of the four classes of centrality metrics (radial versus medial metrics and volume versus length-based metrics) [14]. The DEG and EVC metrics are radial metrics that capture the volume (number) of walks originating or terminating at a node. The BWC metric is a medial metric capturing the volume of walks passing through a node and the CLC metric is a radial metric capturing the length of the walks originating or terminating at a node. Nevertheless, the NSI measure could be computed for any combination and any number of domain-based and/or topology-based node-level metrics for a complex network.

The rest of the paper is organized as follows: Section 2 describes the proposed procedure to construct the unit disk graph of the vertices based on a coordinate system comprising of the normalized values for the node-level metrics as well as explains the use of the binary search algorithm to determine the minimum threshold distance value that is required to obtain a connected unit disk graph; the section also analyzes the time complexity and memory space requirements of the binary search algorithm as well as illustrates the whole process using a toy network of eight vertices. Section 3 provides a brief overview of the 60 real-world networks used to evaluate the proposed unit disk graph-based NSI measure. Section 4 tabulates the results obtained for the NSI measure for the 60 real-world networks with respect to neighborhood-based centrality metrics and shortest path-based centrality metrics, considered separately as well as together. Section 4 compares the DEG-EVC NSI values and the BWC-CLC NSI values obtained for the real-world networks with that of the Pearson's correlation coefficient between these centrality metrics. Section 4 also compares the NSI values for the real-world networks based on a coordinate system of all the four centrality metrics with those of the NSI values of random networks with the same number of nodes and edges (generated using the well-known Erdos-Renyi model [15] and the Configuration model [16]); the purpose of this comparison is to highlight that the notion of node similarity captured by the unit disk graph-based NSI values is not a random phenomenon (unless the nodes in the real-world network are connected using randomly generated edges). Finally, we evaluate the correlation between the proposed NSI measure with that of recently proposed network-level measures (such as randomness index and spectral radius ratio for node degree) as well as classical network-level measures (such as assortative index and ratio of standard deviation to average path length) to showcase its uniqueness. Section 5 reviews the related work on similarity assessment in complex networks. Section 6 concludes the paper.

2. Node Similarity Index (NSI)

In this section, we describe the methodology to compute the proposed node similarity index (NSI) for a complex network. The NSI is a quantitative measure of the extent of similarity of the nodes in a complex network with respect to two or more node-level metrics. Let k be the number of node-level metrics considered for the similarity assessment. The sequence of steps to compute the NSI measure is first outlined below and then explained in detail. We use the graph shown in Figure 1 as a running example graph to illustrate the different steps in the procedure to compute the NSI measure.

(i) Compute the raw values of the k node-level metrics.

(ii) Normalize the values for each of the k node-level metrics using the sum of the squares approach.

(iii) Distribute the vertices in a k-dimensional coordinate system based on the normalized values for the node-level metrics.

(iv) Run a binary search algorithm to determine the minimum threshold distance that would be needed to obtain a connected unit disk graph of the vertices.

2.1. Raw Values for the Node-Level Metrics

As mentioned earlier, we use the centrality metrics as the basis to illustrate the procedure to compute the NSI for a network. Depending on the centrality metrics considered, we would need to use the appropriate algorithms to compute the (raw) values for each of these metrics for the vertices. In this paper, we consider the degree centrality (DEG), eigenvector centrality (EVC; [17]), betweenness centrality (BWC; [18, 19]), and closeness centrality (CLC; [20, 21]) for similarity assessment. The procedures to compute these metrics are available in several sources in the literature (e.g., [1]).

Here, we briefly outline the procedures, assuming the networks analyzed are modeled as undirected graphs and the edges are of unit weights:(i)The DEG value for a vertex is simply the number of edges incident on the vertex.(ii)The EVC of a vertex is computed using the power-iteration method [17] according to which we start with a unit vector (all 1s) as the tentative principal eigenvector (that eventually has all the EVC values) and go through a series of iterations by multiplying (in each iteration) the adjacency matrix of the graph with the tentative principal eigenvector obtained in the previous iteration. At the end of an iteration, we normalize the entries in the resulting product vector (using the sum of the squares approach, see Section 2.2 for an example that illustrates this approach) and use the vector of normalized values as the tentative principal eigenvector for the next iteration. We stop the iterations when the values for the entries in the tentative principal eigenvector between two successive iterations converge to a certain level of precision.(iii)The BWC of a vertex is obtained by running the Breadth First Search (BFS [21])-based version of the Brandes' algorithm [19]: we run the BFS algorithm at each vertex to determine the number of shortest paths from the vertex to every other vertex in the graph. Using this information, for each vertex, we determine the fractions of the number of shortest paths between any two vertices that go through the vertex and the sum of all these fractions is the BWC of the vertex.(iv)The CLC of a vertex is basically the inverse of the sum of the shortest paths lengths (number of hops) from the vertex to every other vertex in the graph and is computed using the BFS algorithm.

2.2. Normalization of the Raw Values for the Node-Level Metrics

For each node-level metric, we normalize the raw values for the vertices and transform the values to a scale of 0 to 1. We use the sum of the squares approach for the normalization. As part of this process, we first obtain the square root of the sum of the squares of the raw node-level metric values of the vertices and then divide each of the raw values by this square root value.

For example, to obtain the normalized DEG values of the vertices in Figure 1, we first obtain the square root of the sum of the squares of the raw DEG values, which is . We then divide each of the raw DEG values by 10.29 to obtain the normalized DEG values of the vertices. Figure 2 displays the normalized centrality values of the vertices in the example graph.

2.3. Distribution of the Vertices in a Coordinate System of the Normalized Node-Level Metric Values

We now distribute the vertices in a coordinate system of the normalized values for the node-level metrics. Each node-level metric is considered as a dimension. If the number of node-level metrics considered is k, we basically distribute the vertices in a k-dimensional coordinate system of the normalized values for the node-level metrics. The coordinate for a vertex is represented as a tuple comprising of the normalized values for the k node-level metrics, which are centrality metrics in this paper.

For example, if all the four centrality metrics (DEG, EVC, BWC, CLC) are considered to form the coordinate system, the coordinate for vertex 0 in the example graph of Figures 1 and 2 would be (0.1943, 0.1535, 0.0000, 0.2696). For ease of presentation and visualization, we show the distribution of the vertices in the example graph using two dimensions at a time (see Figure 3): the neighborhood-based DEG and EVC metrics and the shortest path-based BWC and CLC metrics. As we can see, the distribution of the vertices is different in both the coordinate systems. Sometimes, it is possible that two or more vertices may be located at the same coordinate (like V6 and V7 in both the coordinate systems).

Just with a cursory look at the distributions of the vertices in the two coordinate systems of Figure 3, we could conclude that the vertices are more similar to each other with respect to DEG-EVC rather than BWC-CLC. We could also infer that vertices V3, V6, and V7 are more similar with respect to both DEG-EVC as well as BWC-CLC, even though V3 is not directly connected to V6 and V7. We could as well run some clustering algorithm to find clusters of similar vertices with respect to two or more centrality metrics.

2.4. Binary Search Algorithm to Obtain a Unit Disk Graph with Minimum Threshold Distance

We now seek to construct a unit disk graph that could capture the similarity among the vertices in the coordinate system of the normalized values for the node-level metrics. In a k-dimensional coordinate system of the normalized values (in the range of 0 to 1), the maximum value for the distance between any two vertices is (for example, the maximum distance between any two points in a unit square is ) and the minimum value for the distance is of course 0. The binary search algorithm maintains three auxiliary variables: a left index, a right index, and a middle index. For any iteration, the middle index is the average of the left index and right index values at the beginning of the iteration and is more appropriately called the threshold distance for that iteration. During each iteration, we construct a unit disk graph of the vertices such that there exists an edge between two vertices if the Euclidean distance between the two vertices is less than or equal to the value of the threshold distance for the particular iteration. During any iteration, we maintain the invariant that the unit disk graph is guaranteed to be connected when the right index is used as the threshold distance and not connected when the left index is used as the threshold distance (unless all the vertices are colocated at the same coordinate). The procedural details of the binary search algorithm (see Algorithm 1 for the pseudo code) are as follows:(i)To begin with, the left index is 0 and the right index is . We go through a sequence of iterations (during which the left index and right index approach each other) until the difference between the right index and left index is greater than or equal to ; in this paper, we use . In a particular iteration, either the left index moves to the right (i.e., is increased) or the right index moves to the left (i.e., is decreased).(ii)At the beginning of each iteration, we compute the value for the middle index (threshold distance) as the average of the left index and right index that are updated at the end of the previous iteration.(iii)As part of the iteration, we construct a unit disk graph of the vertices such that there exists an edge between two vertices if the Euclidean distance between the two vertices is less than or equal to the threshold distance. After constructing such a unit disk graph, we run the Breadth First Search (BFS) algorithm on the graph to check if it is connected or not.(a)If the unit disk graph constructed on the basis of the threshold distance for an iteration is connected, the final value for the minimum threshold distance should be greater than the left index, but less than or equal to the current threshold distance (middle index); accordingly, we update (decrease) the value for the right index to be the current value of the middle index.(b)If the unit disk graph constructed on the basis of the threshold distance for an iteration is not connected, the final value for the minimum threshold distance should be greater than the current threshold distance (middle index), but less than or equal to the right index; accordingly, we update (increase) the value for the left index to be the current value of the middle index.(c)When the difference between the right index and left index becomes less than , we stop the iterations and consider the value for the right index during the last iteration as the value for the minimum threshold distance (since we always maintain the invariant that the unit disk graph for any iteration is connected when the right index is used as the threshold distance). The NSI value for the network is then simply computed as “1 − (minimum threshold distance)”.(1)In a coordinate system based on the normalized values of k node-level metrics, the largest possible value for the minimum threshold distance will be (when the vertices are the most dissimilar from each other) and the smallest possible value would be slightly above 0 (unless all the vertices are colocated/exactly similar). Hence, the above formulation of 1 − (minimum threshold distance) would restrict the NSI values to a range of 0...1 such that larger the NSI value, the more similar are the vertices with respect to the metrics considered. Also, the division of the minimum threshold distance by (where k is the number of dimensions: node-level metrics considered) would negate the impact of the number of node-level metrics considered for similarity assessment and capture the impact of the actual node-level metrics considered in their entirety. For example, with the above formulation, it is possible that the NSI value for a network with respect to (DEG, EVC, BWC, CLC) could end up being larger than the NSI value for a network with respect to (BWC, CLC) and be smaller than the NSI value for a network with respect to (DEG, EVC). That is, the significantly larger similarity among the vertices with respect to DEG and EVC could contribute to increasing the similarity among the vertices with respect to all the four centrality metrics and offset the relatively lower similarity among the vertices with respect to BWC and CLC.(2)Note that we do not consider the value of the threshold distance (middle index) for the last iteration as the value for the minimum threshold distance because it might be the case that the unit disk graph of the last iteration was not connected for the threshold distance (middle index) of that iteration (see Table 1 for such a scenario).

Inputs
Real-world network graph,
Number of centrality metrics,
The normalized centrality values for each vertex in
// The centrality-based logical coordinates for a vertex is represented as
Auxiliary Variables
Left Index = 0, Right Index = , Middle Index,
Begin Binary Search Algorithm
while ( | Right Index - Left Index | > ) do
Middle Index = (Left Index + Right Index) / 2
Construct Logical Graph for the vertices using the Middle Index as the threshold distance
/ Two vertices and in are connected with an edge in if the Euclidean distance
Middle Index /
if ( is connected ) then
Right Index = Middle Index
else
Left Index = Middle Index
end if
end while
return NSI = ( 1 − Right Index ) /
End Binary Search Algorithm
2.5. Example to Illustrate the Working of the Proposed Binary Search Algorithm

Figure 4 illustrates the sequence of iterations of the binary search algorithm executed on the example graph of Figures 1 and 2 with the coordinates of the vertices represented by the normalized values of DEG and EVC. As it is a 2-dimensional coordinate system, the initial value for the right index is = 1.414. With the initial left index of 0, the initial value for the middle index is (0 + 1.414) / 2 = 0.707. The unit disk graph for the first iteration is constructed with 0.707 as the threshold distance and we notice the graph to be a connected graph (in this example, we actually see a complete graph wherein each vertex is connected to every other vertex). Hence, for the second iteration, we set the right index to be 0.707 and retain the left index as 0, leading to a new middle index value of (0 + 0.707) / 2 = 0.3535. The unit disk graph for this threshold distance (0.3535) value is also connected and we further reduce the search range by setting the right index to 0.3535. We continue the iterations by either increasing the left index or decreasing the right index. During the 12th iteration, we observe the difference between the right index and left index to be less than 0.001 (), and we finalize the value for the minimum threshold distance to correspond to the value for the right index during the 12th iteration. We use a precision of at most 6 decimal digits (if needed) for the threshold distance.

In Figure 4, along with the iteration #, we indicate the threshold distance (referred to as TD) used to obtain the unit disk graph for that iteration. Table 1 lists the values for the left index, right index and middle index (threshold distance) for each iteration as well as the difference between the values for the right index and left index and whether the unit disk graph for each iteration is connected or not. At the end of the 11th iteration, we notice that the difference between the right index and left index is less than 0.001 and we stop the iterations and conclude the value of the right index at the beginning of the iteration as the minimum threshold distance (0.172607) for the network under study. The NSI value for the network is then 1 − 0.172607 / = 0.877948, where corresponds to the number of node-level metrics (DEG, EVC) considered for the analysis. With a cursory look at the unit disk graph for the minimum threshold distance of 0.172607 (see It # 10 in Figure 4), one could conclude that there are three clusters of similar vertices with respect to DEG-EVC: V1, V2, V3, V6, and V7 form the largest cluster (actually a clique); V4 and V5 form another cluster, and V0 is on its own cluster.

2.6. Number of Iterations, Time Complexity, and Space Complexity

An interesting property of the binary search algorithm applied in the search space of is that the number of iterations of the algorithm for any real-world network just depends on the value of k (the number of node-level metrics/coordinates) and the parameter (we stop the algorithm if the difference between the right index and left index is less than ) and does not depend on the actual number of nodes and edges as well as not on the actual values of the centrality/node-level metrics involved. Even if the range of searchable values in each iteration would vary with the real-world network and the centrality/node-level metrics involved, the size of the search space reduces by half in each iteration (a characteristic of the binary search approach). For example if : at the end of the first iteration, the search space is either or ; in either case, the size of the search space is 0.707. In a similar vein, at the end of the second iteration, the search space is either or or or : the size of each of these search spaces is 0.3535. The size of the search space for the third iteration will be half of 0.3535 = 0.17675 and so on. With the size of the search space reducing by half in each iteration, the number of iterations needed for the search space to reduce from to a value less than would be simply and will be independent of the centrality metrics and their values as well as independent of the actual number of nodes and edges in the real-world network analyzed.

The time complexity of the algorithm is dominated by the time to construct the logical graph (based on the normalized centrality values of the vertices as coordinates) for each iteration, which would be of complexity O() for a real-world network of V nodes. The possibility of an edge between any two vertices in the real-world network needs to be evaluated, and hence the time complexity to construct the logical graph will be O(). After the logical graph is constructed during an iteration, we would need to check for its connectivity to decide whether to change the left index or right index for the next iteration. The Breadth First Search or Depth First Search algorithms of time complexity O() could be used for this purpose. Putting together the number of iterations and the time complexity for each iteration, the overall time complexity of the proposed binary search algorithm(run for k node-level metrics with a terminating search space size of ) for a given graph of V vertices is O().

With regard to space complexity, for each iteration, the algorithm constructs a logical graph GL (a data structure) and checks for its connectivity. As mentioned above, the number of edges in GL would be O(), where V is the number of vertices in the graph . Note that the logical graph constructed during an iteration could be cleared from memory at the end of the iteration. Also, the number of auxiliary variables used remains the same irrespective of the size of the real-world network graph analyzed. Hence, the memory requirements of the algorithm is O(), where V is the number of vertices in the real-world network graph as well.

3. Overview of the Real-World Networks Used for Analysis

In this section, we provide a brief overview of the 60 real-world networks that are analyzed for the proposed node similarity index (NSI) measure. The real-world networks are spread over several domains, such as (listed below along with the number of networks considered for each domain): acquaintance network (12), friendship network (9), biological network (8), coappearance network (8), citation network (4), employment network (4), collaboration network (3), literature network (3), political network (3), communication network (2), game network (2), and transportation network (2). We now briefly describe these networks: an acquaintance network is a social network comprising of people who are not close to each other, but slightly know each other (like an acquaintance) that is typically learnt during an observation period. A friendship network is a social network in which the participant nodes closely know each other, and no observation period is typically used to learn about the friendships. A biological network is a network that models the interactions involving genes, proteins and the associated transcriptions as well as models the interactions between animals of a species, etc. A coappearance network is a network based on the appearance of characters or words (extracted from novels/books/dictionary) alongside each other. A citation network is a network in which there exists a link between two nodes (papers) if one of the two papers has cited the other paper as reference. An employment network is a network in which the interactions between employees (nodes) are due to the job requirements and not due to any personal liking. A collaboration network is a network of authors who are linked if two authors share at least one publication. A literature network is a network of books/papers/terminologies/authors (other than citation, collaboration or coauthorship) in a particular area of literature. A political network is a network of entities (typically politicians) involved in politics. A communication network is a network of entities that communicate in an organizational setting or over a common agenda (e.g., email network, criminal network, trade network, etc.). A game network is a network of teams or players playing for different teams and their associations. A transportation network is a network of entities (like airports and their flight connections) involved in public transportation. In Table 2, we list the 60 real-world networks, their 3 character-code acronym used in the paper, the domain of the network as well as the number of nodes, edges, average degree and the spectral radius ratio for node degree (a measure of the variation in node degree, with a minimum value of 1.0; [22]). In a recent work [23], we had analyzed these 60 real-world networks for assortativity with respect to the neighborhood and shortest path-based centrality metrics and observed the real-world networks to be more assortative with respect to EVC and CLC and more disassortative with respect to BWC and DEG.

4. Node Similarity Index of the Real-World Networks

In this section, we present the results obtained for the proposed node similarity index (NSI) measure for the 60 real-world networks with respect to neighborhood-based centrality metrics (DEG, EVC), shortest path-based centrality metrics (BWC, CLC), and both the neighborhood and shortest path-based centrality metrics (DEG, EVC, BWC, CLC) forming the coordinate systems. With a value of 0.001, the number of iterations incurred (for any complex network) by the binary search algorithm with two and four centrality metrics used for the coordinate systems are respectively and . The median of the NSI values for the (DEG, EVC), (BWC, CLC), and (DEG, EVC, BWC, CLC)-based coordinate systems is 0.92, 0.89, and 0.89, respectively.

Table 3 presents the numerical NSI values for the real-world networks with respect to all the three coordinate systems. For domains that have at least 5 real-world networks (there are four such domains), we group the networks together to present the results in Table 3. For each of these four domains (acquaintance networks, friendship networks, biological networks, and coappearance networks), we make the numbers bold for which the NSI value for a particular coordinate system is greater than or equal to the median value for all the real-world networks with respect to the same coordinate system. For example, we make the numbers bold for a (DEG, EVC) coordinate system if the NSI value in the cell is greater than or equal to 0.92. Based on this coloring scheme, we introduce a measure called relative node similarity score for a network domain that is computed as the ratio of the bold numbers in the domain divided by the total number of cells in that domain across all the three coordinate systems. For example, in the case of acquaintance networks, there are 21 bold numbers in a total of 36 cells and hence the relative node similarity score for acquaintance networks (in comparison to any real-world network; with respect to any coordinate system) is 21/36 = 0.58. Likewise, the relative similarity scores of the Friendship networks, Biological networks and coappearance networks are respectively: 19/27 = 0.70, 11/24 = 0.46 and 5/24 = 0.21. We can thus infer that the nodes in friendship and acquaintance networks are more likely to be similar to each other with respect to the centrality metrics compared to the nodes in the biological and coappearance networks. Nodes in a coappearance network (especially, when it involves the appearance of characters in the same chapter/scene) are less likely to be similar to each other with respect to the centrality metrics.

A visual comparison of the NSI values for the three coordinate systems is presented in Figures 5(a)5(c). For 43 of the 60 real-world networks (i.e., more than 70% of the networks), the (DEG, EVC)-based NSI values are greater than the (BWC, CLC)-based NSI values (see Figure 5(a)). Hence, nodes in real-world networks are more likely to be similar with respect to the neighborhood-based (DEG, EVC) centrality metrics rather than the shortest path-based (BWC, CLC) centrality metrics. A notable exception to this trend is the Roget Network (#49: ROG) whose (DEG, EVC)-based NSI is 0.57 and (BWC, CLC)-based NSI is 0.88. In Figures 5(b) and 5(c), when the (DEG, EVC)-based NSI values and the (BWC, CLC)-based NSI values are plotted against the (DEG, EVC, BWC, CLC)-based NSI values, we observe the (DEG, EVC, BWC, CLC)-based NSI values are lower than that of the (DEG, EVC)-based NSI values for more than 85% (i.e., for 52/60) of the real-world networks; on the other hand, the (DEG, EVC, BWC, CLC)-based NSI values are greater than that of the (BWC, CLC)-based NSI values for more than 50% (i.e., for 32/60) of the real-world networks. The relatively larger similarity among the vertices with respect to (DEG, EVC) contributes to the larger values for the (DEG, EVC, BWC, CLC)-based NSI measure compared to the (BWC, CLC)-based NSI measure. As a result, nodes in real-world networks tend to be more similar to each other when both the neighborhood-based (DEG, EVC) and shortest path-based (BWC, CLC) centrality metrics are considered together rather than when the shortest path-based (BWC, CLC) centrality metrics are considered alone. This corroborates our earlier assertion in Section 2.4 that our formulation for NSI as “1 − (minimum threshold distance)” negates the number of node-level metrics (k) considered and captures the contribution of the node-level metrics in their entirety to quantify the extent of similarity among the vertices.

4.1. Comparison of NSI Values with the Pearson's Correlation Coefficient of the Centrality Metrics

Correlation studies involving centrality metrics have been extensively conducted in the literature (e.g., [7072]), with the Pearson's correlation coefficient [73], whose values range from -1 to 1, being the most commonly used correlation measure. A larger positive value (or a smaller negative value) for the Pearson's correlation coefficient between two centrality metrics means that the two centrality metrics are strongly and positively (or negatively) related as well as one centrality metric could be predicted using a linear function of the other centrality metric (e.g., [74]). If the Pearson's correlation coefficient between two centrality metrics is closer to 0, it implies the two metrics are not linearly related to each other.

In this subsection, we compare the NSI values obtained for the real-world networks based on the neighborhood (DEG, EVC)-based centrality metrics and the shortest path (BWC, CLC)-based centrality metrics with the Pearson's correlation coefficient values for DEG versus EVC and BWC versus CLC for these networks (see Figure 6). The purpose of this comparison is to showcase that the NSI values based on a coordinate system of a particular combination of centrality metrics are independent of the correlation between the corresponding centrality metrics. Thereby, we claim that the correlation coefficient between two centrality metrics for a real-world network cannot be construed as a network-level measure of the extent of similarity among the nodes in the network.

The plots in Figure 6 for both the neighborhood and shortest path-based centrality metrics indicate that the NSI values for the real-world networks based on the coordinate systems of these centrality metrics are independent of the Pearson's correlation coefficient between the constituent centrality metrics for the real-world networks. Though the Pearson's correlation coefficient values range from -1 to 1 (for DEG, EVC) or from 0 to 1 (for BWC, CLC), the NSI values for most of the real-world networks are 0.85 or above (for DEG, EVC) or 0.80 or above (for BWC, CLC). We could not identify any sort of relationship between the NSI values and the correlation coefficients.

Numerically, the (DEG, EVC)-based NSI values are greater than the Pearson's correlation coefficient between DEG and EVC for about 2/3rds of the real-world networks, with the median of the difference being 0.12; on the other hand, for the other 1/3rd of the real-world networks (for which the Pearson's correlation coefficients between DEG and EVC are relatively larger than the NSI values for the networks based on these two metrics), the median of the difference in the values is only 0.04. Though DEG and EVC are positively correlated for a majority of the real-world networks, the Pearson's correlation values between DEG and EVC are negative (-0.5 or lower) for the following four networks: Marvel Universe Network (#33: MUN), Author Facebook Network (#35: AFB), Yeast Phosphorylation Network (#55: YPN) and Network Science Coauthorship Network (#60: NSC). In the case of (BWC, CLC), the NSI values are larger than the Pearson's correlation coefficient between BWC and CLC for more than 85% of the real-world networks, with the median of the difference being 0.43. Thus, the (DEG, EVC)-based NSI values are relatively more closer to the Pearson's correlation coefficients between DEG and EVC compared to the proximity of the (BWC, CLC)-based NSI values to the Pearson's correlation coefficients between BWC and CLC.

4.2. Comparison of the NSI Values for the Real-World Networks and Random Networks

In this subsection, we compare the NSI values for the real-world networks with the NSI values obtained for random networks generated using the well-known Erdos-Renyi [15] and Configuration [17] models. For a given real-world network, both the models generate a random network with the same number of vertices and edges, but the edges between the vertices are randomly assigned. The degree distribution of the vertices in the random network generated using the Configuration model will be the same as the degree distribution of the vertices in the corresponding real-world network. On the other hand, the degree distribution of the vertices in the random network generated using the Erdos-Renyi model will always be Poisson in nature, irrespective of the degree distribution of the vertices in the corresponding real-world network. We expect relatively less variation in the centrality values of the nodes in the random network generated using the Erdos-Renyi model compared to those generated using the Configuration model. Nevertheless, our hypothesis is that since the edges are randomly assigned under both these models, the NSI values of the random networks with respect to any combination of centrality metrics should be different from the NSI values of the corresponding real-world networks.

For each of the 60 real-world networks, we generated hundred random networks according to each of the above two models. For a real-world network with N nodes and L links, to generate a random network per the Erdos-Renyi model, we first determine the probability () for a link between any two nodes in the random network; we then consider all possible node pairs of two different vertices and generate a random number for each pair. If the random number generated for a node pair is less than or equal to , there is an edge between the two nodes in the random network; otherwise, not. To generate a random network according to the Configuration model, we first determine the degree sequence of the vertices in the corresponding real-world network. We set up a list LD that has the vertex IDs such that the number of times a vertex is included in this list corresponds to the degree of the vertex in the real-world network. We then randomly shuffle the vertices in the list LD ten times (to decrease the chances of the same vertex ID appearing consecutively). Finally, we sequentially parse through the shuffled list and connect the adjacent vertices in the list with an edge. For complex real-world networks with a larger number of nodes, the average number of self-loops and multilinks in the random networks generated according to the Configuration model is a constant and their density approaches zero as the number of nodes tends to infinity [75].

After rigorous simulations for a coordinate system based on all the four major centrality metrics (DEG, EVC, BWC, CLC), we observe our hypothesis to be indeed true. For 85% and 63% of the real-world networks (i.e., 52 and 38 of the 60 networks), the average of the NSI values for the random networks generated respectively according to the Erdos-Renyi model and the Configuration model are greater than 0.90. Unlike the corresponding random networks, for only 19 of the 60 real-world networks (i.e., less than 1/3rd of the real-world networks), the (DEG, EVC, BWC, CLC)-based NSI values are greater than 0.90. The relatively larger NSI values for the random networks per the Erdos-Renyi (ER) model vis-a-vis the Configuration model could be attributed to the lower variation in the values of the centrality metrics of the vertices in the ER-random networks that exhibit a Poisson degree distribution.

Figure 7 shows the distribution of the (DEG, EVC, BWC, CLC)-based NSI values of the real-world networks and the average of the (DEG, EVC, BWC, CLC)-based NSI values for the corresponding random networks generated according to the Erdos-Renyi model (Figure 7(a)) and the Configuration model (Figure 7(b)). We do not see any relationship between the two NSI values in each of Figures 7(a) and 7(b), indicating that the NSI values measured for a real-world network are not random and they do capture the extent of similarity among the nodes with respect to the centrality metrics considered. The median of the difference in the NSI values for a real-world network and the random network generated per the Configuration model is 0.06 and the random network generated per the Erdos-Renyi model is 0.10.

For only nine of the sixty real-world networks, the NSI value for the real-world network is greater than the average of the NSI values for the corresponding random networks (per the Erdos-Renyi model). These nine real-world networks are as follows: Taro Exchange Network (#1: TEN), Friendship Network in a Hi-Tech Firm (#7: FHT), Windsurfers Beach Network (#10: WSB), College Dorm Fraternity Network (#13: CDF), Macaque Dominance Network (#15: MDN), Manufacturing Company Employee Network (#22: MCE), World Trade Metal Network (#23: WTN), US Football Network (#30: FON), and Primary School Contact Network (#39: PSN). The values for the spectral radius ratio for node degree for these nine real-world networks range from 1.01 to 1.57 with a median of 1.12. Real-world networks with such a low spectral radius ratio for node degree could be indeed considered to be randomly generated.

4.3. Comparison of the NSI Values with the Values for Other Network-Level Measures

In this subsection, we compare the NSI values obtained for the real-world networks with those of other recently proposed and classical network-level measures. These measures are (i) spectral radius ratio for node degree; (ii) randomness index; (iii) assortative index, and (iv) ratio of the standard deviation to the average path length. Below, we provide a brief description of each of these measures and analyze the relationship vis-a-vis the appropriate coordinate system-based NSI values with which we compare them:

(i) The spectral radius ratio for node degree [22] quantifies the extent of variation in node degree in a way that is independent of the number of nodes and edges in the network (unlike the classical standard deviation measure that is dependent on the number of nodes). The spectral radius ratio for node degree is computed as the ratio of the principal eigenvalue of the adjacency matrix and the average node degree. The smallest possible value for the measure is 1.0 and it corresponds to a regular network where there is no variation in node degree. For random networks that are characteristic of a smaller variation in the node degree, the spectral radius ratio for node degree is typically closer to 1.0. For scale-free networks that are characteristic of a larger variation in node degree, the spectral radius ratio for node degree is appreciably greater than 1.0. As it is a degree-based measure, we compare the (DEG, EVC)-based NSI values of the real-world networks with their spectral radius ratio for node degree (see Figure 8(a)). We could observe an increasing trend of the (DEG, EVC)-NSI values with decrease in the spectral radius ratio for node degree. However, the R2 values for all the models that we tried to fit to relate these two measures are at most 0.25.

(ii) The randomness index [76] quantifies the extent of randomness in any complex network. It is computed as the Pearson's correlation coefficient between the degree of the vertices and the average local clustering coefficient of the vertices with the particular degree. The local clustering coefficient of a vertex [1] is the probability that any two neighbors of the vertex are directly connected. For a theoretically random network (say, a random network generated according to the ER model [15]), the local clustering coefficient of a vertex is independent of the degree of the vertex, and the expected randomness index is 0. For real-world networks that are not random, the local clustering coefficient of the vertices decreases with increase in the degree of the vertices (as it is less likely that all the neighbors of a high-degree vertex will be directly connected to each other), and there is a negative correlation between the two measures, resulting in negative values for the randomness index. The more negative is the randomness index value (i.e., closer to -1) for a real-world network, the lower the extent of randomness in the network. As we expect the vertices in a theoretically random network to be similar to each other with respect to all the centrality metrics (like we saw in Section 4.2), we compare the (DEG, EVC, BWC, CLC)-based NSI values of the real-world networks with their randomness index (see Figure 8(b)). We do not observe any trend of decrease or increase in the NSI values of the real-world networks vis-a-vis their randomness index values: for example, the randomness index of real-world networks whose NSI values are in the vicinity of 0.90 range from -0.92 to -0.16.

(iii) The assortative index measure [13] quantifies the extent of similarity between the end vertices of a network with respect to node degree. It is calculated as the Pearson's correlation coefficient (ranging from -1 to 1) of the remaining degree of the end vertices of the edges in a network. The remaining degree of a vertex is one less than the degree of the vertex. Networks with larger positive values (closer to 1) for the assortative index are considered to be assortative and networks with smaller negative values (closer to -1) for the assortative index are considered to be disassortative. As it is a degree-based measure, we compare the (DEG, EVC)-based NSI values of the real-world networks with their remaining degree-based assortative index (see Figure 8(c)). We observe larger NSI values for both assortative as well as disassortative networks. For example, the assortative index of real-world networks whose NSI values are in the vicinity of 0.90 range from -0.49 to 0.20.

(iv) The ratio of the standard deviation to the average path length has been a classical measure for getting an estimate of the similarity among the shortest path lengths between any two nodes in a network. If there is no significant variation in the shortest path lengths, the ratio is expected to be lower than 1.0 (and more closer to 0.0). The larger the ratio (especially, if greater than 1.0), the larger the variation in the shortest path lengths. As it is a shortest path-based measure, we compare the (BWC, CLC)-based NSI values with the ratio of the standard deviation to the average shortest path length. There is no trend of increase or decrease in the NSI values with the ratio (see Figure 8(d)). The R2 values for the different models that we tried to fit the data do not exceed 0.10. Hence, like the other three network-level measures compared with, the proposed NSI measure captures the extent of similarity among the nodes with respect to the BWC and CLC metrics, and this is not captured with the classical approach of determining the ratio of the standard deviation to the average path length.

To the best of our knowledge, similarity assessment in complex networks has been conducted only at the node-level (i.e., between any two nodes or a set of nodes, also referred to as pair-wise node similarity) and not at the network-level (i.e., among all the nodes in the network). The objective of this paper is to develop a measure to comprehensively (i.e., at the network-level) quantify the extent of the similarity among the vertices in a coordinate system based on the normalized values of the node-level metrics. In this section, we review the prominent measures available in the literature for pair-wise node similarity assessment.

One of the classical approaches for pair-wise node similarity assessment is based on the notion of “equivalence classes” [1]; there are three levels of equivalence classes: structural, automorphic and regular. Two nodes are structurally equivalent if they share many of their neighbors [1]. Some of the measures available to quantify structural equivalence are [1]: cosine similarity, Pearson's coefficient and Euclidean distance, all of which are computed based on the rows associated with the corresponding two vertices in the adjacency matrix of the graph. Two vertices u and v are automorphically equivalent if all the vertices can be relabeled to form an isomorphic graph such that the labels of u and v are interchanged [77]. Two vertices u and v are regularly equivalent if they have neighbors who are themselves similar [5, 77]. Similar to structural equivalence, there exist quantitative measures to assess automorphic equivalence and regular equivalence. In [9], the authors proposed four measures (based on maximum common neighborhood, neighborhood patterns, random walks and k-hop neighbors) to assess the automorphic equivalence of two nodes. SimRank [7] and its variants such as PathSim [8] are examples of well-known measures to assess the similarity of two nodes based on the similarity of their neighbors. However, none of these quantitative measures can be seamlessly extended to quantify the similarity among nodes at the network-level. Also, from the definitions of the three equivalence classes and the measures available to quantify them, we conjecture that it is very unlikely for two distant nodes (i.e., several hops away from each other) to belong to the same equivalence class, especially in the case of structural equivalence, which is the superclass of the three classes [1]. Note that two structurally equivalent nodes are also automorphically and regularly equivalent. Two nodes that are automorphically equivalent are regularly equivalent too, but need not be structurally equivalent. Two nodes that are regularly equivalent need not be structurally or automorphically equivalent [1].

In addition to the above, quantitative measures to assess pair-wise node similarity based on the neighborhood of the nodes were proposed by Ravasz et al. [78], Burt [79] and Goldberg and Roth [80]. Thiel and Berthold [2] proposed that two nodes (need not be directly connected to each other) are structurally similar if their neighborhoods are structurally similar to each other. In [3], Symeonidis et al. recommended that for two nodes that are not directly connected to each other, their similarity could be quantified as the product of the similarity of the end vertices constituting the edges of the shortest path between the two nodes. For weighted graphs, Chen et al. [4] introduced a measure called relation strength similarity (RSS) to assess similarity between two nodes: the RSS of two nodes (u, v) connected to each other is the ratio of the weight of the edge (u, v) to that of the sum of the weights of the edges incident on u and v. The transitive node similarity formulation proposed by Symeonidis et al. [3] for two nodes that are not directly connected to each other could be extended to the RSS measure as well. Though neighborhood-based methods are more common and widely used, there also exist pair-wise node similarity assessment measures that are not neighborhood-based. For example, in [6], the authors applied the notion of “mutual information” from Information Theory to quantify the extent of similarity between two nodes: the similarity score for two nodes is a function of the “information loss” encountered in the network by replacing the two nodes as one node.

While centrality metrics have been traditionally explored for their individual usability to analyze the characteristics of a real-world network, more recent studies [7072] have focused on analyzing the correlation between any two centrality metrics to explore the usability of one centrality metric (typically, a computationally light metric) in lieu of another centrality metric (typically, a computationally heavy metric) at different levels (i.e., for prediction, network-wide ranking, pair-wise ranking, etc.). However, as seen in Section 4, correlation studies do not reveal or quantify the extent of similarity among the vertices on the basis of their centrality values with respect to two or more metrics. In [81], the authors introduced the notion of “centrality distance” to quantify the similarity of two graphs with respect to a centrality metric and is measured as the sum of the absolute differences of the centrality values (without any normalization) of the individual vertices in the two graphs.

6. Conclusions

The high-level contribution of this paper is the proposal for a unit disk graph-based approach to quantify the similarity among all the nodes in a network with respect to two or more node-level metrics. As part of this approach, we propose the use of a k-dimensional coordinate system wherein the coordinate of a vertex is composed of the normalized values of the k node-level metrics considered for similarity assessment. We propose the use of a binary search algorithm to determine the minimum value for the threshold distance (in a search space ranging from 0 to ) that would be needed to obtain a connected unit disk graph of the vertices in the normalized coordinate system. Our hypothesis is that the larger the similarity among the vertices, the smaller the value for the minimum threshold distance needed to obtain a connected unit disk graph. We propose a measure called the node similarity index (NSI) computed as 1 − (minimum threshold distance) to quantify the extent of similarity among the vertices in a scale of 0 to 1. The division by in the NSI formulation (where 'k' is the number of node-level metrics considered for similarity assessment) negates the impact of the number of node-level metrics considered and solely captures the impact of the actual node-level metrics considered. With the binary search approach, for a given k and the terminating search space size , the number of iterations needed for the algorithm is the same for any complex network; the overall time complexity and space complexity of the algorithm are, respectively, O() and O().

We evaluate our proposed model with respect to the four commonly studied centrality metrics (neighborhood-based degree and eigenvector centrality and the shortest path-based betweenness and closeness centrality) on a suite of 60 real-world networks belonging to different domains. Overall, we observe the nodes in real-world networks to be more similar with respect to the neighborhood-based centrality metrics rather than the shortest path-based centrality metrics. For all the combinations of centrality metrics considered, we observe nodes in friendship and acquaintance networks to be relatively more similar among themselves compared to the nodes in biological and coappearance networks. We showcase the uniqueness of the NSI values by comparing them with several quantitative measures such as correlation coefficient, spectral radius ratio of node degree, assortative index, randomness index and ratio of standard deviation to average path length. We do not observe any significant trend of increase or decrease in the NSI values with respect to each of these measures.

We also observed the NSI values of the real-world networks with respect to all the four centrality metrics to be different from the NSI values of the random networks (generated with the ER model) that have the same number of nodes and edges as that of the real-world networks. Thus, the notion of node similarity captured by the unit disk graph-based NSI values is not a random phenomenon and the proposed NSI measure is a unique measure whose values are also not correlated with several of the existing measures for complex network analysis.

Data Availability

The real-world network data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.