Abstract

How to make a correct similarity between patterns is a groundwork in data mining, especially for graph data. Despite these methods that can obtain great results, there may be still some limitations, for instance, the similarity of patterns in directed weighted graph data. Here, we introduce a new approach by taking the so-called the second-order neighbors into consideration. The proposed new similarity approach is named as relative entropy-based similarity for patterns in graph data, wherein the relative entropy provides a brand new aspect to make the difference between patterns in directed weighted graph data. The proposed similarity measure can be partitioned under three phases. First of all, strength set is given by degree and weight of patterns; in this phase, four variables holding the strength about out-degree, in-degree, out-weight, and in-weight are constructed. Then, with the help of Euclidean metric, pattern’s probability set is constructed, which contains influence of similarity between pattern and its all one-order neighbors. Finally, relative entropy is used to measure the difference between patterns. In order to examine the validity of our approach as well as its advantage comparing with the state-of-art approach, two sorts of experiments are suggested for real-world and synthetic graph data. The outcomes of experiment indicate that the recommended method get handy execution done measuring similarity and gain accurate results.

1. Introduction

At present, many practical networks like Facebook social networks, protein interaction networks, aviation networks, and disease transmission networks can be presented as graph data. This type of data is no longer a straightforward portrayal of pattern’s attribute information and composes possible topological information between patterns additionally, i.e., degree and weight of patterns. Due to the extensive use of graph data, many practical problems including pattern analysis, link prediction, and community detection can be abstracted into problems of graph data for research. Among these researches, how to calculate similarity between patterns in graph data is considered as one of the fundamental problems. Many researches on graph data are based on the pattern’s similarity measure, for example, traffic networks [1], image classification [2], and pattern recognition [3, 4].

Over the past few decades, discovering similarity between patterns has attracted substantial consideration [5, 6]. Scholars proposed a range of methods to measure pattern’s similarity, for example, shared neighbor-based similarity, random walk-based similarity, path-based similarity, and information theory-based similarity; these methods discuss the similarity of patterns from different perspective.

The shared neighbor-based similarity measure takes into account the shared information of the connected neighbors between patterns, and the greater the coincidence rate of shared neighbors means the higher similarity of two patterns. Cosine index [7], Sorensen index [8], Jaccard index [9], AA (Adamic-Adar) index [10], and WAA (weighted Adamic-Adar) index [11] are also common methods used in the research of similarity measure, which take into account the number of shared neighbors. Besides, LP (local path) [12] index is an improvement of CN index [13]; on the basis of CN index, the influence of neighbor with path length of 3 on the connection between patterns is added. These indicators reduce the computation time and earn good results in the identification of the most similar patterns. Unfortunately, they remain significant challenges, only topological information of first-order neighbors is taken into account, and many patterns with high similarity have no common neighbors, which leads to certain limitations of such indicators.

Random walk-based similarity is widely used to measure the topological similarity of patterns, such as MLRW (Multiplex Local Random Walk) index [14], BRW (Biased Random Walk) index [15], and LRW (Local Random Walk) index [16]. In the process of calculation, these two methods measure similarity moving from one to other patterns through multistep random walk without the global information of graph data. They simplify the similarity measure to some extent, but these three similarity measures rely on large-degree patterns and most similar patterns may be large-degree patterns, which makes the similarity results sensitive to large pattern dependence.

Path-based similarity is an important method used to measure similarity of patterns, Katz index [17, 18] and ACT (average commute time) [19]. Compared with local index, global index requires the overall topological information. Besides, Aziz et al. in [20] proposed global and quasilocal extensions of some commonly used local similarity index. Although global index provides more accurate similarity than local index, the computation of global metrics is time-consuming and generally not applicable to large-scale graph data, and sometimes, global topology information is unavailable, especially when implemented in a decentralized manner.

In addition to the similarity measures mentioned above, information theory-based similarity is a kind of similarity measure that is often used. Hereinto, relative entropy is an important concept of information theory, which are used to measure similarity of patterns. Scholars have proposed pattern similarity measure based on relative entropy such as LRE (i.e., the abbreviation of local relative entropy) [21, 22], LRWE-SNM (Local Network Relative Weighted Entropy Based Similar Node Mining) [23], and RE-model (relative entropy model) [24]. These methods have advantages in their respective fields and can also measure the similarity of patterns to a certain extent. Although it is faster and simpler to measure, some pattern’s information and complex relationships between patterns are lost, for example, information about second-order neighbors of patterns. That is to say, it is hard to distinguish differences between patterns with similar degree. In addition, there are many other ways to calculate pattern similarity, see [2528] for details.

For the similarity of patterns in directed weighted graph data, similarity is affected by the direction of the edge and edges in different directions have an impact on its weight. Besides, each pattern has information such as out-degree and in-degree, out-weight, and in-weight, and the relationship between the pattern and its neighbors in different scales is complex. Therefore, the similarity measure of patterns in directed weighted graph data cannot start from a single direction. Generally speaking, the above measure of similarity has been used extensively. Nonetheless, there are still some inevitable limitations. These index that used mutual information are limited to the common neighbor structure or local information of patterns; so, it is easy to make the patterns of larger degree become the general patterns in the similarity calculation. Even if existing submethods simplify the measure of topological information, they ignore the directivity of the pattern’s connection and its corresponding degree and weight diversity of the relationship between patterns. Under the circumstances, some edge information of pattern is lost, leading to their performance for calculating the similarity of patterns failing to get further enhancement. In particular, there may be a poor effect when the above indices are applied to ink prediction. To sum up, calculating similarity of patterns from the aspects of degree and weight diversity is still a hotspot [2931].

In this paper, we aim at similarity of pattern in directed weighted graph data. To this, an extended version of the similarity measure approach from a relative entropy point of view is proposed. For more details, the comprehensive process can be considered as three stages. First, compute strength set. By using degree and weight of pattern’s information in its first-order neighbors, four variables that contain the influence of topological information about degree and weight diversity are constructed. Second, generate probability set. To take advantage of the second-order neighbor information of patterns, Euclidean metric is presented to measure the similarity between pattern and its first-order neighbors. On this basis, the value of similarity is normalized to construct probability set of each pattern. Third, Quantify similarity of pattern. With the help of relative entropy, the dissimilarity of any two patterns is measured, and similarity can be gained subsequently. We numerically simulated the proposed similarity measure and verified its effectiveness and efficiency in similarity measure and link prediction. In this paper, there is a proposed relative entropy-based similarity for patterns in graph data with the following several contributions in mind. (1)This paper presents a similarity measure based on relative entropy, which considers the information of second-order neighbors of patterns(2)In the process of pattern’s similarity measure, the proposed method considers both degree information and weight information(3)Compared with most benchmark methods, the proposed similarity measure has a great advantage in measuring similarity of patterns and gets good performance the link prediction

To make a detailed description of the above proposed similarity approach, in this section, we will provide a brief introduction to the structure of this paper. Section 2 contains some preliminaries. Section 3 describes generation of strength set for patterns in detail. Section 4 proposes probability set calculated by similarity set. Section 5 constructs a measure to compute the similarity of pattern in graph data, and a novel algorithm is proposed. Section 6 carries out two type experiments to prove the effectiveness of the proposed method. Conclusion is given in section 7.

2. Preliminaries

In this section, we propose some basic concepts used in this paper, such as graph data [26], relative entropy [32], and pattern’s neighbor [23].

2.1. Graph Data

A graph data is defined as a set of patterns and a set of edges. Generally speaking, the so-named directed weighted graph data can be expressed as a 4-tuple , formally, where (i) is the set of patterns, and represents the pattern(ii) is the set of edges, and indicates the set of edges. Hereinto, if pattern and are connected; otherwise, (iii) is the set of corresponding weight with respect to patterns, thereinto and represent in-degree and out-degree of , respectively, and the value of them, take for example, can be determined by equations

Moreover, the degree can be calculated by the sum of in-degree and out-degree, i.e., (iv) is the set of weights with respect to the corresponding edges. Analogously, represent the weight, in-weight, and out-weight of pattern , respectively. The value of in-weight and out-weight can be determined by following equations: Thereinto, represents weight on edge of and . Furthermore, the value of weight can be calculated by the sum of in-weight and out-weighted, i.e.,

2.2. Relative Entropy

As we known, relative entropy is an asymmetric measure and can be applied to measure the difference between two probability distributions. In general, its mathematical version can be expressed as where and are two probability distributions, and “” in equation (5) represents the number of variables that and depended on. Certainly, the greater value of reflects the smaller similarity of and , and vice versa.

2.3. Pattern’s Neighbor

For a graph data , if there exists at least two patterns and such as or . Then, one can say that is the neighbor of , and vice versa. All the neighbors of constitute the so-named neighborhood with respect of , in aspect of topological information. For the need of simplicity and uniformity, we summarize it as the following definition.

Definition 1. (first-order neighborhood). Given that is a directed weighted graph data, if or for , then the pattern is a first-order neighbor of . Certainly, all of the neighbors of constitute the first-order neighborhood of it and can be expressed as Generally speaking, if pattern has first neighbors, then can be represented as . Obviously, the elements of reflect the topological information of directly. For the case that and but , how to depict the direct relationship of and in aspect of topological information is no longer an obvious question. To this, next definition gives the concept of second-order neighborhood to depict such situation.

Definition 2. (second-order neighborhood). Given that is a graph data and , the second-order neighborhood of a pattern denoted as the set contained neighbors of its all first-order neighbors, which notes as , which can be expressed as

Definition 3. (local neighborhood). Given that is a graph data and , the so-named local neighborhood of can be expressed as where is the first order neighbor of , for .

3. Degree and Weight-Based Pattern’s Strength Set

In this section, we investigate the problem of how to construct the pattern’s strength set in terms of degree and weight.

For any pattern in , its first order neighborhood depends on the corresponding topological connection. Whatever the connection, the topological information for each pattern can be described by four variables: in-degree , out-degree , in-weight , and out-weight .

In what follows, we introduce the concept of strength set for any pattern in a graph data .

Definition 4. (strength set). Given that is a graph data, for any pattern , its strength set can be expressed by following equation: where and . Each variable in contains four strength values consisting of in-degree, out-degree, in-weight, and out-weight, take for example, , in which case (i) represents the strength of out-degree and can be computed by equation (ii) represents the strength of in-degree and can be computed by equation (iii) represents the strength of out-weight and can be computed by equation (iv) represents the strength of in-weight and can be computed by equation

Analogously, represents the strength of first-order neighbor to , which can be calculated by equations mentioned above. One can find that the above proposed strength fully depicts personal properties and topological information with respect to corresponding its first-order neighbors.

As discussed above, take and for example, if and are two different patterns, then is nothing unusual to some extent. In particular, there would be one extreme situation that if , for and .

By making a deeper investigation of relative entropy, one can see that the patterns with more neighbors will lose certain information, for it only calculates the value of nonzero elements in probability set, and the information of nonzero elements in the probability set of its corresponding patterns will also be lost. Considering this deficiency, we introduce a concept, the scale of strength set, to depict the strength set. Before doing this, we suppose that for a graph data , there exists at least one pattern that having the most neighbors, in which we denote the number of it as . To this, for the pattern with and pattern with , we take the following cases into consideration:

Case 1. If , then the and can be represented as

Case 2. If , then the and can be changed into

In other words, append zeros, i.e., to the end of and .

Case 3. If , we append average strength values of to the end of by the equation

Generally speaking, the and can be changed into where the insufficient locations of will be appended by strength values calculated with the help of equation (16), and the rest location of and will be appended by zeros, i.e., .

Case 4. If , the and can be changed into where strength value of can be calculated as the following equation:

4. Generating Probability Set

Relative entropy is applied to compare the difference of two probability set. To some extent, the similarity can be regarded as the difference. For this, we try to calculate the similarity between patterns in aspect of relative entropy. Before do this, how to construct the probability set of each pattern constitutes the first step of similarity measure.

We have known that the strength set , take for example, and its one order-neighbors can be determined in terms of degree and weight. To make full use of relative entropy for the purpose of similarity measure, in what follows, we construct an approach to generate the probability set of patterns for . Each strength value of the first-order neighbors is composed by four variables; here, the Euclid metric can be applied to compute the similarity between and for with respect to its strength set. The concrete formula can be depicted as the following equation:

Obviously, the value describes the similarity between and its first-order neighbor, and it is only a local description in view point of . With the help of equation (20), we can make a global description of the similarity of pattern by the following equation:

Up to now, the caring thing, that is, creating probability set, can be realized by the following equation: where

By above aforementioned, the relative entropy between the probability set and , take in for example, it can be determined by the equation where represents the maximal neighbors of patterns and ; that is, .

It can be analyzed that, in the process of calculating relative value of pattern and , strength set of first-order neighbors of pattern and is constructed, which contain second-order neighbor information. That is to say, with the help of Euclidean metric, the information of pattern’s second-order neighbors is indirectly used during similarity calculation process.

5. Similarity and Algorithm

The calculation of relative entropy among the patterns has been discussed in detail. In this section, the calculated value of relative entropy will be used to compute the similarity between patterns. And then, an algorithm is proposed.

5.1. Quantify Similarity of Pattern

From the process mentioned above, relative entropy of any two patterns is obtained based on the sorted probability sets. Therefore, the relative entropy matrix of graph data with respect to any two patterns can be represented as

And then, the similarity matrix of graph data can be given as follows.

For the value of relative entropy is asymmetric, take and for example, both and describe dissimilarity of pattern and . To obtain more accurate similarity of pattern, the value of relative entropy can be calculated by taking both and into consideration, and the specific calculations of it are shown as follows:

5.2. Algorithm

With the purpose of a better understanding for the proposed pattern similarity measure, this section will give an algorithm containing detailed description of this similarity measure. Notice that for briefness, “Relative entropy-based similarity for patterns in graph data” can be summarized as “RESG.” In terms of this algorithm, the similarity of any two patterns will be computed, after which the most similar patterns can be obtained. One can easily see that there are four states of this algorithm. The input of the RESG algorithm is a weighted directed graph data , and the output is a matrix composed of similarity between any two patterns in .

Input: A directed weighted graph data .
Output: Similarity matrix S of patterns in .
1: for each do
2: Calculate first-order neighbors
3: Calculate by equation (10)- (13)
4: end
5: for each with neighbors and with neighbors do
6: ifthen
7:  
8: else
9:  
10: end
11: end
12: fordo
13: Compute by equation (20) and (21)
14: Compute by equation (22)
15: end
16: for each do
17: Compute by equation (24)
18: Compute entropy matrix by equation (25)
19: Compute similarity matrix by equation (27)
20: end
21: return Similarity matrix S

The first state of the RESG algorithm is lines 1-4, strength set of each pattern in is generated, and each strength set has four variables in terms of in-degree, out-degree, in-weight, and out-weight. The second state is lines 5-11, to fully utilize information of pattern’s first-order neighbors, the pattern with less neighbors will append average strength value in the end of strength set. The third state of the RESG algorithm is lines 12-15, the similarity between patterns and its one-order will be computed, and similarity set is generated. With the help of similarity set, pattern’s probability set will be obtained. It is not hard to find; during the process of generating similarity set, the information of pattern’s second-order neighbors will be used indirectly. The last state is lines 16-20; by taking the above information into account, the relative entropy and similarity of patterns are measured.

6. Experimental Materials

In this section, we introduce some experimental materials such as experimental environment, the graph data used in experiment, and benchmark algorithms. The experimental environment we used is listed in Table 1.

6.1. Data

This subsection will give a detailed description about the directed weighted graph data used in experiments.

A synthetic graph data Datal generated by means of the graph generator Gephi will be applied in first experiment. Datal contains 21 patterns and 31 edges and can be used to illustrate the feasibility of the proposed RESG algorithm in the following illustrative example.

The following is a detailed description of graph data used in second experiment. Data2 and Gene [33] will be used to demonstrate the similarity of our proposed RESG index and other similarity measures. For the edge connections of Data 2 and Gene [33], see Figures 1 and 2 for detail. Stmarks, , , Celegans, and Email167 are directed weighted graph data collected from Stanford Dataset. Each of them will be used to show the effectiveness of the proposed RESG algorithm in link prediction. The topology information of these eight graph data are shown in Table 2, where is the number of patterns, is the number of edges, is the average shortest distance, is the density, is the average degree, and is the clustering coefficient.

6.2. Benchmark Algorithms

Here, we introduce several benchmark pattern’s similarity indices, which are usually used for similarity measure and link prediction. Adamic-Adar (AA) [10], weighted Adamic-Adar (WAA) [11], local relative entropy (LRE) [21], common neighbors (CN) [13], Katz [17], local path (LP) [12], and Local Random Walk (LRW) [16] are often used for the purpose of comparing results with the RESG algorithm. The basic definitions of these indexes are given below.

AA index is the extended version of index, which is defined as

WAA index is the weighted version of AA index, which is defined as where may be smaller than 1; so, we use to avoid a negative value.

CN index directly takes the number of all common neighbors between patterns as similarity into consideration, which is defined as

LRE index is a similarity measure based on relative entropy and local structure of patterns, which is defined as whereinto

Hereinto, is the maximum degree of the graph data, and is the probability set of pattern with respect to degree.

Katz index is based on the global information of graph data, which is defined as where represents the set of all paths with distance between pattern and is the damping factor used to control the path weight.

LP index considers the third-order paths on the basis of common neighbors, which is defined as where is the adjacency matrix of graph data [34], represents the number of paths with length of 3 between patterns and , and is adjustable parameter.

LRW index is proposed based on the local random walk of particles between two patterns, which is defined as where is the number of the edges in the graph data, is obtained according to the density vector evolution equation: , the is the transition probability matrix, and is the matrix transpose.

7. Experimental Analysis

In this section, we evaluated the proposed RESG index into different real-world graph data, and two different forms of experiments are used to demonstrate experimental results, which aims to further prove the effectiveness and efficiency of proposed RESG index.

7.1. Illustrative Example

Data1 is used to illustrate the proposed RESG index, for the edge connections of Data1, see Figure 3 for detail. Taking pattern and for example, in terms of RESG index, next, we deal with the problem of pattern similarity step by step.

Firstly, we find pattern’s first-order neighbors of them, respectively, and put them in , and relevant strength value about topological information of and is calculated and shown in Tables 3 and 4, respectively. However, it can be easily found that and . Based on this, a pattern with the average value of for and is added as the one-neighbor of . After that, the neighbors of the two patterns reached the same number, which avoided the partial information loss of in the subsequent calculation of relative entropy.

Secondly, the similarity sets are generated in the process of calculating the similarity between patterns and its first-order neighbors, and the details of and are shown as

The details of probability set based on strength set of and can be calculated and arranged each element in descending order, which can be shown as

Then, with the help of equation (24), the relative entropy of pattern and can be computed as follows.

Finally, by computing pattern’s similarity of the graph data , the maximum value of pattern similarity can be found from Figure 4; in terms of equation (25), similarity of and is . Obviously, the similarity calculation process of and can help better understand RESG index. The details of relevance matrix of graph data are shown in Figure 4, and the most similar pattern in is shown in Table 5.

According to Figure 3, we can find that compared with patterns and , they have more similar topological structures. Depending on Table 5, the most similar pattern of is exactly identified as pattern . Illustrative example given shows that RESG index is simple, efficient, and reliable with highly satisfactory accuracy.

7.2. Result Analysis

To further illustrate the efficiency of the proposed RESG algorithm in measuring pattern’s similarity, this subsection gives comparative experiments with serval proposed similarity measures. In order to make a detailed description of experimental results, two ways are given. The first way of comparative experiment is to show the experimental results through scatter plots and table of the most similar pattern. The second way is to demonstrate the effectiveness by applying the RESG to link prediction.

The scatter plot reflects the distribution of similarity between patterns. For example, if the most similar pattern of is , then there exists draw points on in plane. There is a good similarity measure, whose scatter plot is dispersed on the plane, rather than concentrated on both sides of diagonal. The reason is that if the points are concentrated on both sides of the diagonal line, it shows that this method is easier to identify its neighbors as most similar patterns, which is not accurate enough. In the following, under Data2 and Gene, the scatter plots of the proposed RESG index and other seven similarity indices are used to further validate the performance of similarity measure, which are vividly shown in Figures 5 and 6, respectively.

Figures 5(a), 6(a), 5(b), 6(b), 5(c), and 6(c) show scatter plots formed by AA index, WAA index, and CN index, respectively. As we can see, the most similar patterns are concentrated near to diagonal. There is no denying that these three indices are low computational complexity; nevertheless, it uses very limited information. Generally speaking, similarity is determined by the number of common neighbors between patterns. Accordingly, the most similar patterns are distributed near the corresponding patterns. Although the symmetry of patterns is good, it is difficult to accurately describe the similarity between patterns when only one path is considered.

Figures 5(d) and 6(d) show scatter plots formed by LRE, respectively. It can be found that the most similar patterns are not distributed near to diagonal nevertheless. From overall view, LRE takes information of the local structure into consideration, and it remains a daunting challenge on obtaining accurate similarity value. In addition, with the size of graph data increasing, the symmetry between patterns decreases significantly.

Figures 5(e) and 6(e) show scatter plots formed by Katz. It is worth noting that the adjustable parameter of Katz index is set to , whereby the scatter plot under Katz index is concentrated around the diagonal line. Furthermore, the symmetry is not desirable. Katz index relies more on path among patterns in graph data, and patterns with larger degree are more likely to be in the path between different patterns; so, there is a greater probability that most patterns are similar to the patterns with greater degree in graph data.

Figures 5(f) and 6(f) show scatter plots formed by LRW index, and the number of random walks in this experiment is set to 3. One can see that the scatter plots are unevenly distributed, and the accuracy of similarity obtained by this similarity measure needs to be further improved. Moreover, LRW index considers the random walk with finite number of steps, and the computational complexity of this measure is higher.

The scatter plots of LP index are vividly showed in Figures 5(g) and 6(g). The advantage of LP index is low computational complexity. However, due to the limited information used, the distribution of similarity values is too concentrated, which makes distinguishable similarity between patterns.

Figures 5(h) and 6(h) show scatter plots formed by RESG index, respectively. As we can see that the most similar pattern is not distributed near to diagonal, and with the size of graph data increasing, the scatter plot formed RESG index still maintains good symmetry. RESG index measures the similarity between patterns using influence of pattern degree and weight and takes the information of first-order neighbors and second-order neighbors of patterns into account, which can get more accurate similarity of any two patterns. Under the circumstances, most patterns avoid becoming general patterns and avoid being identified as certain patterns with common structure that are most similar to multiple patterns. Moreover, in terms of runtime, RESG index is higher to LRW index, CN index, and AA index. However, compared with the same type of relative entropy-based similarity LRE index, the running time of RESG index is only of it. In addition, comparing with the normal algorithm, RESG index is simple and efficient and can satisfy measure the similarity of patterns in large graph data efficiently.

For a different method, a quantification named most similar pattern listed in Table 6 is used to demonstrate the difference between RESG index and three existing measures: LRE index, CN index, and EI index [35], so as to verify the good effect of RESG index from another perspective. The first line of Table 6 is the pattern’s label, three of every 100 patterns are selected randomly, and a total of 20 will be used as experimental patterns listed. “” represents that the pattern does not have the most similar pattern. Since there is such a situation that pattern in graph data has more than one of the most similar patterns, only the same pattern sequence numbers are listed, and the rest of most similar patterns are shown in the table with abbreviation numbers. Take pattern under the EI index for example, (148) represents pattern which has 148 most similar patterns.

As it shows in Table 6, pattern is identified as the most similar pattern of 7 different patterns under LRE index, including pattern . LRE index takes the degree of patterns into consideration simply; so, it is possible that most patterns may have the same degree distribution, which leads to the same similarity of patterns. Analogously, under the EI index, several patterns have more than one most similar pattern. For example, a number of 148 most similar patterns are identified by patterns and so on. However, there are also patterns without the most similar pattern, for instance, patterns .

As we can see, there is no situation that multiple patterns identify the same most similar patterns under RESG index. RESG index takes information of pattern’s one-order and second-order neighbors into account, which can accurately calculate the similarity. Meanwhile, the weight of patterns also contains a lot of topological information, and there may be a situation that the degree distribution is the same but the weight is different. RESG index starts from the perspective of degree and weight, which may make it exact to distinguish the similarity. As a result, RESG index is feasible and effective.

To further verify the feasibility of the proposed similarity measure, RESG index is applied to link prediction and compared the prediction performance with CN index, LP index, Katz index, and LRW index. The experiment is carried out on six graph data collected from Stanford Dataset, and AUC is selected as an index to evaluate the prediction performance of effective path topology stability. For the more information of AUC, see reference [34] for details.

Figure 7 shows the comparison of AUC results on RESG and other four similarity measures. Among them, CN index only considers the degree information of patterns, LRW index, LP index, and Katz index either consider the local path or the global path of graph data; so, their time complexity is relatively high. As we can see from Figure 7, compared with RESG index and LRW index, the AUC value of CN index, LP index, and Katz value on Stmarks and FWEW is not ideal. However, compared with RESG index, LRW index has higher time complexity. The AUC of RES index is the highest on four graph data: FWMW, FWFW, Celegans, and Email167, second only to LRW index on Stmarks and . Meanwhile, compared with the AUC of other four measures, the improvement rate can reach . The experiment suggests that RESG index can achieve the highest AUC value in four graph data; to some extent, it shows the effectiveness and feasibility of RESG index.

However, it deserves our attention that the proposed RESG index also has limitations, and it can achieve better link prediction effect on graph data with small clustering coefficients. For graph data with large clustering coefficient, the effect of this measure needs to be further improved and optimized.

8. Conclusion

Measuring similarity of patterns in graph data is a significant work in many fields. In this paper, to overcome the shortcomings and limitations of existing similarity measures, a relative entropy-based similarity for patterns in graph data abbreviated as RESG index is constructed. Our main work is divided into three aspects. Firstly, strength set is given by degree and weight, which proposed four variables that contains the information of topological relationship in first-order neighbors. Then, in order to generate probability set, patterns with smaller neighbors are redefined by appending empty neighbors up to the same neighbors as another. Finally, relative entropy is computed, and pattern’s similarity will be calculated. In addition, two sets of comparison experiments with serval classic similarity measure are used to show effectiveness and feasibility of the proposed RESG index algorithm. Experiments indicate that by taking pattern’s degree, weight and second-order neighbors into consideration, the RESG index algorithm can better identify similarity between patterns. To some extent, our purposed approach can enrich the research in area of pattern’s similarity in graph data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

What is more, we thank the National Natural Science Foundation of China (No. 61966039). Also, this work is partially supported by the Scientific Research Foundation of Education Department of Yunnan Province (No. 2021Y670).