Abstract

Identifying influential nodes is important for software in terms of understanding the design patterns and controlling the development and the maintenance process. However, there are no efficient methods to discover them so far. Based on the invoking dependency relationships between the nodes, this paper proposes a novel approach to define the node importance for mining the influential software nodes. First, according to the multiple execution information, we construct a weighted software network (WSN) to denote the software execution dependency structure. Second, considering the invoking times and outdegree about software nodes, we improve the method PageRank and put forward the targeted algorithm FunctionRank to evaluate the node importance (NI) in weighted software network. It has higher influence when the node has lager value of NI. Finally, comparing the NI of nodes, we can obtain the most influential nodes in the software network. In addition, the experimental results show that the proposed approach has good performance in identifying the influential nodes.

1. Introduction

Measuring accurately the importance of the node in the software networks is the premise to improve the security and robustness of software [1, 2]. Moreover, with the development of the software, measuring the importance of nodes in the network has practical significance for defending and protecting the influential nodes in the software network [3], if these nodes are suffered by deliberate attacks, maybe cascading failure occurs [4, 5]. Accordingly, how to mine the potential characteristics of software to control the evolution process of the software structure has become a hot spot for researching [69].

Many researchers introduced the idea of complex networks to the field of software structure and abstracted software to a network from different granularity point [10]. With the network structure, many potential characteristics can be discovered directly. Ma et al. [11] abstracted interaction relationship between packages into a software network, and they defined functions in package as nodes and dependencies among functions as edges. Wang et al. [12] proposed an approach to study the evolution of special software kernel components, which adopted the theory of complex networks. They also proposed a generic method to find major structural changes that happened during the evolution of software systems. Li et al. [13] proposed a modular attachment mechanism of software network evolution. Their approach treated object-oriented software system as a modular network, which was more realistic. A new definition of asymmetric probabilities was given to acquire links in directed networks when new nodes attached to the existing network. With the directed network, both of the “scale-free” and “small-world” properties were verified to be present in the software network. In [14], David proposed a method to simplify the complexity of the software network. With the method, some valuable characteristics in the network could be obtained easily. From the researches above, the complex network was proved to be applicative in the software engineering and it brought us a new perspective to research the software structure. However, these methods of modeling the software mentioned above were based on the static structure of the source code. The execution characteristic during the software running process was neglected in these methods. For the software, most of the characteristics are exhibited during the execution process.

The characteristics of the software execution can help us to understand the software better. It is obvious that the node is an important part of the network and it has enormous influence on the stability, reliability, and robustness of the network [15]. For a software network, the software function plays a critical role in the stability and robustness of the software during the execution process. In the structure of the software, the functions carry most of the feature characteristics and topology information and they can affect each other. In most cases, the fault of a function is not only caused by itself but also infected by the other functions. Recently, the importance of the node in the network was defined from different aspects. Bhattacharya et al. [2] defined a measure to evaluate relative importance of the nodes in software by assigning a numerical weight to each node of software graph. By the value of the betweenness and the clustering coefficient, Zhang et al. [16] measured the importance of each node to analyze the influence of each node to the entire network. According to the propagation field of the classes, Li et al. [17] put forward an indicator to measure the importance classes in the software network at class level. Based on the value of the indegree and the outdegree of each node, Wang and Lü [18] proposed a method to mine the influential nodes. With the method, they proved that the fault appeared with a large probability in those nodes with large degree value. In the researches above, the node was proved to play a key role to analyze the network. However, the node was regarded as an individual unit, as well as the relationship between the node and the entire network was ignored. In practical application, the network should be considered as a whole, in which the nodes can interact with each other.

Considering the above-mentioned shortcomings, the dependency relationship between the function nodes, and the absence of efficient analysis methods, we construct the WSN to show the software structure according to the information of multiple execution. Based on the dependency relationship between the function nodes, we present a targeted method FunctionRank to evaluate the importance of the software nodes. With the analysis result of each node, we rank the influence of each node to mine the top- nodes. These function nodes have played an important part in ensuring software reliability and stability. So they should be paid more attention in the process of software updating and software maintenance.

The primary contributions of this paper can be summarized as follows:(i)A novel method is proposed to construct weighted software network (WSN). So we make the understanding and recognition of software structure more accurate.(ii)A measurement node importance (NI) is put forward to evaluate the importance of each node in the network.(iii)The IC (independent cascade) model as an attack model is used to evaluate the influential functions for software system.(iv)The proposed algorithm is an effective method for security measurements of cybernetwork and provides basis for software security and reliability improvement.

The rest of this paper is organized as follows. The construction process of the weighted software network (WSN) is described in Section 2. The node importance of each function node is given definition in Section 3. Then, in Section 4, the method FunctionRank is given to mine the most influential nodes. In Section 5, the performances of the proposed algorithm are showed by experiments. Finally, conclusions and future works of the paper are presented in Section 6.

2. Definitions of Weighted Software Network

Complex networks are suitable to show the invoking relationships between the software functions. Based on the information of the multiple execution processes, we define the software execution dependency structure with a directed-weighted network.

2.1. Software Network

In this section, according to the multiple execution information, we define a software network to demonstrate the software execution dependency structure. Figure 1 shows a real example of software network.

Where each node represents a software function and each edge is the invoking relationship between the functions. In the software network, most of the characteristics can be exhibited during the software execution process.

2.2. Weighted Software Network

Next, in order to guarantee the completeness of the experimental data and make the understanding and recognition of software structure more accurate, we define a weighted software network. Compared with the software network, we consider invoking times between the software functions in multiple execution processes as the weight of each edge. The weighted software network is suitable to demonstrate the complex invoking relationships between the software functions. The definition of weighted software network (WSN) is given as follows:Figure 2 shows a weighted software network, where Node is a software functions set and Edge is an invoking relationship set between the software functions. that stands for the weight of edge is calculated by the following formula:where is the times of the trials with different experiment cases. is a value of 1 or 0. If the edge of one calling relationship appears in an execution trace, no matter how many times of it, let be 1; otherwise it is 0.

Figure 3 presents a simple process of the WSN established. As shown in Figure 3(a), is a function invoking trace in one-time execution of the software. The trace contains a series of function calling relationships which can reflect the software execution process. Figure 3(b) shows a structure of WSN, in which the node and the edge of the network are defined as the function and the calling relationship between the functions appearing in ~s5 in Figure 3(a), the weight of edges represents the number of each calling process executed in the 5 times’ execution, and the times of a calling relationship in some execution processes were ignored.

Based on multiple execution information under the different experimental cases of software, we guarantee the completeness of the experimental data. Function nodes which have appeared during the software multiple execution processes are considered as a set of nodes of the network structure, calling relationship between the software functions is considered as a set of edges, the weight of the edge, we consider the weight to stand for the edge appearing times in the execution traces of the software, and we ignore the times appearing in an execution trace . In this way, WSN is built.

3. Node Importance

According to the complex invoking relationships for software system, we show the most common topology structures of the weighted software network in Figure 4 to explain the importance of the function node.

Definition 1 (IN (indegree nodes)). For a node , IN is a set of functions which call node directly. The IN of node is gotten by only one call step.
As shown in Figure 4(a), . The influence of node is based on IN () which call vi directly.

Definition 2 (ON (outdegree nodes)). For a node , ON is a set of functions which are called by node directly. The number of ON() is ’s outdegree, CO.
As shown in Figure 4(a), and CO () = 3.

Definition 3 (TN (terminal nodes)). The nodes that have no outdegree and have no contribution to the influence of other nodes are defined as terminal nodes.
As shown in Figure 4(b), is a terminal node.

Definition 4 (LTN (loop terminal nodes)). The nodes that only have an outlink to their own are defined as loop terminal nodes.
As shown in Figure 4(c), only has an outlink to its own. So is a loop terminal node.

Definition 5 (OD (output degree)). The weight sum of each edge for a node to its outdegree nodes, , is named as output degree of the node .
In Figure 4(a), the weight of each edge for to is 2, 2, and 5, respectively. is the sum of these weights, namely, ’s output degree.

Definition 6 (WC (weighted contribution)). The ratio of the weight for node to node and ’s output degree, , is the weighted contribution of to .
In Figure 4(a), the weight of to is 2. The weighted contribution of to is given as follows:Based on the above definitions, the node importance (NI) of node is given as follows:where is the certain probability of calling a random node for LTN, and the probability of invoking each node is the same. It is set as 0.15 with experimental verification.

4. Important Nodes Mining

In this section, we first provide an algorithm outdegree nodes to get the outdegree node list of all nodes, according to the outdegree nodes of each node in the software network, and then we provide another algorithm FunctionRank to calculate NI of each node. In the method FunctionRank, we evaluate the importance of nodes iteratively (see Algorithms 1 and 2).

Input: node set , edge set
Output: childStr //the out-degree node list of all nodes
(01) for (each )
(02) for ()
(03) if ( = )
(04) childStr += “” + ;
(05)
(06) print ( + childStr);
(07)
Input: node , childStr ()
Output: the NI of node //evaluate the importance of nodes
Process:
(01) Initialize NI() = 1, = 0.15
(02) if (childStr[] != null)
(03) outdegree = childStr.size ();
(04) for (each childStr [])
(05) if ( is equal )
(06) outdegree - -;
(07) else
(08) weighMap.put (,weight ());
(09) weigh += weight ();
(10)
(11)
(12) for (each childStr [])
(13) if ( is not equal )
(14) tempNI () += NI () weighMap.get ()/weigh;
(15)
(16)
(17)
(18) tempNI () = () (outdegree () + 1);
(19) NI () = tempNI ();

As shown in algorithm outdegree nodes, for each node in set we traverse the edges in set in line (1) and line (2). We define the nodes of an edge as start node and end node , respectively. In line (3) to line (4) we add the end node of an edge to the childStr of node , when node equals the start node of the edge. Finally, we print the childStr of in line (6).

We evaluate the importance of each node in the network by an iterative process, as shown in Algorithm 2. In line (1), we initialise NI(vi) as importance of nodes and as the certain probability to call a random node, respectively. Line (2) to (19) is the iterative process to compute the importance coming from outdegree of the current node and other nodes which call the current node. The computational formula of node importance (NI) is given in line (18); it has higher influence when the node has lager value of NI. Ultimately, the importance for a node (NI) is obtained when error of current NI value and previous NI value is less than a given threshold for all nodes.

With the measuring results obtained from Algorithm 2, we choose the top- nodes as the influential nodes for the software network. In Algorithm 3, we illustrate the process of top- nodes (KN).

Input: node set , NI of each node
Output: the top - influential nodes
Process:
(01) Initialize list //store the importance of nodes
(02) for (each node )
(03)list.add (NI());
(04) end for
(05) Collections.sort (list);
(06) Collections.reverse (list);
(07) print list.get ()

In Algorithm 3, we initialise list as the measurement list for all the nodes in line (1). Lines (2) to (4) are a looping process to store the NI value for each node. The sorting process is given in line (5) and line (6), the top- nodes are chosen from the list in line (7).

5. Experimental Analysis

A series of experiments were conducted to compare the performance of the proposed algorithm (named as FunctionRank) with different parameter values. They were implemented in JDK1.6.0 and executed on a PC with 3.30 GHz CPU and 5 GB memory.

5.1. Experimental Datasets

Firstly, several dynamic software datasets are used to evaluate the performance of the algorithms. The classical software is obtained from the open-source community. These software programs are coded in C or C++, including program software tar and cflow.

In the experiment, we chose different versions of tar and cflow, respectively, for experiment. tar is a decompression software for Linux, and cflow is an analysis tool for C program to extract the relationship of function calls (download from the open-source software library: Https://sourceforge.net).

5.2. Evaluation on the FunctionRank

We run the algorithm on each version of tar and cflow. By the algorithm FunctionRank, we calculate the of each function node. Here we mine top-10 nodes in each version about software tar and cflow. It is shown in Tables 1 and 2, respectively.

As it is shown in Table 1, for versions tar-1.21 and tar-1.23, the NI of the top-10 are almost the same. The reason is that the difference between the three versions only reflects the number of function calls. In other words, there is no change of the component function of these two versions. In the latest three versions, developers changed the logical contents of some functions or insert new functions into the software to enrich the features of software; on the other hand, the software was simplified or some features were removed to improve the robustness, which results in the ranking variation. For example, in the prior versions node _gnu_flush_read ranked 2nd or 3rd but it ranked 7th and 8th in versions tar-1.25, tar-1.27, and tar-1.28. Table 2 shows the top-10 influential functions of software cflow in different versions. The ranking of some functions in each version of cflow varies but with little range. For example, function print_symbol’s ranking ranges from 1 to 2. So we can make a prediction that it may still be more influential than most others in the next new version. Meanwhile, there is no function alloc_cons for the latest versions cflow-1.3 and cflow-1.4 results in the ranking variation. In other words, there is change of the component function of these two versions.

In addition, the number of nodes which have high is rather small in each version. These high value nodes have taken a great part in ensuring software reliability and stability. It means that there are little functions that should be paid more attention in software updating and software maintenance. We calculate the count for different range of values. The results of software tar and cflow are shown in Figures 5 and 6, respectively.

As we can see in Figure 5, most of nodes are ordinary functions. We would not pay more attention to them. Meanwhile, a handful of nodes that have high should be paid more attention. They play important roles in the process of software updating and software maintenance. For cflow, the number of nodes in each scope is shown in Figure 6. It has the same characteristic with tar. The number of nodes with high is much less than that of low . By paying more attention to these influential nodes in future versions, we can improve software reliability and stability. Thereby we can greatly reduce the amount of work and improve work efficiency.

At the same time, of the same ranking nodes within different versions has slight wave, as shown in Figures 7 and 8.

As it is shown in Figure 7, the NI distribution of software tar is similar extremely in the six versions. With the increasing of node ranking, the NI of each node shows a decrease trend. As the lower rank, the value shows a trend of increase. The higher NI ranges from 0.7 to 3.0; most nodes’ values are around 0.4. The development of versions follows the same laws, the NI of a certain ranking remains stable and the NI distribution of different software versions is nearly the same. So, we can predict the future versions’ trends based on this. Meanwhile, Figure 8 shows the NI distribution of software cflow, the higher NI ranges from 0.8 to 4.0, and most nodes’ values are around 0.5. The curve of each version has the same tendency; namely, the NI distribution of software cflow follows the same trend.

5.3. Performance Evaluation

In the study of complex network, we often examine the effectiveness of a method [19, 20] through the analysis of spreading influence about top- nodes. Therefore, this paper will introduce IC (independent cascade) model. The IC model derived from the SIR (Susceptible-Infected-Recovered) model, the SIR model is a theory about virus spreading and has to be researched widely in complex networks, such as the marketing, advertising, early warning, and social stability. In software engineering, the similar algorithms were used to analyze the change impact [21] and error propagation [22].

The IC model is a probability model; when a node is activated, it will attempt to activate its inactive outdegree nodes with probability only once [23]. Whether node can activate its neighbor nodes successfully, is still active, but it has no influence later. The communication process is over when there are no influential active nodes in the network, while, in the actual execution process of software, the running fault can affect the other function running due to the invoking relationship. When running fault, all of the invoked functions would affect the normal execution of the parent function. So the faults can widely spread among the function nodes during the running process. So we take IC model as a software attack model to evaluate the effectiveness of our method. A software attack instance is shown in Figure 9.

We assume the node and node are attacked as Figure 9(a) shows, and then and will attack its inactive outdegree nodes with probability only once, where and are attacked successfully by ; meanwhile and are attacked successfully by in Figure 9(b), next and have no aggressivity, and the nodes attacked by a and d can attack their inactive outdegree nodes with probability in the same way. Finally, the number of attacked nodes represents the influence of original attacked nodes.

When calculating the influence of the top- important nodes obtained by different methods, we will separately run IC model about 10 times and then consider the average of active nodes as the performance evaluation of the method.

The software key entities typically account for a small proportion and only account for one point five percent to two percent in the study of class size [24]. At the same time, it is not acceptable for the cost of checking most of the key entities. So an appropriate number of key entities is needed to be selected. By ranking all functions as descending order according to the measurements, we chose little key functions for different systems: top 20 for tar and top 30 for cflow.

Figure 10 shows the average of active nodes for different software versions. In all the different versions of the software systems, key functions identified by NI can activate more nodes than that identified by the method PageRank and MKN [25] as Figure 10 shows. Visibly, compared with another two methods, NI is more effective in the identification of the key functions. The key functions play an important role in software system in terms of reducing the numbers of test data, detecting the vulnerabilities of software structure, and analyzing software reliability, and they should be paid more attention in the process of software updating and software maintenance. Measuring accurately the importance of the node in the software networks is the premise to improve the security and robustness of software. Moreover, with the development of the software, measuring the importance of nodes in the network has practical significance for protecting the influential nodes from deliberate attacks in the software network.

6. Conclusions and Future Work

In order to understand and recognize software structure better, a novel method is proposed in this paper to mine the influential nodes in weighted software network. Firstly, taking into account the invoking times, we construct a directed-weighted network structure to make the understanding and recognization of software structure more accurate. Then, a measurement of NI is put forward to evaluate the node importance, where we provide an idea of importing PageRank and WSN to Software engineering domain. Furthermore, we also consider the outdegree value as a key parameter to the node importance. The outdegree value can reflect the complexity of the node. Finally, the algorithm named FunctionRank is presented to calculate the NI and the change trends of nodes’ importance are analyzed by different software versions. In addition, the experimental results show that the proposed feasible approach has good performance in identifying the influential software nodes.

Although the approach we proposed shows some feasibilities in identifying influence nodes in complex software network, the broad validity of our approach should be demonstrated further. Our future work is using more open-source software network to evaluate the validity to improve our approach.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the National Key R&D Program of China (2016YFB0800700), the National Natural Science Foundation of China under Grants no. 61472341, no. 61572420, and no. 61772449, the Natural Science Foundation of Hebei Province of China under Grants no. F2015203326 and no. F2016203330, and the Advanced Program of Postdoctoral Scientific Research under Grant no. B2017003005.