Abstract

The scale and complexity of software systems are constantly increasing, imposing new challenges for software fault location and daily maintenance. In this paper, the Security Feature measurement algorithm of Frequent dynamic execution Paths in Software, SFFPS, is proposed to provide a basis for improving the security and reliability of software. First, the dynamic execution of a complex software system is mapped onto a complex network model and sequence model. This, combined with the invocation and dependency relationships between function nodes, fault cumulative effect, and spread effect, can be analyzed. The function node security features of the software complex network are defined and measured according to the degree distribution and global step attenuation factor. Finally, frequent software execution paths are mined and weighted, and security metrics of the frequent paths are obtained and sorted. The experimental results show that SFFPS has good time performance and scalability, and the security features of the important paths in the software can be effectively measured. This study provides a guide for the research of defect propagation, software reliability, and software integration testing.

1. Introduction

The increase in complexity of software requirements makes software developers unsure of the development quality of software system; in effect the “software crisis” still has not been completely solved. How to effectively excavate the inherent characteristics of the software system structure, to recognize, measure, manage, and control the complexity of software structure, becomes a key problem for solving the development bottleneck in the software industry.

Research on the complexity of software network structure can combine the methods of complex system science and statistical physics. Depending on the granularity, software systems can be composed of different types of software entities, such as functions, classes, subroutines, packages, and artifacts. With these entities interacting with each other, software systems can achieve specific functional requirements. If the software entities are viewed as nodes and the relationship between the nodes is abstracted as edges, the software execution process presents a nonlinear network structure according to the relationship of the entities [1] and also a linear sequence structure according to the sequential characteristics of the execution order. Then, the software system can be expressed as an abstracted complex network model and a sequence model, which provides a new train of thought [2] for the description of the software system.

The root cause of the security danger hidden in software lies in the vulnerability of the entity itself. The vulnerability is the measurement of the potential danger of a software entity to be used as an attack and can be discussed from the perspective of computer network [3, 4] or software static code analysis, but the integrity (whole structure) and the dynamic execution (behavior characteristic) of software system are ignored. In addition, the degree to which software system security is threatened depends not only on the severity of the fault, but also on the fault propagation capacity of the entity. If one or more functions fail, the fault may be propagated to other functions by invocation relationships and further lead to a part of or the whole software system crashing, known as “cascading failure” [5]. Therefore, the software security feature measurement should take into account the vulnerability and propagation of software entities.

How to quantitatively measure the security features of nodes from the software complex network is the premise and basis for further analysis of the software behavior trajectory path. At present, there are lots of methods for discovering the important nodes in complex networks. The classic methods based on centricity contain degree centrality [6], closeness centrality [7], betweenness centrality [8], eigenvector centrality [9], subgraph centricity [10], and so on. The classic methods based on random walk model include PageRank [11], LeaderRank [12], and their improved algorithm NodeRank [13]. Wang and Lü [14] by means of the influence node mining method prove that the defect propagation capacity of a node is stronger if the in-degree and out-degree of the node are bigger. Huang et al. [15] based on the invocation and dependency relationships between functions with the fault probability of nodes calculate the fault accumulation degree of upper nodes by the iteration from the leaf nodes. These methods attempt to describe the relevance of software node importance to fault generation and propagation, but fail to form a measurement of software security.

Sequence or path is the most basic and important way for the description of dynamic software execution process. The full execution path of the whole software can reflect the occurrence order and frequency of the software internal entities. However, the method of path extraction and mining is restricted by the nested, circulatory, iteration and the continuous invocation relationships of entities. Most software path mining algorithms are extracted on the basis of complex networks. For example, Tang et al. [16] propose an algorithm for shortest path mining between any two vertices in complex network. Zhang et al. [17] minimize the length of the extracted path and reduce the unnecessary time overhead by further processing the repetitive structure. The GP method proposed by Nguyen et al. [18] can automatically detect and fix software vulnerabilities according to the software execution path. Murtaza et al. [19] predict future software possible defects by analyzing the historical vulnerability sequence data with characteristics of Markov to provide adequate response time. Zou et al. [20] analyze the reliability of Digital Instrumentation and Control software system based on the flow network model by finding sensitive paths in the complexity software. These algorithms are based on the network to extract path, which can lead to the phenomenon of repeated reading and approximate connection; also, these software security analyses cannot work without existing vulnerability information or real faults as their training data.

In this paper, the Security Feature measurement algorithm of Frequent dynamic execution Paths in Software, SFFPS, is proposed. A complex network model and a sequence model are formed based on software dynamic execution behavior. It is for early security feature measurement, before there are real vulnerabilities or faults generated, which can provide the premise for the software quality and reliability evaluation. The main contributions are as follows.

The software system is mapped to a complex network model and sequence model, from the nonlinear perspective to effectively express the characterization of complex correlation between software entities and from the linear perspective to capture sequential characteristics of the dynamic execution.

The behavior nature of fault accumulation and propagation is analyzed based on the system structure of software dynamic execution and standard measurement of security features (vulnerability and propagation) being defined.

Frequent paths in software dynamic execution are mined and weighted by the node security features. The key paths which are worthy of attention are ensured by both their frequency and security features.

The remainder of the paper is organized as follows. Section 2 gives the model construction. Sections 3 and 4 develop the definition of the security features and the SFFPS algorithm. Section 5 provides some examples. Section 6 presents the performance study of SFFPS and shows the rank of the important paths. Section 7 contains the concluding remarks.

2. Constructions of Complex Network Model and Sequence Model

The dynamic execution trace of software systems contains three phases, which are data collection, tracking data simplification, and data visualization as shown in Figure 1. The modeling process of simple functions is shown in Figure 2.

Phase 1. Match the entry and exit configuration functions of the GNU compiler toolchain (gcc), and insert the analysis function into the entry and exit of the application functions to trace the function execution process. The tracking results are recorded in the file trace.txt.

Phase 2. The letters “E” and “X” before the tracking addresses represent the entry and exit of a function, respectively. A simplification tool Pvtrace is used to analyze the function invocation according to the letters “E” and “X.” An address transformation tool Addr2line is used and the address is transformed to function name.

Phase 3. Map the function invocation order to sequence model and a visualization tool Graphviz is used to form the complex network, which defines the global relationship between all the functions.

According to Figure 2, the corresponding relationships of function address and function name are as follows:; ; ; .; , .

Only the addresses with the letter “E” are used for sequence model construction.

3. The Security Feature Definition and Measurement of Function Nodes

The security feature measurement of a function node is based on the software structure; the analysis of vulnerability and propagation is according to cumulative effect and the spread effect caused by the mechanism of fault production and propagation. The global accessibility and fault tolerance with step attenuation effect are fully considered, so the node security features are calculated according to the degree distribution and step attenuation factor.

Definition 4 (software complex network). In a software complex network, functions are defined as the nodes; the invocation relationships between functions are defined as edges.

Definition 5 (vulnerability). Vulnerability of a function node is the characteristic that a function node may break down because of the effect of its invocated fault node through invocation relationship.
Typically, if a node invocates more other nodes, it is more functional and vulnerable. That is to say, it is more likely to be affected and be faulted. The calculation of (vulnerability) is as follows:where , represent function nodes, represents the vulnerability of node , OutDegree represents the out-degree of node , represents the step attenuation factor, which satisfies , and represents the direct out-neighbor set of node .

Definition 6 (propagation). Propagation of a function node is the characteristic that a function node may propagate its fault to the nodes by which it is invocated. The calculation of (propagation) is as follows: where represents the propagation capacity of node , represents the in-degree of node , and represents the direct in-neighbor set of node .

Algorithm 1 describes the calculation process of vulnerability and propagation.

Input: Complex network CN, step attenuation factor
Output: Node list with security features NFlist
for each node in CN
= calculation_;
= calculation_;
NFlist.add ;
Procedure calculation_
= outDegree;
For each node
+= calculation_;
return ;
Procedure calculation_
= inDegree;
for each node
+= calculation_;
return ;

4. Mining Frequent Paths from Dynamic Execution with Security Feature Measurement

The importance of a software dynamic execution path takes into account two aspects: one is the occurrence frequency of the path and the other one is the security feature coming from the nonrepetitive nodes contained in the path. These two aspects are complementary. For example, if there are lots of loop bodies in the software execution, loop body and its subset are always frequent. But because most of its contained nodes are the same, the fault influence range is small. Similarly, if a path contains many different nodes with a lower occurrence frequency, its impact range is large, but its occurrence possibility is small. That is to say, if the frequency of a path is very high and the path contains more nonrepetitive nodes, the path is worthy of more attention.

4.1. Relative Definitions of Frequent Path

Let be a set of function symbols. is a software execution path, and it is composed of function symbols with time-ordered occurrence. Minimal support count (mincount) can be calculated by , where minsup is a given threshold and is the number of function symbols in . If there are symbols in , is a -path.

Definition 7 (subpath and superpath). A path is a subpath of another path , denoted as , if there are numbers , such that and , . It can also be said that is a superpath of path .

Definition 8 (support number). is a path; the support number of , denoted as , is defined as its occurrence number in the software execution.

Property 9 (frequent path). A path is frequent if its support number is equal to or more than mincount.

Property 10 (antimonotone). If path is not a frequent path, any path containing , which is a superpath of , cannot be a frequent path.

4.2. Weighting the Frequent Path Based on the Security Features of Function Node

SFFPS algorithm is for mining the security features of frequent paths based on the dynamic execution sequence model and the node security features in the complex network model. It contains two phases: one is frequent path mining and the other one is security feature weighting. First, the function nodes in the sequence model are read to form the function position set. Then, the position index is used for pattern growth; this self-growth strategy can avoid candidate generation and ensure the continuity of function execution. Finally, path frequency is validated by minimum support count mincount, and path is weighted according to the security feature of the nonrepetitive nodes contained in it. The security features of the frequent paths are measured. Algorithm 2 describes the mining and weighting process.

Input: Function execution path , minimal support threshold minsup
Output: Path list with security features list
;
for each node in
Pos.add(.pos);
for each Pos
sup = ;
if(sup < mincount)
Delete Pos();
else
= .add(, sup);
for (; ; )
gen_mine;
for each
for each different function symbol in
;
;
Sort each by , and form Slist;
Procedure gen_mine
for each
for each position pos in Pos
;
for each position pos in Pos
if (pos+1 exists in Pos)
Pos.add(pos + 1);
sup = ;
if (sup < mincount)
delete Pos;
else
.add;

5. An Illustrative Example

The complex network in Figure 2 is a variant of the tree-like structure in Figure 3, which is redrawn for easier understanding.

Without losing generality, the coordination factor is set to 0.5. Security features of each node are calculated as follows. As the “main” function is special (vulnerability is always large and propagation is 0), it is excluded for measurement.

Vulnerability......

Propagation; .....

According to the sequence model of the example, , if the minsup is set to 0.15, .; ; ; .; .

Frequent 1-Path, ., ., .

The mining method of frequent 2-path is based on the position set of the frequent 1-path by using the adjacent position value as index to find the extended paths. For example, the position set of node is , and its extended position set is . The function nodes in positions 5 and 9 both correspond to node . So, Pos is obtained, , and path EF is a frequent 2-path.

Frequent 2-Path, ; ..

The security features of frequent 1-path included in the function nodes are calculated as before, and the security features of frequent 2-path are calculated as follows. Table 1 shows the security features of all the frequent paths.

6. Experimental Results

Experiments are performed on a PC with Intel® Core™ 3.6 GHz CPU and 16 G main memory, running on Windows 8. We evaluate the runtime and scalability of the algorithm SFFPS and calculate the fault feature ranks of nodes and important paths. To test the algorithms in the same coding environment, all the programs are written in Java using MyEclipse. Datasets used in the experiment are open-source software programs of Cflow and Tar obtained from open-source software library (https://sourceforge.net).

6.1. Runtime and Scalability Tests of SFFPS

By testing the runtime and scalability of SFFPS, two newest versions of each Cflow and Tar are selected. The support threshold is from 0.005 to 0.01 for runtime test, and the upper threshold 0.01 is used for scalability test. The total runtime is composed of three parts, node fault feature calculation, frequent pattern mining, and weight appending. Figure 4 is the runtime test of SFFPS with different support thresholds and Figure 5 is the scalability test with different length percentages of the sequence when the support threshold is set to 0.01.

From Figure 4, SFFPS performs well in the support threshold range [0.005, 0.010]. This is due to the adjacency table which is for the storage of the complex network model. The calculation of the out-degree and in-degree of the nodes is made easier, which improves the calculation of node security feature. Furthermore, as the sequence model is based on the start order of each function, the detailed invocation and end time of a node are ignored, and the length of the sequence model is simplified. Also, position value index is used for the mining and pattern growth of the paths, which avoids candidate generation, and index methods are always effective. Finally, the weight appending process achieves efficiency because fewer nodes are involved by the strategy of nonrepetition.

From Figure 5, SFFPS shows good scalability on the software Cflow. With the increase of the length of the sequence, the execution time of SFFPS is essentially a linear growth. From the experimental data, the number of frequent sequences is also increasing. This indicates that the functions of Cflow are uniformly distributed. However, the time overhead of software Tar is quite expensive around 40% of sequence length; the number of frequent sequences increases rapidly from 194 when the percentage is 20% to 1123. After that, the time overhead and the number of frequent sequences reduces. This indicates that there are more core functions in software Tar and there are more invocations of core functions in the early stage of the program.

6.2. The Security Features of the Function Nodes

Tables 2 and 3 show the security feature rank and value of the function nodes in the newest versions of Cflow and Tar.

From Tables 2 and 3, the security features of the same function nodes are relatively stable for different versions of the same software. So, in the process of version evolution, it can be inferred and predicted that the same function should have approximate rank in a new software version. Also, the function rank in the old version can be used as a basis for the version upgrade process with function nodes remove, merger, or update. The nodes with larger rank changes should be given more attention.

Tables 4 and 5 show the frequent paths of Cflow-1.4 in the top 10 security feature ranks of vulnerability and propagation.

There are double meanings of the paths listed in Tables 4 and 5. One is that the paths are frequent, which first affirms that the occurrence possibility of the path is relatively large. The other one is that the security feature values of the paths are larger, which evaluates the security risk of the path. Only when both of them work together can we make a persuasive security measurement.

In addition, the frequency of the path can be used to predict the function nodes that are going to be affected, and the security features of the path can be used to evaluate the possible impact scale of the abnormal path. For example, the main consideration of random fault detection is the vulnerability. According to the first path of Table 4, from the perspective of frequency, if the first three functions of a fault path are “is_printable,” “include_symbol,” and “direct_tree,” then the next functions which are likely to be affected are “include_symbol,” “print_symbol,” “gnu_output_handler,” and so on. From the perspective of security features, the path displays higher rank and value in vulnerability, which indicates the fault location is relatively accurate. If it is a hostile attack detection, the attacker expects a wider range effect, so the propagation should be considered more. In this case, the analysis method is similar.

7. Conclusion

In this paper, a novel algorithm, SFFPS, is proposed to define and measure the security feature of dynamic execution path in software. Complex network model and sequence model are constructed for the record of invocation relationship and function execution order. The node degree in the complex network is used for security feature analysis from a structural perspective before real fault occurrence. The paths extracted from the sequence model are used for frequency test and weighted by the node security features. Finally, frequent dynamic execution paths with top security feature rank are mined as important paths which should be of greater concern. With the experiment, SFFPS can effectively mine the important paths from the newest versions of software programs Cflow and Tar. SFFPS can be applied as a basis for software evolution, a tool for software internal structure analysis, and a guidance to fault location and attack detection, which are helpful for software quality assurance.

Conflicts of Interest

There are no conflicts of interest related to this paper.

Acknowledgments

This work is supported by the National Key R&D Program of China (2016YFB0800700), the National Natural Science Foundation of China under Grant nos. 61472341, 61772449, and 61572420, the Natural Science Foundation of Hebei Province, China, under Grant nos. F2016203330 and F2015203326, the Advanced Program of Postdoctoral Scientific Research under Grant no. B2017003005, and the Doctoral Foundation of Yanshan University under Grant no. B1036.