Abstract

With the rapid development of network technology, computer viruses have developed at a fast pace. The threat of computer viruses persists because of the constant demand for computers and networks. When a computer virus infects a facility, the virus seeks to invade other facilities in the network by exploiting the convenience of the network protocol and the high connectivity of the network. Hence, there is an increasing need for accurate calculation of the probability of computer-virus-infected areas for developing corresponding strategies, for example, based on the possible virus-infected areas, to interrupt the relevant connections between the uninfected and infected computers in time. The spread of the computer virus forms a scale-free network whose node degree follows the power rule. A novel algorithm based on the binary-addition tree algorithm (BAT) is proposed to effectively predict the spread of computer viruses. The proposed BAT utilizes the probability derived from PageRank from the scale-free network together with the consideration of state vectors with both the temporal and learning effects. The performance of the proposed algorithm was verified via numerous experiments.

1. Introduction

Almost all individuals and industries constantly rely on computer and network technology [116]. Thus, the types of computer viruses are becoming more diversified, and the intensity of the attacks is increasing. When a computer virus infects a facility, the virus seeks to invade other facilities in the network by exploiting the convenience of the network protocol and the high connectivity of the network. Hence, computer viruses cause serious damage to computer facilities. The terminology “computer virus” was introduced by von Neumann in 1966 [17], and it was extended from his lectures regarding self-reproducing computer programs given in 1949 [18]. Subsequently, scholars performed small-scale research on computer viruses, publishing conference papers from 1988 to 2000, and numerous computer virus-related research papers have been published in different journals. For example, Risak studied the functional virus in computer programs in 1972 [19], and Kraus researched self-reproducing computer programs and investigated the behavior of computer-language-like biological viruses in 1980 [20].

The evolution of terminologies on virus includes the worm, malware (trojan horse is the old term) that originated approximately in 1990, and advanced persistent threats (APT) that are newer developments on computer viruses in the past decade.

Researches on worm involve Shoutkov and Spesivtsev explored self-replication of worm [21], and Griffin and Brooks paid attention to the spread of worm in the computer network [22]. The computer viruses of malware are researched such as Witte explored the detection of malware [23] and Peng et al. focus on a survey of the propagation of malware [24]. Furthermore, the researches in APT including Li and Yang, improved the system of cloud memory under APT [25] and Tian et al. defended against APT in power grid [26].

In the past five years, research on computer viruses has focused on defense against viruses [2729]. The research on computer viruses can be classified into the following major research directions: the description of the computer virus, the detection of computer viruses, and protection against computer viruses:

1.1. The Description of the Computer Virus

Eichin and Rochlis described the computer virus via a detailed analysis in 1989 [30], and Spafford investigated how computer viruses are formed in 1994 [31].

1.2. The Detection of Computer Viruses

Davis investigated the detection of computer viruses to enhance the control of risk management in 1988 [32]. Okamoto and Masumoto adopted authentication for detecting computer viruses in 1990 [33]. Spinellis proved that the detection of computer viruses is NP-complete [34]. A detection mechanism for processing sign streams was adopted to evaluate viruses by Wang et al. in 2015 [35]. A model for the nonlinear vaccination probability was used to detect computer viruses by Gan et al. in 2004 [36].

1.3. Protection against Computer Viruses

Al-Dossary proposed a classification formula for defense against computer viruses in 1989 [37]. Yuan et al. established a virus model to optimize the performance of the infection mechanism in 2009 [38]. Youssef and Scoglio focused on reducing the dissemination of viruses by optimizing the weight function of the network structure in 2014 [39].

Understanding the development of computer viruses is important for understanding the historical defense strategies against viruses [27]. In addition, for the detection of the aforementioned computer viruses, the method of identifying the code mode of the virus can be used. If the virus has a self-replicating function whereby the code of the virus is copied to other files, the appropriate protection strategy for the self-replicating virus can be selected and executed immediately [34]. Therefore, the aforementioned three types of computer-virus research all involve protection against computer viruses.

The lifecycle of the virus includes dormant, propagation, triggering, and execution. The dormant phase indicates that the computer virus code is being created and finally born. The propagation phase shows that the files of computer viruses are placed in places that are easy to propagate. Once infected by the computer viruses, it will cause great harm. In addition, the triggering and execution indicate when all the conditions are formed; the computer viruses then begin to execute destructive actions.

Therefore, to prevent computer virus infection, protection in advance is important in the propagation phase. Computer viruses persist owing to the constant demand for computer networks. Hence, it is important to predict the probabilities of the virus spreading to different areas so as to interrupt the relevant connections between the uninfected and possibly infected computers in time to save the files in the computers. This was the major focus of this study.

With the strength of the advanced technology, the susceptible-infectious-recovered (SIR) model is adapted to predict the number of computer-virus spread areas in this study. In the SIR model, all susceptible nodes can be infected at most once. After the infected node is recovered by removing the computer virus, it is protected by using a virus detection and killer software such that it is impossible to be infected by the same computer virus.

Assume that a virus can propagate in a heterogeneous environment freely. Hosts like Linux, Windows, Mac, etc., and different operating systems can have different code bases and the same virus may not work for all OS.

The spread of computer viruses is a scale-free model formed by a scale-free network in which all node degrees follow a power-law distribution. The PageRank algorithm is the most popular among the different scale-free model-related algorithms for calculating the influence of nodes. Hence, the PageRank algorithm is used to provide the theoretical spread probability of the computer virus.

With the defense mechanism (antivirus protection) in the scale-free model to prevent or slow down virus propagation, the computer virus can be detected and killed more easily from time to time. The above defense mechanism is called the learning effect here. Hence, the learning effect can be the temporal learning-effect spread probability of the computer virus spread from nodes to and is higher than if during the computer-virus propagation.

During the process of computer-virus propagation, consecutive timeslots in the time period when the infected node i can still spread out the computer virus are called valid timeslots. An infected node can affect any neighboring node at any valid timeslot; that is,for any node k in and the validation timeslot t.

A novel computer-virus spread dynamic model based on the binary-addition tree (BAT) search algorithm with the learning effect is proposed for modeling the spread of computer viruses. The BAT proposed by Yeh [40] is a heuristic search method similar to the depth-first search (DFS), breadth-first search (BFS), and universal generating function methodology (UGFM). The BAT is more efficient than the DFS and more economical with regard to computer memory than the BFS and UGFM, both of which can crash the computer system because of computer memory overflow problems. Moreover, the BAT is easy to learn, convenient to code, and flexible (i.e., it can be made-to-fit).

The objective of this study was to theoretically predict the probabilities of a computer virus infection and the spread of the virus to different areas. The remainder of the paper is organized as follows. Section 2 provides acronyms and notations. Section 3 presents an overview of the infection model, the scale-free model, the PageRank algorithm, the BAT, and the learning effect, which form the basis of the proposed dynamic BAT. Section 4 introduces the novel temporal learning-effect spread probability and period, which are required data for using the proposed dynamic BAT. Section 5 describes the proposed state vectors formed by the spread vector and the temporal that needed to be found in the proposed dynamic BAT before predicting the spread areas of computer viruses. Section 6 formally presents the proposed dynamic BAT, together with its computational complexity, a demonstration, and experimental results. Section 7 concludes the paper.

2. Acronyms and Notations

All required acronyms and notations are provided in Tables 1 and 2, respectively.

3. Infection Model, Scale-Free, Page Rank, BAT, and Learning Effect

The proposed dynamic BAT, which is a scale-free model, is based on the BAT with a temporal learning effect to predict the areas infected by the computer virus and the areas to which the virus will spread. The infection model is adopted to describe the spread of computer viruses. The scale-free model and PageRank algorithm are used to simulate the computer-virus spread probability before the proposed dynamic BAT is used to predict the probability of the areas infected and spread from the computer virus and the learning effect, which is integrated into the proposed model to simulate the spread of the computer virus in a more practical manner.

Hence, before the proposed BAT is discussed, an overview of the infection model, the scale-free model, the PageRank algorithm, the traditional BAT, and the learning effect are described in this section.

3.1. Infection Model

In recent years, the theory of the spread of epidemics in complex networks has yielded considerable success. Individuals in the system have several basic states: the susceptible state S (healthy but may be infected); infected state I; and removal state R (infected after being cured and gaining immunity or dying after infection).

There are three mature epidemic infection models: the Susceptible–Infected (SI) model, Susceptible–Infected–Susceptible model, Susceptible–Infectious–Recovered (SIR) model, and Susceptible–Infectious–Recovered–Susceptible model.

In the SIR model, a susceptible node (S) has become infectious (I) and can recover (R) to obtain lifelong immunity after curing. The SIR model is the most popular mathematical model. The spread of computer viruses is similar to that of epidemics, and both propagations can be captured by a scale-free model. Hence, the SIR model was adopted to model the spread of the computer virus.

The SIR model acquires lifelong immunity after an illness as shown in Figure 1 [41]. Let S (t), I (t), and R (t) be the proportions of susceptible, infectious, and recovered nodes, respectively, and S (t) + I (t) + R (t) = 1 in G (V, E) [41, 42]. The differential equation describing the propagation mechanism of the SIR model is as follows [41, 42]:where β and γ represent the transmission and recovery rates, respectively.

3.2. Scale-Free and Page Rank Algorithm

The scale-free network is a special network, and its growth is independent of the number of nodes with the same underlying structure. The major difference between the scale-free network and other networks is that it has power-law (or scale-free) degree distributions. For example, the network in Figure 2 is generated from the Barabási–Albert model, which is the first scale-free model.

PageRank, which is used by Google search engines, was the first and the most popular among these famous algorithms for ranking nodes (web page, website, user, etc.) according to importance in the scale-free network. The PageRank value of a node (web page, website, user, etc.) i ∈ V is the probability that users clicking on nodes randomly will arrive at i. The array PR can be calculated as follows:where PR (i) represents the ith element in PR for all i ∈ V, Nnode represents the number of nodes, d represents a damping factor between 0 and 1, M represents the normalized adjacency square matrix such that , Ma,b represents the element in the ath row and the bth column in M, and I represents the identity matrix. The sizes of both matrices I and M are Nnode × Nnode.

The pseudo code of the PageRank algorithm is described as follows (Algorithm 1).

(i)INPUT: A scale-free network G (V, E).
(ii)OUTPUT: The PageRank value PR (i) for all i ∈ V.
(iii)STEP PR0. Let t = 0 and PR (i) = 1/n for all i ∈ V.
(iv)STEP PR1. Let PR = [dM + (1−d)/NnodeI]⋅PR.
(v)STEP PR2. Let PR (j) = PR(j) + /|VI|, where I = {i ∈ V | Degout (i) = 0} and for all j ∈ (VI).
(vi)STEP PR3. Halt if there is no change of PR (i) for all i ∈ V. Otherwise, let t = t + 1 and go to STEP PR1.

STEP PR0 initializes PR (i) for all i ∈ V. STEP PR1 updates PR (i) for all i ∈ V according to Equation (4), by adding part of PR (j) for all nodes j ∈ V with ej,i ∈ V. STEP PR2 adjusts the value of PR (i) by assuming that the search of users will continue even if it reaches a dead end for all i ∈ V. When PR (i) is updated and redistributed in STEPs PR1 and PR2 recursively, the value of PR (i) converges, and the process halts for all i ∈ V.

For example, in Figure 2, we need to have the adjacency matrix first and normalize the matrix by dividing the element values in the adjacency matrix by their degrees, as shown in Tables 3 and 4, respectively.

Then, by simply following the pseudocode above, we obtain all the PR (i) values for all i ∈ V, as follows:

As shown in Table 5, a higher degree of the node corresponds to a higher probability to have a larger PR value. Hence, more important nodes have more links from other nodes in the PageRank algorithm, which is the “rich-gets-richer” phenomenon [41].

PR (i) is used to simulate the spread probability of the computer virus to node i in the proposed dynamic BAT to solve the spread of the computer-virus problem.

3.3. BAT

The BAT proposed by Yeh is a simple implicit enumeration method. Experiments revealed that the BAT is more efficient than the DFS and more economical with regard to computer memory than the BFS and UGFM. The DFS, BFS, and UGFMs are all well-known implicit enumeration methods.

By adding one to the zero vector repeatedly via binary addition, the BAT can generate all binary-state vectors whose coordinates are either 0 or 1. Let X be a binary-state vector with an n-tuple and Xi be the value of its ith coordinate (Algorithm 2). The source code of the BAT is presented below [40]:

(i)Input: The number of coordinates n.
(ii)Output: All n-tuple binary-state vectors X.
(iii)STEP B0. Let X be a zero vector, SUM = 0, and i = 1.
(iv)STEP B1. If Xi = 1, let Xi = 0, SUM = SUM −1, and go to STEP B4.
(v)STEP B2. Let Xi = 1 and SUM = SUM +1.
(vi)STEP B3. Let i = i + 1 and go back to STEP B1 if i < n.
(vii)STEP B4. If SUM = n, halt; otherwise, let i = 1 and go back to STEP B1.

In STEP B0, the BAT begins to generate all vectors from the zero vector X. From STEPs B1 to B3, the state vector X is added to generate a new vector repeatedly. To reduce the runtime, the current coordinate is changed either from 0 to 1 or from 1 to 0. If it is changed from 1 to 0, we must go to the coordinate adjacent to the current coordinate to repeat the same procedure until it is changed from 0 to 1. After each new X is generated in STEP B4, its probability, cost, time, or any predefined function can be calculated. STEP B4 also tests whether the stopping criterion SUM = n is satisfied, that is, whether X becomes a vector of which all the coordinates are 1.

For example, let n = 5 and X = (0, 0, 0, 0, 0). To easily understand how BAT is based on the binary addition, each vector is rewritten to a binary number such that the ith digit of such a binary number is equal to the ith coordinate of X, e.g., the binary number of X = (0, 0, 0, 0, 0) is 00000. Note that there 32 different vectors in total because 25 = 32.

Following the BAT code listed above, we have the first five new state vectors generated from zero in the sequence:

In the same way, we have all state vectors from (0, 0, 0, 0, 0) to (1, 1, 1, 1, 1) obtained from the BAT without duplications as listed in Table 6.

From the above, the BAT is very simple to learn, easy to code, and flexible (can be made-to-fit). Hence, BAT is modified to solve the proposed problem.

3.4. Learning Effect

In economics, productivity increases and results in higher wages after suitable education, and this process is the learning effect. In realistic industrial processes, the operation time is reduced because the workers’ skill or the flow process improves steadily, and this phenomenon is also called the learning effect. In many real-world applications [7, 28, 29, 40, 4349], the learning effect is pragmatic. Hence, the learning effect is introduced in this work to study the proposed computer-virus spread area prediction problem and offers a defense mechanism (antivirus protection) in the scale-free model to prevent or slow down virus propagation.

Let pi,j,t be the probability that the computer virus spreads from node i to its jth state at time t. If there is no learning effect, pi,j,t is a constant for all values of t. However, pi,j,t is reduced occasionally because users acknowledge the spread of the computer virus and learn how to prevent infection or propagation after the infection.

The values of pi,j,t are reduced gradually because of the learning effect, according to the following formula:where α represents the learning rate and is set to 0.35, as in [3941]. Owing to the learning effect,  = 0 and pi,j,t <  ≤ 1 for all  < t < ∞ and 0 < pi,j,t, in accordance with Equation (6). Note that 0 ≤ pi,j,t =  ≤ 1 for all and t if there is no learning effect.

4. Temporal Learning-Effect Spread Probability and Period

Before the proposed method is solved for calculating the probability of a specified number of infected computers conducted from a computer virus during a specific time period, we need to know both the temporal learning-effect spread probability and the length of the infected period. These two factors are discussed in this section.

4.1. Proposed Learning-Effect-SIR Model

The SIR model is adopted. As mentioned in Section 3.1, each node can be categorized into susceptible (S), infected (I), or removed (R) in the SIR model. Each node undergoes the transition of SIR, that is, from a susceptible node (S) to an infected node (I) and then to a removed node (R) after a certain infected period.

The spread probability pi,j is the probability that the computer virus is spread out from an infected node i ∈ V to a susceptible node j ∈ V (i). The temporal learning-effect spread probability Pr (node j infected from node i at time t only) = pi,j,t is a special spread probability at timeslot t only, and it varies with the timeslot because of the learning effect of which users know how to resist the computer virus to reduce the loss gradually.

The computer virus can spread to any susceptible node in V (i) at timeslot t with probability pi,j,t if node i ∈ V is infected. The computer virus cannot spread from node i ∈ V to any neighboring node j ∈ V (i), that is, pi,j,t = 0, if i is a removed node. Moreover because of the learning effect, the computer virus can occasionally be detected and killed easily, such that the temporal learning-effect spread probability pi,j,t is decreased if t is increased.

4.2. Initial Spread Probability

pi,j,0 = pi,j represents the initial spread probability of node i infected at time 0, where no learning effect is considered, where for all infected node i ∈ V and susceptible node j ∈ V (i). Hence, the initial spread probability is simple to calculate, and it is derived here before we determine the temporal learning-effect spread probability.

The value of pi,j for all nodes i ∈ V to j ∈ V (i) is defined according to the PageRank algorithm. Details are presented in Section 3.2, as follows:

Equation (7) is based on the fundamental concept of the scale-free network: a larger PageRank number, that is, a higher node distribution probability, corresponds to a higher spread out probability, that is, pi,j is proportional to PR (j). For example, suppose that after node 0 is infected, we have p0,5 and p0,7 based on V (0) = {5, 7} and equation (7):

According to equations (8) and (9), we have four possible situations: infected node 0 can spread the computer virus to susceptible node 5 only, node 7 only, both nodes 5 and 7, or nowhere with the following probabilities:respectively.

Without considering the learning effect, the initial spread probability of each infected node is provided below according to the adjacent matrix and PagePank values listed in Tables 3 and 5, respectively.

4.3. LAGS

The infected timeslot of node i ∈ V is denoted as ti if node i is infected at time ti. Moreover, the infected timeslot of any node spread from node i can only be after or equal to ti. If node i is infected at time ti but only starts to spread to node j at time tj, there is a lag of Δti,j = titj. The value of Δti,j can be any nonnegative integer, and it is called a no-lag infection if Δti,j = 0.

For example, in Figure 2, let the computer virus start spreading after infecting node 0 at time 0, that is, t0 = 0. Assuming that susceptible node 5 is infected from node 0 and susceptible node 7 is infected from node 5, we have t0 = 0 ≤ t5 and t5 ≤ t7, for example, t5 = t7 = Δt0,5 = Δt5,7 = 0; t5 = Δt0,5 = 1, t7 = 2, and Δt5,7 = 1; t5 = Δt0,5 = 2, t7 = 3, and Δt5,7 = 1. Moreover, as shown in Table 6, the whole network is infected at timeslot 2 if node 0 spreads the computer virus out to all its neighboring nodes, that is, V (0) = {5, 7} at timeslot 1, and the computer virus spreads from node 5 to all the nodes connected to node 5, that is, V (5) = {1, 2, 3, 4, 6}, at timeslot 2.

Suppose that all infections have 1-lag as shown in Table 7 and the computer virus is initialized at node 0. In the worst case, the computer virus is spread out in the order of the node labels, that is, 1, 2, …, |V| − 1, at time 1, 2, …, |V| − 1, respectively. Hence, we have the following property:

Property 1. The upper bound of the spread period is |V| − 1 if all infections are 1-lag infections.

4.4. Spread Probability with Learning Effect

As mentioned in Section 3.4, pi,j,t is reduced according to the learning effect modeled in the following equation:

For example, the values of pi,j,1, pi,j,2, and pi,j,3 are presented in Tables 810 on the basis of the initial spread probability given in Table 11 and Equation (7).

If node i infected at timeslot t did not spread the computer virus at timeslots t, t+1, …, and −1 and spread it to node j at timeslot t, the related probability is denoted as Pi,j,t and calculated using Equation (12).Here,andrepresent the probabilities that node i did not spread to any node during time τ and any time before t, respectively.

From equation (12), a larger ( − t) corresponds to a lower probability of the computer virus spreading from node i regardless of whether there is a learning effect.

For example, node 0 is infected at the beginning in Figure 2. The temporal learning effect spread probabilities of nodes 5, 7, and ∅ are presented in the 2nd, 3rd, and 4th columns below. In addition, the probability that node i spreads the virus to nodes j = 5 and 7 are presented in the last two columns. The temporal learning-effect probability p0,5,t is reduced from 0.1610000 at t = 0 to 0.1263180, 0.1096058, …, 0.0719161 at timeslots t = 1, 2, …, 9, respectively, as indicated by Table 12. The probabilities that node 0, that is, P0,5,t, start to spread to node 5 are 0.088903, 0.046458, 0.024074, …, 0.000299 at timeslots t = 1, 2, …, 9, respectively.

According to Equations (6) and (12), = 0 with the learning effect and  = 0 regardless of whether there is a learning effect, respectively. Hence, the spread stops at a specific timeslot. The infected period is the total time in which either the whole network is infected or the computer virus is removed from the whole network. The node infected period of node i is defined as the time from the infection of the node to any of the following circumstances:(1)Node i becomes a recovered node, that is, the computer virus-infected node i is killed;(2)There are no susceptible nodes in V (i), that is, all nodes are either infected nodes or recovered nodes in V (i).

Because of the instantaneous infection, the smallest t such that pi,j,t = 0 for all j ∈ V (i) is called the infected period of node i. Pi,j,t = 0 if pi,j,t = 0 for all j ∈ V (i).

5. State Vectors

A state vector is a feasible vector to indicate where and when the computer virus spreads. This section proposes a dual-vector form to construct the state vector by integrating the spread vector and the temporal vector, where the former and the latter indicate where and when the computer virus spreads, respectively.

5.1. Spread Vectors

The spread vector is a |V|-tuple vector, and the kth coordinate is the state of the node (k−1) or node k if the first node is labeled 0 or 1, respectively. Moreover, in the spread vector, the first infected node is the node where the value of its related coordinate is equal to itself. Let both the first node and the first coordinate be labeled as 0. For example, in Figure 2, the spread vector X1 = (0, 5, 5, 5, 7, 0, 2, 0) indicates that node 0 is the first infected node and that the virus spreads to nodes 5 and 7; node 5 spreads the virus to nodes 1–3 after it is infected, because the values in coordinates 1–3 are all 5; nodes 4–7 are infected by nodes 7, 0, 2, and 0, respectively.

A spread vector must be feasible; i.e., the spread of the computer virus must be possible. Only nodes in V (i) can spread the virus to or from node i ∈ V. Hence, we have the following important property and such property is implemented in the proposed BAT to have all spread vectors without needing to verify its feasibility to reduce the runtime.

Property 2. Let X (i) be the value in the ith coordinate represented by node i of vector X. A vector X is a feasible spread vector if X (i) ∈ V (i) for all nodes i ∈ V.
For example, in Figure 2, X = (0, 5, 5, 5, 7, 0, 7, 0) is an infeasible spread vector, because it is impossible for the computer virus to spread from node 7 to node 6; that is, X (6) = 7 ∈ V (6).
To simplify the use of the proposed BAT, each spread vector is reconstructed and called the labeled spread vector such that.(1)The value, i.e., j, at coordinate i is the jth node in V (i), of which all nodes are arranged in the increasing order of the node labels.(2)The first infect node is in bold.For example, X1 = (0, 5, 5, 5, 7, 0, 2, 0) discussed above is rewritten as = (0, 0, 0, 0, 2, 0, 1, 0) because node 0 is the first infected node and must be written in bold, and nodes 5, 5, 5, 7, 0, 2, 0 at coordinates 1–7 are the nodes labeled 0, 0, 0, 2, 0, 1, and 0 in V (1) = {5, 7}, V (2) = {5, 6, 7}, V (3) = {5, 6}, V (4) = {5, 6, 7}, V (5) = {0, 1, 2, 3, 4, 6}, V (6) = {1, 2, 3, 4, 5, 7}, and V (7) = {0, 1, 2, 3, 4, 6}, respectively.
To clarify, the following list contains the first 10 spread vectors, labeled spread vectors, and their corresponding 1-lag temporal vectors, which are discussed in Section 5.2.

5.2. Basic Temporal Vectors and Instantaneous Infection

A temporal vector is a vector in which the coordinate value is the timeslot of the related node that has the infection. Similar to the labeled spread vector, the first infect node is in bold in the temporal vector. For example, T = (0, 2, 2, 2, 2, 1, 2, 1) is the timeslot vector with respect to that in Table 13. In T, nodes 0–7 are infected at timeslots 0, 2, 2, 2, 2, 1, 2, and 1, respectively. Moreover, from T, we observe that node 0 is the first infected node, because T0 = 0 and all the infections have 1-lag, as the gap between two consecutive distinctive numbers in T is 1, e.g., 0 and 1; 1 and 2.

Let Ti,t be the t-lag temporal vector of node i if all infections are t-lag, that is, Δtj,k = t for all j ∈ V and k ∈ V (j). Because all infections are t-lag, the following property holds.

Property 3. The t-lag timeslot vector is the upper-bound of any feasible timeslot that has at most t-lag.
We have the following important property that is implemented in the proposed BAT to serve as an upper-bound to help in searching for all possible feasible timeslot vectors.

Property 4. The t-lag timeslot vector Ti,t=t × Ti,1 for all i ∈ V.
Each infected node can spread the computer virus only after it is infected and before it is cured. The maximal spread period can be infinity if the computer virus is not detected, as discussed in Section 4.1, theoretically. Hence, a new concept called instantaneous infection is provided for defensive pessimism.
In an instantaneous infection, the computer virus can spread without waiting for another timeslot, that is, Δti,j = 0 for all i ∈ V and all j in V (i) if the node is first infected. Note that a 0-lag timeslot Ti,0 corresponds to a zero vector and an instantaneous infection. Hence, we have the following property:

Property 5. The 1-lag basic temporal vector is the upper-bound of any feasible timeslot vector.
All temporal vectors can be generated according to Property 5 such that each of their coordinates is less than or equal to that of the related basic temporal vector. Moreover, owing to the characteristic of the instantaneous infection, we have the following important property in filtering out the feasible temporal vectors from all vectors generated according to Property 5.

Property 6. The temporal vector Y is feasible if and only if the following conditions are satisfied, where T represents the 1-lag basic temporal vector that generates Y; that is, Y (i) ≤ T (i) for all node i in V.(1)Y (i) < Y (j) if T (i) < T (j), where i, j ∈ V,(2)Y (i) ≤ Y (j) if T (i) = T (j), where i, j ∈ V.

5.3. Basic State Vectors

The state vector is a dual vector formed by two vectors—a spread vector and a temporal vector—separated using the notation “;”. For example, X1 = (0, 5, 5, 5, 7, 0, 2, 0; 0, 2, 2, 3, 2, 1, 5, 1) indicates the following:(1)Node 0 is the first infected node at timeslot 0 because it is characterized by X1 (0) = 0(2)Node 0 spreads the virus to nodes 5 and 7 at t = 1(3)Node 5 spreads the virus to nodes 1 and 2 at t = 2 and to node 3 at t = 3(4)Node 7 spreads the virus to node 4 at t = 2(5)Node 2 spreads the virus to node 6 at t = 5

A state vector is a basic state vector if its temporal vector is basic. A state vector and/or a basic state vector are infeasible if it is impossible to spread the computer virus according to either its spread vector or the temporal vector, that is, ji,t, and j ∉ V (i) or tj ≤ ti; otherwise, it is a feasible state vector. For example, X1 discussed above is a feasible state vector, X2 = (0, 0, 5, 5, 7, 1, 5, 0; 0, 2, 2, 3, 2, 1, 5, 1) is infeasible because 1 ∉ V (0) in its spread vector, and X3 = (0, 5, 5, 5, 7, 0, 2, 0; 0, 2, 0, 3, 2, 1, 5, 1) is infeasible because T2 = 0 but T5 = 1 and node 2 is infected from node 5.

For any state vector, the feasibility of its temporal vector depends on its spread vector. Moreover, from Property 5, all temporal vectors can be deduced from the 1-lag temporal vectors. Hence, in the proposed BAT, all the feasible spread vectors together with their 1-lag temporal vectors are found first to reduce the computational burden of searching for all the state vectors.

The following property involves the relationship between the damping factor d and state vectors. According to this important property, we can simply find all the state vectors for a specific d (without finding all the state vectors for any d), which is very useful for improving the efficiency of the related algorithms.

Property 7. Regardless of the values of the damping factor d, a feasible state vector is always feasible for all d.

6. Proposed BAT

Similar to the DFS, BFS, and UGFM, the BAT is an implicit enumeration search method that can find all the feasible state vectors. However, the BAT is easier to code, more flexible to modify, and more efficient in execution than the other methods [40, 50, 51]. Hence, the BAT is adopted in this study and developed in this section formally to solve the proposed problem.

6.1. Basic Idea behind the Proposed BAT

Let the update procedure be started from the last coordinate to the first coordinate. The basic idea in the proposed BAT is redefined in terms of the fundamental concept in the traditional BAT update procedure by changing a binary vector to a multistate vector as follows:(1)If it is possible to replace the value of the current coordinate with a larger feasible value, it is replaced, and the new vector is a new state vector. For example, in the traditional BAT, the last 0 in the binary-state vector Xi = (0, 1, 1, 0) can be updated to 1, and Xi+1 = (0, 1, 1, 1) is a new binary-state vector.(2)If it is impossible to replace the value of the current coordinate with a larger feasible value, the current value is reset to the smallest feasible value, the algorithm moves to the next coordinate, the foregoing procedure is repeated until the replacement is possible, and then the first step is performed. For the example used in the first step, Xi+1 = (0, 1, 1, 1) is updated to Xi+2 = (1, 0, 0, 0).

To deal with the temporal and learning effect properties in the proposed BAT, the details of the foregoing new idea are explained in the remainder of this section.

6.2. BAT-1 for Spread Vector and 1-LAG Temporal Vector

There are two BATs in the proposed BAT. The first one, which is called BAT-1, finds all the feasible spread vectors together with the related 1-lag temporal vectors. The second BAT, which is called BAT-2, finds all the feasible temporal vectors according to the found 1-lag temporal vectors.

BAT-1 is proposed here by changing the binary states to multistates to find all the feasible labeled spread vectors to fit the proposed problem, and its pseudocode is presented below (Algorithms 3 and 4):

(i)Input: A scale-free network G (V, E) and the computer virus infects node TARGET first.
(ii)Output: All state vectors without duplications.
(iii)STEP S0. Let X be a zero vector with n coordinates represented the node states, vector index k = 1, istop = 1 if TARGET = 0, and istop = 0 if TARGET >0.
(iv)STEP S1. Let coordinate index i = (n − 1).
(v)STEP S2. If i = TARGET, let i = (i − 1) and go to STEP S3.
(vi)STEP S3. If X (i) < (Wi − 1), let X (i) = X (i) + 1, and execute 1-lag_Temporal_Vector (X).
(vii)STEP S4. If Xk is feasible, let k = k + 1 and go STEP S1.
(viii)STEP S5. If i = istop, halt and X1, X2, …, Xk are all feasible spread vectors.
(xi)STEP S6. Let X (i) = 0, i = (i − 1), and go to STEP S2.
(i)STEP L0. Let FLAG(V) = false for all V ∈ V, FLAG(TARGET) = true, t = 1, L0 = {TARGET}, and L1 = ∅.
(ii)STEP L1. Let Tu = t, FLAG(u) = true, and Lt = Lt ∪ {u}, where for all L(Xu) = v, for all v ∈ Lt-1, and FLAG(u) = false.
(iii)STEP L2. If Lt = ∅, halt and return the information that X is infeasible.
(iv)STEP L3. If FLAG(v) = true for all v ∈ V, halt and return the information that X is feasible.
(v)STEP L4. Let t = t + 1, Lt = ∅, and go to STEP L1.

The above procedure essentially follows the concepts proposed in Section 6.1. For example, according to the proposed BAT-1 algorithm, we have the first 10 spread states and the related temporal vectors, as shown in Table 11 listed in Section 5.1.

6.3. BAT-2 for All Temporal Vectors

From Property 5, all temporal vectors can be obtained from 1-lag temporal vectors. Hence, another BAT based on Properties 36 is implemented to find all temporal vectors for each 1-lag temporal vector as follows (Algorithm 5:

(i)Input: A 1-lag temporal vector T reordered to in decreasing of the coordinate values.
(ii)Output: All feasible temporal vectors respective to T.
(iii)STEP T0. Let Y be a zero vector with n coordinates represented the node infected time, vector index k = 1, istop = 1 if TARGET = 0, and istop = 0 if TARGET >0.
(iv)STEP T1. Let coordinate index i = (n − 1).
(v)STEP T2. If Y (i) <  (i), let Y (i) = Y (i) + 1, and execute 1-lag_Temporal_Vector(X).
(vi)STEP T3. If Y (u) < Y (v) and (u) <  (v) for all nodes u and v in V, Y is infeasible and go STEP S1. Otherwise, let k = k + 1 and go STEP S1.
(vii)STEP T4. If i = istop, halt and Y1, Y2, …, Yk are all feasible temporal vectors generated from T.
(viii)STEP T5. Let X (i) = 0, i = (i − 1), and go to STEP T2.

For example, there are 2591 1-lag temporal vector candidates generated from the basic 1-lag temporal vector T = (0, 2, 2, 2, 2, 1, 3, 3) obtained according to the feasible labeled spread vector X = (0, 0, 0, 0, 0, 0, 0, 3), for which the spread vector is (0, 5, 5, 5, 5, 0, 1, 1) in Figure 2. In total, 479 of them are feasible, and the first 30 temporal vectors are presented in Table 14.

6.4. Pseudocode for the Proposed Algorithm

Assume that the computer virus starts to spread from node TARGET, and we wish to determine the probability that the whole network is infected within a t-lag. The pseudocode of the proposed method for solving the foregoing problem, according to the temporal spread probability with the learning effect derived in Section 4.4, the novel dual state vectors developed in Section 5.3, and the new BAT proposed in Sections 6.2 and 6.3, in estimating the infected probability of computer virus spread areas is presented below (Algorithm 6:

(i)Input: A scale-free network G (V, E), the first infected node TARGET, the damping factor d, and the allowed timeslot lag t.
(ii)Output: The probability that the computer virus spreads throughout the whole network within time lag t.
(iii)STEP 0. Count the degree of each node, calculate the PageRank values of each node, calculate the initial spread probability of each directed arc using Equation (7), and compute the temporal spread probability with the learning effect for each arc using Equation (12).
(iv)STEP 1. Implement the proposed BAT-1 algorithm to search for all feasible basic state vectors constructed by the spread vectors and the 1-lag temporal vectors.
(v)STEP 2. Implement the proposed BAT-2 algorithm to find all the feasible t-lag temporal vectors according to the 1-lag temporal vectors.
(vi)STEP 3. Calculate and sum the probabilities of all the feasible state vectors using Equation (12).

For example, let the damping factor be d = 0.1, the lag be one timeslot, and node 0 be the first infected node in Figure 2. Then, we can obtain the PageRank values of each node, the degree of each node, the initial spread probability of each directed arc, and the temporal learning-effect spread probability of each directed, as shown in Tables 5, 7, and 9, respectively.

Using the proposed BAT-1 and BAT-2 algorithms, the basic state vectors and state vectors are obtained. The first 10 basic state vectors and the first 30 1-lag temporal vectors generated from the basic 1-lag temporal vector T = (0, 2, 2, 2, 2, 1, 3, 3) are presented in Table 14.

In the last step, that is, STEP 3, the probabilities of all the feasible state vectors are calculated and summed. The final probability of node 0, which is infected first and spreads the virus throughout the whole network within one timeslot, is 0.7038000, and there are 1268 feasible basic state vectors among the 9720 state vector candidates. In addition, the probability of the labeled spread vector X = (0, 0, 0, 0, 0, 0, 0, and 3), for which the basic 1-lag temporal vector is T = (0, 2, 2, 2, 2, 1, 3, 3), is 1.00259E-08, with 479 feasible 1-lag temporal vectors filtered out from 2591 1-lag temporal vectors, and the probabilities of the first 30 1-lag temporal vectors are presented in Table 15.

From the above, the proposed BAT can search for complete state vectors that satisfy the requirements of the proposed problem using BAT-1 and BAT-2. Only with all the state vectors can the analytical probability of the proposed algorithm be calculated, as described by STEP 3.

6.5. Experimental Analysis

In a scale-free network, the node degree distribution follows the power law. Thus, the computational burden increases with the size of the free-scale network, according to the power law. To confirm the performance of the proposed BAT in calculating the probability of the entire network being infected by a computer-virus, the proposed algorithm was tested on the mid-size network shown in Figure 2, which datasets are shown on the adjacency matrix in Table 3, by letting each node be the infected node individually for the damping factors of d = 0.1, 0.3, 0.5, 0.7, and 0.9 under allowed time lags of t = 0, 1, and 2. Hence, there were 8 × 5 × 3 = 120 tests in total.

As the number of state vectors increased, the runtime increased exponentially. Hence, we only discuss t values up to 2.

Both BAT-1 and BAT-2 were coded in Python 3.7.7 and run on Spyder 4.1.3. The 160 tests were conducted on Windows 10 with an Intel Core i7-8650U CPU at 1.90 GHz and 2.11 GHz with 16 GB RAM. The experimental results are presented in Tables 1618.

The numbers of basic state vector candidates Ns and basic feasible state vectors ns were only related to the node degrees Deg (i) based on the characteristics of the scale-free network and were unrelated to the values of the damping factors (d) and the allowed time lags (t), for all i ∈ V. Hence, we have the same number of basic and feasible state vectors and probabilities if V (i) = V (j), for all nodes i and j. Thus, we list the values of Ns and ns in Table 14 and not in the other tables.

The foregoing observation is useful, and, accordingly, we must focus on the nodes without the same neighbors. For example, nodes 1, 3, and 4 all have the same neighbors, that is, V (1) = V (3) = V (4) = {5, 6, 7}, and we can search for the state vectors and calculate the probability for node 1 because nodes 3 and 4 are identical.

Moreover, a higher degree of the first infected node corresponds to a higher probability of having a smaller number of basic state vector candidates. This is because the total initial spread probabilities from one node to its neighbors are one, and the more neighbors have a higher probability of having a lower spread probability. In addition, the more the neighbors, the smaller the values after using multiplications in Equation (12).

For the same reason as in Table 16, the probabilities that the whole network was infected by node i and node j are equal if V (i) = V (j) in Table 17, for all nodes i and j. For example, V (1) = V (3) = V (4) = {5, 6, 7} from Table 5; the probabilities that the whole network was infected by nodes 1, 3, and 4 were 1.78550E-05 for d = 0.1 and t = 0.

The increment in d reduced the obtained probability. For example, for t = 0 and i = 0, the related probability was 1.05186E-05 for d = 0.1, and it was higher than 7.55361E-06 for d = 0.3, as shown in Table 15. This is because a smaller d corresponded to a higher probability of exploring new areas.

A virus that spent a longer amount of time spreading had a higher probability of spreading to the entire network. Hence, the increment of t increased the obtained probability; for example, the probability was 1.05186E-05 for t = 0 and increased to 1.64706E-05 for t = 1, as shown in Table 17. However, the rate of the increment in the probability decreased as t increased. This is because a larger value of t corresponded to a smaller number of susceptible nodes remaining, which resulted in a lower spread probability. For example, 1.64706E-05/1.05186E-05 = 1.56585477 > 1.67060E-05/1.64706E-05 = 1.014292133, where 1.05186E-05, 1.64706E-05, and 1.67060E-05 represent the spread probabilities of t = 0, 1, and 2 for i = 0 and d = 0.10, respectively.

Similar to the foregoing observations in Table 17, Table 18 shows that the increment of the allowed time lag t increased the number of temporal state vector candidates Nt and feasible temporal vectors nt; e.g., Nt was 11651837 for t = 1 and increased to 514419848 for t = 2.

Interestingly, the ratio of (the number of temporal state vector candidates)/(the number of temporal state vectors), that is, Nt/nt in Table 18, increased with t. This confirms that a larger value of t corresponded to a larger Nt and nt. In addition, a larger Nt corresponded to a smaller portion of feasible temporal vectors because all the feasible temporal vectors were selected from the temporal state vector candidates.

Table 18 shows another interesting finding that is coincident with those in Tables 16 and 17: the values of Nt for nodes i and j are identical if V (i) = V (j). For example, Nt = 9196660 for nodes 1, 3, and 4 because V (1) = V (3) = V (4) = {5, 6, 7}. Moreover, the values of nt for nodes i and j are still the same if V (i) = V (j). Hence, the probability that the entire network is infected, Ns, ns, Nt, and nt are fixed for nodes i and j if V (i) = V (j).

Summarize the experimental results and present the following recommendations that we can learn from:(1)The decrement in the damping factors of d helps to obtain a higher probability of exploring new areas.(2)The increment in the allowed time lags of t increases the obtained probability.(3)The increment in the allowed time lags of t increases the number of temporal state vector candidates Nt, feasible temporal vectors nt, and the ratio of (the number of temporal state vector candidates)/(the number of temporal state vectors), that is, Nt/nt.

7. Conclusions

The spread of computer viruses not only jeopardizes the security of computer and network systems but also hinders their normal operation. Because of the skepticism of both human and software customers in dealing to virus detection, after it was acknowledged that post-creation, a computer virus' spread is aided by the digital antidote and decreases over time. This is called the learning effect.

The spread of a computer virus can be modeled using a scale-free network in which the node degree distribution follows the power rule. A novel computer-virus spread dynamic model with the learning effect based on the scale-free model is proposed. A simple and straightforward method based on the BAT and a novel concept called the temporal learning-effect spread probability and the dual-vector combined with the spread vector and the temporal vector is proposed for modeling the spread of computer viruses and theoretically predicting the analytical probabilities of all the infected-computer scenarios.

The reliability and performance of the proposed BAT was confirmed on a simulated temporal deterioration-effect scale-free network generated using the Barabási–Albert model. The results encourage the extension of the proposed BAT, including the development of the Monte Carlo [7] to BAT to solve larger-size scale-free problems.

In the future, we will strive for opportunities to collaborate with government and private organizations on this research to further verify how well the model proposed in this study fits in real life. In addition, comparing the current model with a virus-resistant model, with defense mechanism, will be planned for future works.

Moreover, several extended topics will be considered in the future for advanced research. For example, how to adjust the parameters of the model for calculating the probability of spreading will be studied when there are firewall devices, IDS, IPS, and DMZ in the intranet. We will further incorporate the version of the operating system, the status of opening service port, the status of protocol service, and the executive time of facility, etc., in order to define how these factors are influencing the probability of virus spreading.

Also, we will carefully consider whether to check for existing problems to ensure that the optimal control problem is solvable before attempting to formulate a solution. The proposed algorithm results will be compared with other methods provided in the literature, and the simulation results will also be displayed in a graphical format. Indeed, stability is an important issue for all dynamic systems. However, this paper is more academic. In fact, this paper has already considered stability in the learning effect. Moreover, we will discuss stability separately in the future. Later studies will discuss the scalability of the model and use a large network of 3000 to 5000 routers to understand the proposed performance algorithm and to understand the possibility of when the infection is less likely to become endemic or when it is more likely to become endemic.

Data Availability

The datasets are shown on the adjacency matrix in Table 3.

Disclosure

This article was once submitted to arXiv as a temporary submission that was just for reference and did not provide the copyright.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the Ministry of Science and Technology, R.O.C. under grant MOST 102-2221-E-007-086-MY3 and MOST 104-2221-E-007-061-MY3.