Abstract

Nowadays, digital computer systems and networks are the main engineering tools, being used in planning, design, operation, and control of all sizes of building, transportation, machinery, business, and life maintaining devices. Consequently, computer viruses became one of the most important sources of uncertainty, contributing to decrease the reliability of vital activities. A lot of antivirus programs have been developed, but they are limited to detecting and removing infections, based on previous knowledge of the virus code. In spite of having good adaptation capability, these programs work just as vaccines against diseases and are not able to prevent new infections based on the network state. Here, a trial on modeling computer viruses propagation dynamics relates it to other notable events occurring in the network permitting to establish preventive policies in the network management. Data from three different viruses are collected in the Internet and two different identification techniques, autoregressive and Fourier analyses, are applied showing that it is possible to forecast the dynamics of a new virus propagation by using the data collected from other viruses that formerly infected the network.

1. Introduction

A few decades ago, computer viruses arose in the form of programs with simple code and able to undermine the smooth operation of a machine. Initially, in spite of the large number of viruses, they caused minor damages to machinery and their spread was very slow. Over the years, due to the rapid development of technology, such as software and hardware, the development and popularization of the Internet and the great variety of equipment using software and networks, viruses have become a major threat [1].

Currently, these virus programs have more complex codes, being able to produce mutations of themselves, and their detection and removal by antivirus programs became more difficult [2]. Their goals go much further than simply damaging a machine. They are capable of acquiring personal data of users of networks, such as a bank account, and cause severe damages to large corporations [3].

In view of these concerns, a better understanding of the computer viruses spreading dynamics is mandatory. To improve the safety and reliability in computer systems and networks, it is important to have the capacity of recognizing and combating the several types of infections faster and more effectively [4, 5].

Research actions started at the end of the 80s with the classical paper of Kephart et al. [6] proposing an ecosystem approach for computational systems. Then, the efforts were concentrated on the development of antivirus programs, responsible for the detection and removal of viruses, based on the previous recognition of the infection code based on the models shown in [2, 7, 8]. These programs have a great upgrading power, but act just as simple vaccines against diseases [2, 4]. They are not able to predict the behavior of networks when an infection is established in a machine and, consequently, cannot support preventive attitude against virus actions based on events of the network.

The first effort to produce models for the spreading of computer viruses based on their epidemiological counterparts is reported in [7] with the initial ideas for deriving long-term behaviors considering the graph representing the network connections. Then, with Markov chains representing the local behavior of infection action in a single node, susceptible-infected-removed (SIR) models were presented trying to fit the long-term behavior of the viruses propagation [9].

This kind of approach had some attention in the last five years and the relations between spreading viruses and topological parameters of the network were studied, being successful mainly when modeling the propagation by email networks [10]. Besides, SIR models were modified [5] and applied to guide infection prevention [11, 12], deriving expressions for epidemiological thresholds [1113].

This work focuses on the achievement of models for the dynamics of the spread of certain viruses, mainly taking into account the correlation functions between the several viruses spreading data, during a certain period of time. Thus, the number of infections from a type of virus could be foreseen in the short term by comparison with other viruses or with notable events in the network, which would support preventive policies.

In order to provide simple algorithms to allow operational facility, simple autoregressive models are chosen [14, 15]. Considering the periodicity of the data collected, Fourier models are also tried, producing the same results of the autoregressive ones.

2. Methodology

The data to be collected for modeling computer infections propagation are the number of daily, weekly, and monthly infections for several computer viruses. These numbers are found in the Internet, for instance, in http://www.avira.com/, and support the development of linear identification models.

The next step is the choice of a specific virus to be analyzed, in the enormous range of possibilities. In this work, a premise was taken into consideration: in order to have an efficient identification, the several chosen viruses need to present similar propagation dynamics. Here, the high incidence of cases reported and the email spreading compose the chosen criterion.

Wormnetsky.p, wommytob.mr, and trdir.stration.ge were chosen, that is, two worms and a trojan. Figures 1, 2, and 3 show the dynamical evolution of the number of infections with Wormnetsky.p, , and , respectively.

First, in order to verify the relations among the viruses, cross-correlation coefficients are calculated. Considering two signals and simultaneously sampled in regular intervals, and calling and their samples, for a certain time interval containing sample periods, the cross-correlation coefficient, , between and measures how they are related with each other in this interval (see [16, page 206]). Table 1 summarizes the cross-correlation coefficients, calculated for the three pairs of infection signals, for the time interval of Figures 1, 2, and 3, sampling the data daily.

The results from Table 1 indicate acceptable correlation between the spread of the viruses chosen, corroborating the visual similarity between the temporal evolution of the three infections. Due to this, only is considered to identify the system parameters to be used to provide short-term forecasts for the three viruses. Following this identification strategy, model accuracy is checked.

3. System Identification Algorithms

In order to identify the parameters to model the temporal evolution of the infections by the three types of viruses selected here, two approaches were followed:

(i) using a linear autoregressive model, that is, consider that the current value of a variable depends only on the former values, up to a certain delay [14, 15];(ii) identifying the main frequencies of the time series and treating them as Fourier series [14, 15].

3.1. Autoregressive Model

Considering a regularly sampled signal , its estimated value at instant is given bywhere are the model parameters to be estimated by using the minimum square method, and is the maximum delay to be considered [14, 15], measured by the number of sampling intervals.

By using a “free-prediction” strategy, the vector data are divided into two parts: one is used for the identification of the system parameters and the other for the simulation and validation of the model. In the case of the data described in Section 2, the 25 first samples are used for identification and the last 5 for simulation. Different values of are considered and Figures 4, 5, and 6 show the results for equal to , , and , respectively, with the continuous line representing the real data and the asterisks representing the simulation results.

In order to compare the several chosen delays, Table 2 shows the mean-square estimation error in each case. Considering these results, from now on, all models will use .

To have an idea of the efficiency of the adopted identification strategy, the estimated parameters for are used to model the dynamics of and . The results are shown in Figures 7 and 8, respectively, with the continuous line representing the real data and the asterisks representing the simulation results. Table 3 summarizes the mean-square errors of these simulations.

The simulations performed taking into account only the parameters calculated for the show that the short-term estimations of new infections are not precise for and , as expected, because the same model is used for different viruses. Nevertheless, the model is able to predict with some accuracy the increasing and decreasing tendencies in their dynamics. This knowledge permits the implementation of preventive policies, considering only the propagation profile.

3.2. Fourier Series Model

Observing the strong oscillatory character of the three different viruses studied, a model considering the signals as a sum of cosines was developed. Figures 9, 10, and 11 present the frequency spectrum for the temporal evolution of the , and propagation. As one can see, the main frequencies of the three dynamic behaviors are the same.

Figures 9, 10, and 11 indicate that a good set of frequencies for developing the model is = [0 0.2 0.3 0.5]. Following the same reasoning used in Section 3.1 for identification, the model parameters are calculated by using only the data from and the predictions of new infections for and are obtained by using the same parameters.

To have an idea about the efficiency of the adopted identification strategy by using Fourier methods, Figures 12, 13, and 14 show the predicted dynamics of , , and , respectively, with the continuous line representing the real data and the asterisks representing the simulation results. Table 4 summarizes the mean-square errors of these simulations.

As in autoregressive models, simulations performed taking into account only the parameters calculated for the show that the short-term estimations of new infections are not precise for and , as expected, because the same model is used for different viruses. But, again, the model is able to predict with some accuracy the increasing and decreasing tendencies in their dynamics, allowing to establish preventive policies by using only the data from propagation.

4. Conclusions

Two different models for the dynamics of computer viruses propagation were compared: autoregressive and Fourier analysis presenting similar results. They provide good predictions for three different types of infections by using the data collected for just one of them.

In spite of not being totally satisfactory, these models present the possibility of predicting increasing and decreasing tendencies in the propagation of a certain type of virus by using the accumulated experience with another one. It seems that this point could be used to predict and control the infection levels in advance, providing preventive actions in order to increase safety and reliability.

Acknowledgments

The first author issupported by FAPESP and CNPq and the second author is supported by the Brazilian Oil Agency.