#### Abstract

A susceptible-infective-recovered (SIR) epidemiological model based on probabilistic cellular automaton (PCA) is employed for simulating the temporal evolution of the registered cases of chickenpox in Arizona, USA, between 1994 and 2004. At each time step, every individual is in one of the states S, I, or R. The parameters of this model are the probabilities of each individual (each cell forming the PCA lattice) passing from a state to another state. Here, the values of these probabilities are identified by using a genetic algorithm. If nonrealistic values are allowed to the parameters, the predictions present better agreement with the historical series than if they are forced to present realistic values. A discussion about how the size of the PCA lattice affects the quality of the model predictions is presented.

#### 1. Introduction

In control engineering, it is fundamental to identify the system to be controlled in order to determine the best strategy for achieving the intended goals. The task of identifying a system is to determine the best model able of describing the dynamical behavior of such a system from measured data (e.g., [1]). Mathematical models in terms of differential equations for mechanical, electrical, thermal, and chemical systems are commonly obtained from well-known physical laws or conservation principles. For instance, the model of a mechanical system can be derived from the Newton's laws of motion; the model of an electrical system can be deduced from the Maxwell's equations of electromagnetism; models for chemical and thermal systems usually obey principles of energy and mass conservation. Biological systems, however, do not present a clear starting point for obtaining the corresponding mathematical models. In fact, models for such systems are usually empirical, and the identification and the controller synthesis are not usually simple. Moreover, scaling problems pose additional difficulties to accomplish these tasks.

Scaling problems are ubiquitous in nature and prominent in biology. However, such problems are not easily treated, or even recognized, in a plethora of cases. As put by Haldane in the beginning of the 20th century, “The most obvious differences between different animals are differences in size, but for some reason the zoologists have paid singularly little attention to them” (cited in [2]). Interestingly, the same kind of “overlooking” attention is found in many ecological and epidemiological contexts. The question that arises is whether we are prone to obtain reliable identification of systems if this bias is maintained.

There are several approaches to scaling problems, from dimensional analysis to fractal dimensions (e.g., [3, 4]). These approaches are extremely relevant in engineering problems. For instance, the benefits of dimensional analysis and similitude principles both in designing prototypes and in modeling general principles within given conditions are above argument (e.g., [5]). However, ecological/epidemiological problems have a subtle “dimensional architecture.” If system identification can be obtained in these problems, then one of the basic goals of scaling approaches would be fulfilled, namely, “the design of (relevant) experiments and the proper organization of the results” ([5, page 227]; we put the brackets in that phrase). In other words, there should be expected an important gain both in basic sciences and in economical aspects.

Here, we report a numerical study revealing the influence of scale on parametric identification of an epidemiological model used for predicting the temporal evolution of the number of chickenpox cases registered by the Arizona Department of Health Services [6], USA, between 1994 and 2004. The model is based on probabilistic cellular automaton (PCA), where each cell of the lattice corresponds to an individual, which is in one of three states: S, I, or R. The state S represents the individual that is susceptible and therefore subjected to this contagious disease; the state I is related to the individual that is infective and hence can transmit the disease for susceptible ones; the state R is associated to the recovered individual. The numerical values of the probabilities concerning the transitions among these three states define this SIR model.

In Section 2, the epidemiological model is introduced. In Section 3, the genetic algorithm (GA) employed for identifying the probabilities of state transitions in this SIR model is described. In Section 4, the numerical results are shown. In Section 5, a discussion about scaling problems in ecological and epidemiological systems is presented.

#### 2. SIR Model Based on PCA

In our cellular automaton (CA) model [7–9], the population lives in a square matrix formed by cells (individuals) with periodic boundary conditions. At each time step , there is a probability of a -cell being infected, where depends on the number of infective neighbors and it obeys the constraints: (i.e., a susceptible individual can contract the disease only if ) and if , then (i.e., is a monotone increasing function of ). Each -cell has probability per time step of becoming cured and probability per time step of dying because of the disease. At each iteration, infective and recovered cells may die for other causes with probability . When individuals die, susceptible ones replace them. Therefore, the total number of individuals remains constant, a common assumption in epidemiological models (e.g., [10]). The states of all cells are simultaneously updated at each time step .

The coupling topology influences the dynamical behavior of biological, electronic, or social networks (e.g., [9, 11, 12]). In studies on spreading diseases using CA, the contact network among individuals (which defines the neighborhood of each cell in the lattice) can be considered as regular in a first approximation. Hence, the von Neumann (e.g., [13]) and Moore (e.g., [7, 8, 14]) neighborhoods are commonly employed. The von Neumann neighborhood of radius of a cell is formed by the cells orthogonally surrounding such a cell until the distance (i.e., if , the von Neumann neighborhood comprises the 4 closest neighbors: left, right, up, and down); the Moore neighborhood of radius of a cell is constituted by all cells pertaining to the square matrix of size centered in this cell (i.e., if , the Moore neighborhood consists of the 24 cells pertaining to the square matrix of size centered in such a cell).

With this model, we intend to fit the data shown in Table 1 by finding appropriate values for , , , and able of reproducing the number of chickenpox cases recorded in each year, in the State of Arizona, USA, during 11 years. The values of these probabilities are found by using a genetic algorithm (GA).

#### 3. GA Used for Identifying the SIR Model

There have been published several works on identification of transition rules of CA by using GA (e.g., [15–19]).

Here each generation of our GA is composed by 15 chromosomes (15-candidate solutions), where the length of the chromosome depends on the kind and on the radius of the neighborhood. For instance, when the von Neumann neighborhood of is employed, each chromosome is formed by 7 genes (the values of , , , , , , ), when the Moore neighborhood of is used, each one consists of 11 genes (the values of , , , , ). Here, we investigate both neighborhoods for and .

The fitness of every chromosome is evaluated by simulations with the PCA model. The value of is high if the numbers of infective individuals in the PCA lattice are close to the data composing the historical time series. The fitness function is written as , where is defined here by and by where is a signal function: if and ; otherwise, is the total number of years composing the time series (here according to Table 1); labels each year of the epidemiological series. Thus, corresponds to the second year (1995), and is the initial year (corresponding to 1994; in fact, the number of infective cases in this year is the initial condition of the PCA simulations). is the number of historical cases registered in the th year; is the number of infective individuals obtained in the PCA model in the th year in the th simulation. Every chromosome is evaluated times (here ) in lattices , where the numerical values of the initial conditions are always the same, but the corresponding geographical distributions of susceptible, infective, and recovered individuals may be altered from one simulation to another.

The function is the sum of the modules of the differences between the data and the number of cases obtained in simulations by using the probabilities corresponding to that chromosome; thus, gives the total error between the historical series and the series produced by simulating times the disease spreading in the lattice. The higher , the lower .

The function is the number of segments of the time series, where the registered cases and the predicted values present contrary tendencies of evolution in simulations. For instance, if in two consecutive years the numbers in Table 1 decreased but the numbers of infective cases in the th simulation increased (i.e., if both curves have slopes with different signals), then ; if they have the same tendency of variation, this term is zero. The higher , the lower .

An initial population of chromosomes is randomly generated, and the value of their genes must always be real numbers between zero and one, that is, , because they represent probabilities. Also the genes representing must obey the constraint: for . The best seven chromosomes are carried over to the next GA generation without alteration. The other eight chromosomes (in order to complete 15) are obtained from the current generation by applying crossover and mutation operations.

Crossover is performed by selecting a single point in two randomly picked chromosomes (the parents) and swapping of the genes between them. Thus, two new chromosomes (the children) are created. The number of child chromosomes produced by crossover is seven.

Every parent chromosome has of chance of suffering mutation and producing a child chromosome. Every gene composing such a chromosome has of chance of changing its value. If a gene is selected to suffer mutation, then a number between and is added to the value of this gene (if after this genetic operation , then we impose ; if , then we impose in order to get ).

After applying these genetic operations for producing new chromosomes, the values of in such chromosomes are reordered (if necessary) because must be a monotone increasing function of .

To choose eight chromosomes for the next GA generation, the pool of parents and children is organized in a crescent order in terms of (excluding the seven fitter candidate solutions). Then, eight chromosomes are selected by applying eight times the expression: where is the position of the chromosome in such an ordering that will be chosen for composing the next generation; is the highest integer number lesser than or equal to ; is the length of this ordering (the total number of candidate solutions excluding those seven elite chromosomes); is a random number. The idea of using instead of is to favor the selection of chromosomes in the end of this ordering (with higher value of ). The total number of generations of this GA is 25.

Notice that the value of does not depend on the lattice size ; it only depends on the number of data of the historical series and on the number of simulations. Thus, . However, is dependent on . In fact, . In order to compare the quality of predictions obtained in PCA lattices with different sizes, we propose a metric called efficiency written as . The value of is obtained from by , and the value of is obtained from by where is given by and is the average value of the historical series excluding the first datum; thus, Notice that if (i.e., the numbers of infective individuals generated in the PCA lattice of size in all simulations are equal to the numbers of registered cases converted to this spatial scale), then ; if (i.e., the predictions in simulations give a value of corresponding to the average deviation presented by the historical series), then .

Because and usually , then . The efficiency is calculated only for the fitter chromosome of the last GA generation. In the first phrase of this paragraph, “usually” means that, at the end of the evolutionary process, the value of corresponding to the fitter chromosome commonly obeys the constraint (hence ). In fact, this always happened in our simulations. Obviously, the higher the value of , the better the PCA predictions.

#### 4. Results

About 5 million people live in Arizona. For a lattice of size , the cases of chickenpox must be multiplied by as shown in the third column of Table 1, in order to convert the historical data to this spatial scale.

Each time step of the PCA is equivalent to 2 months. Thus, to predict the number of infective individuals from one year to another, the PCA must be iterated by 6 time steps.

At first, we performed simulations considering von Neumann and Moore neighborhoods of or , and time steps per year. After 25 GA generations, the fitter chromosome was obtained with Moore neighborhood of , and it was formed by the following genes: varies from (for ) to (for ), , , . The efficiency of this candidate solution was . Observe that the numbers obtained for and are far from being realistic because these values mean that more than of infective and recovered individuals die and are replaced by susceptible ones at each two months. The value of should be about because about of the population die at each two months [20]. The usual death rate of chickenpox is 10 per 100 000 cases [21]; hence should be about . The value of found by the GA is small, since more than of infected individuals become cured in 2 months. In spite of these values being unrealistic, this set of probabilities can reasonably approximate the historical series as shown in Figure 1 (see the solid and the dashed lines).

In order to obtain realistic values for the state transition probabilities, the limit was imposed to the evolutionary algorithm. In this case, the fitter chromosome (after 25 GA generations) was found with Moore neighborhood of , and its efficiency was , a worst result than the previous one. Figure 1 also exhibits a temporal evolution (see the dotted line) of the number of infective individuals simulated by employing the PCA model with the corresponding chromosome.

Then, more modifications on the GA were accomplished. Now each time step of the PCA can correspond to 1 or 2 months (thus, 12 or 6 time steps are necessary to complete 1 year, resp.). And additional limits were imposed to the following genes: . With such modifications, the efficiency of the better candidate solution decreases to , and it was obtained for von Neumann neighborhood of and 6 times steps per year. The values of vary from to , , , and . These values are still unrealistic. Figure 1 presents a temporal evolution of the PCA with these probabilities (see the dash-dotted line).

In the simulations reported above, and consequently the spatial scale is 1:20 (one cell of the lattice corresponds to 20 individuals living in Arizona). For , the scale is . By using this new spatial scale and the additional limits to the death and cure probabilities, the best solution, with , was found with von Neumann neighborhood of and 12 time steps per year. The values of vary from to , , , and . A temporal evolution of the infective cases in PCA is presented in Figure 2 (see the dashed line).

In the scale 1:5, if those additional limits on , , and are removed, the best chromosome was found with Moore neighborhood of and 6 time steps per year. It has the highest efficiency found in all numerical experiments, ; however, the probabilities are again unrealistic. For instance, this chromosome presents (an absurd value, of course). Figure 2 also illustrates the temporal evolution of the corresponding PCA. Observe the good agreement between the historical series and a PCA simulation (see the dotted line).

#### 5. Discussion

We found that the size of the PCA lattice affects the quality of the predictions of our epidemiological model, after being identified by using a GA. The best set of state transition probabilities, with , was obtained in the scale 1:5. By imposing realistic constraints to the values of these probabilities, the efficiency of the best solution presented . In the scale 1:20, the best solution respecting such constraints dropped to .

Here, we had to deal with the problem of scaling the lattice of the CA in relation to the number of individuals, healthy and infective ones. The most obvious first approach is to maintain the ratios from the original data, that is, a linear relation of size. In this sense, space is completely embedded in the number of individuals, which, ultimately, becomes the size reference frame. One potential problem in doing this is that the neighborhood (either Moore or von Neumann) maintains, therefore, its absolute size. Let us exemplify this.

Epidemiological models based on differential equations can be used for describing the spreading of infectious diseases (e.g., [7–10]). These equations are a mean-field approximation for PCA if the three groups (S, I, and R) are homogeneously distributed in space. A parameter called basic reproduction number, , is obtained from the infectiveness, mortality, and recovery rate constants (e.g., [7–10]). The value of identifies the outbreak or the extinction of the disease (usually represents extinction and represents outbreak, e.g., [7–10]). It was suggested that “to make the 's equivalent,” the transmission probability of the local model should be the global (mean-field) probability times the ratio population size to neighborhood size ([22, page 116]). However, as it can be seen, our results indicate that such a linear proportionality may not hold, at least to estimate reliable parameters from real data.

Turner et al. [23] studied how spatial scale interferes with the identification of estimators in landscapes indexes as diversity and dominance. They concluded that both quantitative and qualitative changes in the measurements would be present depending on how the scale is defined. Later, in the same lines, Dungan et al. [24] called into attention that the various spatial scaling measures (as grain, resolution, range) are not interchangeable and suggested avoiding the term scale. As the present study showed, this shortcoming in space is readily present when size is an inherited variable of individuals.

DeAngelis and Petersen [25] focused in a predation model, where salmons migrate through predators and concluded that the size resolution of the cells has a crucial importance in the predicted outcome. The authors emphasize that fewer cells (therefore, a coarse spatial resolution) over predict the mortality rate. Moreover, irrespectively to the spatial resolution, the models give different predictions than the mean-field ones. On the other hand, Pascual et al. [26] concluded that despite variations in the predicted rates due to scaling problems, the functional forms are quite the same as the well-mixed (mean-field) models.

The discussion of system identification in ecological/epidemiological studies and scaling issues is still an open question. Our investigation goes in the same direction as many previous ones (see above), that is, that size matters. We add to the query the problem of interchanging space with individuals (and their neighborhoods). Therefore, if modeling is concerned with real predictions much more than simple data fitting, then a judicious evaluation of the underlying scale should be performed as a first step. It seems that these ecological/epidemiological problems need their particular definitions of size, whether this means space itself or number of individuals or a combination of both. In other words, the geometry for scaling these systems is still awaiting the due attention.

#### Acknowledgment

L. H. A. Monteiro is partially supported by CNPq.