#### Abstract

Corrosion occurs in many engineering structures such as bridges, pipelines, and refineries and leads to the destruction of materials in a gradual manner and thus shortening their lifespan. It is therefore crucial to assess the structural integrity of engineering structures which are approaching or exceeding their designed lifespan in order to ensure their correct functioning, for example, carrying ability and safety. An understanding of corrosion and an ability to predict corrosion rate of a material in a particular environment plays a vital role in evaluating the residual life of the material. In this paper we investigate the use of genetic programming and genetic algorithms in the derivation of corrosion-rate expressions for steel and zinc. Genetic programming is used to automatically evolve corrosion-rate expressions while a genetic algorithm is used to evolve the parameters of an already engineered corrosion-rate expression. We show that both evolutionary techniques yield corrosion-rate expressions that have good accuracy.

#### 1. Introduction

Corrosion is a natural phenomenon that can cause substantial economic and environmental losses which result from the damage incurred in metal constructions over the years. The cost of corrosion has been reported [1, 2] to be as large as 3.1% of the gross domestic product of countries such as the United States, United kingdom, and Australia. Corrosion costs can be (i) direct when the metallic structure is greatly damaged in which case replacement or expensive maintenance are required or (ii) indirect when the worsened appearance of the construction reduces its value (even if the construction is not greatly damaged and can still be used just fine).

Corrosion refers to the disintegration of materials into their constituent atoms because of chemical or electrochemical reactions with the environment [3]. This disintegration causes a loss in the thickness of the construction which results in a decrease in resistance and strength and consequently a decrease in the service performance of the construction. Corrosion occurs in many engineering structures such as bridges, pipelines, refineries, and so forth and can result in the destruction of materials in a gradual manner and hence shortening their lifespan.

Corrosion can occur in many environments such as atmosphere, soil, sea, and so forth where environmental factors affect the material in complicated processes leading to its corrosion. Depending on the environment, corrosion can be atmospheric, underground, marine, gaseous, or microbial and bacterial. Atmospheric corrosion is the type of corrosion we are mainly interested in because (i) it has been reported that atmospheric corrosion is responsible for more corrosion-induced failures than any other corrosion type [4] and (ii) it is the most major corrosion type in SABIC [5] industrial sites—where the findings of this work are going to be applied.

Because of its huge impact on the economy and environment, an understanding of corrosion and ability to predict corrosion rate of a material in a particular environment plays a vital role in evaluating the residual life of the material and consequently reducing associated costs. In order to understand and predict corrosion, we must model the environmental factors that influence corrosion and derive relationships between them and the rate of the resulting corrosion.

In this paper, we propose the use of genetic programming [6] and genetic algorithms [7], to derive the corrosion-rate expressions in terms of the major influential environmental factors.

The rest of the paper is structured as follows. In Section 2, we will found the mathematical properties of the problem of deriving corrosion-rate expression. In Section 3, we will describe the methodology of our work where we use genetic programming and genetic algorithms. In Section 4, we conduct an empirical evaluation of our work. In Section 5 we discuss our findings. In Section 6 we review related work in the automatic derivation of corrosion-rate expressions. Finally, in Section 7, we draw conclusions and set directions for future work.

#### 2. Problem Formulation

The problem of identifying a corrosion model reduces to defining a function that expresses corrosion rate in terms of the environmental factors that cause it. Such environmental factors will differ from one site to another and include temperature, air humidity, wetness, acidity, concentration of particular chemicals, and so forth. The interaction between the environmental factors and the metal causes corrosion over time. The major influential environmental factors in atmospheric corrosion in the literature are the following.(i)Temperature **(****)**: the degree in Celsius; an increase in temperature stimulates corrosion by increasing the rate of electrochemical reactions and diffusion processes [4].(ii)Time of wetness** (****)**: the time during which the environment’s critical relative humidity is greater than 80% and the average temperature is above 0^{∘}C; which forms an electrolyte film on the metal causing its corrosion [4, 8]. (iii)Sulfur** (****)**: the amount of concentration of the contaminant; sulfur stimulates electrochemical reactions in the electrolyte layer on the metal formed by humidity above 60% to 70% [8].(iv)Chloride** (****)**: the amount of concentration of the contaminant; chloride prevents the creation of protective oxide layers on the metal which accelerates the corrosion process [9].(v)Exposure time** (****)**: the time interval over which the measurements of the previous environmental factors took place.

In our work, the function that represents the corrosion rate has five inputs , , , , and that represent the environmental factors. We shall use the following representation: the environmental factors form an -by- matrix which is input to the function and the resulting output corrosion rates form the -by- vector . Here, is the number of observations where the values of the environmental factors are recorded together with the corresponding corrosion rates. The corrosion model that we want to identify is thus .

#### 3. Methodology

In order to identify the function , we first start by collecting an -by- matrix of data where is the number of experiments in which values of the variables in have been collected together with the resulting value . After that, the set of observations is split into two parts: one for building the model and one for evaluating its accuracy. The part of data that is used for building the model is usually the largest, say data items, and the part used to evaluate the model is the remaining data items. This division is not engraved in stone and can be changed while satisfying two opposing criteria: (i) the size of the dataset used in training should be as large as possible to account for a diversity of data points while deriving the expression of interest and (ii) the size of the dataset used in evaluation should be as large as possible to avoid overfitting in the derived model.

After collection of data, we apply the evolutionary technique of interest, that is, genetic programming (Section 3.1) and genetic algorithms (Section 3.2)—in order to determine the function .

##### 3.1. Genetic Programming

Genetic programming (GP) [6] is a bioinspired computer algorithm that mimics natural evolution of living organisms. It is similar to genetic algorithms with the exception that individuals are computer programs as opposed to vectors of values.

The objective of GP is to evolve a computer program that solves a given problem. In order to do so, a population of computer programs called individuals—that are randomly generated initially—is evolved across a number of generations. The evolution of the population involves the exchange of genetic material between the individuals through crossover operations and the alteration of the genetic material of single individuals through mutation operations. A selection strategy is applied to the individuals of a population in a given generation to decide which ones are allowed to proceed to the next generation. Such selection is based on the fitness of the individuals which is a problem-dependent value that specifies the goodness of an individual in solving the problem at hand. The evolution continues until a good-enough individual that solves the problem adequately is found, or until a maximum number of generations is reached.

Each individual in the population is a program represented by its abstract-syntax tree (AST). All nonleaf nodes of the AST represent operators, and leaf nodes represent problem variables or constant values. Crossingover two programs means taking one or more subtrees from the first program and inserting them into the second program and taking one or more subtrees from the second program and inserting them into the first program (crossovers can be single point or multiple point). Mutating means changing the content of one or more nodes in the AST.

GP can be used to solve a variety of optimization problems amongst which is symbolic regression that we shall describe here because it is the essence of our approach. To solve a symbolic-regression problem (also known as function-discovery problem), the genetic program (GP (we use GP to refer to both “genetic programming” and “genetic program”)) takes as input a set of observations of values of some variable , a set of observations of values of some variable and tries to identify the function such that is true for all pairs in the observations—and also true outside the observations. The function to be determined is a computer program that will be evolved by the GP. The initial population of the GP contains a number of randomly generated functions —each represented as an AST, for example, Figure 1 shows the AST representation of some function in the GP. The functions will be crossed over and mutated over generations to produce new fitter functions. The fitness of a function is calculated as the sum of differences for all pairs in the observations. At the end, the GP either discovers during evolution or another function of equal or inferior fitness. Notice that the GP can derive some function that satisfies for all observed data pairs but does not satisfy in the general case, that is, for some unobserved pairs the relation holds. This is a classic case of overfitting and is usually (partially) tackled by dividing the observations into a training part used during evolution and a testing part used after evolution to give an indication of how well the derived function generalizes to new unseen data. Should the derived function not generalize well enough, evolution is restarted with the derived function injected in the initial population of the new GP run.

##### 3.2. Genetic Algorithms

Genetic algorithms (GAs) [7] work in a very similar fashion to genetic programming except that each individual during evolution is an array as opposed to a tree. This means that GAs cannot be used to evolve a symbolic expression like GP does since evolving an expression requires the ability to evolve an AST, not a flat array. However, the GA can be used as a powerful regression tool to estimate the coefficients of an expression whose structure is known already. For example, if we know for example that some function is where is the independent variable and and are constants, we can evolve the array such that the difference is minimal in all observations (as discussed in Section 3.1). The use of GAs in this case is similar to the use of linear and nonlinear regression, however, with the added advantage that it can escape local minima.

#### 4. Evaluation

We use the datasets available from [10] to conduct our experiments. The datasets show the corrosion rates of the two metals: steel and zinc. Corrosion for both metals is measured against the five most influential environmental factors: temperature, time of wetness, concentration of sulfur dioxide, Chloride, and exposure time as explained in Section 2. Tables 1 and 2 show some relevant statistics about the datasets we use in our experiments. In the following we will show the results of applying each evolutionary techniques to determine an expression or model of the corrosion rates of steel and zinc.

##### 4.1. Results of Using GP

The GP was run using the parameters shown in Table 3. The fitness function is aggregate and is computed as the sum of the average mean squared error (MSE) and the complexity of solution measured as the number of nodes in the resulting corrosion expression. The size of the expression was added as penalty to guide the evolution process towards small expressions. The expressions obtained for the corrosion rates of steel and zinc are shown in Tables 4 and 5, respectively—in decreasing goodness-of-fit scores. The value in Tables 4 and 5 is obtained by performing a regression analysis between the model output (i.e., the corrosion-rate values obtained using the derived expression) and the corresponding target (i.e., the corrosion-rate values available in the dataset). The more is closer to 1, the more the model is fitting the target data.

Figures 2, 3, 4, 5, and 6 show the goodness of fit of the derived GP expressions for steel according to their order in Table 4, and Figures 7, 8, 9, 10, and 11 show the goodness of fit of the derived GP expressions for zinc according to their order in Table 5. In these figures, the variable *Target* on the -axis shows the measured corrosion rate in the datasets while the variable *Output* on the -axis shows the estimated corrosion rate using the respective GP expression for the same dataset point. The reported results are obtained by using our own GP implementation. Although there are robust GP systems around such as Eureqa Formulize [12], we got the best results using our own GP system especially that we forced all environmental factors to appear in the final symbolic expression—something we had little control over in Eureqa Formulize.

As can be seen, the GP expressions have very high goodness-of-fit values despite the necessarily noisy datasets. The GP ran for around 60 minutes to derive the corrosion-rate expressions of each metal before reaching the maximum number of generations.

##### 4.2. Results of Using GAs

We used the robust MATLAB ga library to obtain the best results. We used the corrosion expression (1) from [13] where the constants are the ones to evolve using the GA. Table 6 lists the GA parameters used in the evaluation:

Initially, the GA yielded very inaccurate expressions for steel and zinc (the error was in the range )—which was rather unexpected. However, a closer investigation revealed that the GA was not performing protected division that is, during the evaluation of the fitness of an individual that has , , or the fitness values were erroneous.

To circumvent this problem, we penalized individuals that have , , or by assigning poor fitness values to them. The GA corrosion expressions for steel and zinc are shown in (2) and (3), respectively, with MSE values of 5791 and 15.9, respectively. The goodness-of-fit plots are shown in Figure 12 for steel and Figure 13 for zinc:

As can be seen, the GA expressions are also accurate. The GA ran for around 15 minutes to derive each of the reported expressions before reaching the maximum number of generations.

#### 5. Discussion

Table 7 shows a summary of the accuracy of the two evolutionary techniques for predicting corrosion rates for steel and zinc.

In terms of usefulness, the GP expressions are superior to the GA expressions because they are derived automatically, that is, without knowledge about the structure of the target corrosion-rate expression, whereas we assume a specific corrosion-expression structure for GAs and evolve its parameters. The explicit GP corrosion-rate expressions give more insight into the corrosion process because they show how corrosion rate is affected by the environmental factors.

The datasets used in the experiments are characterized by the presence of a number of outliers which can either be (i) genuine data points where corrosion rate deviates significantly from average because of the inherent complexity of the corrosion process or (ii) erroneous data points that result from faults in measurement devices, human mistakes during data entry, and so forth. The analysis presented in this paper assumes case (i), that is, all data points are assumed to be valid.

As can be seen from the results, the slope of the fitting line is seemingly controlled by outliers. In order to investigate this issue further, we have redone the GP evolution, however this time by a better handling of the outliers using two methods as follows. First, we used a logarithmic distance measure instead of the squared error . The outliers in the dataset are data points of large magnitudes, which means that if they do not lie close to the curve during evolution, they will affect the fitness of the solution in a substantial way if their distance from the curve is measured by, for example, . When the distance is logarithmic, the effect of outliers on the evolution of the curve is significantly reduced, for example, if then . Second, we removed the outliers all together and used the squared error as we did previously.

The resulting corrosion-rate expression for steel using the logarithmic error is shown in (4); the resulting corrosion-rate expression for zinc using the logarithmic error is shown in (5) the resulting corrosion-rate expression for steel after dropping outliers is shown in (6), and the resulting corrosion-rate expression for zinc after dropping outliers is shown in (7). Figures 14, 15, 16, and 17 show the goodness of fit of the corrosion-rate expressions (4), (5), (6), and (7), respectively:

As can be seen from the results, using GP still gave good results when outliers were eliminated and when their effect was significantly reduced.

#### 6. Related Work

Corrosion modeling is not a novel research area. Many corrosion models have been developed in the literature to yield expressions of metal corrosion as shown in Table 8. Most of these models do not take into account all five environmental factors that we consider in this work (see Table 8).

In addition to this, corrosion modeling has also been attempted using artificial neural networks in numerous works including [10, 20–22] and using support vector regression [11]; however, these techniques do not yield explicit corrosion-rate expressions.

#### 7. Conclusions

In this paper, we have developed a corrosion model based on two evolutionary computation techniques, namely, genetic programming and genetic algorithms. Both techniques yielded corrosion-rate expressions with good accuracy with genetic programming being superior because it can learn without prior knowledge the structure of the corrosion expression. The findings of the this work will allow better understanding of the corrosion phenomenon in terms of cause and effects so that necessary action such as prevention measures can be carried out.

#### Acknowledgment

This research is funded by the Institute of Consulting Research & studies, Umm Al-Qura University, Makka, Saudi Arabia, Grant no. S2011-2.