Abstract

For large-scale optimization, CMA-ES has the disadvantages of high complexity and premature stagnation. An improved CMA-ES algorithm called GI-ES was proposed in this paper. For the problem of high complexity, the method in this paper replaces the calculation of a covariance matrix with the modeling of expected fitting degrees for a given covariance matrix. At the same time, to solve the problem of premature stagnation, this paper replaces the historical information of elite individuals with the historical information of all individuals. The information can be seen as approximate gradients. The parameters of the next generation of individuals are generated based on the approximate gradients. The experimental results were tested using CEC 2010 and CEC2013 LSGO benchmark test suite, and the experimental results verified the effectiveness of the algorithm on a number of different tasks.

1. Introduction

Most machine learning problems can be modeled as optimization problems, and as the amount of data increases, the model becomes more and more complex. Therefore, the problem of large-scale optimization has become the focus of most researchers. The covariance matrix adaptation evolutionary computation (CMA-ES) [1] is one of the most powerful evolutionary strategies for global optimization. It is an algorithm based on the population probability distribution, very similar to the estimation of distribution algorithm (EDA) [2]. A common shortcoming of simple evolutionary strategies is that the noise parameters of the standard deviation are fixed. CMA-ES can automatically adjust the standard deviation according to the distribution of the population, which brings two benefits:(1)Increasing the diversity of the population, so that the algorithm can jump out of the local optimal solution(2)Adjusting the standard deviation parameters so that the algorithm can adaptively adapt to the fitness terrain

In addition, because CMA-ES can use the information of the optimal solution to adjust its parameters at the same time, it can expand the search scope when the optimal solution is far away or narrow the search scope when the optimal solution is near. Due to these advantages, CMA-ES algorithm, as one of the most popular gradient-free optimization algorithms, has become the choice of many researchers and practitioners.

Despite its huge advantage in solving optimization problems, CMA-ES suffers some limitations when dealing with LSOPs:Time-consuming: because the basic operation of CMA-ES is based on covariance, and each generation takes time and space, sampling the new population requires additional calculations to decompose the matrix. CMA-ES relies on spectral decomposition when dealing with numerical errors and ill-conditioned conditions, which is generally considered to be computationally inefficient compared to other decomposition techniques.Lack of diversity in population: another obvious disadvantage is that CMA-ES assessed some of the best individuals. Although it can speed up convergence to some extent, this strategy discards most of the information. This prevents the CMA-ES from performing well when dealing with ill-conditional problems. We all know that great people in life have certain qualities that we can learn from, but some people who fail also keep a record of “not doing” something. It is important for the next generation to make better calculations and evaluations.

These two limitations may prevent the use of CMA-ES for LSGOs.

To solve the above problems, this paper proposes an evolutionary strategy based on gradient information utilization (GI-ES), which extends the application of CMA-ES in the field of large-scale optimization. The characteristics of GI-ES are as follows:(1)The fitness scores were optimized for each sampling scheme for each GI-ES. The best-performing scheme is likely to perform better in the sampled generation if the expected results are good enough. Maximizing the expected fit score of a sampling scheme is the same as maximizing the overall fit score of the sample in a given sample.(2)The gradient information is simulated using individual information, and the approximate gradient information is used to guide the search.(3)GI-ES uses an approximate gradient for the search directions, allowing the algorithm to adapt to the fitness landscape, on which the variables depend. This process generates a fitness score evaluation based on the expected fitting score. The gradient signal is obtained by using the maximum likelihood estimation method of the model described above. It differs from traditional evolutionary computation in that it represents the “population” as a parameterized distribution. It uses a search gradient to update the parameters of the distribution, which can be calculated using the adaptive values of historical data.

Following the introduction section, we first discuss the related work of large-scale optimization and the utilization of historical information strategies for evolutionary computation in Section 2. In Section 3, the background and motivation of this paper are introduced. Section 4 describes the detailed implementations of GI-ES. Thereafter, the simulation results on the benchmark test suites are examined to evaluate the effectiveness of the proposed approach in Section 5. Finally, Section 6 summarizes this paper.

In recent years, large-scale optimization has become a hot research topic, and many large-scale benchmark functions have been put forward to examine the merits of large-scale optimization algorithms. Researchers have done a lot of useful work and proposed many solutions to solving large-scale global optimization problems.

There are currently two main research directions for large-scale optimization problems:(1)Decomposition-based algorithm: the dimensionality reduction (decomposition) is carried out for large-scale problems in the form of grouping, so as to decompose the large-scale problems into multiple sub-problems to solve. Multiple sub-problems are optimized using evolutionary algorithms within the framework of cooperative coevolution (CC).(2)Instead of directly decomposing large-scale problems, optimization is carried out by combining multiple local search algorithms, each with its own set of parameters.

The rest of this section will further discuss some algorithms in the recent CEC large-scale optimization competition.

CEC2008 is the first known LSGO competition. Multiple trajectory search (MTS) [3] was the first winner. Then, MTS-LS1, MTS-LS2, and MTS-LS3 are the improved version of MTS. Some DE-based algorithms have achieved good results in LSGO competitions. Self-adaptive differential evolution with multi-trajectory search (SaDE-MMTS) [4] is a hybridization algorithm, which integrates JADE [5] and modified MTS-LS1. MA-SW-Chains [6] is an extension of MA-CMA-Chains, where CMA was replaced with Solis Wets (SW).

In the CEC2010 competition, MA-SW-Chains was the winner. Ensemble Optimization Evolutionary Algorithm (EOEA) [7] was the second-ranked algorithm. In EOEA, the optimization process is divided into two stages, namely, global shrinking and local exploration. In EOEA, EDA based on the Mixed Gaussian and Cauchy Model (MUEDA) is used to achieve more quickly convergence. The goal of the second stage is exploitation. The third place in CEC2010 is the Differential Ant Colony Algorithm (DASA) [8]. It tries to solve LSGO by converting the real parameter optimization problem into a graph search problem.

Improved multiple offspring sampling (MOS) [9] was the winner of CEC2012. MOS combines SW and MTS-LS as two local searches. Self-adaptive differential evolution algorithm (jDElsgo) was proposed in [10]. After continuous improvement, jDElsgo was ranked second in CEC2012. The third rank was cooperative coevolution evolutionary algorithm with global search (CCGS). CCGS is considered an extension of EOEA [11]. Cooperative coevolution with delta grouping (DECC-DML) was proposed in [12] to enhance the performance of CC framework on non-separable problems.

CEC2013 uses a new benchmark functions. The winner of CEC2013 competition was modified MOS [13]. DECC-G [14] was the reference algorithm in CEC2013. The second-ranked algorithm was smoothing and auxiliary function-based cooperative coevolution for global optimization (SACC) [15]. SACC adopted parallel search for the first time under the CC framework.

Bi-space interactive cooperative coevolutionary algorithm (BICA) [16] is a two-space co-evolutionary algorithm framework that evolves in two spaces. The model evolves to provide better grouping, and the individual evolves to achieve better adaptability. SHADE with Iterative Local Search (SHADE-ILS) [17] iteratively combines the modern differential evolution algorithm with a local search method selected from various search methods. The selection of local search methods is dynamic and takes into account the improvements they have made in the previous enhancement phase to determine the best method for the problem in each situation. In LSHADE-SPA [18], differential evolution with linear population size reduction is used for global exploration, while a modified version of multitrack search is used for local exploitation.

A hybrid adaptive evolutionary differential evolution (HACC-D) was also proposed in CEC2014 [19]. HACC-D belongs to the CC class of algorithms. JADE and SaNSDE are used as CC subcomponent optimization algorithms. Scaling up Covariance Matrix Adaptation Evolution Strategy using cooperative coevolution (CC-CMA-ES) is another CC-based competitive algorithm. The basic optimizer in CC-CMA-ES is CMA-ES.

For a detailed overview of the state-of-the-art large-scale optimization algorithms, please refer to [20]. This paper proposes an algorithm for large-scale optimization problems, called GI-ES. The algorithm has been verified on the CEC2010 and CEC2013 test sets. The experimental results verify the potential value of the algorithm.

3. Background and Motivation

This section briefly introduces the background of CMA-ES algorithm. With that in mind, the paper focuses on the utilization of information by the CMA-ES algorithm. Based on the analysis of different information utilization methods, this paper proposes a model of gradient information utilization.

3.1. CMA-ES

The CMA-ES algorithm is an evolutionary strategy algorithm proposed by Hansen et al. [1]. CMA-ES algorithm uses a Gaussian distribution to sample the solution space of the optimization problem. The parameters of the distribution are updated according to a sample selection mechanism. Iterate through the update process until all the conditions are met.

For the objective function , CMA-ES generates a new generation of population by estimating the distribution of the objective function; that is, the points in the new population are obtained from the normal distribution , where represents the mean value, is a positive definite matrix, and this matrix is called the covariance matrix. In the algorithm, the covariance matrix is continuously adjusted to make the distribution of the search points closer to the equipotential line of the target function; that is, the ideal covariance matrix should be the inverse matrix equal to the hessian matrix, though this is difficult to achieve for more complex functions. Next, we briefly introduce the basic ideas behind CMA-ES.

3.1.1. Sampling

For a given objective function, CMA-ES first assumes the existence of an optimal fitness terrain to make the search move in the optimal direction. For convex quadratic function, the correlation between variables is linear and can be completely eliminated. So CMA-ES treats these functions “as” spherical functions and effectively solves them. This strategy is also applicable to general black-box objective functions, because their local landscapes can be approximated as convex quadratic functions. In general, the optimal covariance is not known. CMA-ES sampling through the multivariate Gaussian distribution . is the search step size, which is used to control the local search capability of Gaussian distribution. The number of samples is used for each sample, and then the fitness value of the samples is calculated. In each generation, CMA-ES provides a multivariate normally distributed parameter for sampling the next generation.

3.1.2. Update

The CMA-ES algorithm updates the parameters (, and ). Update operations are more complex. The new mean was updated with the excellent individuals obtained in the generation. The contribution of individuals to the mean value is considered by inertia weighting.wherewhere represents individuals with the highest fitness ranking among samples. Because the weight of each sample is different, the weight of the sample with better fitness is bigger.

There are two ways to update covariance: and . This paper adopts update strategy. The update strategy of covariance matrix is as follows:where is the n-dimensional integer matrix, is the learning rate, the learning rate range is [0, 1], and is the evolutionary path, that is, the memory of mean deviation will decay along with the optimization process. Therefore, CMA-ES algorithm is used to model the variance matrix of multivariate Gaussian distribution, instead of using maximum likelihood estimation to model the sample. In this way, the successful step size update in the previous step will be highly likely to appear again in the next generation. For the detailed method of updating, see [21].

The step size isamong which is the evolutionary path.

3.2. Motivation

Because CMA-ES can use the information provided by the optimal solution to adjust its mean value and variance simultaneously, it can expand the search scope when the optimal solution is far away, or narrow the search scope when the optimal solution is near. But the covariance matrix takes in space and time. Throughout the iteration process, only the individuals with higher fitness ranking are considered. When faced with some simple problems, the algorithm can quickly converge, but when faced with large-scale optimization problems, this can quickly lead to premature stagnation.

Survival of the fittest is an important part of evolutionary computing, but the diversity of populations is crucial to the performance of the algorithm. Through evolution, the traditional CMA-ES preserves the optimal individuals and influences the distribution of the next generation by learning the individuals of the optimal parts. That is, in the evolutionary past, populations tended more and more toward “elite” individuals than toward “bad” individuals. But the bad ones retain key information about what not to do. Success studies are everywhere, but what is really useful is often the experience of unsuccessful people, because it contains long-term life observations and lessons. These constitute the motivation for writing this article.

4. Proposed Approach

The proposed algorithm GI-ES adopts the basic framework of CMA-ES but makes some improvements to CMA-ES. In the proposed scheme, keep all the information about each scheme available to each generation, good or bad. With these gradient signal assessments, we can move the whole scheme in a better direction for the next generation. Since we need to evaluate the gradient, we can use the standard stochastic gradient descent algorithm (SGD) applied to deep learning [22].

4.1. GI-ES

The fitness score was optimized for each sampling scheme in GI-ES. If the expected results are good enough, the best-performing scheme in the sampling generation may perform better. Maximization of the expected fitness score of a sampling scheme is actually equivalent to maximization of the overall fitness score.

Assuming that is the sampling scheme vector of the probability distribution function , we can define the expected value of the objective function F aswhere represents the parameter of the probability distribution function. For example, if is a normal distribution, is and . For our simple two-dimensional problem, every whole is a two-dimensional vector . Using the same logarithm likelihood method as in REINFORCE, we can calculate the gradient of :

In a sample of size , we have scheme , so that the gradient can be estimated by summation:

With the above gradient, we can use a learning rate of alpha (say 0.01) and start our theta optimization of the probability distribution function so that our sampling scheme gets a higher fitness score on the target function F. Using SGD or Adam [23] algorithm, we can update in the next generation:

After the probability distribution function is updated, the new competitive scheme can be sampled until an appropriate solution is obtained. Since relevant parameters are not used, the efficiency of this algorithm is .

GI-ES adopts an approximate gradient as the direction of search. It represents the “population” in traditional evolutionary computation as a parameterized distribution .

4.1.1. Multinormal Distribution

In this proposed method, multivariate normal distribution is used as the parameterized distribution. The parameter of multivariate normal distribution is . is mean of multivariate normal distribution, and is the covariance matrix. To sample more efficiently, we need a matrix , meeting . Then, the problem can be simplified to a standard multivariate normal distribution problem. represents the probability density function of the multinormal distribution.

To calculate the gradient information of the multivariate Gaussian variable, the logarithm of the probability density is obtained, so that the gradient can be estimated by summation:so and can be obtained in (15) and (14).

Then, update the parameters with the calculated gradient information.

To adapt to exploring or mining solution space, the algorithm can change the solution distribution according to the parameters of the solution that it is exploring.

4.1.2. The Technique of GI-ES

Further, A can be decomposed into a scale parameter , and a normalized covariance factor satisfying . This decoupling form of two orthogonal components can be independently learned.

The advantage of overall information utilization is to prevent information loss, but outliers still need to be considered. Therefore, this paper adopts the method of fitting degree shaping to solve outliers dominant situation [24]. In this method, according to the fitness value information, the population individuals are ranked according to the fitness from small to large. Calculate the utility value according to the fitness value .

The complexity of each covariance matrix update is . The complexity can be reduced to by calculating the update of local non-exponential coordinates. In this case, the update of gradient information can be decomposed into the following components:

The pseudocode of GI-ES is given in Algorithm 1.

Require:: objective function; : initial ; ;
Ensure: optimal
(1)initial and ;
(2)repeat
(3)  fordo
(4)    draw sample ;
(5)    ;
(6)    evaluate the fitness value
(7)  end for
(8)  sort the sampling particles according to the fitness value and compute utilities function
(9)  compute approximate gradients according to (17)–(20)
(10)  update parameters use the approximate gradient information
(11)  update the approximate mean vector
(12)  update the scalar step size
(13)  update the covariance factor
(14)until the termination criterion

5. Experiments and Analysis

In this section, GI-ES is used to compare with the state-of-the-art algorithms to verify the effectiveness of the proposed algorithm. Three performance analysis experiments were performed. First, GI-ES was evaluated using CEC2010. Second, CEC2013 is used to evaluate the GI-ES and compared with nine state-of-the-art algorithms. Finally, a parametric analysis was performed to study the effect of each component in GI-ES.

The first experiment is CEC2010, the second experiment is CEC2013, and the third experiment is statistical analysis. The benchmark functions are shown as follows. The dimensions (D) of all functions are 1000 except for two overlapping functions, F13 and F14 in CEC2013, where D is 905.(1)The CEC2010 contains 20 test questions. These test functions can be divided into four classes:(i)F1–F3: separable functions(ii)F4–F8: partially separable functions, in which a small number of variables are dependent, while all the remaining ones are independent ()(iii)F9–F18: partially separable function that consists of multiple independent subcomponents, each of which is m-non-separable ()(iv)F19–F20: fully non-separable functionsFor more detailed features, please refer to [25].(2)The CEC2013 contains 15 test questions:(i)F1–F3: separable functions(ii)F4–F11: partially separable functions(iii)F12–F14: overlapping functions (D = 905)(iv)F15: fully non-separable functions

For more detailed features, please refer to [26].

To make the experimental data more accurate, each experiment was run 25 times to record the statistical results. The solution error measure was recorded at the end of each run; is the well-known global optimum of each function. The maximum number of iterations is set to 3.0E + 6 according to the default value of test suite.

5.1. Parametric Analysis

Most of the parameters of GI-ES are the same as reference [27]. In the algorithm, the number of population and the learning rate of gradient information are the parameters that need to be specified artificially.

In Section 4, we stated that GI-ES represents a mixed effect of three main components which are as follows: (1) the fitness scores were optimized for each sampling scheme for each GI-ES, (2) the gradient information is simulated using individual information, and the approximate gradient information is used to guide the search, and (3) GI-ES uses an approximate gradient for the search directions.

To further verify the performance of the algorithm, we analyzed each part of the algorithm to show the individual effect of each components. Table 1 illustrates the mean values of each component, and best values are marked in bold. It can be seen from the results that the GI-ES algorithm with three mechanisms works best.GI-ES (1): the fitness scores were optimized for each sampling scheme for each GI-ES. GI-ES (2): the gradient information is simulated using individual information, and the approximate gradient information is used to guide the search. GI-ES (3): GI-ES uses an approximate gradient for the search directions.

5.2. Evaluation Criteria

To evaluate the performance of GI-ES, we apply the same methods in [18] such that three evaluation criteria were used. The first is Formula One Score (FOS). Formula One Score was used in the latest LSGO competition (CEC2015). According to this criterion, the algorithms will be ranked from best to worst. Then, the top 10 ranks will get 25, 18, 15, 12, 10, 8, 6, 4, 2, and 1, respectively. Algorithms ranked outside the top 10 will get zero. Maximum values of R indicate better performance. The second and third are two non-parametric statistical hypothesis tests: Friedman test using as a significance level.

5.3. Performance Analysis Using CEC2010 and CEC2013

Statistical results of GI-ES using CEC2010 and CEC2013 are illustrated in Tables 2 and 3, respectively. Figure 1 illustrates the convergence behavior of GI-ES using sample functions from each class in CEC2013: f3 as fully separable, f8 and f11 as partially separable functions, f12 and f14 as overlapping functions where D = 905, and f15 as fully non-separable functions.

CMA-ES variants for solving large-scale optimization are used as comparison algorithms. Some other algorithms are not derived from CMA-ES but are the state-of-the-art ones, for example, CC-based differential evolution (DECC-G) [14], the multiple offspring sampling (MOS) [9]. The comparison algorithm is shown in Tables 4 and 5. All of these algorithms followed the same CEC2010 and CEC2013 guidelines. The results of the comparative experiment are recorded in Tables 6 and 7.

Tables 8 and 9 summarize the ranking for GI-ES and the state-of-the-art algorithms using Formula One Score (FOS). Tables 10 and 11 summarize the ranking obtained using Friedman’s test.

5.3.1. Formula One Score (FOS)

As shown in Table 8 and 9, GI-ES ranked second in all comparison algorithms of CEC2010 using Formula One Score (FOS) and first for CEC2013. Regarding CEC2010, the best algorithm is MMO-CC with 237 points. Comparing the winners of CEC2010 and CEC2012: MA-SW-chains and MOS2012 get 115 and 168 points. This shows that GI-ES is competitive. Using CEC2013, GI-ES gets points, followed by MOS2013, and VGDE, with 218 and 196 points, respectively.

5.3.2. Friedman Test

According to Friedman test illustrated in Tables 10 and 11, GI-ES obtained the best ranking for both CEC2010 and CEC2013 benchmarks. Using CEC2010, GI-ES gets 5.33 points, jDEsps gets 6 points, MMO-CC gets 7.75, and MA-SW-chains gets 7.95 points. While using CEC2013, GI-ES gets 3 points, MOS2013 gets 3.57 points.

From the previous comparison, high ranking using Formula One Score (FOS) does not guarantee the same ranking using Friedman test.

In the Friedman test, the critical value is 16.92, and value is 1.64E 05. These indicate that there are significant differences between different algorithms.

High ranking using Formula One Score (FOS) does not guarantee the same ranking using Friedman test. Because more weight will be given for the top positions.

5.4. CPU Computational Time

Previous experimental results have evaluated the effectiveness of GI-ES. This section illustrates the analysis of the algorithm. The runtime of the algorithm is shown in Table 12. Table 12 records the average running time of GI-ES and the baseline versions of CMA-ES and MA-SW-Chain CEC2010 LSGO. The dimensions are 1000.

It can be seen from the results that, for the separable functions f1–f3, the runtime of GI-ES is slightly better than that of MA-SW-Chain and CMA-ES. The run-time difference between the three algorithms is not obvious. For partially separable functions (f4–f18), the running time of GI-ES is significantly better than that of CMA-ES and MA-SW-Chains. In terms of fully non-separable functions, CMA-ES performs better than GI-ES. In general, GI-ES offers better performance on the part of the separability problem.

6. Conclusion

This research has shown that the use of guiding gradient information improves the performance of evolutionary computations. The problem of low utilization of historical information by ES was solved by guiding the information generated by the distribution of next-generation solutions. The guidance information was obtained by approximating the gradient. This strategy not only increased the diversity of information, but also made full use of the optimal information in the heuristic algorithm. The theoretical analysis and experimental results showed that this method incorporating guidance information is accurate and stable.

The experimental results showed that the use of guiding information is effective. The algorithm was also compared with other state-of-the-art meta-heuristic algorithms and demonstrated good average performance and pair-wise comparison performance across a wide range of test functions.

The experiments showed that this algorithm is an effective global optimization method for large-scale problems, which makes it applicable to a large number of practical applications. The principle behind the use of guidance information is simple, but effective, and has a certain guiding significance for heuristic optimization algorithms.

Data Availability

The data are available at https://titan.csit.rmit.edu.au/∼e46507/publications/lsgo-cec10.pdf.

Conflicts of Interest

The author declares that there are no conflicts of interest.