Computational Intelligence and Neuroscience

Volume 2016 (2016), Article ID 1537325, 13 pages

http://dx.doi.org/10.1155/2016/1537325

## Metaheuristic Algorithms for Convolution Neural Network

^{1}Machine Learning and Computer Vision Laboratory, Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia^{2}Computer System Laboratory, STMIK Jakarta STI&K, Jakarta 12140, Indonesia

Received 29 January 2016; Revised 15 April 2016; Accepted 10 May 2016

Academic Editor: Martin Hagan

Copyright © 2016 L. M. Rasdi Rere et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A typical modern optimization technique is usually either heuristic or metaheuristic. This technique has managed to solve some optimization problems in the research area of science, engineering, and industry. However, implementation strategy of metaheuristic for accuracy improvement on convolution neural networks (CNN), a famous deep learning method, is still rarely investigated. Deep learning relates to a type of machine learning technique, where its aim is to move closer to the goal of artificial intelligence of creating a machine that could successfully perform any intellectual tasks that can be carried out by a human. In this paper, we propose the implementation strategy of three popular metaheuristic approaches, that is, simulated annealing, differential evolution, and harmony search, to optimize CNN. The performances of these metaheuristic methods in optimizing CNN on classifying MNIST and CIFAR dataset were evaluated and compared. Furthermore, the proposed methods are also compared with the original CNN. Although the proposed methods show an increase in the computation time, their accuracy has also been improved (up to 7.14 percent).

#### 1. Introduction

Deep learning (DL) is mainly motivated by the research of artificial intelligent, in which the general goal is to imitate the ability of human brain to observe, analyze, learn, and make a decision, especially for complex problem [1]. This technique is in the intersection amongst the research area of signal processing, neural network, graphical modeling, optimization, and pattern recognition. The current reputation of DL is implicitly due to drastically improve the abilities of chip processing, significantly decrease the cost of computing hardware, and advanced research in machine learning and signal processing [2].

In general, the model of DL technique can be classified into discriminative, generative, and hybrid models [2]. Discriminative models, for instance, are CNN, deep neural network, and recurrent neural network. Some examples of generative models are deep belief networks (DBN), restricted Boltzmann machine, regularized autoencoders, and deep Boltzmann machines. On the other hand, hybrid model refers to the deep architecture using the combination of a discriminative and generative model. An example of this model is DBN to pretrain deep CNN, which can improve the performance of deep CNN over random initialization. Among all of the hybrid DL techniques, metaheuristic optimization for training a CNN is the focus of this paper.

Although the sound character of DL has to solve a variety of learning tasks, training is difficult [3–5]. Some examples of successful methods for training DL are stochastic gradient descent, conjugate gradient, Hessian-free optimization, and Krylov subspace descent.

Stochastic gradient descent is easy to implement and also fast in the process for a case with many training samples. However, this method needs several manual tuning scheme to make its parameters optimal, and also its process is principally sequential; as a result, it was difficult to parallelize them with graphics processing unit (GPU). Conjugate gradient, on the other hand, is easier to check for convergence as well as more stable to train. Nevertheless, CG is slow, so that it needs multiple CPUs and availability of a vast number of RAMs [6].

Hessian-free optimization has been applied to train deep autoencoders [7], proficient in handling underfitting problem, and more efficient than pretraining + fine tuning proposed by Hinton and Salakhutdinov [8]. On the other side, Krylov subspace descent is more robust and simpler than Hessian-free optimization as well as looks like it is better for the classification performance and optimization speed. However, Krylov subspace descent needs more memory than Hessian-free optimization [9].

In fact, techniques of modern optimization are heuristic or metaheuristic. These optimization techniques have been applied to solve any optimization problems in the research area of science, engineering, and even industry [10]. However, research on metaheuristic to optimize DL method is rarely conducted. One work is the combination of genetic algorithm (GA) and CNN, proposed by You and Pu [11]. Their model selects the CNN characteristic by the process of recombination and mutation on GA, in which the model of CNN exists as individual in the algorithm of GA. Besides, in recombination process, only the layers weights and threshold value of C1 (convolution in the first layer) and C3 (convolution in the third layer) are changed in CNN model. Another work is fine-tuning CNN using harmony search (HS) by Rosa et al. [12].

In this paper, we compared the performance of three metaheuristic algorithms, that is, simulated annealing (SA), differential evolution (DE), and HS, for optimizing CNN. The strategy employed is looking for the best value of the fitness function on the last layer using metaheuristic algorithm; then the results will be used again to calculate the weights and biases in the previous layer. In the case of testing the performance of the proposed methods, we use MNIST (Mixed National Institute of Standards and Technology) dataset. This dataset comprises images of digital handwritten digits, containing 60,000 training data items and 10,000 testing data items. All of the images have been centered and standardized with the size of 28 × 28 pixels. Each pixel of the image is represented by 0 for black and 255 for white, and in between are different shades of gray [13].

This paper is organized as follows: Section 1 is an introduction, Section 2 explains the used metaheuristic algorithms, Section 3 describes the convolution neural networks, Section 4 gives a description of the proposed methods, Section 5 presents the result of simulation, and Section 6 is the conclusion.

#### 2. Metaheuristic Algorithms

Metaheuristic is well known as an efficient method for hard optimization problems, that is, the problems that cannot be solved optimally using deterministic approach within a reasonable time limit. Metaheuristic methods work for three main purposes: for fast solving of problem, for solving large problems, and for making a more robust algorithm. These methods are also simple to design as well as flexible and easy to implement [14].

In general, metaheuristic algorithms use the combination of rules and randomization to duplicate the phenomena of nature. The biological system imitations of metaheuristic algorithm, for instance, are evolution strategy, GA, and DE. Phenomena of ethology for examples are particle swarm optimization (PSO), bee colony optimization (BCO), bacterial foraging optimization algorithms (BFOA), and ant colony optimization (ACO). Phenomena of physics are SA, microcanonical annealing, and threshold accepting method [15]. Another form of metaheuristic is inspired by music phenomena, such as HS algorithm [16].

Classification of metaheuristic algorithm can also be divided into single-solution-based and population-based metaheuristic algorithm. Some of the examples for single-solution-based metaheuristic are the noising method, tabu search, SA, TA, and guided local search. In the case of metaheuristic based on population, it can be classified into swarm intelligent and evolutionary computation. The general term of swarm intelligent is inspired by the collective behavior of social insect colonies or animal societies. Examples of these algorithms are GP, GA, ES, and DE. On the other side, the algorithm for evolutionary computation takes inspiration from the principles of Darwinian for developing adaptation into their environment. Some examples of these algorithms are PSO, BCO, ACO, and BFOA [15]. Among all these metaheuristic algorithms, SA, DE, and HS are used in this paper.

##### 2.1. Simulated Annealing Algorithm

SA is a technique of random search for the problem of global optimization. It mimics the process of annealing in material processing [10]. This technique was firstly proposed in 1983 by Kirkpatrick et al. [17].

The principle idea of SA is using random search, which not only allows changes that improve the fitness function but also maintains some changes that are not ideal. As an example, in minimum optimization problem, any better changes that decrease the fitness function value will be accepted, but some changes that increase will also be accepted with a transition probability () as follows:where is the energy level changes, is Boltzmann’s constant, and is temperature for controlling the process of annealing. This equation is based on the Boltzmann distribution in physics [10]. The following is standard procedure of SA for optimization problems: (1)*Generate the solution vector*: the initial solution vector is randomly selected, and then the fitness function is calculated.(2)*Initialize the temperature*: if the temperature value is too high, it will take a long time to reach convergence, whereas a too small value can cause the system to miss the global optimum.(3)*Select a new solution*: a new solution is randomly selected from the neighborhood of the current solution.(4)*Evaluate a new solution*: a new solution is accepted as a new current solution depending on its fitness function.(5)*Decrease the temperature*: during the search process, the temperature is periodically decreased.(6)*Stop or repeat*: the computation is stopped when the termination criterion is satisfied. Otherwise, steps and are repeated.

##### 2.2. Differential Evolution Algorithm

Differential evolution was firstly proposed by Price and Storn in 1995 to solve the Chebyshev polynomial problem [15]. This algorithm is created based on individuals difference, exploiting random search in the space of solution, and finally operates the procedure of mutation, crossover, and selection to obtain the suitable individual in system [18].

There are some types in DE, including the classical form DE/rand/1/bin; it indicates that in the process of mutation the target vector is randomly selected, and only a single different vector is applied. The acronym of bin shows that crossover process is organized by a rule of binomial decision. The procedure of DE algorithm is shown by the following steps:(1)* Determining parameter setting*: population size is the number of individuals. Mutation factor () controls the magnification of the two individual differences to avoid search stagnation. Crossover rate (CR) decides how many consecutive genes of the mutated vector are copied to the offspring.(2)*Initialization of population*: the population is produced by randomly generating the vectors in the suitable search range.(3)*Evaluation of individual*: each individual is evaluated by calculating their objective function.(4)* Mutation operation*: mutation adds identical variable to one or more vector parameters. In this operation, three auxiliary parents are selected randomly, in which they will participate in mutation operation to create a mutated individual as follows: where and .(5)* Combination operation*: recombination (crossover) is applied after mutation operation.(6)* Selection operation*: this operation determines whether the offspring in the next generation should become a member of the population or not.(7)* Stopping criterion*: the current generation is substituted by the new generation until the criterion of termination is satisfied.

##### 2.3. Harmony Search Algorithm

Harmony search algorithm is proposed by Geem et al. in 2001 [19]. This algorithm is inspired by the musical process of searching for a perfect state of harmony. Like harmony in music, solution vector of optimization and improvisation from the musician are analogous to structures of local and global search in optimization techniques.

In improvisation of the music, the players sound any pitch in the possible range together that can create one vector of harmony. In the case of pitches creating a real harmony, this experience is stored in the memory of each player and they have the opportunity to create better harmony next time [16]. There are three possible alternatives when one pitch is improvised by a musician: any one pitch is played from her/his memory, a nearby pitch is played from her/his memory, and an entirely random pitch is played with the range of possible sound. If these options are used for optimization, they have three equivalent components: the use of harmony memory, pitch adjusting, and randomization. In HS algorithm, these rules are correlated with two relevant parameters, that is, harmony consideration rate (HMCR) and pitch adjusting rate (PAR). The procedure of HS algorithm can be summarized into five steps as follows [16]:(1)*Initialize the problem and parameters*: in this algorithm, the problem can be maximum or minimum optimization, and the relevant parameters are HMCR, PAR, size of harmony memory, and termination criterion.(2)*Initialize harmony memory*: the harmony memory (HM) is usually initialized as a matrix that is created randomly as a vector of solution and arranged based on the objective function.(3)*Improve a new harmony*: a vector of new harmony is produced from HM based on HMCR, PAR, and randomization. Selection of new value is based on HMCR parameter by range 0 through 1. The vector of new harmony is observed to decide whether it should be pitch-adjusted using PAR parameter. The process of pitch adjusting is executed only after a value is selected from HM.(4)*Update harmony memory*: the new harmony substitutes the worst harmony in terms of the value of the fitness function, in which the fitness function of new harmony is better than worst harmony.(5)*Repeat (3) and (4) until satisfying the termination criterion*: in the case of meeting the termination criterion, the computation is ended. Alternatively, process (3) and (4) are reiterated. In the end, the vector of the best HM is nominated and is reflected as the best solution for the problem.

#### 3. Convolution Neural Network

CNN is a variant of the standard multilayer perceptron (MLP). A substantial advantage of this method, especially for pattern recognition compared with conventional approaches, is due to its capability in reducing the dimension of data, extracting the feature sequentially, and classifying one structure of network [20]. The basic architecture model of CNN was inspired in 1962, from visual cortex proposed by Hubel and Wiesel.

In 1980, Fukushimas Neocognitron created the first computation of this model, and then in 1989, following the idea of Fukushima, LeCun et al. found the state-of-the-art performance on a number of tasks for pattern recognition using error gradient method [21].

The classical CNN by LeCun et al. is an extension of traditional MLP based on three ideas: local receptive fields, weights sharing, and spatial/temporal subsampling. These ideas can be organized into two types of layers, which are convolution layers and subsampling layers. As is showed in Figure 1, the processing layers contain three convolution layers C1, C3, and C5, combined in between with two subsampling layers S2 and S4 and output layer F6. These convolution and subsampling layers are structured into planes called features maps.