Abstract

Neuroevolution is the field of study that uses evolutionary computation in order to optimize certain aspect of the design of neural networks, most often its topology and hyperparameters. The field was introduced in the late-1980s, but only in the latest years the field has become mature enough to enable the optimization of deep learning models, such as convolutional neural networks. In this paper, we rely on previous work to apply neuroevolution in order to optimize the topology of deep neural networks that can be used to solve the problem of handwritten character recognition. Moreover, we take advantage of the fact that evolutionary algorithms optimize a population of candidate solutions, by combining a set of the best evolved models resulting in a committee of convolutional neural networks. This process is enhanced by using specific mechanisms to preserve the diversity of the population. Additionally, in this paper, we address one of the disadvantages of neuroevolution: the process is very expensive in terms of computational time. To lessen this issue, we explore the performance of topology transfer learning: whether the best topology obtained using neuroevolution for a certain domain can be successfully applied to a different domain. By doing so, the expensive process of neuroevolution can be reused to tackle different problems, turning it into a more appealing approach for optimizing the design of neural networks topologies. After evaluating our proposal, results show that both the use of neuroevolved committees and the application of topology transfer learning are successful: committees of convolutional neural networks are able to improve classification results when compared to single models, and topologies learned for one problem can be reused for a different problem and data with a good performance. Additionally, both approaches can be combined by building committees of transferred topologies, and this combination attains results that combine the best of both approaches.

1. Introduction

Deep Learning encompasses a broad set of techniques that are able to infer “deep” models to solve diverse machine learning problems. From these techniques, convolutional neural networks (CNNs) are probably the most well-known, extensively studied and widely used. CNNs are a type of neural network that most typically comprises two different parts: first, convolutional layers are in charge of automatically extracting relevant features from the input; then, fully-connected layers are responsible for performing supervised learning. While other architectures exist, such as residual networks [1] or fully convolutional networks [2], the one described is the most commonly found in the literature.

The advantage of CNNs is that all the network parameters, from both the feature extractor and the classifier, can be learned using backpropagation. CNNs have been widely used for computer vision, achieving an outstanding performance, but have also been applied to a variety of problems with remarkable success: natural language processing, signal classification, human activity recognition, or even music generation.

While CNNs have been proved to work well in a very diverse number of domains, one downside is that the network topologies can be very complicated, involving a large number of hyperparameters from both the convolutional layers, the fully-connected layers, and the training process itself. Generally, the topologies are manually designed, by either choosing the most suitable one after some trial-and-error or using domain-specific expertise to infer what would constitute a good topology for the problem. Unfortunately, the first approach is very time-consuming, while the second one requires an expertise which may not always be available.

While there are no analytic procedures for automatically determining the optimal CNN topology and hyperparameters for a certain problem, in recent years some works have focused on developing mechanisms to automatically search for them. An important subset of these mechanisms involve neuroevolution, a concept that arose in the late 1980s to apply evolutionary computation for optimizing some aspects of neural networks and that, only in very recent years, with the improvement of hardware technology and efficient GPU-based deep learning frameworks, is starting to be applied to deep and convolutional neural networks.

In this work, we will first use a previously described evolutionary algorithm [3] to optimize the topology of convolutional neural networks. This procedure includes mechanisms specifically devoted to preserving the diversity of the population during evolution. Besides, in this paper, we will focus in the study on two aspects that arise naturally from the neuroevolutionary process itself.

The first of such aspects is the fact that evolutionary computation can output not just one solution, but rather a whole population of candidate solutions resulting from the optimization procedure. In this paper take advantage of the evolutionary process in order to obtain not just an optimized CNN but rather a set of outperforming CNNs that can be combined within an ensemble. This whole process is enriched by the diversity-enhancing techniques mentioned before, which will lead to a higher variance in the models conforming the ensemble.

Regarding the second aspect, it must be noticed that neuroevolution is a very expensive task in computational terms. This high cost is due to the fact that, in every generation of the genetic algorithm, each candidate CNN topology must be evaluated, thus requiring first learning its parameters using some training data and then computing its performance (fitness) using some validation data. For this reason, in this paper, we study the process of transferring the topologies optimized for one domain to a different domain. A successful outcome in this topology transfer learning task would make neuroevolution a more appealing tool since most of the knowledge could be reused between different problems.

The main contribution of this paper is therefore to address these two aspects, showing how ensembles of neuroevolved and diverse CNNs can significantly outperform individual models and how the knowledge acquired during evolution can be then transferred to a different problems with successful results.

The remainder of this paper is structured as follows: first, Section 2 introduces the context of this paper, describes related works, and elaborates on the contribution of this paper when compared with these works. The proposal of this paper is described in Sections 3 and 4, which describe the procedure for building committees of CNNs and for performing topology transfer learning and report and discuss the results attained after a systematic evaluation. Finally, Section 5 provides some conclusive remarks about the work carried out in this paper.

CNNs were first introduced by LeCun et al. in 1998 [4, 5] as an approach to achieve outperforming classification of different types of documents and information, such as images, speech, or time series. The most common CNN architecture comprises two distinct parts: first, a sequence of convolutional layers is in charge of automatically extracting relevant features from input data. This stage is known as “feature learning” or “representation learning” and replaces the procedure in which an expert or group of experts perform manual feature engineering to convert some unstructured information into a set of valuable features. Then, once these features have been extracted, the second part of the CNN, often known as “dense layers”, will be in charge of performing classification. This classifier module will commonly comprise a fully-connected feedforward or recurrent neural network. The backpropagation mechanism is used both for learning the parameters of the dense layers (as in classical neural networks) and also for the convolutional layers, in order to minimize a loss function defined over the output of the neural network and the expected classification output.

Nowadays, CNNs have become one of the most widespread techniques in the field known as “deep learning”. Many frameworks have arisen in recent years in order to easily train CNN models over very different types of input data, supporting a large variety of tensor algebra operations and automatic differentiation. Examples of widely used deep learning frameworks are Theano [6], Caffe [7], or TensorFlow [8] and some other libraries that have been also published to ease the design of convolutional neural networks, such as Lasagne [9] or Keras [10]. CNNs are becoming ubiquitous at solving a variety of artificial intelligence problems due to the availability of high-end hardware (mostly graphics processor units, GPUs, and specific hardware for tensor processing) and the ease of implementation provided by such frameworks and libraries. Nevertheless, the right topology and hyperparameters of a CNN for solving a problem are still a challenging design decision which in some cases requires expert knowledge and in most of the cases expensive trial-and-error to be done.

The problem of designing the topology of a neural network is not new: it goes back as far as the mid-1980s, when the backpropagation algorithm was introduced by Rumelhart el al. [11]. Backpropagation enabled a fast and reliable way of learning the parameters of a neural network, and it was a key discovery for settling the field of neural networks; however, the topology of the network should be known prior to the start of the training process. Unfortunately, no analytic procedure exists which determines the optimal topology of a neural network, and the common way to design it involves either trying different architectures until one satisfies our expected quality or reusing topologies that have been proved to be successful for very similar problems.

However, by the end of the 1980s, a new approach arose: neuroevolution. This paradigm uses evolutionary algorithms in order to optimize the topology, and in some cases also the weights, of a neural network. Evolutionary algorithms are a set of biologically inspired, metaheuristic search techniques that can optimize a population of individuals in order to maximize a certain fitness metric. In neuroevolution, the population comprises the description of the neural network topologies and/or hyperparameters, and the fitness function is a certain machine learning metric to be optimized (e.g., accuracy, precision or recall, F1 score, etc.).

Neuroevolution has been applied successfully to determine optimal neural network topologies for almost three decades. However, its application to CNNs has been marginal, and only in very recent years a significant amount of works have been published proposing different approaches to perform an evolutionary optimization of CNN topologies. A very recent survey has been provided by Stanley et al. early in 2019 [12].

One of the earliest approaches was proposed by Koutník et al. [13] in 2014, where an architecture of four convolutional layers with max-pooling and a small recurrent network with three hidden units is fixed. All parameters are evolved by encoding them in a real-valued genome. However, as it can be seen, this work do not perform architecture optimization.

Other work was presented in 2015 by Verbancsics and Harguess [14], proposing an update to HyperNEAT [15] to support the evolution of CNNs, by learning the weights of a feature extractor consisting on convolutional layers. Also this year, Young et al. [16] introduced MENNDL (multi-node evolutionary neural networks for deep learning), where a genetic algorithm was used to optimize only six hyperparameters of a CNN, focusing on high performance computing. A more recent version was presented in 2017 [17], where the number of evolved hyperparameters was increased to eleven.

Some relevant works were also published during 2016. It is the case of the work by Loshchilov and Hutter [18], where an evolution strategy is used to evolve 19 hyperparameters from a CNN, most of which belong to the learning process, using an architecture with a fixed number of layers. Other work was published by Fernando et al. [19], where they propose the introduction of a differentiable compositional pattern producing network (DPPN), a type of network which can be evolved using approaches based on augmenting topologies. An interesting novelty of this approach is that the topology and the initial weights can be evolved altogether and later optimized using backpropagation.

Most works in this field started to arise in 2017. A representative example is GeNet, introduced by Xie and Yuille [20], whose approach consists of evolving a graph connecting different stages (each corresponding to a convolutional layer), and which allows for the optimization of complex nonsequential networks. Also relevant is CoDeepNEAT by Miikkulainen et al. [21], which consists of a modification of NEAT to support the evolution in CNNs, with support to nonsequential architectures and improved with a co-evolutionary approach. In EXACT, proposed by Desell [22], another NEAT-approach is used to evolve the filter sizes of convolutional layers and their connectivity. Real et al. [23] from Google Brain also proposed an approach where a graph was evolved, with each node corresponding to a convolutional layer, allowing for complex topologies. Moreover, Sun et al. [24] described the application of a GA to the evolution of CNNs, innovating with the introduction of variable-length chromosomes. Also, Dufourq and Bassett [25] introduced EDEN, which relied on a genetic algorithms with two genes for encoding the network architecture and the learning rate.

In the works proposed by Suganuma et al. [26] and by Davison [27], genetic programming is used instead for evolving the architecture of the CNN. Meanwhile, Bochinski et al. [28] proposed IEA-CNN, an approach using an evolutionary strategy, innovating by sorting the evolved layers by descending complexity, effectively reducing the search space factorially on the number of layers. Additionally, they extend their contribution by building ensembles out of evolved models, using a fitness function that takes the global classification error of the population, and naming this alternative CEA-CNN.

State-of-the-art works have been also presented during 2018. For example, Baldominos et al. [3] conducted a research where two implementations of evolutionary algorithms were successfully used to evolve the topologies of CNNs for improving the performance of handwritten digit recognition. Liu et al. [29] suggested using a genetic algorithm for evolving individuals using mutation by adding, removing, or editing edges in a computation graph which can be translated into a convolutional neural network. Finally, first Kramer [30] and later Prellberg and Kramer [31] have presented an approach based on an evolutionary algorithm that relies only on the mutation operator and have introduced a mechanism to support parameters inheritance, so that descendants during the evolutionary process do not need to learn weights from scratch. Assuno et al. [32] have presented DENSER, a work where a multi-level encoding of candidate solutions allow for the optimization of the topology of the network and the activation functions, with authors claiming that it can be used also to evolve the hyperparameters of the learning process as well as of the data augmentation stage. In late 2018, Wang et al. [33] used differential evolution to optimize different hyperparameters of both convolutional and fully-connected layers. Sun et al. [34] have recently proposed the application of a genetic algorithm where two building blocks (ResNet and DenseNet) are used to evolve the CNN architecture.

In this promising field, there are still many research lines that are worth being explored. As we already described in the previous section, in this paper we expect to contribute to the field of knowledge by working in two different subjects of study that arise naturally from the advantages and limitations of neuroevolution. First, given that evolutionary algorithms evolve not only a single individual but rather a population of them, we hypothesize that the output of the optimization process can be combined altogether to build a committee of neural networks that perform better than any single model.

The idea of ensembling several neural network models into a committee is not new and has been extensively carried out in the literature. For example, in the field of image classification, Cirean et al. [35] attained an improvement of 0.8 percentage points over the best result reported in the state of the art of the MNIST database by using committees of CNNs. In recent years, this idea has been applied to a variety of fields, such as facial expression analysis [36], astrophysics [37], pose estimation [38], or medical imaging [39]. However, the idea of building an ensemble out of a population of neuroevolved CNN topologies is less common and, to the best of our knowledge, has been only explored before by Real et al. [23] and by Bochinski et al. [28] in 2017. In the former work, the ensemble is built by choosing the top-2 models of the evolved population based on validation accuracy. When testing over the CIFAR10 dataset, the ensemble attains an accuracy of 95.6%, versus 94.6% of the best single model. Beyond neuroevolution, the process of ensembling automatically determined CNN topologies is also used in MetaQNN [40], where reinforcement learning is used to determine optimal topologies. As a result, ensembles attain a test accuracy over the MNIST dataset of 99.68%, compared to 99.56% when using a single model. In the latter text, ensembles are evolved using a fitness function that considers the global classification error of the population, and authors report a classification accuracy of 99.76% with an ensemble of 34 CNNs, compared to 99.66% with a single model, a very competitive performance among the state of the art.

Regarding the second subject of study, since neuroevolution is an expensive process, we believe that it is worth exploring whether the best topology evolved for a problem can be reused for a similar problem, even when the data are different. The deep learning literature has studied transfer learning before, but understood as the application of a trained machine learning model to a different domain with very few additional trainings [41]. In the case of neural networks, the learned parameters (weights) are reused for a different problem. In some cases, only the feature representations (i.e., the parameters of the convolutional layers) may be used, training the classifier from scratch. In this paper, however, we will be focusing on topology transfer learning. Thus, we are interested in transferring knowledge not of the CNN model weights but rather of its architecture and learning hyperparameters. In other words, once the expensive process of determining the best CNN topology for solving a certain problem is completed, we are interested in testing whether this evolved topology is useful for training a CNN model in a different machine learning problem. In the case of being useful, then plenty of computation time could be saved by avoiding the repetition of the evolutionary process. This approach contrasts with the one followed by Real et al. [23], where they run again the complete neuroevolutionary process for the CIFAR-100 dataset, using the same encoding and hyperparameters that they used for CIFAR-10.

In this paper, we will explore both alternatives separately and then combine both of them together to put the robustness of these improvements in the neuroevolutionary process to the test. To the best of our knowledge, the study of the use of ensembles along with topology transfer learning within the context of neuroevolution is novel and has not been addressed earlier in the literature.

3. Committees of Convolutional Neural Networks

A committee of machine learning models, commonly referred to as an “ensemble”, is a set of models that operate together in order to provide a single response. In a classification problem, each model will return a class and then all responses will be considered to produce a single outcome.

In this paper, we will rely on the neuroevolutionary process described in the work by Baldominos et al. [3], which already implement specific mechanisms to preserve the diversity of the population of CNN topologies. This mechanism consists in a niching strategy where individuals more similar to others in the population are penalized. To do so, an adjusted fitness () for the -th individual () is computed as follows, being the non-adjusted fitness for such individual:

In the previous equation, is a value that represents the similarity of two individuals and is computed as follows:

In the formula for computing similarity, refers to the number of convolutional layers and refers to the number of fully connected layers. We can see that individuals with a different number of layers are considered completely different. When the number of layers coincide, then the formula looks for the fraction of hyperparameters whose value is equal between both individuals. From this formula, it is easy to check that the image of the similarity function is in the range , where means that two individuals are completely different and means that they share the exact same setup.

By using this niching strategy, we expect the increase in diversity to result in better ensembles, since models are guaranteed to be very different from each other.

During the neuroevolutionary process, the top 20 topologies found during the evolution are stored in a hall-of-fame and then trained for a longer time to come up with competitive models. Then, we will build committees of CNNs from the best models found during the neuroevolutionary procedure. To do so, we will sort the models based on their performance and then build committees by adding models one at a time, up to a maximum of 20 models. The committee will work based on a majority voting policy; that is, its response will correspond to the class returned by the majority of the models comprising the committee. In case of a tie, it will be resolved by choosing the class decided by the most competitive model involved in the tie.

3.1. Evolution of CNN Models

Following the procedure described by Baldominos et al. [3], we will use both genetic algorithms (GA) and grammatical evolution (GE) to automatically determine the topology and hyperparameters of CNNs that maximize the classification performance.

The GA encoding consists in a 69-bit binary string using Gray encoding. The chromosome encodes the following parameters:(i)Input configuration: batch size from 25 to 200.(ii)Convolutional layers: number of convolutional layers, and, for each convolutional layer, number of kernels, kernel size, pooling size (or no pooling), and activation function.(iii)Dense layers: number of dense layers, and, for each dense layer, type of the layer (feedforward or recurrent), number of neurons, activation function, weights regularization (L1 or L2), and dropout rate.(iv)Learning process: gradient descent function (stochastic gradient descent, Adam, Adamax, etc.) and learning rate.

In GE, the approach is similar but the individuals’ phenotype is defined by a language generated by a grammar specified in Backus-Naur form (BNF), which can be found in Algorithm 1. The genotype in GE consists in a sequence of integers which are used to choose production rules from the grammar until a valid string is generated (i.e., there are no remaining non-terminal symbols). The set of valid strings (the language) correspond to the search space of candidate solutions. This encoding provides better flexibility than the GA since it reduces redundancy.

<dnn> <input>  <conv_lys>  <dense_lys>  <opt_setup>
<input> <batch_size>
<batch_size> 25 | 50 | 100 | 150
<conv_lys> <conv> | <conv>  <conv> | <conv>  <conv>  <conv>
<conv> <n_kernels>  <k_size>  <act_fn>  <pooling>
<n_kernels> 8 | 16 | 32 | 64 | 128 | 256
<k_size> 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<pooling> null | <p_size>
<p_size> 2 | 3 | 4 | 5 | 6
<dense_lys> <dense> | <dense>  <dense> | <dense>  <dense>  <dense>
<dense> <d_type>  <n_units>  <act_fn>  <reg_fn>  <dropout_r>
<d_type> rnn | lstm | gru | feedforward
<n_units> 32 | 64 | 128 | 256 | 512 | 1024
<act_fn> relu | linear
<reg_fn> null | l1 | l2 | l1l2
<dropout_r> 0 | 0.5
<opt_setup> <opt_type>  <learn_rate>  <batch_size>
<opt_type> sgd | nesterov | momentum | adagrad | adamax | adam | adadelta | rmsprop
<learn_rate> 5E-1 | 1E-1 | 5E-2 | 1E-2 | 5E-3 | 1E-3

The learning procedure takes place as follows: the networks represented by each chromosome have been trained only for 5 epochs and using a random 50% sample of the training set in each epoch. This allows generating fairly tight estimates of the performance of the networks, saving the time necessary for the evaluation of individuals without whom the evolutionary process would be unfeasible. When evaluating the fitness of an individual, the niching strategy described earlier is used to compute the adjusted fitness.

The complete procedure for evaluating each candidate solution is as follows:(1)Translate the genotype into a phenotype by creating a CNN topology with the parameters specified by the genotype.(2)Randomly initialize the network’s weights.(3)Train the network during 5 epochs, using a random 50% of the data in each epoch.(4)Compute the classification error of the network on a set that is different from the training set and assign the obtained error as the individual fitness, which must be minimized.(5)Compute the adjusted fitness provided the niching strategy.

Throughout the execution of the evolutionary system, the best individuals found thus far are stored such that at the end of the process we will have a set called hall-of-fame, with the 20 best architectures found throughout the evolutionary process. The hyperparameters of each of the top 10 topologies for the GA are detailed in Table 1. Each of these architectures is trained for 30 epochs and without sampling. To avoid biases resulting from the stochastic nature of the process, each topology is trained 20 times. The results of this full training stage are summarized in Table 2 (each row corresponds to each topology in the hall of fame, whereas columns describe the distribution of performance of fully trained models).

3.2. Results and Discussion of Committees of CNNs

For the sake of clarity and economy, throughout this section, we will be more exhaustive when describing the results obtained using GA, which are slightly better, with a very small difference, than those attained by GE.

For the ensemble, we use the 20 models generated at the end of the evolutionary process (we only choose the best for each different topology) and will follow a majority-voting policy: for each instance in the validation set, the class predicted by the 20 models is calculated, and the one that obtains the largest consensus is generated.

The error rates of the 20 ensembles are shown in Figure 1 in the dark line. It is worth recalling that we are testing different ensembles size by adding models one at a time, adding first those with better performance. Light-colored diamonds in the figure show, just for reference, the average error rate of all the models included in the ensemble. Because classifiers are sorted by ascending error, this average increases as more classifiers are added.

As we expected, the error rate of a committee is lower than the average error rate of its components in all cases. It can be seen how the introduction of a few models decreases the error down to 0.28% (when 7 classifiers are used) and then stabilizes for a while, around 0.3%, until it starts to increase again when using more than 14 classifiers. In the case of GE, the best committee found led to an error rate of 0.29%.

It is worth noting that the best committee found in our research is the one involving the best 7 classifiers. As said before, this ensemble classifies the MNIST test set with just an error rate of 0.28%. To the best of our knowledge, when considering those works where no data augmentation is used at all, this result is only outperformed by the works by Chang and Chen [42] and by Bochinski et al. [28], whose proposals have both resulted in a test error rate of 0.24%.

An error rate of 0.28% over a test set of 10000 samples translates into 28 incorrectly classified samples. The confusion matrix of the best model is shown in Figure 2. The interpretation of this confusion matrix is very interesting: we can see how the most frequent error involves the number ‘9’ being classified as a ‘4’. Some other common mistakes involve recognizing ‘3’ instead of ‘5’ or ‘1’ instead of ‘7’. Mixing up these numbers may be acceptable if they are poorly written, as is the case of these samples. To be more specific, the 28 images that were misclassified are depicted in Figure 3. We can see how those manuscript digits are indeed very unclear or poorly written. For example, the fourth image in the first row could be either a ‘4’ or a ‘9’, the third image in the second row could be a ‘3’ or a ‘5’, etc. It can difficult even for humans to recognize these numbers properly.

It is remarkable that the best classifier obtained using a single model translates into a test error rate of 0.40% (see Table 2), whereas using ensembles of neuroevolved models makes it possible to reduce this error rate down to 0.28%. This makes the use of evolutionary systems even more interesting, since it allows not only the automatic generation of the architectures and parameters of CNNs, already a complex and sophisticated task, but also a considerable improvement of the performance of the CNNs through the use of an ensemble of the evolved models, which manually would be almost impossible to attain.

4. Topology Transfer Learning

The second objective of this work is to verify whether a CNN model or committee optimized for a certain domain can be transferred directly to a different (but similar) domain, resulting in a performance that is good enough. This would prove the robustness of the models generated by evolution and would allow extending them to new data, without being forced to repeat the whole process again. In order to validate those transfer capabilities, we decide to use the EMNIST database, which shares structure with MNIST, the one used for evolving the models, and comprises a similar problem: handwriting recognition. We hypothesize that topologies that performed well with MNIST should also be able to attain reasonably good results with EMNIST. In this section, we will explore the problem of transferring a CNN topology learned using neuroevolution.

4.1. The EMNIST Database

EMNIST (Extended MNIST) database was introduced in 2017 by Cohen et al. [43] and consists of a set of handwritten characters (both digits and letters).

Figure 4 shows ten samples for each letter in the EMNIST dataset, including both uppercase and lowercase variants, and two samples for each digit (in the last two columns).

EMNIST database is derived from NIST Special Database 19 [44], which contains NIST’s (National Institute of Standards and Technology of the US) entire corpus of training materials for handprinted document and character recognition. It contains over 810,000 isolated characters from 3,699 writers [45] who filled a form. These characters have been labelled after manually checking.

Authors releasing EMNIST admit that, in the past years, deep learning and convolutional neural networks have allowed scientists to achieve accuracies over 99.7% in the MNIST dataset, stating that at that point “the dataset labeling can be called into question” [43]. For this reason, they suggest that MNIST has become a non-challenging benchmark.

Even though NIST Special Database 19, from which EMNIST was extracted, was available since 1995, it has remained mostly unused. The main reason was that this dataset was difficult to access and challenging to use in modern computers. Recently, in 2016, NIST has released a second edition of this database [45] which is much easier to access.

In order to make both compliants in terms of structure, authors have performed a processing similar to the one done with the MNIST database. The result is a dataset that contains more instances than MNIST, includes letters apart from digits, and, in consequence, is a more challenging benchmark for evaluating the performance of character recognition systems. This processing comprises the next steps:(1)Original images in the NIST Special Database 19 are stored as 128x128-pixel BW images.(2)A Gaussian blur filter with is applied to soften the edges.(3)Blank padding is removed, reducing the image to the region of interest (the actual digit).(4)The image is then centered in a square image while preserving the aspect ratio, padding it with a 2-pixel border.(5)The image is downsampled to 28x28 pixel using bi-cubic interpolation.

As a result, each instance in the EMNIST database is a 28x28-pixel grayscale image, where each pixel is a number between 0 and 255.

In this paper we will use two different taxonomies provided by the EMNIST dataset:(i)Digits: similar to MNIST, but with four times more instances (280,000 instead of 70,000).(ii)Letters: this dataset contains only letters, and a distinction between uppercase and lowercase is not made. As a result, the dataset contains 26 classes and a total of 145,600 samples.

To the best of our knowledge, the EMNIST dataset is so new that there are not published works using it as a benchmark. The original EMNIST paper by Cohen et al. [43] includes a baseline using a linear classifier and OPIUM (Online Pseudo-Inverse Update Method), a classifier introduced by van Schaik and Tapson [46]. It is worth noting that the performance in the original EMNIST paper is reported in terms of accuracy instead of error rate. For this reason, in this section, we will use this metric for reporting the performance of each work.

More recently, Peng and Yin [47] have used Markov random field-based CNN achieving an accuracy of 95.44% in the letters dataset and of 99.75% in the digits dataset. Also, Singh et al. [48] reported an accuracy of 99.62% in EMNIST Digits, using a CNN with three convolutional layers and two fully connected layers. In EDEN [25], authors also tested their neuroevolution approach against the EMNIST Digits test set, obtaining an accuracy of 99.3%. The dataset has also been used by Netftci et al. [49] for testing the performance of event-driven random backpropagation when training neuromorphic deep neural networks, although they have combined letters and digits data for classification, and therefore its performance cannot be compared with the results obtained in this work; and by Shu et al. [50] albeit with the purpose of pairwise classification.

Despite the fact of EMNIST not being used so far in other published works, some researchers have used NIST Special Database 19 in the past. It should be noted that these works are not directly comparable, because the database is not identical; however, results can be extrapolated. For example, Milgram et al. [51] reported an accuracy of 98.75% using SVMs with sigmoid function, Granger et al. [52] attained an accuracy of 96.49% using particle swarm optimization to evolve the topology of neural networks, and Oliveira et al. [53] reported an accuracy of 98.39% with a multi-layer perceptron.

Other authors have used also letter from NIST Special Database 19. For example, Radtke et al. [54] used record-to-record travel (accuracy of 96.53% for digits and 93.78% for letters), Koerich and Kalva [55] tested a multi-layer perceptron only with the letters dataset (accuracy of 87.79%.), and Cavalin et al. [56] used hidden Markov models (accuracy of 98% for digits and up to 90% for letters, though this result is only using uppercase letters, and the accuracy decreases to 87% when lowercase letters are also considered).

Finally, to the best of our knowledge, Cireşan et al. [35] are the only authors to have used this database for testing the performance of committees of CNNs, attaining accuracies of 88.12% for the whole database, 92.42% for letters, and 99.19% for digits.

A summary of the reviewed works is shown in Table 3. The upper side of the table shows the performance of classical machine learning models and non-convolutional neural networks, whereas the lower side shows those works involving CNNs. Best results are boldfaced.

4.2. Transfer of Evolved CNN Topologies

In order to evaluate the performance of the neural models obtained in the MNIST domain, when directly transferred to the EMNIST domain, the 20 best topologies, obtained by neuroevolution, are selected and then transferred into the new domain where they will follow a standard learning procedure to check their performance. We have performed this task for both evolutionary algorithms (GA and GE) and for both the letters and digit domains. However, in this paper, we will report only the results obtained by GE, being similar and slightly better than those using GA, for clarity and efficiency reasons.

After full training of the neuroevolved topologies, a statistical summary of the accuracies for each architecture is shown in Table 4 for the Letters dataset and Table 5 for the Digits dataset, showing the mean, median, standard deviation, and maximum and minimum values. It should be recalled that the results shown correspond to the accuracies obtained, over the test set, by the networks transferred from the MNIST domain to the Letters and Digits domains. We have reported the performance in terms of accuracy instead of error for consistency with most works in the state of the art.

It can be seen that first individuals do not behave better than the rest, pointing out that these topologies are not explicitly optimized for the EMNIST dataset. However, results are very good: both in the Digits and in the Letters datasets; the maximum accuracies obtained are over 99.7% and 95%, respectively. If we compare the results obtained with our evolutionary system (refer to Tables 5 and 4) with those of the state of the art (shown in Table 3), the enormous efficacy of the evolved models can be appreciated, even when the neuroevolution was carried out for a different (yet similar) domain. In the Letters domain, the accuracy of our approach (95.19% for Letters and 99.73% for Digits) is only outperformed by the work by Peng and Yin [47] (95.44% and 99.75%, respectively).

Moreover, the distributions of accuracies for each topology after full training with the EMNIST training set are depicted in Figure 5 for the Letters domain and Figure 6 for the Digits domain. It is interesting to realize that results are very homogeneous for most individuals, and variance is very small in all cases. This points out the robustness of the method: it is easy to obtain competitive models even in one or few executions of the neuroevolution and full training phases. The variance is higher in the Letters dataset, since the domain is more complicated and the number of classes is larger (26 against 10), although the same conclusions apply to a lesser extent.

It is also worth realizing that performance is consistent across both datasets. When a topology behaves especially well in one domain, it has a particular good behavior in the other as well. For example, in the Letters domain, the three best models are 4, 6, and 2, and so are in the Digits domain, too.

4.3. Topology Transfer with Committees of Evolved CNNs

Additionally, we find it interesting to explore how both mechanisms introduced in this paper perform when combined, i.e., whether committees for the EMNIST database can be built and obtain successful results given models whose topologies were designed for the MNIST dataset.

Figure 7 shows the accuracy evolution for the Letters dataset as new models are included. The best result in EMNIST Letters is an accuracy of 95.35% (error rate of 4.65%), with 10 CNNs, with an ensemble involving 20 CNNs. The accuracy looks very steady when more than three CNNs are involved, yet results improve at the end, attaining better accuracy with a larger number of models.

The plot referred to the Digits dataset is shown in Figure 8. It is noticeable that, beyond 8 individuals, the accuracy stabilizes around 99.75% (error rate of 0.25%). The best accuracy is 99.7725% (error rate of 0.2275%), with a committee of six CNNs. The accuracy is very steady as new models are added to the ensemble, with the worst result of an ensemble of at least three CNNs obtaining only 0.03 percentage points less than the aforementioned result.

Given these results, we can conclude that once again the use of committees improves the results over those attained with single models. In particular, the using of committees in the Letters dataset increases the accuracy from 95.19% (using a single model) to 95.35%, reducing the gap with the result reported by Peng and Yin [47] (95.44%). Regarding the Digits dataset, the accuracy raises from 99.73% to 99.7725%, a result that would head the ranking with this dataset.

Regarding the interpretation of the results, Figure 9 shows the confusion matrix for the EMNIST Letters dataset using the best ensemble found, which was obtained using the topologies optimized with GE. It can be seen that accuracy is almost perfect. The most common mistakes involve mixing up the letters ‘I’ and ‘L’, the letters ‘G’ and ‘Q’, and to a much lesser extent the letters ‘V’ and ‘U’. These seem like acceptable mistakes given the high similarity of these characters. Figure 10 shows a random sample of 100 misclassified images in the EMNIST Letters dataset. As we already knew from the confusion matrix, most misclassified samples are vertical bars which could be either an L’ or an ‘I’ (notice that both are even more similar when comparing a lowercase ‘l’ with an uppercase ‘I’). Also, it can be seen how some characters are hardly recognizable even by a human.

As for the confusion matrix for the Digits dataset using the best ensemble found, it is depicted in Figure 11. Again, most values are in the main diagonal, representing an almost perfect accuracy. Most remarkable mistakes involve misclassifying digits ‘9’ and ‘4’, ‘3’, and ‘5’, and ‘2’ and ‘3’. To a lesser extent, the ensemble also mixes up the digits ‘6’ and ‘0’. From a test set of 40,000 samples, only 97 were incorrectly classified. In fact, only 91 instances from a total of 40,000 in the test set have been incorrectly classified, and these can be seen in Figure 12. Most of these digits are hardly recognizable. Actually, one instance seems to involve two digits in one (seventh row, eighth column). Others seem to be incomplete, and it is hard to tell whether they are a ‘5’ or a ‘3’ (e.g., first row, third column). Finally, the confusion between ‘4’s and ‘9’s seems to arise because either a digit ‘4’ is very rounded on the top or the digit ‘9’ seems to be slightly open, maybe due to the image being incomplete.

5. Conclusions

Convolutional neural networks are a very effective tool in numerous complex problems and in particular in classification tasks. The only drawback of those systems is the dependence between network models, understood as the union of architecture and parameters, and the results they are able to accomplish. This dependence makes it difficult and expensive to find the optimal model for each problem. One way to avoid this difficulty is by the use of evolutionary systems to automatically find appropriate architectures in each case, an approach called neuroevolution which has been used with success since the late 1980s but only in recent years have been applied to deep learning models.

In this work, we have focused on analyzing the potential of a neuroevolutionary system regarding two different lines of study: on one hand to improve the performance of the generated models, by means of the use of committees, and on the other hand to validate the robustness of such models to be transferred to new, similar, domains.

The use of committees exploits the property of evolutionary systems to generate a population of individuals, instead of working with a single solution. The use of multiple models allows the outcomes to be modulated, in such a way that possible errors of individual systems can be corrected, provided that the models that conform to the ensemble are different and effective enough, as not to distort the result of the best model. In this work, this effect is achieved through the inclusion of niching strategies in the evolutionary system, as well as the use of a historical set, hall-of-fame, of the best models.

Experiments have been carried out for different sizes of ensembles, from 2 up to a maximum of 20, for the handwritten classification task. Results prove an improvement with respect to the isolated models, in all cases. In addition, a more detailed analysis shows higher performance for small ensemble sizes, around seven models, from which the improvements are less important. The results also show an insignificant dependency on the size of the ensembles, from a size of four, which allows us to conclude that it is not necessary the use of large ensembles, above a critical minimum size.

Regarding topology transfer learning, our hypothesis was that once suitable topologies were found using neuroevolution, these topologies should behave reasonably well over a different domain or problem which is similar to the one used for evolving the population. If the hypothesis is correct, then a lot of time could be saved, since neuroevolution is an computational expensive process. In our work, the neuroevolutionary process searched for optimal topologies for the MNIST database, and then we have transferred those topologies to EMNIST. EMNIST has been released recently and provides several databases: we have focused on EMNIST Digits, which is the same problem as MNIST but with data obtained from a different source, and EMNIST Letters, which is a similar yet different problem involving recognition of handwritten letters. Results have shown that transferred topologies are able to obtain a very high performance, attaining an accuracy of 99.73% in EMNIST Digits and 95.19% in EMNIST Letters.

These results show the great robustness of the models generated by our evolutionary system. Not only they achieve the best results reported so far in these domains, but also 18 out of the 20 models obtained outperform the best state-of-the-art results in the Digits domain, and so do the whole 20 in the Letters domain. These results are even more significant if we take into account that one of the referred works also uses CNNs, in an ensemble of 7 models. This confirms not only the difficulty of finding effective models by hand, but also that even models learned by our system in some domains, when applied directly to different, but similar, domains, obtain better results than those designed by experts focused on the latter. Of course, this happens without disregarding that, in the future, new works can unveil models better adapted to that domains, thus resulting in a better performance.

Additionally, we have realized that these two improvements are not exclusive, and when combined, the results can be improved, outperforming single models, reaching accuracies of 99.7725% for EMNIST Digits and 95.35% for EMNIST Letters. This translates into an improvement of 0.0425 percentage points for EMNIST Digits and 0.16 percentage points for EMNIST Letters when compared to the best accuracy obtained with individual models.

Based on all the tests and analyses, the proposed approaches are recommended even if the process is time-consuming, since it is fully automated and the output can be used for building committees or to be applied to different, yet similar problems. Results empirically prove how successful both approaches are and support some of the benefits of using neuroevolution for determining the best topologies and hyperparameters of CNNs.

Data Availability

The databases used in this paper are publicly available for download and, in particular, can be accessed from the following website: http://yann.lecun.com/exdb/mnist/, whereas EMNIST can be downloaded from the following site: https://www.nist.gov/itl/iad/image-group/emnist-dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is partially supported by the Spanish Ministry of Education, Culture and Sport under FPU fellowship with identifier FPU13/03917.