Abstract

Convolutional neural network (CNN) is an important way to solve the problems of image classification and recognition. It can realize effective feature representation and make continuous breakthroughs in the field of image recognition, but it needs a lot of time in the training process. At the same time, random forest (RF) has the advantages of fast training speed and high classification accuracy. Aiming at the problem of image classification and recognition, this paper proposes a hybrid model based on CNN, which inputs the features extracted by CNN into RF for classification. Since the random weight network can also obtain valid results, the gradient algorithm is not used to adjust the network parameters to avoid consuming a lot of time. Finally, experiments are conducted on MNIST dataset and rotated MNIST dataset, and the results show that the classification accuracy of the hybrid model has improved more than RF, and also, the generalization ability has been improved.

1. Introduction

Handwritten digit recognition is a kind of pattern recognition, which is included in the character recognition technology, and the key technology for processing some data information is handwritten digit recognition, such as financial statements, postal codes, and various bills [1]. In 1998, Lecun et al. proposed the handwritten digit recognition model LeNet I5, which was widely used to recognize handwritten digits of U.S. bank checks; in literature [2], the K-nearest neighbor classification (KNN) algorithm was used to achieve a classification error rate of 2.83% on a MNIST dataset. The K-nearest neighbor (KNN) algorithm in [3] achieved a classification error rate of 2.83% on the MNIST dataset; support vector machine (SVM) and its improved algorithms were widely used in classification tasks. In 2012, Niu et al. proposed a hybrid model CNN-SVM applied to digital recognition [4], using CNN for feature extraction and SVM as a classifier, combining the respective advantages of both sides and achieving good experimental results in image classification tasks; Luo et al. [5] in 2014 proposed a hybrid method ELM-SRC (sparse representation-based classification), which combined the advantages of SRC in processing noisy images and the fast training speed of ELM, and conducted experiments on USPS handwritten dataset, which both improved the classification accuracy and ensured the time efficiency.

CNN is a deep learning algorithm, which is widely used in many fields, such as target recognition, scene classification, and face recognition [6]. CNN learns by layer by layer, and each layer automatically extracts different features from the input image, which works very well and is considered as one of the representatives of general-purpose image recognition systems [7]. Usually, the neurons in the convolutional layer are connected to the upper layer by local perceptual fields, and the features of that local area are obtained by convolution, and the secondary features are extracted by pooling in the pooling layer [8]. The structure of alternating convolutional and pooling layers makes it possible to tolerate input samples with certain distortions [9, 10]. However, the CNN requires a BP algorithm to adjust the parameters. Random forest (RF) is proposed in 2001, which has high accuracy in classification and regression and fast training speed and is not prone to overfitting problems, and also performs well in noise immunity [11, 12]. The existing RF-based classifiers rely on hand-selected features; however, manual feature selection is very time-consuming and requires expertise background, and whether good results can be achieved depends on a certain amount of experience and luck. In literature [13, 14], it is proposed that the network structure can achieve good results even with randomly unpretrained weights.

The paper [15, 16] generated realistic original images using a random initialized network reduction without any training; the paper [17, 18] proposed a deep neural network and RF model to detect retinal vessels in fundus images in 2015 and achieved an accuracy of 93.27% on the DRIVE dataset. Based on the above issues, a hybrid model hybrid model is proposed in the paper. In the hybrid model, the features are extracted with a CNN with random weights and then handed over to the RF to complete the classification, which makes the model take much less time in extracting features, overcoming the problem of long training time of CNN and solving the drawback of manual feature selection by RF [19, 20].

2. Convolutional Neural Networks

The combination of convolutional, downsampling, and fully connected layers constitutes the most classical network structure of CNN. The convolutional layer uses a linear filter kernel to perform linear convolution, and then, a nonlinear activation function is added to compute the extracted features.

Figure 1 shows an example of a classical CNN neural network structure.

2.1. Convolutional Layers

The convolutional layer is the most critical component of CNN, also called filter or kernel, which is used for low-dimensional feature extraction on high-dimensional data. The parameters are a set of trainable convolutional kernels, each of which has a relatively small size () in order to extract the right size feature map without losing useful information. In each convolutional layer, there are a certain number of convolutional kernels, which is a hyperparameter in the CNN and needs to be empirically specified artificially, and each kernel computes. A feature map means that we have extracted some features of the input image; i.e., the original three-dimensional image becomes a two-dimensional feature map. The combination of all the feature maps is our output data, which can be used for further feature extraction or as the final feature extraction result. Multiple convolutional kernels are used to extract different aspects of features, such as color, contour, and background.

The depth of a normal deep neural network is mainly reflected in the deep layers of the network, which leads to a dramatic increase in parameters, while the most important feature of convolutional layers compared to the most common fully connected neural networks is that the parameters can be reduced substantially, even by orders of magnitude. If an convolution kernel does the convolution operation on an image with the same depth of the image, we get a new image of size , which is the feature map. The divided by is the step size we set for the convolution. The step size can be set to other values according to our needs, but if the division is not exhaustive, we need to fill the image so that it can take an integer number of steps, so the most important thing is to be able to divide the whole thing.

2.2. Pooling Layers

CNNs generally alternate between a convolutional layer followed by a pooling layer. Its most intuitive function is to reduce the dimensionality, so that the number of parameters in this layer is also reduced, making the computation simpler and faster, extracting important features and conforming to invariance. The most common way to do this is to downsample the input using a filter of size , where the four pixel values are combined into one pixel value. Each max-pooling operation takes the largest of the four numbers (some region of the input image). The depth of the image does not change, and the image size is reduced while preserving as much of the original information as possible.

As shown in Figures 2 and 3, max-pooling has the advantage of not increasing the number of parameters to be adjusted and is generally more accurate than other methods, highlighting features, while mean-pooling tends to be more smooth.

2.3. Training Process

The CNN performs supervised guided training, and the process is roughly as follows.

2.3.1. FC Layer

(1)Calculate the output for the fully connected layer as a function of where represents the activation function, and here we use the sigmoid function

The error loss of the training sample is where indicates that there are a total of classes in the multicategory problem. (2)Weight update

For the convenience of presentation, we define the derivative of the error with respect to the basis as the sensitivity, whose expression is

The sensitivity of layer in back propagation can be expressed as

For the base value in the output layer (at each layer, is a vector), the partial derivative to the base value for layer is

For the weights (at each level, is a matrix), find the partial derivatives.

Here, is the input of the layer which is also the output of the upper layer, so the final amount of change in the weights is

2.3.2. Convolution Layer

In the convolution layer, the output is obtained after the filter kernel is calculated.

From equation (4), we know that if we want to get the change value of the weight of each neuron in layer , we must first get its corresponding sensitivity . In order to find this sensitivity, we need to first sum the sensitivity of the nodes in the next layer (to get ) and use equation (4) to calculate the sensitivity corresponding to each neuron node in the current layer . Next, the weights are updated.

Because the downsampling operation is often done after the convolutional layer, the feature map obtained in the pooling layer does not match the size of the input. The pooling layer is needed. The sensitivity of node in the th layer is given by taking a for all the weights of the pooling map. where means upsampling , depending on the pooling method described earlier. Now, for a given map, we can compute the gradient of the base value as follows:

Finally, the gradient of the weights of the convolution kernel is calculated using Matlab’s convolution function.

2.3.3. Pooling Layer

For the pooling layer, the number of input feature maps is not changed, but only the input feature maps are made smaller: the where denotes a pooling function. Each output feature map corresponds to a weight and a bias of its own.

Here, again we have to obtain the sensitivity before we can update the weights and the basis . If this layer is fully connected to the next layer, the gradient of the layer can be calculated directly by BP. However, when the sensitivity of the convolution kernel is not fully connected as equation (9) shows, the sensitivity of the convolution kernel is calculated here again with the help of the convolution function.

The gradient of the base value is then calculated as in the convolution layer, as in equation (10). The sensitivity to the weights is calculated as follows. where .

3. Random Forest

RF consists of classification trees, and its basic idea is to set multiple weak classifiers into one strong classifier. The classification tree consists of different nodes, where the root node represents the training set, each internal node represents a weak classifier that divides the samples according to a certain attribute, and each leaf node is a labeled training or test set that classifies the input data into several subsets. The final decision result of RF is the optimal result chosen by voting on all classification trees.

The Gini index is used to decide the optimal binary cut point for that feature. The Gini index represents the uncertainty of the set . In the classification problem, suppose there are classes, and for a given set of samples , the Gini index is defined as where is the subset of samples in that belong to the th class. If the sample set is divided into two parts and according to whether the value of feature takes or not, i.e.,

Conditional on characteristic , the Gini index of the set is defined as

represents the uncertainty of the set after the partitioning by . When constructing a classification tree, the feature with the smallest Gini index and its corresponding optimal binary cut point are selected. The RF is constructed using the Gini index minimization criterion with the following steps. (1)Using the bootstrap resampling method, the th sample set is drawn back from the original sample set . The th sample set is denoted as and a random vector is generated for the th classification tree that is independently and identically distributed with the previous random vectors. In this paper, we use to represent the th classification tree model(2)Build classification trees for each of the samples. The generation of the classification tree is the process of recursively building a binary classification tree, using the feature with the smallest Gini index to split the binary tree(3)The final classification result is voted based on the results of each classification tree

The flow of constructing the random forest is shown in Figure 4.

4. Hybrid Model Based on Deep Learning and Random Forest

4.1. Model Structure

The hybrid model structure is shown in Figure 5, and the main improvement is to use the features of the output layer of CNN to do classification by RF. First, the feature extraction of the image is done with the convolutional and pooling layers with random weights, and the extracted features are fed into the RF classifier to get the classification results. The number of filters in the convolutional layer greatly affects the generalization ability of the model; based on experience, the values of N1 and N2 of the model are 10 and 20, respectively.

In CNN, part of the image area (local perceptual field) is used as the input of the bottom layer of the network and then transmitted to each layer of the network in turn, and each layer is filtered by multiple filters to the most significant features which are computed in each layer through multiple filters. Because the local receptive fields of the image allow the neurons to detect the most basic features of the image, such as edges or corners, the local receptive fields of the image are used as the input. The local perceptual field of the image allows the neurons to detect the most basic features of the image, such as edges or corners, and the neurons in each layer share the weights. A secondary pooling layer for extracting features is after each convolutional layer. This unique approach is able to obtain salient features for data that are invariant to translation, scaling, skew, and rotation.

Whether the features are designed manually or obtained by deep learning, everyone aims to obtain good features that reflect the nature of the original data, which is very much in line with the research intuition that using good features always leads to good results in various ways. ELM (extreme learning machine) does a random projection of the original data, projecting the original information into a certain space randomly, giving up the pursuit of good features for the improvement of the solution speed, and has obtained very good results in some tasks, so the classification of random features is worth exploring.

RF is a combinatorial classifier that solves the defect of decision tree overfitting and is more resistant to noise and anomalies; and it runs relatively fast and remains efficient for large amounts of data.

Our hybrid model exploits the advantages of CNN in feature extraction and RF in terms of speed and less overfitting and uses CNN with random weights to automatically extract the features and use it as the input of the RF classifier, which avoids the CNN consuming a lot of time in the training process, while obtaining better classification accuracy.

4.2. Training Process
4.2.1. Extraction of Features

Step 1. Network initialization and random initialization of weights and biases.

Step 2. Convolutional layer extracts features. For example, the convolution kernel is operated to obtain the output. Here, represents the set of input feature maps.

Step 3. The pooling layer extracts features. For the pooling layer, the number of feature maps remains the same, but the input feature maps are made smaller. where denotes a pooling function.

4.2.2. Replacement Classifier

The output of the C5 layer is used as the extracted features as the training set and the test set to build a random forest as follows.

Input: sample set ; number of split attributes.
Step1: Select samples from the sample set using Bootstrap sampling.
Step2: Randomly select attributes and choose the best split attributes to build CART decision tree.
Step3: Repeat Step1 and Step2 for times to build CART decision trees.
Step4: Form a random forest with CART trees, for the test set decide which class the data belongs to by voting on the results, and the percentage that is different from the correct classification label is the classification error rate of RF.
Output: Random forest of trees.

5. Experiments and Results

To evaluate the classification performance of our hybrid model, we conducted experiments on the well-known MNIST and rotated MNIST datasets.

5.1. MNIST Dataset

Assuming that Ntree is the number of trees in RF, when Ntree is taken too small, the classification accuracy will not reach the desired result. Since RF is not prone to overfitting problem, we can make the value of Ntree as large as possible to ensure the classification accuracy, but it will take much time to build RF, so the value of Ntree has an important significance to the performance and complexity of RF. In order to avoid the problem of long training time of CNN, we use the method of random weights, after the CNN extracts the features, the extracted features are input to RF for classification; in order to compare, we do the experiments under different Ntree numbers. Table 1 shows the test error rate under different Ntree values.

From Table 1, it is seen that the classification accuracy of the hybrid-RF model is better than that of the RF at each Ntree.

Figure 6 shows the results of RF and hybrid-RF model, from which we can see that the test error rate of the hybrid model is better than RF at different Ntree.

Figure 7 shows the relationship between the classification error value and the number of hidden layer neurons of ELM (extreme learning machine) classifier on the MNIST dataset; the final training error is 0.66%; the test error is 2.47%.

Many methods have been experimented on the MNIST dataset, and Table 2 lists the performance of several different methods on the MNIST dataset, among which CKELM (convolutional extreme learning machine with kernel) is a convolutional neural network with random weights to extract the number of features, and the classifier is replaced with kernel extreme. The error rate on the MNIST dataset is 3.20%; the network structure of DAEs is ; CNN-0 is a convolutional neural network with random weight filtering kernels, because the weights are not adjusted, and the error rate that can be seen from the table in CNN-1 is the convolutional neural network after 50 training iterations.

Under the given hardware conditions, the CNN takes about 190 s for one iteration, and 50 iterations have consumed close to 3 hours. The RF itself runs very fast, taking only 20 minutes to train on the MNIST dataset (when ), and the hybrid model takes less time to train than the original RF because the dimensionality of the data is less than the original. Our model greatly reduces the time for feature extraction and the accuracy is guaranteed.

Due to the effectiveness of random weights, we use a CNN with random weights, which does not require gradient descent algorithm to adjust the parameters, also overcomes the problem of sensitive learning rate selection, and combines the advantages of fast and efficient RF, while the experimental results also prove the effectiveness of the hybrid model.

5.2. Rotated MNIST Dataset

To further illustrate the effectiveness of our model, we selected a rotated MNIST dataset from the variant MNIST dataset for the comparison test, which rotates the digital images in the MNIST dataset uniformly between 0 and 2π. Here, we randomly select 50,000 data from the rotated MNIST dataset as training data and 10,000 as test data and do the same experiments as the MNIST dataset for the RF and hybrid models with different Ntree numbers. Table 3 shows the error rates of the RF and hybrid models for the rotated MNIST dataset.

Figure 8 shows the experimental results of RF and our hybrid model in the rotated MNIST dataset, from which we can see that our hybrid model outperforms RF for different Ntree values, which again validates the effectiveness of our model and also has better generalization ability than RF.

Table 4 shows the comparison of the experimental results on the rotated MNIST dataset. CNN-0 is the CNN with random weights and CNN-1 is the CNN after 50 iterations of training.

6. Conclusion

For the image classification and recognition problem, we propose a hybrid model in this paper. In the hybrid-RF model, CNN extracts features with random weight and then completes the classification combined with RF. Therefore, the model greatly reduces the time spent in the process of feature extraction, effectively overcomes the problem of long CNN training time, and avoids the problem of manual feature selection of RF. Sufficient experimental results show that our proposed hybrid-RF model has superior performance and can effectively solve the problems of image classification and recognition. In the hybrid model, the features are extracted using CNN with random weights and then handed over to RF to complete the classification, which makes the model take much less time to extract features, overcomes the problem of long training time of CNN, and solves the drawback of manual feature selection by RF.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares no conflicts of interest regarding this work.