Abstract

Along with the strong representation of the convolutional neural network (CNN), image classification tasks have achieved considerable progress. However, majority of works focus on designing complicated and redundant architectures for extracting informative features to improve classification performance. In this study, we concentrate on rectifying the incomplete outputs of CNN. To be concrete, we propose an innovative image classification method based on Label Rectification Learning (LRL) through kernel extreme learning machine (KELM). It mainly consists of two steps: (1) preclassification, extracting incomplete labels through a pretrained CNN, and (2) label rectification, rectifying the generated incomplete labels by the KELM to obtain the rectified labels. Experiments conducted on publicly available datasets demonstrate the effectiveness of our method. Notably, our method is extensible which can be easily integrated with off-the-shelf networks for improving performance.

1. Introduction

Image classification is a fundamental and challenging task in computer vision, which has received ever-increasing interests. In general, image classification is aimed at distinguishing the image categories according to their semantic information. It is widely applied in real-world applications, including face recognition [1], traffic sign detection [2], and other applications [37]. Traditional image classification methods often adopt hand-crafted features (e.g., SIFT [8] and HOG [9]) combined with classical classifiers (e.g., SVM [10]). However, they often have poor performance due to the limited representation ability of features.

Convolutional neural networks (CNNs) have witnessed great development on image classification tasks. In [1113], they design small convolutional kernels to increase the layers of CNNs. The authors in [14, 15] build relations of different convolutional layers and alleviate vanishing gradient problem in deep networks by shortcut connections. SKNet [8] allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information by a dynamic selection mechanism. Res2Net [16] represents multiscale features at a granular level and increases the range of receptive fields for each network layer. All of the above-mentioned methods try to improve CNN’s design to boost classification performance, but complex architectures and huge parameters often lead to redundant complicated loads and poorly trained models. These frameworks consist of two major components, i.e., a feature extractor and a softmax classifier. Feature extractor can get powerful feature representation, and these features can be normalized by softmax function to obtain predicted probabilities. However, the softmax function encourages the output scores dramatically distinctive which potentially leads to overfitting. Some studies consider improving the discriminative ability of features by replacing softmax with other machine learning algorithms [1720], e.g., Support Vector Machine (SVM) or Extreme Learning Machine (ELM). Although considerable progress has been achieved in image classification, existing traditional methods are still faced with some limitations: (1) extracting high-dimensional features in matrix manipulation is time-consuming and (2) most works rarely exploit the label-wise relation for image classification.

To solve these problems, we propose a novel method to learn to rectify label (LRL) by kernel extreme learning machine (KELM), which attempts to calibrate label output of CNN and obtain more accurate distribution. The KELM is an improvement of ELMs by adopting a kernel function for feature mapping. Because the kernel function can map labels to the infinite dimensional space, the separability of the label distribution can be further enhanced. Figure 1 schematically illustrates our framework. It consists of two steps, i.e., preclassification and label rectification. In preclassification, we put images into a trained CNN and generate corresponding classification results as incomplete labels. These labels may greatly deviate from their ground truth. Hence, we try to exploit label-wise relation by label rectification. The final classification result can be obtained through random kernel mapping and linear combination. It is worth noting that KELM is more suitable for multiclass classification against KSVM [21].

In summary, our main contributions are threefold: (i)To the best of our knowledge, it is the first time that a label rectification method is proposed for image classification, which can exploit label-wise relations for more correct distribution(ii)We present a novel image classification framework (LRL) that combines CNN with KELM, and it has well generalizability for different CNN architecture(iii)Our experiments on publicly available datasets also demonstrate the efficiency and effectiveness of our method

The rest of our paper is organized as follows. In Section 2, we introduce the preclassification. In Section 3, we introduce the way label rectification briefly. In Section 4, we show the experimental results. In Section 5, we draw a conclusion of this paper.

2. Preclassification

Deep convolutional neural networks (CNNs) have recently achieved unprecedented success in image classification. And it is generally believed that a CNN consists of two components: a feature extractor and a softmax classifier. The feature extractor can carry out depth construction to obtain strong representative capability for input images. Most existing efforts try to improve the architecture of feature extractor (increasing depth [22], multiscale kernel size [23], attention mechanism [24], etc.) to learn more complicated mappings. Some methods notice the limitation of the softmax classifier in nonlinear conditions. The softmax function encourages the output scores dramatically distinctive. And it makes model to assign full probability to the ground truth label for each training sample, which potentially leads to overfitting. Therefore, they use other machine learning models (SVM [17], ELM [20], etc.) to replace it. However, all these methods ignore exploiting predicted label information which is generally regarded as final classification results of CNNs. In addition, according to existing observations, CNNs are sensitive to hyperparameters, so it is difficult to train the network well and it lacks adaptability in real-world applications.

The illustrations mentioned above motivate us to capture label information output by CNN to boost classification performance. This paper denotes the labels as incomplete labels because we conjecture these labels still have potential for improvement. In our method, we therefore extract labels instead of features from CNNs. It can be formulated as where is a well-trained CNN, is the input image, is extracted labels, and is the parameters of CNN. Compared with features in fully connected layers [25], extracted labels contain predicted scores of each image category, and they are relatively low dimensions (it depends on the number of all categories), so it is more efficient in subsequent operations. In addition, for a pretrained CNN, the proposed method can drastically improve their performance according to our experiments. It is worth noting that we do not use the softmax function to normalize the extracted labels in our method.

3. Label Rectification

Given incomplete label , we aim to revise them by a model . For efficiently training , we utilize the ELM to rectify incomplete labels. The ELM was proposed by Huang et al. [21], it is a single-hidden layer feedforward neural network (SLFN) which randomly chooses hidden nodes and analytically determines the output weight of SLFNs. ELM tends to provide good generalization performance at extremely fast learning speed and the hidden layer need not be tuned. ELM consists of three layers: input layer, hidden layer, and output layer. The structure of it is shown in Figure 2.

Consider the incomplete label is passed through an ELM network, the hidden layer can map it to large dimensionality that increases the universal infinite approximation ability of the ELM. The output function of ELM for generalized SLFNs can be described as where is the output weight vector of hidden layer and is the output of the th hidden node output; maps the -dimensional label to the -dimensional hidden layer feature. And the output functions of hidden nodes may not be unique. Different output functions may be used in different hidden neurons. The vector can be defined as where is a nonlinear piecewise continuous function satisfying ELM universal approximation capability theorems and are randomly generated according to any continuous probability distribution. The sigmoid function (4) is conventional kernel function for ELM.

The KELM is a species of ELMs that adopts the Gaussian kernel function (5) as ELM feature mapping. Because the kernel function can map labels to the infinite dimensional space, the separability of the label distribution can be further enhanced. This encourages a finite output that can generalize better.

KELM is to minimize the training error as well as the norm of the output weights. The KELM learning function uses the minimal norm least square method; it can be formulated as where is the hidden layer output matrix and is the target matrix of the training data. The minimal norm least square method was used in ELM, and it can provide the solution of : where is the Moore-Penrose generalized inverse of matrix . The orthogonal projection method can be used to calculate : when HTH is nonsingular, can be defined as or when is nonsingular, can be defined as

As far as we get the value of , we can pass through the incomplete label to well-trained ELM or KELM, and the rectified label feature matrix can be defined as

4. Experiments

4.1. Label Rectified Evaluation

To evaluate the effectiveness of the proposed method, we conduct experiments on CIFAR-10 and CIFAR-100 [26]. Those two datasets are widely used in image classification benchmark and consist of natural scene images. The size of all images is pixels. The training and test sets contain 50k and 10k images, respectively. For adequately evaluating our method, we utilize two classical CNNs, i.e., VGG19 [22] and ResNet50 [27]. We set two baselines [26, 28] for comparisons. Table 1 shows our experiment results.

LRL can promote CNN classification performance. On CIFAR-10, ResNet50 achieves 90.79% accuracy, while ResNet50-LRL (KELM) achieves 92.28% accuracy. On CIFAR-100, VGG19-LRL (KELM) outperforms VGG19 by 1.39% accuracy. From these comparisons, we can conclude the proposed method can effectively rectify incomplete label output by CNNs. Compared with LRL (ELM), LRL (KELM) has more potential for image classification tasks, and it demonstrates the RBF kernel is more appropriate to exploit label relations.

4.2. Comparison with State-of-the-Art Methods

In this section, we compare our method with state-of-the-art methods. We run experiments on the Caltech-256 dataset [14]. It contains more classes and fewer samples than CIFAR, with 256 classes and a total of 30607 images. Each category has a minimum of 80 images. Following [20], we utilize 60 randomly selected images per class as the training dataset, the rest as the testing dataset. We resize all images to pixels. VGG19 [22], ResNet18 [27], Zhu.et al. [20], Squeeze-Net [11], VGG19-BN [30], and Inception-V3 [31] are introduced for comparison, and we finally adopt ResNet50 for preclassification. Table 2 shows our experimental results.

Notably, ResNet50-LRL (KELM) also works well on Caltech-256 and outperforms all other state-of-the-art approaches. ResNet50 achieves up to 80.40% Top-1 and 92.95% Top-5 accuracy. Compared with Zhu.et al. [20], our method outperforms it by 11.96% Top-1 and 2.4% Top-5 accuracy, and ResNet50-LRL (KELM) outperforms base model by 1.46% Top-1 and 0.32% Top-5 accuracy. Those experimental results demonstrate the superiority of our approach on image classification.

4.3. Classifier Comparisons

In our method, we adopt the KELM to rectify incomplete labels. In order to evaluate it, we compare it with other different classifiers, linear SVM, kernel SVM (KSVM), neural network (NN), and sigmoid function ELM (ELM). Besides, we introduce a method of Li et al. [20], which utilizes the KELM to replace the softmax classifier that may cause two problems. First, features may contain more redundant information than labels. Second, the dimension of features is larger than labels, which consumes more time for training. In our method, we extract the labels whose dimension is equal to the image classes. We train the ResNet50 as preclassification. Table 3 exhibits our experimental results.

Apparently, Li et al. and LRL (KELM) are efficient in improving the performance of image classification. But other classifiers have a negative impact on improving accuracy. Li et al. can improve the accuracy of the base model by about 1.12%, and LRL (KELM) can improve about 1.07%. But, in contrast, experiments show that the LRL (KELM) has the advantage of high training speed. In conclusion, the LRL (KELM) achieves a better balance between accuracy and training time.

4.4. Training Observation

In our training phase, we apply a pretrained network (on ImageNet) into specific datasets (e.g., Caltech-256), so we can obtain a well-trained CNN for preclassification. But badly trained CNNs should be considered. Therefore, we apply the LRL method into each ResNet50’s training epoch. We plot the Top-1 accuracy of ResNet50, ResNet50-LRL (ELM), and ResNet50-LRL (KELM). The visual comparisons are provided in Figure 3.

Obviously, KELM-based LRL (green line) can substantially promote ResNet50 (blue line) performance in very early epochs. It manifests our method is effective for unwell-trained CNNs. Although the disparity of them becomes small, KELM-based LRL still outperforms ResNet50. Besides, we notice that ELM-based LRL (orange line) cannot improve ResNet50 performance after the 13th epoch.

4.5. Implementation Details

For the Caltech-256 dataset, we modify the final output of VGG19 and ResNet50 to 257. These CNNs are trained using a batch size of 32 for 50 epochs. For the CIFAR-10 and CIFAR-100 datasets, we replaced the fully connected layer of ResNet50 to 11 and 101 outputs. ResNet50 are trained 300 epochs using a batch size of 128, and the initial learning rate is set to 0.0001 and is divided by 10 at 30% and 75% of training epochs. All CNNs are trained by stochastic gradient descent. We use a weight decay of 0.01 and Nesterov momentum of 0.9. In all experiments, the images are randomly flipped and cropped before passing into the networks. In all our simulations on ELM with sigmoid additive hidden node and RBF hidden node, . All the hidden node parameters are randomly generated based on uniform distribution. Experiments are conducted on an NVIDIA Titan Xp GPU.

5. Conclusion

In this paper, instead of designing complicated and redundant CNNs, we explore the label-wise relation for label rectification and then propose a method (LRL) to learn label rectification through kernel extreme learning machine to improve the accuracy of image classification. Compared with intermediate features, label-wise features contain less redundant information and have a lower dimension. LRL can achieve similar accuracy while saving more time during training and inference. In addition, by training observation, we find that LRL has a strong advantage in rectifying the labels of pretrained models. Extensive experiments conducted on public datasets demonstrate the effectiveness of LRL (KELM). As far as we know, this is the first label rectification approach for image classification. In the future, we will develop a method to learn label rectification from source dataset to target dataset directly.

Data Availability

CIFAR10 and CIFAR100: http://www.cs.toronto.edu/~kriz/cifar.html; Caltech256: http://vision.caltech.edu/Image_Datasets/Caltech256/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Qiang Cai and Fenghai Li contributed equally to this work.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (2018YFB0803700), National Natural Science Foundation of China (No. 61877002), Beijing Municipal Commission of Education PXM2019 014213 000007, Beijing Natural Science Foundation, and Fengtai Rail Transit Frontier Research Joint Fund 19L00005.