Abstract

Indoor scene classification plays a vital part in environment cognition of service robot. With the development of deep learning, fine-tuning CNN (Convolutional Neural Network) on target datasets has become a popular way to solve classification problems. However, this method cannot obtain satisfying indoor scene classification results because of overfitting when scene training datasets are insufficient. To solve this problem, an indoor scene classification method is proposed in this paper, which utilizes CNN feature of scene images to generate scene category features to classify scenes by a novel feature matching algorithm. The novel feature matching algorithm can further improve the speed of scene classification. In addition, overfitting is eliminated by our method even though the training data is limited. The presented method was evaluated on two benchmark scene datasets, Scene 15 dataset and MIT 67 dataset, acquiring 96.49% and 81.69% accuracy, respectively. The experiment results showed that our method was superior to other scene classification methods in terms of accuracy, speed, and robustness. To further evaluate our method, test experiments on unknown scene images from SUN 397 dataset had been done, and the models based on different training datasets obtained 94.34% and 79.80% test accuracy severally, which proved that the proposed method owned good performance in indoor scene classification.

1. Introduction

Scene cognition is a key point of service robot cognition. Scene information can help improve robot service level. Indoor scene classification is one of the most import missions of service robot, which can enable the robot to provide different services according to different scenes.

A great deal of researches on indoor scene classification had been done. Traditional scene classification method was usually based on manual vision features such as SIFT (scale-invariant feature transform) [1] and SURF (speeded up robust features) [2]. Reference [3] utilized an improved SIFT feature named RootSIFT to build a BoW (Bag of Word) model and combined selective attention for scene classification. In [4], SPM (spatial pyramid matching) model was proposed based on BoW to classify scenes. Reference [5] structured CLM (codebookless model) model by extracting SURF feature of scene image, which obtained better accuracy on indoor scene classification. However, the mentioned vision features are low-level features of images without rich semantic information. It is hard to get satisfying results on complicated scene classification.

In recent years, deep learning has already made huge progress on image classification [6, 7], object detection [810], and so on. Deep learning methods especially CNN have become popular solutions for scene classification. A scene classification model was presented in [11] based on deep CNN feature. In [12] accuracy of scene classification was improved by transferring learning. Transferring learning is to fine-tune a pretrained model on a new dataset and the pretrained model has been well trained on large scale dataset. By this way, better accuracy can be obtained in comparison with training from scratch. In [13] CNN feature transferring was employed to classify scenes and got good results. From the cited references we can find that the deep learning methods own more excellent performances in scene classification than manual vision features. However, there are still some problems as follows. A deep CNN model with extremely large parameters needs enough training data even if trained by transferring learning. Training CNN needs powerful GPUs to speed up, which is expensive. If the training dataset is insufficient, overfitting is around corner. Therefore, it is difficult to get satisfying results based on very limited indoor scene datasets by fine-tuning a pretrained CNN.

Aiming at the above problems, this paper proposes an indoor scene classification method for service robot based on CNN feature. Different from the general method of fine-tuning CNN, our method utilizes CNN feature of scene images to generate scene category features to classify indoor scenes by a new feature matching algorithm. The novel feature matching algorithm can further speed up the scene classification. Meanwhile overfitting can be eliminated by this method when training data is insufficient. The presented method was adequately estimated on two benchmark scene datasets, Scene 15 [4] dataset and MIT 67 [14] dataset, and tested on completely new scene images that are different from the training datasets.

2. Overall Framework

Essentially CNN is a kind of input to output mapping, which can learn a lot of mapping relationships between input and output and does not require any precise mathematical expression. CNN usually adopts alternating settings of convolution layer and sampling layer. The convolution layers are used to extract image features named CNN feature. CNN feature of network pretrained on large scale datasets includes abundant representation information. Therefore an indoor scene classification method for service robot based on CNN feature is proposed in this paper. The method contains two parts and overall framework is illustrated in Figure 1.

The first part is CNN feature extraction and process. A CNN feature extraction model is built by reconstructing a pretrained CNN model. The output of the CNN feature extraction model is one-dimensional feature vector with discriminative representation information. Then a category of scene images is processed by the model to create scene category feature that can generally represent this kind of scene. By this way, other scene category features can be obtained. This part is a learning process and the main purpose is to get the category features with high discrimination of various scenes for scene classification in the next part.

The second part is about scene classification. A test scene image is put into the same CNN feature extraction model to generate a CNN feature vector. Then the CNN feature vector is matched with the scene category features by a proposed feature matching algorithm, which calculates diverse scores of each scene category. The scene category can be decided according to the maximum score.

3. Method Description

3.1. Obtaining Pretrained CNN Models

There are a lot of open source deep learning frameworks such as Theano [21], Caffe [22], and MXNet [23], which promote the development of deep learning. This paper is based on MXNet.

MXNet supports many kinds of programming languages and deep learning algorithms and provides diverse pretrained deep CNN models based on a variety of large scale datasets. Three types of pretrained model (shown in Table 1) are selected from MXNet in this paper. The selected deep CNN models are all ResNets [7] with different network layers. Model 1 and Model 2 were trained on a combined dataset including ImageNet11K [24] dataset and Places365 [25] dataset, achieving 0.3113 and 0.2255 top-1 accuracy, respectively. ImageNet11K dataset includes 11,221 category objects and 11,797,630 images totally. Places365 is dataset about scenes with 365 category scenes and 8,000,000 images. Model 3 was trained on ImageNet11K dataset. After training on these large scale datasets, the three models own powerful capacity to extract CNN feature.

3.2. Scene Category Feature

Firstly the CNN feature extraction model needs to be built based on the pretrained models by using flatten layer instead of softmax layer. Architectures of the feature extraction models are shown in Table 2 and ReLU activation functions are used in the models. Output of the CNN feature extraction model is a vector rich in semantic information. The dimensionality and length of the vector are 1 and 2048, respectively. The processes of generating scene category feature are listed as follows.

Suppose that the category number of a scene dataset is and each category has scene images.

Input. Scene images in training dataset.

Output. All scene category features.

Step 1. Put image from scene category into CNN feature extraction model to create a feature vector of the image. The is the length of the feature vector .

Step 2. Image CNN features of all image in scene category can be generated according to Step 1. Then the mean value of is figured out byand is the scene category feature vector of scene category .

Step 3. Each scene category feature vector is generated by Step 2 and all scene category feature vectors are .

Step 4. Scene category feature vectors are normalized into by Z-Scorewhere is the mean value of each scene category feature and is the standard deviation. If the original scene category feature vector is directly used for analysis, it will highlight the role of the vector with higher value in the comprehensive analysis and weaken the role of the vector with lower value. Therefore, in order to ensure the reliability of the results, the original vector needs to be standardized. And the improvement will be proved in subsequent experiments. Figure 2 shows element value changes of scene category feature vector after normalization.

After the aforesaid steps we can get all scene category feature vectors for scene classification. After that, put a test scene image into CNN feature extraction model and the CNN feature vector of the test scene image is created, which also needs to be normalized into by Z-Score.

3.3. Scene Category Feature Matching

Scene classification results are figured out by measuring the similarity between scene category feature and CNN feature of test scene image . Therefore feature matching is a key point. Service robot should possess the capacity of real-time scene classification, which requires the feature matching algorithm to be fast enough. Some common feature vector matching algorithms are compared and analysed as follows, and a new feature vector matching algorithm is proposed in this paper.

The difference between vector and vector can be expressed by distance measure or similarity measure. Familiar distance measure is Euclidean distance.Larger value of Euclidean distance means larger difference between the two vectors. The frequently used similarity measures are Pearson correlation coefficient and cosine similarity .The output range of and are both . The value close to 1 means that the two vectors are more similar. Suppose that vectors and are normalized by Z-Score.Put (6) into (5) to get (7).Equation (4) can be simplified in (8) since the mean value of and is 1 and the standard deviation is also 1.It can be found that Pearson correlation coefficient is equal to cosine similarity when the input vectors are normalized.

From variance formula (9) we can get (10). can be simplified by formula (9) and (10) into (11).When is big enough, can be seen as , so . In the same way, . There is a direct linear relationship between the square of Euclidean distance and Pearson correlation coefficient [15]. With these conditions, the square of Euclidean distance can be unfolded as follows.In the same manner, Euclidean distance is equal to Pearson correlation coefficient as well as cosine similarity. Although the three feature matching algorithms have the same property, they own different computation speeds. They both need to extract a root which is time-consuming. Inspired by formula (12), a new feature matching algorithm is presented without extracting a root. The new algorithm can improve calculated speed which will be proved by experiments. The CNN feature vector of a test scene image will be matched with scene category feature vectors by the proposed feature matching algorithm . The output score matrix is as follows.The largest element in the score matrix is the index of scene category. By this way we can get the category of the test scene image.

4. Experiments and Analysis

4.1. Datasets

Our method was evaluated on two benchmark scene datasets, Scene 15 dataset and MIT 67 dataset. Scene 15 dataset includes 15 categories of outdoor and indoor scenes. We selected 5 categories of scenes (living room, store, kitchen, PAR office, and bedroom) shown in Figure 3 because our study object service robot mainly worked indoors. There are 1,245 grayscale images in the 5 categories. MIT 67 dataset contains 67 types of indoor scenes and 15,620 RGB images totally. All of 67 categories of indoor scene image in MIT 67 are used to train the proposed models. Some examples of scene image in MIT 67 dataset are demonstrated in Figure 4. Experiment designs were specified as 5-fold cross-validation in order to make the experiment results convincible and repeatable. Each category scene images were divided into 5 parts; 1 part was used to created scene category features and the rest were tested. All images were resized to pixels.

4.2. Experiment Results and Analysis

The proposed method was written in Python using MXNet deep learning framework and run on a PC. The PC operating system was Ubuntu 16.04.4 with Intel i5-6500 CPU, 32G memory, and 1 NVIDIA GTX 1080 graphics card. In order to fully test the performance of our method, some experiments were carried out as follows.

(1) The Impact of Normalization on Classification Results. This experiment was done to test if Z-Score normalization could improve scene classification accuracy. Two groups were set to make a comparison; one used Z-Score to process the CNN feature vector and the other did not. We utilized cosine similarity to match features since the three matching algorithms were not equivalent without Z-Score. The experiment results are listed in Table 2 and a histogram is provided in Figure 5 for comparison. From Table 3 and Figure 5 we can see that using Z-Score normalization can obtain better accuracy on both Scene 15 dataset and MIT 67 dataset with 3 different pretrained models.

(2) Computational Speed Comparison of Feature Matching Algorithms. Service robot should be able to recognize scenes in real time, so the speed of indoor scene classification is significant. In order to verify the advantage of the proposed feature matching algorithm on computing time, the following experiments were carried out in contrast with other algorithms. Euclidean distance (ED), Cosine similarity (CS), Pearson correlation coefficient (PCC), and the proposed feature matching algorithm were, respectively, tested on the two datasets based on the three different pretraining models. The scene classification results are shown in Table 4.

From the results in Table 4, we can see that CS and PCC have similar processing speeds, which are much slower than ED and our algorithm. Compared with ED, the proposed algorithm is faster. In addition, the scale of datasets and the layers of the pretrained models affect the scene classification speed. By contrast our feature matching algorithm is able to meet the service robot demand of real-time scene classification.

(3) Contrast with Transferring Learning on CNNs. A transfer learning strategy was used; fully connected layers of the pretrained CNN models were changed according to the number of scene categories and fine-tuned on the datasets. The training parameter sets were learning rate 0.0001, batch size 32, and epoch 100 (MIT 67) and 200 (Scene 15). Figure 6 shows the training curves of fine-tuning CNNs on Scene 15 dataset and MIT 67 dataset. Results of the comparison are listed in Table 5.

The training accuracy curves in Figure 6 indicate that the CNN models can perfectly fit in the training dataset and obtain about 100% training accuracy in Scene 15 dataset since the number of categories is small. However, it is hard to get the same high accuracy on test dataset and overfitting appears. In MIT 67 dataset, larger scene categories and deficient training dataset result in unstable accuracy curves and overfitting. From Table 5 we can see that the proposed method can get better results without overfitting based on each pretrained model on each dataset.

(4) Contrast with Other Advanced Scene Classification Methods. In order to verify the performance of our method, other advanced indoor scene classification methods were used as references to carry out the following comparison tests. Table 6 demonstrates indoor scene classification performance of different methods. Confusion matrixes created by CLM+SVM [5] and our method are shown in Figure 7. Performance contrasts on MIT 67 dataset are listed in Table 7, and Figure 8 shows scene classification confusion matrixes of MIT 67 dataset on three pretrained models based on our method.

In Table 6 our method gets higher scene classification accuracy and better efficiency than other methods. Scene classification confusion matrixes in Figure 7 indicate that our method is more robust than CLM+SVM in [5]. For classification of a large number of scenes in MIT 67 dataset, it can be observed from Table 7 that our method with model 2 obtains better results than other methods. The classification confusion matrixes of MIT 67 dataset in Figure 8 demonstrate the robustness of our method.

(5) Test Experiments on Different Scene Data. To further evaluate the indoor scene classification ability of our method, a completely new scene dataset, SUN 397 [26], which was different from the training data source, was used for test experiments. There are 397 well-sampled categories with 130,519 RGB scene images in SUN 397. All test images were resized to pixels. To test the models trained on Scene 15 dataset, data of the same 5 categories with 1,336 indoor scene images were selected from SUN 397, and each scene had about 267 images. For the models trained on MIT 67 dataset, data of the same 67 categories with 15,937 indoor scene images were used to make the test experiments, and there were approximately 237 scene images in each category. Test results are list in Table 8. To further prove the performance of our method, scene classification test confusion matrixes are shown in Figure 9 (the models trained on Scene 15) and Figure 10 (the models trained on MIT 67).

From the test results in Table 8 a conclusion can be drawn that our method owns good ability to classify indoor scenes on a completely different source scene data, which proves that the proposed feature matching algorithm refers to the content distance between different indoor scenes. Although the models were trained on grayscale images from Scene 15 dataset, they can obtain more than 92% accuracy with different pretrained models, which demonstrates more evidence that our method is based on the content and semantics of image to make scene classification instead of lower-end abstract features such as pixel, colour, and edge descriptor. In addition, Figures 9 and 10 show the test confusion matrixes of scene classification by various models trained on Scene 15 and MIT 67 datasets severally, which depicts the robustness of our method on test scene dataset.

5. Conclusions

In this paper an indoor scene classification method for service robot based on CNN feature is proposed. We utilize CNN feature of scene images to generate scene category features to classify indoor scenes by an improved feature matching algorithm. The novel feature matching algorithm can further speed up the scene classification. The presented method is adequately evaluated on two benchmark scene datasets, Scene 15 dataset and MIT 67 dataset. Compared with general method fine-tuning CNN on training dataset, this method can obtain satisfying accuracy without overfitting on a small amount of training dataset and does not need to be trained repeatedly. In contrast to other indoor scene classification methods, the scene classification results have been greatly improved in terms of accuracy, classification speed, and robustness by our method. The experiment results show that this method has good performance in indoor scene classification and can meet the task requirements of service robot indoor scene classification. Nowadays, with the continuous development of computer hardware and cloud robots, the computing capacity of service robots has been greatly improved. Our next step is to further improve scene cognition ability of service robots based on deep learning methods.

Data Availability

The data used to support the findings of this study are available from the public scene datasets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is Supported by National Natural Science Foundation of China (U1813215 and 61773239) and the Taishan Scholars Program of Shandong Province.