Abstract

A new scene classification method is proposed based on the combination of local Gabor features with a spatial pyramid matching model. First, new local Gabor feature descriptors are extracted from dense sampling patches of scene images. These local feature descriptors are embedded into a bag-of-visual-words (BOVW) model, which is combined with a spatial pyramid matching framework. The new local Gabor feature descriptors have sufficient discrimination abilities for dense regions of scene images. Then the efficient feature vectors of scene images can be obtained by -means clustering method and visual word statistics. Second, in order to decrease classification time and improve accuracy, an improved kernel principal component analysis (KPCA) method is applied to reduce the dimensionality of pyramid histogram of visual words (PHOW). The principal components with the bigger interclass separability are retained in feature vectors, which are used for scene classification by the linear support vector machine (SVM) method. The proposed method is evaluated on three commonly used scene datasets. Experimental results demonstrate the effectiveness of the method.

1. Introduction

Scene classification is an appealing and challenging problem in image processing and machine vision. The goal of scene classification is to automatically classify scene images into specific scene categories such as mountain, street, forest, and inside city. Scene classification methods have many applications, such as video retrieval, content-based image retrieval, UAV autonomous landing, and intelligent vehicle navigation [1]. Moreover, scene classification can provide an important cue for object recognition and detection, action recognition, and other computer vision tasks.

Scene classification methods can be divided into two main categories. First, the early methods mainly use low-level global features (e.g., texture and color) which are extracted from a whole image [2, 3]. These methods often exhibit poor classification performance, because they lack an intermediate image description that is extremely valuable in determining the scene category. Second, the methods make use of semantic models [4]. They describe the contents of scene images by the semantic intermediate representation, which can be mainly divided into the local semantic concepts based intermediate representation methods and the global semantic concepts based intermediate representation methods.

The local semantic concepts based intermediate representation methods make use of the features extracted from local regions in scene images [5, 6]. They generally represent the scene image by a collection of local descriptors using segmentation, dense sampling patches, or interest point detectors. These methods are widely used due to their effectiveness, especially the bag-of-visual-words (BOVW) model [7, 8]. The BOVW model extracts local feature descriptors of scene images and obtains visual words by clustering and then uses the histograms of visual words to represent images. The BOVW model has obtained good performance, but this technique also has some limitations. The BOVW model uses the orderless collection of local descriptors to represent scene images [9], and therefore any spatial relationships of scene images are lost. The loss of spatial position information affects the accuracy of scene classification [10]. The weakness of the BOVW model can be mitigated by a spatial pyramid matching framework [11]. In the pyramid matching framework, a scene image is partitioned into increasingly finer grids. The histogram of visual words inside each subregion is computed. The pyramid matching framework has obtained encouraging performance. Nowadays, many of the best scene categorization methods are based on this scheme.

The global semantic concepts based intermediate representation methods take the scene image as a whole for obtaining global description features. The “Gist” model is the most prominent one of these methods. It has exhibited good performance in many applications [12, 13]. In this model, scene images are convolved by the multiscale and multiorientation Gabor filters. Then the filtering results are divided into a grid and the means of all subregions are computed and assembled for yielding feature vectors [14]. Lastly, the “Gist” features are used for scene classification. The “Gist” model is obtained from the sparse grid of the scene image. Thus, the “Gist” feature is coarse-grained, and some detailed information of the scene image is lost. When scene images are complex, the classification performance of the “Gist” model is not very good. For example, when some categories of indoor environments are included in scene datasets, the classification accuracy of the “Gist” model drops dramatically.

In this study, we will present a new method for scene classification using local Gabor features. The proposed method not only solves the coarse-grained problem of the “Gist” feature but also utilizes the spatial information of the pyramid matching model. In addition, the proposed method extracts principal components of feature vectors of scene images by the improved KPCA algorithm, which can retain more category information. Last, the linear “1-a-r” SVMs are used for scene classification. For evaluating the performance of the proposed method, three scene datasets are used for classification testing. We also investigate the impacts of different parameters on the performance of the proposed classification method. The proposed method is also compared with several well-known methods.

This paper is arranged as follows: In Section 2, our method of scene classification is described, and the implementation steps are presented. In Section 3, we evaluate the proposed method on three different datasets and present experimental results. In Section 4, the conclusions are given.

2. The Proposed Scene Classification Method

The framework of the proposed method is illustrated in Figure 1. First, scene images are convolved with a 2D Gabor filter bank, and then the image patches of pixels are obtained from the filter responses by dense sampling. The local Gabor feature of each sample point is obtained by computing the Gaussian-weighted mean in the corresponding neighborhood of each filter channel and assembling these means in a vector. Accordingly, local Gabor feature descriptors of dense sampling patches of all scene images can be extracted, and then visual words can be obtained by the -means clustering algorithm. For exploiting spatial position information, the pyramid histogram of visual words (PHOW) based on a spatial pyramid model is used in this scheme. Owing to the relatively high dimension of PHOW, the computational costs of training and testing of SVM classifiers are high. In order to solve this problem and improve classification accuracy, an improved KPCA method is used for extracting appropriate principal components. The feature vectors obtained by the improved KPCA method are used for scene classification by linear SVMs.

2.1. Local Gabor Feature Extraction

Gabor filters are particularly appropriate for obtaining the texture representation of scene images [15]. In this paper, we extract local Gabor features of images for scene classification. Figure 2 illustrates the procedure of feature extraction. Given a scene image, we firstly convolve it with 2D Gabor filters. The 2D Gabor filters [16] are defined aswhere , , , , , and . In this research, we adopt the Gabor filter bank with eight different orientations () and five different scales (). The magnitude responses are used for feature extraction. In order to obtain the fine-grained Gabor feature, we perform dense sampling. We utilize 8 pixels as the sampling interval of the dense regular grid. The pixel neighborhood of each sample point is used for calculating the local feature descriptor. For each sample point, the Gaussian-weighted mean of the corresponding neighborhood of every channel is computed, respectively. The mean is treated as the feature value of the corresponding filter channel. Then the local Gabor feature descriptor can be obtained by the concatenation of feature values of all channels. The dimension of the local Gabor feature descriptor is 40 . By dense sampling, Gabor feature descriptors of 961 sample points can be extracted from a scene image.

We use a Gaussian function for weighting calculation in the neighborhood of each sample point. The Gaussian-weighted function iswhere () denotes the pixel position in the neighborhood. The sample point corresponds to . The pixel in the upper left corner of the neighborhood corresponds to (−7, −7). The pixel in the lower right corner of the neighborhood corresponds to (7, 7). is the Gaussian width. We let be 100 in this study.

The local Gabor feature descriptors are fine-grained Gabor features which have sufficient discrimination abilities for dense sampling patches of scene images. Then we represent scene images using the bag-of-visual-words model. First, we quantize these Gabor feature descriptors into discrete codewords by the -means clustering algorithm. Each cluster center corresponds to a visual word. Scene images can be represented as histograms of visual words [17] after the Gabor feature descriptors are mapped into visual words.

Figure 3(a) illustrates three scene images. Figure 3(b) shows the histograms of visual words based on local Gabor feature descriptors. In this experiment, the vocabulary size is 300. It can be seen that local Gabor features can yield the effective histogram representation of scene images.

We evaluate the local Gabor feature descriptors for scene classification on a 15-category scene dataset [18] and compare them with scale invariant feature transform (SIFT) descriptors [6]. The SIFT descriptors are extracted by dense sampling on a regular grid, which is the same as the grid used by local Gabor features. We use “1-a-r” RBF-SVMs for scene classification and randomly select 200 images of each category as the experiment images. Half of them are used as training samples and the others are used for testing. The codebook size is set to be 300. The comparison results of classification accuracy of all scene categories are shown in Figure 4. We can see that the local Gabor feature descriptor obtains good classification performance. In the same experiment conditions, the classification accuracy of the local Gabor feature is higher than the SIFT descriptor on most of scene categories.

2.2. Pyramid Histogram of Visual Words (PHOW)

The bag-of-visual-words model is limited due to the loss of spatial position information. Thus, we construct a spatial pyramid and compute the pyramid histogram of visual words. The pyramid histogram of visual words is suitable for scene classification because it contains position information of scene images [19]. In order to construct a spatial pyramid, a scene image is partitioned into increasingly finer grids by the quadtree decomposition. A sequence of grids at levels are obtained. Then the histogram of visual words inside each subregion is computed, respectively. The PHOW can be obtained by concatenating histograms of visual words of all subregions at different levels.

Figure 5 shows the pyramid histogram of visual words (PHOW) of a scene image. The number of levels of the spatial pyramid is three. For three different levels, the number of visual words of each subregion is counted and shown, respectively. The size of the vocabulary is 300, and therefore the dimensionality of the PHOW is .

Using pyramid histograms of visual words as feature vectors for scene classification, a spatial pyramid matching kernel (PMK) is adopted as follows:where and represent two scene images and is the visual word number. is defined aswhere is the number of levels and is the current level. Each level is weighted using for the purpose that matched points from the finer resolution are weighted more highly than those at the coarser resolution. is the abbreviation of a histogram intersection function, which is defined as where denotes the count of the th visual word in the th subregion of image at level . denotes the count of the th visual word in the th subregion of image at level .

2.3. Improved Kernel Principal Component Analysis

Pyramid histograms of visual words of scene images are assumed to be (), . First, KPCA is to map each original input vector into the higher-dimensional feature space and then compute the covariance matrix:where here is the nonlinear mapping of the input variables . Then we solve the following eigenvalue problem [20]:All solutions with must lie in the span of [21], and . Thus, is equivalent to where is a kernel matrix defined by . By utilizing the kernel function, nonlinear mapping and inner products computing in the feature space can be avoided [22]. The principal component can be extracted by projecting onto eigenvector as follows [23]:

Let denote the nonzero eigenvalues of the kernel matrix . By using only the first several eigenvectors sorted in descending order of the eigenvalues, the number of principal components can be reduced [24]. The choice of the number of principal components is as follows:where is the predefined threshold of the KPCA method.

For simplicity, we have assumed that the observation data are centered, and this could be done by substituting the kernel matrix with , where is a square matrix. Its elements are all .

KPCA can retain information as much as possible when feature vectors are simplified. For pattern classification, the most important is not the total amount of retained information but the category information. In view of this, we further extract appropriate principal components by evaluating category information of feature vectors.

In this research, we use the interclass separability for evaluating category information. The separability of the th dimension component of feature vectors between class and class is defined as follows: where is the distance between the center of the th dimension component of feature vectors of class and the center of the th dimension component of feature vectors of class . Consider , where is the center of the th dimension component of feature vectors of class . Consider , where is the number of samples of class , and represents the th dimension component of the th sample of class . represents the standard deviation of the th dimension component of class . It is formulated as .

The bigger is, the better the separability of the th dimension component between class and class is. When is smaller than 1, there is an overlap between the th dimension component of class and that of class .

We define the interclass separability of the th dimension component of feature vector as follows:

Let represent the category information of the th dimension component. The bigger is, the more suitable for classification the th dimension component is. Then is sorted in descending order, and the components corresponding to the first separability are retained.

The choice of the number of appropriate principal components for scene classification is as follows:where is the predefined threshold.

After appropriate principal components are extracted, linear “1-a-r” SVMs [25] are used for scene image classification. The linear SVMs have simple decision function and fast classification speed. These advantages are more prominent for multiclass classification problems.

3. Experiments and Results

The proposed method is evaluated on three datasets.

OT dataset [9, 14]: it contains 2688 images from 8 scene categories, which are coast (360 samples), forest (328 samples), mountain (374 samples), open country (410 samples), highway (260 samples), inside city (308 samples), tall buildings (356 samples), and streets (292 samples). The size of each image is .

FP dataset [4, 16]: it contains 3859 images from 13 scene categories. FP dataset is an extension of OT dataset by adding 5 new categories, which are bedroom (216 samples), kitchen (210 samples), living room (289 samples), office (215 samples), and suburb (241 samples). The image size is approximately .

LS dataset [1, 11]: it contains 4485 images from 15 scene categories. LS dataset is an extension of FP dataset by adding 2 new categories, that is, industrial (311 samples) and store (315 samples). Figure 6 depicts some example images from three datasets. These scene datasets are publicly available at http://www-cvr.ai.uiuc.edu/ponce_grp/data/.

We randomly select 125 images of each category as the experiment images. The fivefold cross-validation is performed for achieving the accurate estimation of classification performance. First, scene images are filtered by the Gabor filter bank of 5 scales and 8 directions, and local Gabor feature descriptors are extracted. Then based on the spatial pyramid matching model, pyramid histograms of visual words are obtained. The vocabulary size is 300, and the number of levels of the spatial pyramid is three. The improved KPCA method with spatial pyramid matching kernel (PMK) is used for dimensionality reduction. The threshold is set to be and the threshold is set to be . Last, linear “1-a-r” SVMs are adopted for scene classification. The penalty factor of the “1-a-r” SVMs is set to be 10.

Figure 7 shows the confusion matrixes of the proposed method for three different scene datasets. In the confusion matrix, average classification rates for individual categories are listed along the diagonal. The entry in the th row and th column is the percentage of images from category that are misidentified as category . For the OT dataset, the highest classification rate is 100% for the highway category, and the lowest classification rate is 72% for the open country category. The biggest confusion happens between coast category and open country category. By observing, we find that the misclassified “coast” images show certain similarity to the “open country” images at first glance. While there is no color information to help separate sea water from grassland, the misclassified “coast” images are very easy to be confused with “open country” images. For FP dataset and LS dataset, the biggest confusion happens between the indoor categories (kitchen, living room, and bedroom). By observing the misclassified images, we find that some classification errors are related to the ambiguity of scene images. For example, some “kitchen” images are confused with “living room” images. We find most of them depict the furniture (such as dining table, coffee table, and cabinets) in the central parts of images and the windows in the edge parts of images. They are very easy to be confused. In spite of this, the proposed scheme has achieved good performance. The classification accuracy of three scene datasets is 87.5%, 82.8%, and 78.7%, respectively.

In order to test the influence of different factors (such as kernel functions, scales, and orientations of the Gabor features) on classification performance of the proposed method, we perform experiments with RBF kernel function, POLY kernel function, and pyramid matching kernel function for three scene datasets, respectively. Tables 13 show the performance comparison of these experiments.

In this study, we set the Gaussian width of the RBF kernel function to be 1 and set the parameter of the POLY kernel function to be 2. As shown in Tables 13, the schemes using the RBF kernel function for KPCA obtain better classification performance than the schemes using the POLY kernel function, and the scheme using the PMK for KPCA obtains the highest classification accuracy. The experimental results also show that classification accuracy has an upward trend with the increasing number of directions and scales of extracted Gabor features. But the conclusion cannot be drawn that the more directions and scales of Gabor features are used, the better classification performance is obtained. Owing to the meticulous division, the Gabor features with 12 orientations are not more suitable for scene classification than the Gabor features with 8 orientations. Consequently, the local Gabor features with 5 scales and 8 orientations are the most appropriate for scene classification.

In the proposed method, the nonlinear principal components of feature vectors are extracted by the improved KPCA, and linear “1-a-r” SVMs are used for scene classification. The training time and the testing time decrease relatively owing to dimensionality reduction of feature vectors, and the classification performance changes with the number of retained principal components. Figures 8 and 9 show some experimental curves of our method. The training time and the testing time are the runtime of the linear “1-a-r” SVMs. The experimental environments are given as follows: windows 7, MATLAB7.10, CPU Intel i3-2330M, 2.20 GHz, and 2.00 GB RAM.

Figures 8(a)8(d) show the experimental curves of the number of principal components, classification accuracy, training time, and testing time when the threshold changes form 95% to 60% . As shown in Figure 8, the number of principal components declines rapidly when the threshold decreases. The training time and testing time of “1-a-r” SVMs decrease correspondingly with the reduction of threshold , and the classification accuracy also decreases correspondingly.

Figures 9(a)9(d) show the experimental curves of the number of principal components, classification accuracy, training time, and testing time when the threshold changes form 95% to 60% . Because the principal components with the bigger interclass separability are used for scene classification in our method, good classification performance can be obtained. Figure 9(b) shows the classification accuracy with the various parameter . Initially, the classification accuracy gradually increases with the decrease of parameter , because some components with less category information are discarded. After reaching the maximum, the classification accuracy gradually decreases with the decrease of parameter , because the number of discarded components increases so much that some components with more category information are discarded. The classification accuracy reaches its peak when is about 80%–90%. The number of principal components, the training time, and testing time of “1-a-r” SVMs decrease correspondingly with the reduction of threshold .

In this study, the image size is approximately . If the images are bigger, the training time and testing time are not affected. The computational cost for local Gabor feature extraction of each scene image is linear with the size of the image. If the images are bigger, the computational cost for feature extraction is higher. However, the training time and the testing time measured in this paper are the runtime of the “1-a-r” SVMs, and therefore the change of the time for feature vector extraction is not included. Moreover, the factors that affect the runtime of SVM classifiers (such as the dimensionality of feature vectors and the number of training images and test images) are not related to the size of the image. Even if the images are bigger, the runtime of “1-a-r” SVMs is unchanged.

The proposed method is also compared with several well-known algorithms, such as the dense SIFT method [11], the BOVW method [4], and the “Gist” method [14]. We randomly select 200 scene images of each category from three different datasets as experiment images. Half of them are used as training samples and the others are used for testing. The penalty factor of “1-a-r” SVMs is set to be 10. In the dense SIFT method, the sampling interval of the dense regular grid is 8 pixels. SIFT descriptors are computed from image patches. The vocabulary size is 300, and the number of levels of the spatial pyramid is three. The other parameter settings are the same as the settings in [11]. “1-a-r” SVMs with the spatial pyramid matching kernel are used for scene classification. In the BOVW method, Difference of Gaussian (DoG) detectors are used to automatically detect key points. SIFT descriptors are adopted for representing local features of scene images, and “1-a-r” RBF-SVMs are used for scene classification. We set the Gaussian width of the RBF kernel function to be 1. The other parameter settings are the same as the settings in [4]. In the “Gist” method, “Gist” feature is extracted from a grid of the filtering output of a scene image convolved with 40 Gabor filters (5 scales and 8 orientations), which have been described in Section 2. “1-a-r” SVMs with the RBF kernel function are used for scene classification. The Gaussian width of the RBF kernel function is set to be 1.

Figure 10 shows the classification accuracy of different methods. For three different scene datasets, the proposed method is slightly better than the dense SIFT method and much better than the BOVW method and the “Gist” method. In the proposed method, local Gabor features extracted by imitating the “Gist” model, which conforms to the mechanism of human vision, have good discrimination abilities for sampling patches of scene images. So the accuracy of visual words which are obtained by the -means clustering algorithm can be guaranteed. Meanwhile the improved KPCA is used for extracting nonlinear principal components. The principal components containing more category information, which are suitable for scene classification, are retained. The proposed method achieves considerably higher accuracy.

4. Conclusions

A new scene classification method has been proposed based on local Gabor features. The local Gabor feature descriptors which are extracted according to the “Gist” theory have sufficient discrimination abilities for sampling patches of scene images. By quantizing local Gabor features into discrete codewords and employing a spatial pyramid matching model, pyramid histograms of visual words which contain spatial position information of images are obtained for representing scene images. In addition, the principal components of PHOW containing more category information are extracted by an improved KPCA method. These principal components are suitable for scene classification, and they can improve both classification accuracy and computational cost. Numerical experiments are conducted on three scene datasets. The experimental results demonstrate the effectiveness of the method. The proposed method can also be extended to different applications such as the classification of commodity images and the classification of event images.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the Science Research Foundation of the Education Department of Liaoning Province of China (Grant no. L2014174).