Abstract

In recent years the spatial resolutions of remote sensing images have been improved greatly. However, a higher spatial resolution image does not always lead to a better result of automatic scene classification. Visual attention is an important characteristic of the human visual system, which can effectively help to classify remote sensing scenes. In this study, a novel visual attention feature extraction algorithm was proposed, which extracted visual attention features through a multiscale process. And a fuzzy classification method using visual attention features (FC-VAF) was developed to perform high resolution remote sensing scene classification. FC-VAF was evaluated by using remote sensing scenes from widely used high resolution remote sensing images, including IKONOS, QuickBird, and ZY-3 images. FC-VAF achieved more accurate classification results than the others according to the quantitative accuracy evaluation indices. We also discussed the role and impacts of different decomposition levels and different wavelets on the classification accuracy. FC-VAF improves the accuracy of high resolution scene classification and therefore advances the research of digital image analysis and the applications of high resolution remote sensing images.

1. Introduction

With the rapid development of satellite and sensor technologies, remote sensing has become an important and efficient means to collect earth spatial information in recent years [14]. Remote sensing images with high spatial resolutions can be acquired from satellites, such as IKONOS, QuickBird, and WorldView [5, 6]. High resolution remote sensing images provide us with a great deal of information on texture structures and spatial details. The improvement in spatial resolution also increases the intraclass variability of land-cover classes and reduces the interclass variability between different classes [7], which increase the fuzziness of classification and pose a big challenge for automatic classification of remote sensing scenes. Remote sensing scenes are the separated subareas extracted from remote sensing images and possess specific semantic meanings, such as farmlands and residential areas. Remote sensing scene classification is a process to classify specific scenes in remote sensing images, which is essential to many remote sensing applications and has attracted much attention in recent years [810]. Various classification methods have been developed, which can be applied to remote sensing scene classification, such as minimum distance method [11], maximum likelihood method [11], neural network methods [1216], fuzzy methods [1721], support vector machine methods [2123], particle swarm optimization methods [19, 24], artificial immune methods [25, 26], and Markov model methods [2729]. However, due to the complex texture structures and spatial details in high resolution remote sensing scenes, scene classification is still a difficult task. Remote sensing scene classification methods based on visual attention may provide potential solutions to resolve this issue.

Visual attention is an important characteristic of the human visual system [30]. The human visual system can be easily attracted by salient details of an image and recognize objects or scenes in the image. Visual saliency measures to what extent details in an image attract human attention [31]. In the past twenty years, visual attention has become one of the hot spots in the relevant research and applications of artificial intelligence [3236]. In 1998, Itti et al. proposed a visual attention model [32], which was based on the attention mechanism of the human visual system. The Itti visual attention model can be used to extract a variety of features from input images, such as brightness and color. Then these features were analyzed and consolidated to generate saliency maps. Walther and Koch further developed the saliency model proposed by Itti et al. They introduced a feedback mechanism in generating saliency maps for object recognition [33]. Achantay et al. proposed a frequency-tuned method to compute pixel saliency directly and detect salient regions [34]. Hou and Zhang designed a fast method to detect image saliency by exploring spectral components in an image [35]. Tian et al. proposed a color saliency model to detect salient objects in natural scenes [36]. In their color saliency model, different color features were extracted and analyzed. For different color features, two efficient saliency measurements were employed to compute different saliency maps. And a feature combination strategy was presented to combine multiple saliency maps into one integrated saliency map. Scene feature extraction is a key step in scene classification, which affects the classification accuracy. When the human visual system observes and classifies scenes, it is usually through a multiscale process. However, attempts to extract visual attention features through a multiscale process for scene classification are relatively rare in literatures.

The assumption of this study is that visual attention features could be extracted through a multiscale process for high resolution remote sensing scene classification. Fuzzy theory is an effective mathematical tool to process fuzzy and complex information [1721], which could be suitable for high resolution remote sensing scene classification. Therefore, the fuzzy classification method [1719] is preferred in this study. The main goals of this study are to propose a novel visual attention feature extraction algorithm based on wavelet transform, which extracts visual attention features through a multiscale process; to apply a fuzzy classification method (FC) using visual attention features (VAF) to achieve an improved accuracy in the scene classification; to compare and evaluate the effects of FC-VAF with four traditional classification methods using IKONOS, QuickBird, and ZY-3 remote sensing scenes; and to discuss the parameter sensitivity of FC-VAF.

2. Methodology

2.1. Wavelet Transform-Based Visual Attention Feature Extraction
2.1.1. Basic Principle of Wavelet Transform

The wavelet analysis is a powerful mathematical tool to obtain decomposition, reconstruction, and a multiscale representation of signals [3739]. It introduces inherent scaling and good identification of signals, which is relevant to the human perception. A digital image is regarded as a two-dimensional discrete signal and can be decomposed and reconstructed by the two-dimensional discrete wavelet transform. The two-dimensional discrete wavelet transform allows good localization in both the frequency and spatial domain. The image can be decomposed into multiple levels using wavelet basis functions. It can be considered as a chain of successive levels of decomposition of the image by applying the one-dimensional discrete wavelet transform in the horizontal and vertical directions [3739]. Two-level two-dimensional discrete wavelet decomposition of an image is illustrated in Figure 1. There are several popular wavelets in the field of the wavelet analysis, such as Daubechies wavelets, Symlets wavelets, and Discrete Meyer wavelet [39]. Different wavelets lead to different wavelet decomposition effects and application results.

2.1.2. Visual Attention Feature Extraction through a Multiscale Process

Wavelet transform can obtain the multiscale representation of images. Therefore, a novel visual attention feature extraction algorithm based on wavelet transform is proposed, which extracts visual attention features from the saliency maps of remote sensing scenes through a multiscale process.

Visual saliency in an image measures to what extent details attract human attention [31]. Tian et al. proposed a color saliency model to detect salient objects in natural scenes [36]. In their color saliency model, different color features were extracted and analyzed. For different color features, two efficient saliency measurements were proposed to compute different saliency maps. And a feature combination strategy was presented to combine multiple saliency maps into one integrated saliency map. We adopt the color saliency model above to obtain the integrated saliency map for an image as follows [36]:where ; represent the saliency maps of the intensity, hue, and saturation components of the image, respectively; represent the weight values of , respectively; represent the intensity, hue, and saturation components of the image, respectively; is the average value of ; is the average value of .

The visual attention features are extracted from an integrated saliency map as follows.

(a) The integrated saliency map is decomposed by N-level two-dimensional discrete wavelet transform. The multiscale representation of the integrated saliency map is obtained and composed by . The multiscale representation of an integrated saliency map for visual attention feature extraction is illustrated in Figure 2, where .

(b) Visual attention focuses are extracted in the top level of the multiscale representation. The salient points in the top level are extracted based on the saliency values of the points. The human visual system can be easily attracted by the most salient point. Therefore, the most salient point is selected as the first and current visual attention focus. Then visual attention is shifted among the salient points in the top level. The next visual attention focus is the unselected salient point which is closest to the current visual attention focus. For example, there are three salient points in Figure 2. The most salient point is selected as the first and current visual attention focus. Then select the salient point as the second visual attention focus because it is closer to than .

(c) Visual attention is shifted from the top level to the low level of the multiscale representation. Take the visual attention focus in Figure 2, for example. According to the position relation between two adjacent levels of the multiscale representation, in corresponds to a small region in . Select the point with maximal value in the region as the corresponding visual attention focus . In the same way, we can obtain the visual attention focus in the visual saliency map.

(d) The saliency values of the visual attention focuses in the visual saliency map are used as the visual attention features for scene classification. In Figure 2, the saliency values of , , and are used as the visual attention features.

2.2. Fuzzy Classification of Remote Sensing Scenes

We apply the fuzzy classification method [1719] using visual attention features to achieve an improved accuracy of scene classification. The classification procedure is described as follows.

(a) Multiple original features are extracted from the samples of remote sensing scenes, including gray level cooccurrence matrix features [40], Laws texture energy features [40], and visual attention features. These features consist of feature vectors, which represent the corresponding scenes in the recognition process.

(b) The features are transformed into fuzzy features using the standard S-function as follows:where , , and are the fuzzy parameters; .

(c) Fuzzy class centers are obtained by using the mean value method. Suppose is the th component of the class center of the th class, is the number of the training samples of the th class, and is the th component of the feature vector of the training sample ; then is computed as follows:where ; is the dimension of the feature vectors of the samples.

(d) Test samples are classified using Euclidean fuzzy closeness degree on the basis of the fuzzy closeness principle [18].

(e) Fuzzy classification results are assessed using overall accuracy (OA), Kappa coefficient (KC), average producer’s accuracy (APA), and average user’s accuracy (AUA) based on confusion matrices [41, 42].

A flowchart of the fuzzy classification process is shown in Figure 3.

3. Case Study

3.1. Materials

In order to validate the effectiveness of FC-VAF, 80 samples of remote sensing scenes were selected as the experimental data from widely used high spatial resolution remote sensing images, including IKONOS, QuickBird, and ZY-3 images. The samples consist of four classes, which are residential areas, farmlands, woodlands, and water areas, respectively. Each class has 20 samples where 10 samples are used as the training samples and all are used as the test samples. The size of the samples is pixels. Representative samples of remote sensing scenes are shown in Figure 4.

3.2. Methods and Results

To demonstrate the effectiveness of FC-VAF, comparisons were carried out between FC-VAF and scene classification based on four traditional algorithms. The four methods for comparison are standard backpropagation neural network classification (SBPC), adaptive learning rule backpropagation neural network classification (ALRBPC), general regression neural network classification (GRNNC), and fuzzy classification (FC). Four gray level cooccurrence matrix features and four Laws texture energy features were extracted from these samples for all scene classification methods. The Euclidian closeness degree measurement was adopted in both FC and FC-VAF. Symlets wavelet was adopted in FC-VAF. The main parameters of different methods are shown in Table 1.

We compared the results of different scene classification methods using the measures of OA, KC, APA, and AUA. Table 2 shows the performances in terms of the classification accuracy derived by SBPC, ALRBPC, GRNNC, FC, and FC-VAF. From Table 2, we can see that GRNNC outperformed FC and ALRBPC using OA, KC, APA, and AUA, while SBPC was the worst performer. FC-VAF obtained the best classification results among the five methods according to the values of OA, KC, APA, and AUA. For example, the OA values of SBPC, ALRBPC, GRNNC, FC, and FC-VAF are 76.3%, 78.8%, 82.5%, 80.0%, and 85.0%, respectively. The KC values of SBPC, ALRBPC, GRNNC, FC, and FC-VAF are 0.683, 0.717, 0.767, 0.733, and 0.800, respectively. FC-VAF can obtain satisfactory classification results in such images, because FC-VAF is on the basis of fuzzy theory and utilizes visual attention features in the process of classification.

4. Discussion

4.1. Discussion of the Effects of Wavelet Decomposition Levels

The decomposition level (DL) of wavelets is the key parameter of FC-VAF, which affects the accuracy of scene classification. The scene classification accuracy of FC-VAF related to DL was analyzed and discussed. The 80 samples of scenes in the case study were used with different DL values . Other parameters of FC-VAF were kept the same as those in the case study. The classification accuracy of FC-VAF for each DL value is shown in Figure 5. It shows that, with the increase of the DL value, the OA value increases to the maximum 85.0% when DL is 2 and then decreases. KC, APA, and AUA have similar trends as that of OA. Therefore, the optimal value of DL is 2 among the three values for FC-VAF in this application.

4.2. Discussion of the Effects of Different Wavelets

Different wavelets lead to different wavelet decomposition effects, which affect the classification accuracy of FC-VAF. The scene classification accuracy of FC-VAF related to wavelets was analyzed and discussed. The 80 samples of scenes in the case study were used with different wavelets. Other parameters of FC-VAF were kept the same as those in the case study. The classification accuracy of FC-VAF using different wavelets is shown in Figure 6. It shows that DMeyer wavelet outperformed Daubechies wavelet using the measures of OA, KC, APA, and AUA, while Symlets wavelet was the best performer. For example, the OA values of Daubechies wavelet, DMeyer wavelet, and Symlets wavelet are 81.3%, 82.5%, and 85.0%, respectively. Therefore, Symlets wavelet is optimal among the three wavelets for FC-VAF in this application.

5. Conclusions

In this study, a novel visual attention feature extraction algorithm was proposed, which extracted visual attention features through a multiscale process. And a fuzzy classification method using visual attention features (FC-VAF) was developed to perform high resolution remote sensing scene classification. FC-VAF was evaluated by using 80 samples of remote sensing scenes, which were selected from widely used high resolution remote sensing images, including IKONOS, QuickBird, and ZY-3 images. FC-VAF achieved more accurate classification results than four traditional classification methods according to the measures of OA, KC, APA, and AUA. The OA values of SBPC, ALRBPC, GRNNC, FC, and FC-VAF are 76.3%, 78.8%, 82.5%, 80.0%, and 85.0%, respectively. The KC values of SBPC, ALRBPC, GRNNC, FC, and FC-VAF are 0.683, 0.717, 0.767, 0.733, and 0.800, respectively. The classification accuracy of FC-VAF related to the decomposition level and to the wavelets was discussed.

FC-VAF can extract visual attention features through a multiscale process and improve the accuracy of scene classification in high resolution remote sensing images. Therefore, FC-VAF not only advances the research of visual attention models and digital image analysis methods, but also promotes the applications of high resolution remote sensing images. Possible further development of the study will focus on the integration of FC-VAF and other intelligent algorithms to further improve the accuracy of high resolution remote sensing scene classification.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This paper was supported by the National Natural Science Foundation of China (Grant no. 41371343). The authors also wish to thank Susan Cuddy at CSIRO for her helpful comments and suggestions.