Journal of Sensors

Volume 2015, Article ID 142612, 10 pages

http://dx.doi.org/10.1155/2015/142612

## Bayesian Information Criterion Based Feature Filtering for the Fusion of Multiple Features in High-Spatial-Resolution Satellite Scene Classification

^{1}Signal Processing Laboratory, School of Electronic Information, Wuhan University, Wuhan 430072, China^{2}Wireless Communication and Sensor Network Laboratory, School of Electronic Information, Wuhan University, Wuhan 430072, China

Received 12 November 2014; Accepted 18 February 2015

Academic Editor: Tianfu Wu

Copyright © 2015 Da Lin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper presents a novel classification method for high-spatial-resolution satellite scene classification introducing Bayesian information criterion (BIC)-based feature filtering process to further eliminate opaque and redundant information between multiple features. Firstly, two diverse and complementary feature descriptors are extracted to characterize the satellite scene. Then, sparse canonical correlation analysis (SCCA) with penalty function is employed to fuse the extracted feature descriptors and remove the ambiguities and redundancies between them simultaneously. After that, a two-phase Bayesian information criterion (BIC)-based feature filtering process is designed to further filter out redundant information. In the first phase, we gradually impose a constraint via an iterative process to set a constraint on the loadings for averting sparse correlation descending below to a lower confidence limit of the approximated canonical correlation. In the second phase, Bayesian information criterion (BIC) is utilized to conduct the feature filtering which sets the smallest loading in absolute value to zero in each iteration for all features. Lastly, a support vector machine with pyramid match kernel is applied to obtain the final result. Experimental results on high-spatial-resolution satellite scenes demonstrate that the suggested approach achieves satisfactory performance in classification accuracy.

#### 1. Introduction

Scene classification has aroused more and more attention in remote sensing domain. For satellite imagery of high-spatial-resolution, it evokes a great deal of challenging problems in scene classification due to high intraclass variability, low interclass disparity, and other external factors such as changes of viewpoint, illuminations and shadows, background clutter, partial occlusions, and multiple instances. In addition, with considerable increase of the spatial resolution of images, details of the targets become clearer, and a host of cues also become more distinctive, such as structure and colour. As a consequence, it is of great importance to appropriately combine and fuse them in various respects. In the past decade or so, many researchers and practitioners have made great efforts in exploiting different sources of information in the high-spatial-resolution satellite imagery to enhance classification performance [1–5].

Unlike the case of low-spatial-resolution satellite images, where solitary type of feature descriptor has been proved to be effective and efficient for classification [6, 7], it is universally recognized that instead of adopting a solitary type of feature, it is more favorable to fuse and combine a set of diverse and complementary features such as features based on structure and colour information [8, 9]. Hence, how to fuse these pieces of diverse and complementary information and eliminate the ambiguities and redundancies between them becomes a critical problem. One widely praised approach is feature-level fusion, in which features from different channels are fused, developing a new pattern for scene classification. Many approaches have been reported in the literature [10–12].

The canonical correlation analysis (CCA) [13] method has been spotlighted, especially in feature fusion realm [14, 15], due to its capability in expressing inherent correlation between two sets of features. In order to extract canonical correlation features from two groups of features, the CCA method firstly constructs a correlation criterion function by extracting two diverse features from the identical samples. Then the CCA method creates efficient and effective discriminant features for classification.

However, when the dimensions of features are too large, such as in case of high-spatial-resolution satellite scene classification researches, traditional CCA methods are no longer proper. Besides, when the features are extracted from the identical image, sample covariance matrices turn into being undefined or unstable, inducing extra difficulty in parameter estimation. As a consequence, a dimension reduction approach is imperative for CCA to tackle this problem. In the past decade, abundant approaches have been proposed for feature shrinkage and selection, containing the nonnegative garrote by Breiman [16], least absolute shrinkage and selection operator (Lasso) by Tibshirani [17], smoothly clipped absolute deviation (SCAD) by Fan and Li [18], and Elastic-net by Zou and Hastie [19]. Recently, these approaches have been employed to CCA for relationship assessment between two sets of high-dimensional remote sensing data. The feature selection strategies are utilized to the canonical feature loadings which set some of the coefficients to exact zeros for selecting the remaining features. The crucial features selected are then called the sparse set of features, and the canonical correlation analysis exploiting these features is often known as sparse canonical correlation analysis (SCCA). Initially, Waaijenborg et al. [20] suggested a penalized form of CCA adopting an iterative regression process with the Univariate Soft Threshold (UST) form of the Elastic-net penalty. Subsequently, Parkhomenko et al. [21] suggested the SCCA approach utilizing a form of regularization resembling UST Elastic-net [19]. Witten and Tibshirani [22] incorporated the Lasso penalty in their SCCA. These approaches are interesting. Yet, they do not control the sparsity in a direct way. At the same time, it has been revealed that sundry penalized likelihood approaches have the oracle property under certain conditions [18, 23–26]. However, in practical application, without an appropriate model of feature selection, the oracle property cannot be realized. Accordingly, the approaches may not necessarily generate sparse set of features. To address the issue, this paper suggests a two-phase process in implementing SCCA in which L1 penalty is utilized on the feature loadings during the first phase, and then a Bayesian Information Criterion (BIC) based feature filtering algorithm is conducted to further remove redundant and noisy information. To be specific, in the first phase, we gradually impose a constraint via an iterative process to set a constraint on the loadings for averting sparse correlation descending below to a lower confidence limit of the approximated canonical correlation. In the second phase, Bayesian information criterion (BIC) is utilized to conduct the feature filtering which sets the smallest loading in absolute value to zero in each iteration for all features.

The rest of this paper is organized as follows. The feature extraction process is presented in Section 2. Section 3 offers a concrete and detailed depiction of the methodology exploited in this paper. Section 4 displays experimental results and gives a performance assessment. Finally, Section 5 summarizes the work and points the directions for the future work.

#### 2. Multiset Feature Extraction

##### 2.1. Scale-Invariant Feature Transform (SIFT) Descriptor

In the detected region, SIFT descriptor extracts a gradient orientation histogram [27]. The gradient image is sampled over a 4 × 4 grid in each of eight orientation planes; thus resulting descriptor is of dimension 128. A weight to the magnitude of each sample point is gained by implementing a Gaussian window function. This puts more highlights on the gradients that are near the center of the region and renders the descriptor less sensitive to the small changes in the position of the detected region. The gradient magnitude is used to weigh the contribution to the orientation and location bins. The descriptor is immune to small errors in the region detection and small geometric distortions, largely due to the quantization of orientations and gradient locations. The square root of the sum of squared components is calculated to normalize the descriptor for acquiring illumination invariance.

##### 2.2. Colour Histogram Descriptor

Colour histogram descriptor is three separated histograms for the R, G, and B channels [28]. A colour histogram represents the approximate distribution of the colours in an image, due to the fact that each histogram bin stands for a local colour range in the given colour space. Colour histograms are invariant to the translation and rotation of the image content; meanwhile they are unsophisticated to compute. In our experiments, the number of bins is 40, and the resulting descriptor is of dimension 120 through concatenating the three independent histograms.

#### 3. Methodology

##### 3.1. Sparse Canonical Correlation Analysis (SCCA)

Canonical correlation analysis (CCA) is a multivariate statistical approach proposed to grope for the correlation between two sets of features [13]. Suppose that two feature sets and are of dimensions and (, ), which extracted from the same image. Let the columns of and be standardized to have standard deviation 1 and mean 0, let and be and vector of weights, and let and be linear combinations of the features of data sets and , respectively. Note that and are vectors. Subsequently, (1) will be maximized to estimate coefficient vectors and :where and are within data covariance matrices and is the between data covariance matrix. Equation (1) can be reformulated as follows for scaling and has very insignificant influence on the correlation coefficient:

The aforementioned CCA approach is not applicable when the quantity of features is excessive. Latent multicollinearity between predictor features further complicates the computation for the covariance matrices can turn into undefined or unstable. As a consequence, some critical features should be selected by standard model selection criteria. In the subsequent process, the selected set of features is utilized to compute the canonical correlation for making the results understandable, which is called sparse canonical correlation analysis (SCCA). Theoretically, SCCA is implemented by maximizing the penalized objective function below:

In order to tackle the multicollinearity problem, a host of approaches have been introduced. Vinod suggested incorporating penalty terms to the diagonal elements of the covariance matrix, which appears to resemble the ridge regression thought in regression analysis [29]. This needs to estimate additional ridge parameters. Other regularization forms have been proposed where the variance matrices are substituted with their corresponding identity matrices [22] or diagonal matrices [21]. In our work, the matrices and are substituted with their corresponding diagonal matrices.

##### 3.2. Shrinkage Methods

Penalized linear regression mechanisms have been widely applied to analyze high-dimensional data, and it has incorporated feature selection and shrinkage techniques. Assume that is an vector and is an matrix. Then the estimation of penalized regression coefficient can be yielded by using following penalized regression model:where is the penalty term and is a tuning parameter which is estimated utilizing permutation approaches or cross validation (CV). As the first penalized regression approach, ridge regression was proposed to temper the multicollinearity among the predictors in which a quadratic penalty term is embedded to the regular least square estimating equations [30]. Ridge regression implements a penalty on the coefficients to shrink them towards zero. Yet the shrunken coefficients are never equal to zero. As a consequence, ridge regression fails to conduct feature selection. The least absolute shrinkage and selection operator (Lasso), Elastic-net, and smoothly clipped absolute deviation (SCAD) are different from ridge regression which solve the multicollinearity problem (i.e., shrinkage) and set some of the coefficient to exact zero, creating sparse set of features (i.e., feature selection). In this paper, we apply different penalty functions to SCCA utilizing the algorithm elaborated by Parkhomenko et al. [21]. The tuning parameters for all penalty functions are estimated through cross validation (CV).

###### 3.2.1. Least Absolute Shrinkage and Selection Operator (Lasso) Penalty

The least absolute shrinkage and selection operator (Lasso) penalty is a shrinkage approach and has the competence of selecting discriminant features by shrinking some coefficients and setting others to zero [31]. The penalty term of Lasso is defined as follows:where is a tuning parameter. The solution of Lasso is given asThis is similar to the soft thresholding rule introduced by Donoho et al. [32] and Donoho and Johnstone [33], which was utilized to estimate wavelet coefficients.

###### 3.2.2. Elastic-Net Penalty

Elastic-net is a regularization mechanism that carries out continuous shrinkage and feature selection simultaneously [34]. To be specific, this approach utilizes both penalty of Lasso and quadratic penalty of ridge regression to create a convex combination. Consequently, this approach preserves the abilities of feature selection and coefficients shrinkage. The definition of the Elastic-net penalty can be formulated as follows:Nevertheless, thanks to two tuning parameters needed to be estimated, the computational cost of Elastic-net is somewhat higher.

As a substitute to Elastic-net, Zou and Hastie proposed a predigested version of the Elastic-net called univariate soft thresholding (UST) [34], the solution of which is shown asIn this paper, the Elastic-net based on the univariate soft threshold is adopted to implement feature loadings and as follows:

###### 3.2.3. Smoothly Clipped Absolute Deviation (SCAD) Penalty

Fan and Li proposed a nonconvex penalty function named smoothly clipped absolute deviation (SCAD) [18]. They suggested three criteria for determining an excellent penalty function, namely, (i) sparsity, (ii) continuity, and (iii) unbiasedness. They made further efforts to claim that the SCAD penalty possesses these properties. The SCAD penalty is shown as follows:When value within the range of and , SCAD penalty function coincides with a quadratic spline function. The function is continuous, and when and the first derivative can be formulated as follows:

The SCAD penalty is continuously differentiable on , but singular at 0, with its derivatives zero outside the range . This penalty function sets small coefficients to zero, shrinks mid-size coefficients towards zero, and keeps large coefficients untouched. Consequently, the SCAD penalty generates almost unbiased coefficients and a sparse solution for large coefficients. The solution of SCAD penalty is shown as follows:

This thresholding rule has two unknown parameters: and . In an ideal situation, the optimal results () can be acquired utilizing a scheme involving a two-dimensional grid-search with criteria resembling cross validation approaches. However, such an execution is computationally intensive. In the Bayesian perspective, Fan and Li advised is a wise option for many issues [18]. They have further highlighted that the performance of feature selection issues does not boost tremendously when data-driven approaches are adopted. In this paper, we set to 3.7 and was selected by cross validation. Meanwhile, the thresholding rule (12) was adopted to load vectors and .

###### 3.2.4. Hard-Threshold Penalty

Hard-thresholding directly sets several coefficients to zero [35, 36]. However, this penalty function does not tackle the issue of multicollinearity among the predictors, for it does not shrink any coefficients toward zero. Nevertheless, the results obtained by this penalty are unbiased estimators with large effects. The solution of the hard-thresholding rule is revealed as follows:

##### 3.3. The Suggested BIC Based Feature Filtering Algorithm

The major drawback of the current SCCA approaches is that they do not control sparsity directly. Hence, it is hard to complete efficient and effective elimination of noisy and redundant information. There is a trade-off between the sparsity of the features and the maximum correlation. In this paper, we suggest a two-phase process to establish an equilibrium between the sparsity of the features and the maximum correlation. In the first phase, we gradually impose a constraint via an iterative process to set a constraint on the loadings for averting sparse correlation descending below to a lower confidence limit of the approximated canonical correlation. In the second phase, Bayesian information criterion (BIC) is utilized to conduct the feature filtering which sets the smallest loading in absolute value to zero in each iteration for all features.

The proposed feature filtering process is iterative and simple. One more coefficient of , at each iteration, is set to be 0 according to the magnitudes of the coefficients in absolute value. Let and be the constrained effective dimension reduction direction; the proposed feature filtering process is shown as follows.(i)Let .(ii)Define a new direction by maintaining the largest coefficients of in absolute value and assigning the other coefficients to 0. Searching as projection of into the space , the set of all should satisfy the following.(1)The set of zero coefficients in is the same as that in .(2)Consider , .(3)Consider .(iii)Compute the correlation and the BIC-type criterion , where is the sample size.(iv)Let . Repeat Steps (ii)–(iv) until .After the above feature filtering process is implemented, we obtain a sequence of as descends from to 0. Let be the integer at which is minimized. Then, the smallest coefficients of in absolute value are assigned to 0. This proposed feature filtering process is a streamlined feature selection process. In feature filtering, at most possibilities are taken into account, which make it viable to conduct even when are large. Lastly, the features corresponding to the minimum BIC value are the final selected features.

##### 3.4. Support Vector Machine (SVM) with Pyramid Match Kernel (PMK-SVM) Classifier

Establishment of kernel-based learning algorithms is based on the notion of mapping data into a Euclidean space and then discovering linear relations within the mapped data. Taking a typical issue as an example, the SVM unearths the optimal separating hyperplane between two classes in a feature space. The assistance provided by a kernel function is to map pairs of data points in an input space to their inner product in the feature space, thereby estimating the similarities between all points and deciding their relative positions. Linear relations are discovered in the feature space, although a decision boundary may still be nonlinear in the input space, depending on the method of a feature mapping function. The support vector machine (SVM) with pyramid match kernel (PMK-SVM) [37] furnishes an accurate and time-saving solution for classification and the pyramid match kernel function is formulated as follows:where are the input sets, in which is a sphere of diameter, is the feature extraction function, is the th histogram in , and is a histogram intersection function which measures the overlap between two histograms’ bins:where and are histograms with bins, and denotes the count of the th bin of . Since in the construction of the pyramid phase , and , (15) is equivalent toIn order to preserve generality and obtain promising and satisfactory classification results. Here, we employ support vector machine (SVM) with pyramid match kernel (PMK-SVM) as classifier. Under the multiclass circumstances, a set of binary classifiers and majority vote technique are utilized to perform multiclass categorization. Figure 1 illustrates our classification scheme based on the suggested two-phase BIC filtering process.