#### Abstract

Feature extraction plays an important role in preprocessing procedure in dealing with small sample size problems. Considering the fact that LDA, LPP, and many other existing methods are confined to one case of the data set. To solve this problem, we propose an efficient method in this paper, named global between maximum and local within minimum. It not only considers the global structure of the data set, but also makes the best of the local geometry of the data set through dividing the data set into four domains. This method preserves relations of the nearest neighborhood, as well as demonstrates an excellent performance in classification. Superiority of the proposed method in this paper is manifested in many experiments on data visualization, face representative, and face recognition.

#### 1. Introduction

Nowadays with the continual development of information technology, the amount of data has largely expanded, such as in the domain of pattern recognition, artificial intelligence, and computer vision. Because the dimension of the samples of data set is a lot greater than the number of the obtained samples of data set, it results in “the curse of dimensionality” [1]. Feature extraction method plays an important role in dealing with small sample size (SSS) problems. It represents original high dimensional data in the low-dimensional space through capturing some important data structure and information and is a common preprocessing procedure in multivariate statistical data analysis. At present, feature extraction methods have successfully been applied in many domains such as text classification [2], remote sensing image analysis [3], microarray data analysis [4], and face recognition [5, 6].

Generally speaking, feature extraction methods can be divided into the following three kinds with regard to applied tools. Firstly, it is based on algebra properties and takes advantage of generalized eigen decomposition in matrix theory and extracts some features that contained more discriminative information through discovering algebra structure of samples of data set. Secondly, on the basis of geometry characteristics, it aims to find optimal projection directions in some cases. Thirdly, in the process of dealing with SSS problems, traditional statistical methods, based on the number of the samples of data set face some challenges, thus, we must take advantage of intersectant knowledge from statistical inference and information science to obtain informative features we expect.

From the viewpoint of the data structure, feature extraction methods have two ways: globally and locally. The former concerns that the data set itself or the class in the data set is regarded as the whole and extracts the important features wholly of the corresponding data set. There are some typical methods such as principle component analysis (PCA) [7, 8], linear discriminant analysis (LDA) [7, 9], and maximum margin criterion (MMC) [10]. Although intended results could be got in practice, those methods do not make use of the inner structure of data set. Meanwhile the latter makes use of neighborhood structure of each sample in the data set and maintains the local information hidden in the extracted features. Locality preserving projections (LPP) [11, 12], neighborhood preserving embedding (NPE) [13], and average neighborhood margin maximum (ANMM) [14] suppose the neighborhood of the sample lies in the submanifold space of the data set and preserve such property. DLPP [15], PRLPP [16], and ILPP [17] are based on local information which be proposed to address the small sample size (SSS) problem. In [18], an entropy regularization term is incorporated into the objective function for controlling the uniformity level of the edge weights in graph. DLPP/MMC [19] seeks to maximize the difference between local class based on maximum margin criterion. Those methods only use the local geometry or global information and do not find entirely the intrinsical structure of the training set.

In this paper, we combine global character with local structure of the data set and in order to perform linear projection and propose a new feature extraction method called global between maximum, meanwhile, local within minimum (GBMLWM). As for a fixed sample from the data set, the others are divided into four parts: the same class with the fixed sample within or without its nearest neighborhood and the different class from the fixed sample within or without its nearest neighborhood, see Figure 1. In the course of local within minimum, we make use of three domains: domain I, domain II, and domain III. And in the process of global between maximum, similarly to LDA, our methods by maximum between-class scatter. So it not only overcomes the disadvantages of the global method, but also takes full advantage of the local method. It is worthwhile highlighting some properties of GBMLWM algorithm from some perspectives as the followings: (1)GBMLWM method shares excellent properties with LDA and MMC. In this paper, we maintain global merit in the process of global between maximum. Similar to LDA, we first keep all samples in the data set away from the class centroid, and then let the samples, labeled the same class with the fixed sample and beyond its nearest neighborhood close to its class centroid. So, GBMLWM is a way to supervise, and it is feasible to take apart between class of the data set and keep close within class of the data set. (2)For making use of local structure of the data set, GBMLWM inherits several important advantages of local methods, that is, LPP. In this paper, we maintain the neighborhood of the sample in the data set and keep the samples labeled the same class with the fixed sample in its nearest neighborhood approach. At the same time, let the samples different from the fixed sample away from it. So, GBMLWM maintains the submanifold space of the fixed sample.(3) As connection to PCA, LDA, MMC, LPP, and ANMM, we could derive those methods from GBMLWM framework by imposing some conditions, that is to say, those methods are the special case of GBMLWM. Visual and classification experiments have also indicated that proposed method in the paper is superior to the above methods.

The rest of this paper is organized as follows. Section 2 briefly reviews global and local methods, that is, PCA, LDA, MMC, LPP, and ANMM. The GBMLWM algorithm is put forward in Section 3, and its relationship with the above methods is also discussed in this section. The experimental outcomes are presented in Section 4. The conclusion appears in the Section 5.

#### 2. Brief Review of Global and Local Methods

Suppose that is a set of -dimensional samples of size , and it is composed of , where each class contains samples, , and let a -dimension column vector which denotes the th sample from the th class. Generally speaking, the aim of the linear feature extraction or dimensionality reduction is to find an optimal linear transformation from the original high-dimensional space to the goal low-dimensional space , so that those transformated data in terms of different optimal criteria best represent different information such as that of algebra and geometry structure.

##### 2.1. Principle Component Analysis

PCA attempts to seek an optimal projection direction so that covariance of the data set is maximized, or average cost of projection is minimized after transformation. The objective function of PCA is defined as follows: where , is the mean of . Applying algebra knowledge, (2.1) may be rewritten as where is the sample covariance matrix. is the mean of the all samples. The optimal is the eigenvectors of corresponding to the first largest eigenvalues.

##### 2.2. Linear Discriminant Analysis

The purpose of LDA is to discriminate and classify, and it seeks an optimal discriminative subspace by maximizing between-class scatter matrices, meanwhile, minimizing within-class scatter matrices. LDA's objective is to find a set of vectors according to where respectively, represent the between-class scatter matrix and the within-class scatter matrix. is the mean of the th class. The projection directions are the generalized eigenvectors solving associated with the first largest eigenvalues.

##### 2.3. Maximum Margin Criterion

MMC keeps similarity or dissimilarity information of the high-dimensional space as much as possible after dimensionality reduction by employing the overall variance and measuring the average margin between different classes. MMC's projection directions matrix is as follows where , are defined as (2.5).

##### 2.4. Locality Preserving Projection

PCA, LDA and MMC aim to preserve global structure of the data set, while LPP is to preserve the local structure of the data set. LPP models the local submanifold structure by maintaining the neighborhood relations of the fore and aft transformated samples in data set. With the same mathematical notations as above, the objective function of LPP is defined as follows: where is a diagonal matrix, that is, , is the Laplacian matrix. And is a similarity matrix, defined as follows: where , for , is a kernel parameter, is the set of nearest neighborhood of . The optimal is given by the eigenvectors corresponding to minimum eigenvalue solution to the following generalized eigenvalue problem:

##### 2.5. Average Neighborhood Margin Maximum

Different from PCA and LDA, ANMM aims to obtain effective discriminating information by using average local neighborhood margin maximum. For each sample, ANMM aims at pulling the neighborhood samples with the same label towards it as near as possible, meanwhile, pushing the neighborhood samples with different labels away from it as far as possible. ANMM's solutions as follows: where is called the scatterness matrix, is called the compactness matrix and , , respectively, is the nearest heterogenous and homogenous neighborhood of the , is the cardinality of a set. Here, we can regard ANMM as the local version of the MMC.

#### 3. Global between Maximum and Local Within Minimum

In this section, we present our algorithm—global between maximum, simultaneously local within minimum (GBMLWM). It profits from global and local methods. GBMLWM algorithm preserves not only the local neighborhood of submanifold structure, but also the global information of the data set. To state our proposed algorithm, we first give four domains about as follows: *Domain I*: those samples are a subset of the nearest neighborhood of and labeled the same class with . *Domain II*: those samples are also a subset of the nearest neighborhood of , but labeled the different class from . *Domain III*: those samples labeled the same class with , but do not lie in the nearest neighborhood of . *Domain IV*: those samples do not lie in the nearest neighborhood of and also are labeled as the different class from .

Figure 1 shows us an intuition about the above four domains. The nearest neighborhood of consists of domain I and II. The samples labeled the same class with lie in domain I and III, and the samples labeled the different class from lie in II and IV.

##### 3.1. Global between Maximum

The purpose of classification and feature extraction is to make the samples labeled as different class apart from each other. We first operate those points in domain II and IV via maximizing global and local between-class scatter. That is to say, our aim is not only to make the data globally separable, but also to maximize the distance between different classes in the nearest neighborhood. Thus, our objective functions are defined as follows:

##### 3.2. Local Within Minimum

As for classification, maximizing between class is not adequate, and compacting within-class scatter is also required. So, we now make the samples from domain I close to itself, the samples from II away form and the samples from domain II close to their own class centroid. where and , if lie in corresponding to oneself domain I; otherwise, and , if lie in domain II, otherwise, .

##### 3.3. GBMLBM Algorithm

In the previous description, nearest neighborhood of is indicated as nearest neighborhood based on Euclidue distance between two samples from the data set. Our objective function is defined as follows: where And then our optimal projection directions are solutions to the following optimization problem: So, is the eigenvectors of corresponding to the first largest eigenvalues. It is obvious that the GBMLWM algorithm is fairly straightforward instead of computing inverse matrix, and thus it absolutely avoids the SSS problem. Now, the algorithm procedure of GBMLBM is formally summarized as follows: (1)as for each sample , dividing the samples from data set except into four domains: I, II, III, and VI; (2)computing , according to (2.3), (3.3), (3.4), and (3.5), respectively; (3)and then, we can obtain matrix according to (3.7); (4)computing the generalized eigenvectors of , and the optimal projection matrix corresponding to the largest eigenvalues, where is the rank of matrix . For a testing sample , its image in the lower dimensional space is given by

##### 3.4. Discussion

Here, we find those methods limited to global structure or local geometry of the data set are special case of GBMLWM algorithm. PCA regards the data set as a whole domain and demands all the samples away from the total mean of the data set. Thus, we see that PCA is an unsupervised version special case of GBMLWM algorithm. Both MMC and LDA divide samples except in the data set into two domains: one is composed of the samples labeled as the same class with , called within-class ; the other contains the samples labeled the different class from the , called between-class . They, respectively, correspond to the domains I III and II VI, as illustrated in Figure 1. The local methods, such as LPP and ANMM, are different from the above methods based on global structure. LPP and ANMM divide the whole data set into two domains according to the nearest neighborhood of . LPP is operated in I II, while ANMM method in the domain I and II, as depicted in Figure 1. Those local methods do not utilize the global information of the data set and are local version special case of the algorithm proposed in this paper. The superiority of the GBMLWM algorithm is manifested in the data experiments in the following section.

Training cost is the amount of computations required to find the optimal projection vectors and the sample feature vectors of the training set for comparison. We compare the training cost of the methods based on their computational complexities. Here, we suppose that each class has the same number of training samples. If we regard each column vector as a computational cell and do not consider the computational complexity of eigen-analysis, we estimate approximately computational complexity for six different algorithms which include based-local methods and based-global techniques. Table 1 gives the analysis of computational complexity for the six different algorithms. From Table 1, we can see that our method has the largest training cost. However, in practice, the size of neighborhood and the number of class are often not large enough to cause much more computation of our algorithm. The computational complexity of GBMLWM also shows that our algorithm not only considers the global information, but also utilizes the local geometry. That makes our algorithm efficiently reflect the intrinsical structure of the training set. The following experimental results also manifest this point.

#### 4. Experiments

In this section, we will carry out several experiments to show the effectiveness of the proposed GBWMLW method for data visualization, face representative, and recognition. Here, we will compare the global methods, that is, PCA, LDA, MMC, and local methods, that is, LPP, NPE, and ANMM, with our proposed method on the following four databases: MNIST digit, Yale, ORL, and UMIST database. In the processing of the PCA, we only maintain the dimensions to ensure scatter matrix nonsingular. In testing phases, the size of neighborhood is determined by 5-fold cross validation in all experiments, and the nearest neighbor (NN) rule is used in classification. In using the LPP and GBMLWM algorithms, the weight of two samples is computed with Gaussian kernel, and the kernel parameter is selected as follows: we firstly compute the pairwise distance among all the training samples, then, is made equal to the half median of those pairwise distance.

##### 4.1. Data Visualization

In this subsection, we first use a publicly available handwritten digits to illustrate data visualization. MNIST database [20] has 10 digits, and each digit contains 39 samples. The number of total samples is 390 which each image has the size . Here, we only select 20 samples from each digit. So the size of the training set is , and each image is represented lexicographically as a high-dimensional vector of the length 320. Figure 2 shows all the samples of the ten digits. For visualization, we project the data set in 2-D space by all seven subspace learning methods. And the experiment results are depicted in Figure 3. With the exception of LDA and GBMLWM, the samples from the different digits seem to heavily overlap. Compared with GBMLWM algorithm, LDA makes the samples from the same class become a point. Although this phenomenon is helpful for classification, it has a poor generalization ability since it does not exhibit the case in each object oneself. GBMLWM algorithm not only separates each digit, but also shows what is hidden in each digit. When the number of the nearest neighbor of reduces from to , the samples from the same object become more and more compacted. That also verifies that LDA is a special case of GBMLWM.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

##### 4.2. Yale Database

This experiment aims to demonstrate the ability of capturing the important information on Yale face database [21], called face representative. The Yale face database contains 165 gray scale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, with/without glasses, happy, left/right light, normal, sad, sleepy, surprised and wink. All images from the Yale database were cropped and the cropped images normalized to the pixels with 256 gray level per pixel. Some samples from the Yale database are shown in Figure 4. Here, the training set is composed of all the samples from this database. And the most significant 10 eigenfaces obtained from the Yale face database through using the seven subspace learning methods are shown in Figure 5. From the Figure 5, we obviously see that our algorithm captures more basic information of the face than other methods.

##### 4.3. UMIST Database

The UMIST database [22] contains 564 images of 20 individuals, each covering a range of poses from profile to frontal views. Subjects cover a range of race, sex, and appearance. We use a cropped version of the UMIST database that is publicly available at S. Roweis' Web page. All the cropped images normalized to the pixels with 256 gray level per pixel. Figure 6 shows some images of an individual. We randomly select three, four, five, and six images of each individual for training, and the rest for testing. We repeat these trails ten times and compute the average results. The maximal average recognition rates of seven subspace learning methods are presented in Table 2. From Table 2, we find that GBMLWM algorithm's highest accuracy, respectively, are , , , and on the different training sets and corresponding testing sets. The improvements are significant. Furthermore, the dimensions of the four GBMLWM subspaces corresponding to the maximal recognition rates are remarkably low, and they are 15, 13, 11, and 18, respectively.

##### 4.4. ORL Database

In the ORL face database [23], there are 40 distinct subjects, each of which contains ten different images. So there are 400 images in all. For some subjects, the images are taken at different times, varying the lighting, facial expressions and facial details. All the images are taken against a dark homogeneous background with the subjects in an upright, frontal position. All images from the ORL database are cropped, and the cropped images normalized to the pixels with 256 gray level per pixel. Same samples from this database are showed in Figure 7. In this experiment, four training sets, respectively, correspond to the numbers of samples from each subject three, four, five, and six. And other samples, respectively, form the testing sets. We repeat these trails ten times and compute the average results. The recognition rates versus the reduced dimensions are shown in Figure 8. The best average recognition rates of seven subspace learning methods are presented in Table 3. It can be seen that GBMLWM algorithm's recognition rates remarkably outperform the other methods in all the four training subsets with the highest accuracy of , , , and , respectively. The standard deviations of the GBMLWM corresponding to the best results are 0.03, 0.02, 0.02, and 0.02.

**(a)**

**(b)**

**(c)**

**(d)**

#### 5. Conclusions

In this paper, we have proposed a new linear projection method, called GBMLWM. It is an efficient linear subspace learning method with the supervised and unsupervised character. Similar to PCA, LDA, and MMC, we consider the global character of the data set. At the same time, similar to LPP, NPE, and ANMM, we also make the best use of the local geometry structure of the data set. We have pointed out that the existing linear subspace learning methods are a special case of our GBMLWM algorithm. A large number of experiments demonstrate that the method which we propose is obviously superior to other existing methods, such as LDA and LPP.

#### Acknowledgments

The authors would like to express their gratitude to the anonymous referees as well as the Editor and Associate Editor for their valuable comments which lead to substantial improvements of the paper. This work was supported by the National Basic Research Program of China (973 Program) (Grant No. 2007CB311002), the National Natural Science Foundation of China (Grant Nos. 60675013, 61075006), the Research Fund for the Doctoral Program of Higher Education of China (No. 20100201120048).