Abstract

An image object recognition approach based on deep features and adaptive weighted joint sparse representation (D-AJSR) is proposed in this paper. D-AJSR is a data-lightweight classification framework, which can classify and recognize objects well with few training samples. In D-AJSR, the convolutional neural network (CNN) is used to extract the deep features of the training samples and test samples. Then, we use the adaptive weighted joint sparse representation to identify the objects, in which the eigenvectors are reconstructed by calculating the contribution weights of each eigenvector. Aiming at the high-dimensional problem of deep features, we use the principal component analysis (PCA) method to reduce the dimensions. Lastly, combined with the joint sparse model, the public features and private features of images are extracted from the training sample feature set so as to construct the joint feature dictionary. Based on the joint feature dictionary, sparse representation-based classifier (SRC) is used to recognize the objects. Experiments on face images and remote sensing images show that D-AJSR is superior to the traditional SRC method and some other advanced methods.

1. Introduction

Sparse representation has its unique advantages in signal processing, image processing, computer vision, pattern recognition, and so on. Image recognition based on sparse representation can be divided into two parts: sparse representation and classification recognition. First, we need to build a dictionary to represent the test samples, and then use the sparse representation coefficient and classification dictionary to identify the objects. Wright et al., for the first time, put forward a sparse representation-based classifier (SRC) [1]. SRC used the original training samples as dictionaries to represent the test samples linearly and then calculated the sparse representation coefficients of the test samples. It used sparse representation coefficients and training samples to calculate all kinds of reconstruction residuals, so the test samples can be identified according to the minimum reconstruction residuals.

Lai et al. proposed a tensor feature extraction method based on multilinear sparse principal component analysis (MSPCA). The key operation of MSPCA is to rewrite the multilinear PCA (MPCA) into multilinear regression forms and relax it for sparse regression. Moreover, it inherits the sparsity from the sparse principal component analysis (SPCA), and it can iteratively learn a series of sparse projections, achieving good results in face recognition [2]. By introducing the sparsity or -norm learning, Lai et al. proposed a unified sparse learning framework, which further extends the locally linear embedding-based methods to sparse cases. This method achieves good results in image recognition, especially in the case of small samples [3]. There is also a generalized robust regression (GRR) method for jointly sparse subspace learning. By incorporating the elastic factor on the loss function, GRR can enhance the robustness to obtain more projections for feature selection or classification and have better robustness in face recognition [4].

Convolutional neural network (CNN) is a machine learning model under deep supervised learning. For image recognition, CNN can directly use the image data as input data without manual preprocessing and additional feature extraction. Therefore, CNN has achieved good recognition effects. CNN is very suitable for extracting image features as it can extract a variety of image features including texture, shape, color, and image topology.

Jiajia et al. proposed a new CNN-GRNN model for image classification and recognition. This model used a simple CNN model for image feature extraction, and then used a general regression neural network (GRNN) model for classification [5]. Lu and Linghua proposed a face recognition method based on discriminant dictionary learning, which used a Gabor filter to learn the new dictionary and classified the images with sparse representation [6]. Mahoor et al. proposed a face motion combination recognition framework based on sparse representation and used the average Gabor feature of motion combination to establish an ultracomplete dictionary to improve the recognition accuracy of various actions [7].

To some extent, the aforementioned researches have improved the recognition efficiency, but they also have their own limitations. For example, only using the CNN model for image recognition will both take a lot of time to adjust parameters and require a large number of training samples [8]. Sometimes, it is difficult to obtain a large number of experimental samples that meet the requirements. On the contrary, the sparse representation of the traditional dictionary introduced above mostly used traditional features, which cannot meet the requirement of high recognition rate in many cases. In view of these situations, we improve the traditional dictionary into an extended dictionary and use deep features as the atoms in the dictionary to propose the D-AJSR approach.

D-AJSR is a data-lightweight classification framework with relative high recognition rate. At the same time, compared with artificial intelligence methods, D-AJSR can classify and recognize objects well with few training samples.

2. Joint Sparsity Model

2.1. Classification Method Based on Sparse Representation

SRC is a classification and recognition framework for face images first proposed by Wright et al. [1], which has been gradually applied to other image classification and recognition. If there are training samples (, general m << n) belong to k classes, then the entire training data set can be expressed aswhere , is the jth sample of the ith class, and is the sample number of the ith class.

Based on the theory of sparse representation (SR), a new test sample in class i can be linearly expressed by the training sample as follows [9]:where is the sparse representation coefficient of y, .

Without considering the noise, formula (2) can be written aswhere .

In order to get the sparsest x, SRC needs to solve the following minimization problem:where is the coefficient vector and is norm.

For each class, we can construct a mapping function to represent the nonzero element selected from the coefficient vector corresponding to class i. The test sample reconstructs the representation with the sparse coefficient as . Then, y is classified to by using the minimum residual:where is norm and .

2.2. Joint Sparsity Model

Joint sparse model (JSM) was first proposed to encode multiple related signals effectively [10]. In JSM, due to the correlation between the signals, the related signals can be used as one set, and each signal can be represented as a combination of public features and private features on a specific sparse basis. The public feature is the public part of all the signals in one signal set, and the private feature is the characteristic part unique to each signal. So, the jth signal can be represented by the public features of a certain set and its own private features:where represents the public features and represents the private features of the jth signal.

Assuming that all images can be divided into K classes and there are J training images in each category, the jth image in the ith class can be expressed as . If an image is represented as a one-dimensional column vector, the image in the ith class can be represented as . According to JSM, the jth image in the ith class can be represented aswhere represents the public features of the images in the ith class and represents its own private features [11]. If is the orthonormal basis that can sparse represent the training image, then formula (7) can be written aswhere is the sparse representation of on transform basis and and represent the sparse representation of the public part and the private part on basis , respectively. If both sides of formula (8) are left multiplied by , then formula (8) is changed to . Combined with formula (7), , so the joint representation of the image can be expressed as

Formula (9) can be simplified aswhere and . is the overcomplete dictionary and consists of two parts: and . preserves the discriminant information, and its sparse representation can be obtained by solving the minimization of the following formula:

After obtaining , the public features of all images in class i and the private features of each image can be obtained in the field according to the inverse transformation:

All public features and all private features form a joint feature dictionary D:

So, we can use the following formula (14) to identify which category the objects belong to:

As can be seen from the above, the joint sparse model algorithm only uses two parts to represent each kind of the training image, which effectively reduces the size of storage space.

3. Image Object Recognition Based on D-AJSR

The algorithm framework of D-AJSR is shown in Figure 1. Unlike JSM summing up private features directly, D-AJSR combines public features and private features into a joint dictionary. Based on this, sparse representation is used to find the sparse solution of the test samples on the adaptive joint dictionary.

3.1. Deep Feature Extraction

CNN can automatically extract complex global and local features from images [8]. Therefore, D-AJSR introduces the deep features extracted by CNN into sparse representation to enhance the recognition ability of sparse representation.

In this paper, VGG19 is adopted for feature extraction. In the ILSVRC-2014 image classification competition, VGG took the second place with a 7.3% top-5 error rate and the champion of object detection [12]. VGG uses a small convolution kernel of 3 × 3 throughout the construction of the network and superimposes deep networks by superposing 3 × 3 small convolution kernels. The network structure of VGG19 is shown in Figure 2.

Figure 3 shows the examples of extracted features. The left image is the original image, the upper row shows the features extracted from the first layer, and the lower row shows the features extracted from the second layer. By comparing the extracted features of each layer, it can be found that most textures and detail features are extracted by the shallow network, while the contour and shape features are extracted by the deeper network. Relatively speaking, the deeper the layers are, the more representative the extracted features will be, but the resolution of the feature maps will become lower.

3.2. Adaptive Weighted Reconstruction

When constructing the joint dictionary, the object information contained in different samples is also different, and the samples which have larger variance contain more object information. Therefore, we consider increasing the weights of the samples with more object information in the dictionary and reducing the weights of the samples with less object information so as to improve the discrimination ability of the feature dictionary [13].

The feature vector can be transformed as follows after it has been extracted by CNN:where represents the extracted features of the ith image, represents the weighted image features, and represents the average of the features.

Formula (15) can adaptively carry out weighted reconstruction and normalization for feature vector elements, which can increase the standard deviation or variance of the feature vectors to a certain extent, help deep feature dictionary containing more recognition information, and improve the recognition efficiency.

3.3. Main Steps of D-AJSR

The main steps of D-AJSR are as follows:(1)VGG19 network is used to extract the deep features of the training and the test images.(2)The adaptive weighted reconstruction of feature vectors is carried out to improve the ability of distinguishing feature dictionaries, and the principal component analysis (PCA) method is used to reduce the dimensionality of the reconstructed dictionary.(3)The public features of each class in the feature dictionary and the private features of each image are extracted. All the public features form a matrix Q, and all the private features form a matrix H. So, we can get the final joint dictionary feature .(4)Sparse representation of test samples is carried out on the joint feature dictionary, and the sparse coefficient is obtained. The feature image y of the test sample is reconstructed and identified by using formula (14).

4. Experiments and Analysis

In order to verify the validity of D-AJSR, we conduct experiments on the face images and remote sensing images, respectively. The computer used in the experiments is configured as Intel Core i5-3210M @2.5 GHz with 4 GB memory. The experimental platform is Matlab R2017a. In deep feature extraction, we can get 64 global deep features in the first layer and 128 deep features in the second layer. All the experimental results in this chapter are the average results of 10 experiments and D-AJSR(8) represents 8 deep feature maps are used.

4.1. Face Image Recognition

In this part, experiments are performed on extended YaleB [14] and AR [15] data sets, respectively. SRC [1], CRC [16], RRC [17], low-rank matrix recovery with structural incoherence (LR) [18], extended SRC (ESRC) [19], discriminative low-rank representation method (DLRR) [20], and sparse dictionary decomposition method (SDD) [21] approaches are compared with D-AJSR in the following experiments.

4.1.1. Experiments on Extended YaleB Data Set

Extended YaleB data set contains 2,414 positive images of size 168 × 192 for 38 people under different lighting conditions, part of which are shown in Figure 4. In the experiments, 16 images of each person are randomly selected for training, and the rest are used for testing. The feature dimensions after the PCA process are 25, 50, 75, 100, and 150 respectively. The initial dimension of 8 deep features is 42 × 48 × 8 = 16128. The experiment results are shown in Table 1.

In the experiments, we chose the features obtained from the second layer. In Table 1, bold numbers in each column indicate the highest recognition rate under the same condition. As can be seen from Table 1, D-AJSR maintains a high accuracy in all dimensions, and its performance in 50 dimensions is better than that of other methods in 75, 100, and 150 dimensions. Therefore, D-AJSR can greatly reduce the feature dimension under the same precision requirement.

4.1.2. Experiments on AR Data Set

The AR data set contains more than 4000 positive images of 126 people, each with a size of 120 × 165. In the experiments, we use a subset of 2,600 facial images of 100 people, including 50 men and 50 women. Each person has 26 images, which are divided into two separate parts. Each part has 13 pictures, of which 7 are facial expression pictures or unshielded pictures of light changes, 3 are pictures wearing sunglasses, and 3 are pictures camouflaged with scarves, as shown in Figure 5 (the images are selected randomly). In the two parts, we use one part for training and the other part for testing. The feature dimensions of face images are also 25, 50, 75, 100, and 150, respectively. The initial dimension of 8 deep features is 30 × 41 × 8 = 9840.

In the experiments, we chose the features obtained from the second layer. In Table 2, bold numbers in each column indicate the highest recognition rate under the same condition. Because there are samples of wearing sunglasses and camouflaged and the number of these two samples is small, which affects the dictionary training, the recognition rate of our method is lower than that on YaleB data set. As can be seen from Table 2, although the D-AJSR method does not achieve the best effect when the dimensions are 25 and 50, it still remains at a medium level. When the dimensions are 25 and 50, the D-AJSR method does not perform well, mainly because the number of principal components is relatively small and the variance contribution rate is low (less than 0.6). As the representativeness of principal components becomes better, starting from dimension 75, the recognition rate of D-AJSR is better than that of other methods.

In addition, we also compared the experiment results with the locality-constrained and label embedding dictionary learning algorithm (LCLE-DL) [22]. The average recognition rate of LCLE-DL is about 80%, while the average recognition rate of our method is 86.60%. In terms of recognition accuracy, the result of D-AJSR(8) method is relatively better.

4.2. Remote Sensing Image Recognition Experiments

In this part, remote sensing aircraft images are selected from Google Earth 7.1.8 to build data sets for experiments. Remote sensing images of Google Earth are composed of satellite images and aerial images, among which the satellite images come from QuickBird satellite and Landsat-7 satellite, and the aerial images come from BlueSky Company and Sanborn Company and so on. In experiments, images taken at different shooting times and locations are downloaded as data sets. Figure 6 shows the examples of remote sensing images.

The size of a remote sensing image is 170 × 170, and the initial dimension of 64 deep features obtained from the first layer is 85 × 85 × 64 = 462400 (Figure 7). After the PCA process, the feature dimensions of aircraft images are 25, 50, 75, and 100, respectively. In the experiments, the SRC method [1] and the adaptive weighted joint sparse representation classification method (AJRC) [13] are compared with D-AJSR. In these experiments, we choose the features obtained from the first layer. The experimental results are shown in Table 3, and the bold numbers in each column indicate the highest recognition rate under the same condition.

Due to the small number of samples in the data set, the great interference caused by the shadow of the aircraft, and the tire marks on the ground, the recognition rates of all the 3 methods do not reach the better effect as those in the aforementioned experiments. At the same time, because the atoms of the same object are only in 8 directions, the recognition of the object will also be affected. But on the same data set, the effect of D-AJSR is still better than that of the other methods.

4.3. Comprehensive Analysis of Experiments
4.3.1. Cumulative Percent of the Principal Components

In the experiments, when using PCA to reduce the dimension of features, the cumulative variance contribution rate of features with different lengths is shown in Table 4. In Table 4, the left column is the number of feature maps selected from the second layer of VGG19, and the results are obtained on YaleB data set.

As can be seen from Table 4, due to different image sizes and different numbers of deep features, the cumulative variance contribution rate of features with the same number is not the same. On the contrary, if the same cumulative variance contribution rate is selected, the feature length will be different. In order to keep the size of the dictionary consistent, we use the same number of principal components in the experiment.

4.3.2. Effect and Efficiency of Different Feature Map Numbers

The recognition rate of D-AJSR varies with the number of depth feature maps. For object recognition on YaleB data set, we use the deep features obtained from the second layer of the VGG19 network, and there are 128 feature maps in total. These feature maps contain different information of the objects. We choose different number of feature maps in experiments, and the recognition results are shown in Table 5. The first row in the table is the number of principal components selected by PCA, and the left column is the number of feature maps selected after deep feature extraction.

The bold numbers in Table 5 are the highest recognition rate in each column. As can be seen from Table 5, when the number of feature maps is small, the recognition rate will increase with the increase of feature maps. When the number of feature maps increases from 1 to 4, the recognition rate increases by 4.89% on average, among which the recognition rate increases by the most when 25 principal components are selected, reaching 8.69%. However, with the increase of feature maps, the improvement of recognition rate is not obvious or even decreases. For example, when the number of deep feature maps increases from 64 to 128, all recognition rates decrease in varying degrees. Because the number of deep feature maps is too large, the original atomic column vectors are too long and the energy is difficult to concentrate, so the resolution of the final feature vector decreases. Therefore, in practical applications, we need to select the appropriate number of deep feature maps.

In addition to the recognition rate of D-AJSR, we also calculate its time efficiency. The experiments are carried out on the remote sensing image data set and compared with SRC and AJRC. The training efficiency results of different approaches on remote sensing images are shown in Table 6, the test efficiency results are shown in Table 7, and the unit of time is second (s). In the experiments, there are 150 training samples and 225 test samples, and the sample size is 170 × 170. The experiments adopt the feature maps of the first layer, and D-AJSR(64) represents 64 deep feature maps are used.

From Tables 6 and 7, it can be seen that the object recognition time of D-AJSR is longer than that of SRC. However, when the number of feature maps is less than 32, the recognition time is shorter than that of AJRC. The feature extraction of 226 deep feature maps in D-AJSR takes about 30 seconds. Considering the results of Tables 57, D-AJSR is more advantageous than the other two methods.

From all the aforementioned experiments, we can see that D-AJSR can achieve satisfactory recognition results when the data set is small. Generally speaking, in the recognition of remote sensing objects, VGG and other neural network methods often need thousands of images as training sets for each class, while D-AJSR only need a few images as atoms for each class to output the recognition results. In many cases, it is difficult to obtain a large number of training data for specific tasks, such as the recognition of sensitive objects in special circumstances, identification of unusual objects, and so on. At this time, D-AJSR can give full play to its advantages and provide timely recognition results.

5. Conclusions

Aiming at the application requirements of object recognition, we introduce deep features into adaptive joint sparse representation and propose D-AJSR, a data-lightweight classification framework. In order to improve the object recognition rate, the method also adaptively adjusts the atomic weights. Experimental results show that the method has relatively higher recognition rate. On the contrary, since deep feature extraction is more complex than simple change feature extraction, the time consumption of the method will increase correspondingly.

When the number of samples is too small, methods such as deep learning cannot provide reliable identification results due to inadequate training. However, D-AJSR can provide recognition results when there are only a dozen or even a few samples, which provides an effective solution for object recognition without sufficient samples. In addition, after angular rotation expansion of training samples, D-AJSR also has a certain ability of rotating object recognition. In D-AJSR, feature extraction needs some time. Therefore, under the framework of sparse recognition, how to select features that are more expressive and can be extracted quickly is worth our attention in the future.

Data Availability

The YaleB data sets and AR data sets are the public data sets, which can be found from reference [14, 15].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Defense Pre-Research Foundation of China (7301506); the National Natural Science Foundation of China (61070040); the Education Department of Hunan Province (17C0043); and the Hunan Provincial Natural Science Fund (2019JJ80105) (Research on CAD Technology of Children Colonoscopy Based on Artificial Intelligence Technology).