Abstract

Nonnegative orthogonal matching pursuit (NOMP) has been proven to be a more stable encoder for unsupervised sparse representation learning. However, previous research has shown that NOMP is suboptimal in terms of computational cost, as the coefficients selection and refinement using nonnegative least squares (NNLS) have been divided into two separate steps. It is found that this problem severely reduces the efficiency of encoding for large-scale image patches. In this work, we study fast nonnegative OMP (FNOMP) as an efficient encoder which can be accelerated by the implementation of factorization and iterations of coefficients in deep networks for full-size image categorization task. It is analyzed and demonstrated that using relatively simple gain-shape vector quantization for training dictionary, FNOMP not only performs more efficiently than NOMP for encoding but also significantly improves the classification accuracy compared to OMP based algorithm. In addition, FNOMP based algorithm is superior to other state-of-the-art methods on several publicly available benchmarks, that is, Oxford Flowers, UIUC-Sports, and Caltech101.

1. Introduction

In computer vision, image representation is a core topic for image understanding and processing. Over the past decade, sparsity has been implemented as one of the priors for a good encoder which makes the corresponding representations more useful when building classifiers [1]. In particular, it is suitable for categorization tasks as sparse representations are more likely to be separable in high dimensional spaces.

It is well known that the classical sparse coding with imposing norm regularization achieves impressive performance for face recognition, text classification, and robotic perception tasks [24], whereas orthogonal matching pursuit (OMP), the canonical greedy algorithm for sparse approximation, can commonly replace the relaxed algorithm due to its high efficiency in large-scale problems. While OMP as encoder shows simplicity and fast execution for many tasks, in practice it is not optimal in terms of stability. In other terms, such a greedy algorithm can augment small data variations which give rise to large deviations in terms of representations [5].

With the development of study on nonnegativity constraints in numerical analysis, nonnegative least squares (NNLS) and nonnegative matrix factorization (NMF), which are frequently used tools, have been applied in image processing and computer vision where the experiments show that enforcing a nonnegativity constraint can produce a much more accurate approximate solution [6]. Therefore, nonnegativity constraints can be employed to ameliorate the aforementioned instability of OMP. Furthermore, it is shown that nonnegative sparse coding is useful for modeling human vision systems on natural images in visual neuroscience [7]. More importantly, nonnegative sparse coding has also appeared in various other applications, such as motion extraction, text classification, and human action recognition [811].

On the other hand, current research on sparse representation learning falls into two groups which are dependent on manually designed descriptors, that is, SIFT [12, 13], and derive from pixel level via hierarchical structure, respectively [14, 15]. As a matter of fact, the latter is referred to as layerwise unsupervised training which advocates to build models from scratch instead of strong dependence on descriptors. A considerable amount of work is dedicated to learning this deep architecture. Specifically, deep belief nets and convolutional deep belief networks make use of stacked Restricted Boltzmann Machines (RBM) to learn high-level image features from low-level ones for recognition [16, 17]. Deconvolutional networks concentrating on high quality latent representations take advantage of a decoder-only model as opposed to the symmetric encoder-decoder of the RBM [18]. Deep autoencoders investigate the feasibility of building high-level features from only unlabeled data and obtain neurons that function as detectors for faces, human bodies, and cat faces [19]. Deep convolutional neural networks are capable of achieving record-breaking results on a highly challenging Imagenet dataset by using purely supervised learning [20]. It is a remarkable fact that a popular architecture based on multilayers matching pursuit encoders has achieved great success over the last few years [21, 22].

Intuitively, an unsupervised hierarchical training manner combined with nonnegative sparse coding should be taken into account. According to the point of view proposed in [28], it is desirable to obtain good image representations on top of nonnegative sparsity. The 4-layer model is trained on a 24-core CPU and an Nvidia Tesla M2075 GPU for fast computing. As a result, this trained model based on ISTA algorithm layer-by-layer has shown slightly better performance with high computational configuration on object classification. In addition, the nonnegative OMP (NOMP) put forward by Lin and Kung [5] can be regarded as a more stable encoder in hierarchical architecture. However, NOMP is only applied to small-size images in the first layer of model and several complicated preprocessing steps are also needed for layer-1 as well as the sign-splitting technique. In spite of delivering competitive accuracy to some best known encoders, NOMP is actually not very efficient on account of separation of selecting and NNLS steps, which is verified on synthetic data in [30].

For this reason, by studying and analyzing efficient orthogonal matching pursuit with nonnegativity constraints called fast nonnegative OMP (FNOMP) in deep networks for full-size image categorization, we demonstrate benefits of the novel encoder. In this paper, firstly we compare computational efficiency of FNOMP encoder with NOMP encoder under different experiment conditions. Next, we consider classification accuracy of FNOMP based algorithm on three categories of object and event datasets in comparison to OMP based deep learning models and other state-of-the-art approaches.

The main contribution of this paper is that we validate the computational time of novel FNOMP, which is significantly shorter than that of NOMP in encoding combined with dictionaries of different sizes and various sparsity levels. Then, it is shown that FNOMP based algorithm can obtain meaningful image representations and therefore is appropriate for full-size image classification in deep networks. Moreover, traditional preprocessing steps comprising mean subtraction, whitening, and sign-splitting are not applied in our method, which simplifies the whole process. Finally, it is found that image size has a great influence on classification accuracy.

The remainder of this paper falls into four sections. In Section 2, the definition of the hierarchical framework for categorization is given. In Section 3, the dictionary training and efficient OMP with nonnegativity constraints are presented. Then, in Section 4, details of our experimental results and analysis on several datasets are elaborated. Finally, in Section 5, the conclusion is drawn.

2. Hierarchical Learning in Deep Networks

Recently, it is desirable to propose fully automatic approaches which can replace those hand-designed descriptors. Meanwhile, a typical manner in machine learning has focused on learning good representations from unlabeled input data for higher-level tasks such as image categorization. More specifically, the hierarchical structures learn multilayer features by greedily training several layers, one layer at a time. For example, a 2-layer deep model which computes sparse codes with fast nonnegative OMP in each layer can be trained as shown in Figure 1.

As can be seen from Figure 1, the densely sampled image patches are computed with FNOMP for sparse codes in the first layer, which are converted as input for the second layer. Then, the higher image-level representations are provided by similar steps from the first layer.

In practice, as discussed in [15, 21], the deep network implementations are generally composed of four steps.

Given an image of -by- pixels with p channels, the pipeline can be illustrated in Figure 2.(i)-by- pixel receptive field with a step of one pixel between them is used for the first layer of features. After training the dictionary with filters for the first layer, we find that the image takes on a -by--by- representation based on fast nonnegative OMP pattern.(ii)Max pooling strategy is employed over adjacent -by- spatial blocks; then, a -by--by- pooled representation is generated.(iii)-by- pixel receptive field with a step of one pixel over the whole maps yields the second layer which is featured by -by--by- dimension and the corresponding number of feature is -by-. Akin to the dictionary training stage in the first step, the image finally obtains a -by--by- representation by means of efficient OMP with nonnegativity constraints.(iv)Pyramid max pooling and contrast normalization are also applied to form final pooled representation.

3. Sparse Coding with Efficient Nonnegative OMP

3.1. Dictionary Training

The gain-shape vector quantization for training in deep networks will be implemented throughout this work. Let be a set of -dimensional input signals; that is, . Specifically, the dictionary is trained by using an alternating manner described as follows:where indicates each column of dictionary , makes each dictionary element normalized, is the number of nonzero elements in , and is a sparsity constraint factor. For instance, OMP-1 will be used as a form of gain-shape vector quantization, and then it begins with and greedily selects an element of to be nonzero to minimize the residual reconstruction error at each iteration.

3.2. Efficient OMP with Nonnegativity Constraints

The standard nonnegative OMP (NOMP) can be applied to find an approximate solution to the following problem: where NOMP computes codes with at most nonzero elements and all elements are nonnegative. Generally, the pipeline of NOMP can be summarized as follows:(i)Firstly, the residual vector is initialized as and iteration number is set to be 1. In order to have the highest positive correlation with residual, the algorithm needs to choose the atom ; that is . When , the iteration will be terminated.(ii)Secondly, the nonnegative least squares (NNLS) will be served as a tool to approximate the coefficients of the selected atoms:(iii)Finally, the new residual will be computed and the corresponding iteration number will be incremented by 1.

However, the selection process and the NNLS are divided into two relatively independent stages. Accordingly, we should take into account a more efficient algorithm which aggregates these two steps. Inspired by analysis of comparison of OMP based on matrix decomposition [31], we need to choose factorization, which provides the largest reduction of computational complexity when problem size increases and little numerical error will be accumulated in the inner products or the solution. Therefore, we can address the issue by computationally efficient decomposition fashion. In fact, OMP attempts to find the orthogonal projection at each iteration as follows:where and are the subdictionary and coefficient vector, respectively, restricted to the support . denotes Moore-Penrose pseudoinverse of and

Let be the th signal residual. At iteration , inverting a matrix for calculation has a complexity of which is a heavy computing burden. Thus, factorization is applied to incorporate a matrix decomposition of the selected subdictionary.

The dictionary which chooses atoms can be factorized following . The columns of are grouped based on the iteration number and signifies the , the selected atom. Actually, we can readily solve (4) which is replaced by another problem because the column span of and is the same. Thus, and is orthonormal; we can quickly find the solution by . Therefore, the efficiency of the method will be heavily dependent on the calculation speed of , , and .

According to Gram-Schmidt process which is a method for orthonormalising a set of vectors in an inner product space, we only need to keep Gram-Schmidt process to seek the last column of after the first terms of have been decomposed. Thus, . In order to find , firstly we need to find the orthogonal element to the span of and then normalize the corresponding orthogonal ones as follows: where .

Similarly, and can be updated, respectively, as follows:where and . While fast OMP is beneficial to factorization, this method may still be used to choose negative elements in . Accordingly, we need to develop fast OMP with nonnegativity constraints.

As stated above in NOMP, the atom will be selected due to the highest positive correlation with residual. At iteration , the approximation of can be computed as follows:Then, according to (7), in the th iteration, we seeFor some unique , . Thus, in the th iteration, we seeAccording to (9), as keeps positive, we can assure that all the are nonnegative when meets such condition as follows:Then, we see .

Next, if the of atom has the largest value or is shrunk by (10), the corresponding atom will be selected. But if the atom having the highest positive correlation does not comply with (10), the most possible solution should be recorded. The decisive criterion of the solution can be listed as follows:where signifies the most possible solution and is the current possible one in the th iteration for an internal loop. We can define where denotes the sorting operator in a descent order. The initialization of can be described as , . Then, will be added to support and update and after inner-loop termination. Therefore, the whole process of FNOMP can be summarized as follows in Algorithm 1.

FNOMP
 Initialization: , ,
 While & do
     Let
     Let
     Let
     Let
     While ~ Terminate & do
      from (10)
     Let
     
     Update based on (11)
     End while
     Let
     Update and
     Let
     Let
     Let
 End while
 Output:

As shown in Algorithm 1, more details about the difference between NOMP and FNOMP can be elaborated from two aspects. Firstly, although both algorithms are composed of two loops, that is, internal and external loops, the decision and update steps based on (11) make a difference to FNOMP which is implemented in the internal loop. Comparatively, NOMP needs nonnegative least squares to optimize the coefficients of the selected atoms, whereas both algorithms terminate when is satisfied and the highest positive correlation with residual is less than or equal to zero in the external loop. Secondly, an analysis of time complexity incorporates difference between these two algorithms. The total computational cost of FNOMP mainly includes two parts which are for internal loop and for sorting largest coefficients, respectively, where is the sparsity level, indicates the iteration number of inner loop, is the dimensionality of dictionary, and signifies the number of atoms. Compared with FNOMP, the total computational cost of NOMP is , where is the iteration number of inner loop.

4. Experiments and Analysis

In this section, we apply FNOMP based method on three widely used datasets, that is, Oxford Flowers, UIUC-Sports, and Caltech101. Compared with standard NOMP with different dictionary sizes, computational costs of FNOMP will be shown in the first part. Without traditional preprocessing steps, the algorithm based on FNOMP will be used for training 2-layer deep models and compared with several state-of-the-art methods in terms of classification accuracy in the second part. The configurations of our PC are Intel Core i5 quad core CPU and frequency is 3.1 GHZ, 16 GB RAM, Windows 7 64-bit operating system. All codes are written in Matlab.

4.1. Comparison of Computation Costs

In this part, we run the efficient NOMP encoder on patch size over dense grid with step size of one pixel and the corresponding dimension of atoms will be fixed at 108. The size of overcomplete dictionary will be increasing from 200 to 400 and the sparsity level is set to be at 5 in the first experiment. To study the computation time for encoding in practice, we will resize the test image to various ratios which are no larger than , , and , respectively. Figure 3 illustrates that the running time of NOMP of three sets is consistently longer than that of FNOMP algorithm. On average, the computation cost of this novel method decreases by 42% which implies it can be applied to the full-size datasets using medium size of images.

In the second experiment, we will keep the sparsity level ranging from 1 to 20 and the size of dictionary will be fixed at 400, while the image size is fixed to be . A comparison of computation costs of NOMP and FNOMP is shown in Figure 4. As the sparsity level increases, the execution time of standard NOMP apparently rises at a faster rate than that of FNOMP. In particular, the computation time will mount by more than 50% when sparsity is greater than 8 and will shoot up to 60 seconds when 20 nonzero elements are in coefficient vectors.

4.2. Comparison of Classification Accuracy
4.2.1. Oxford Flowers Categorization

The Oxford Flowers dataset contains 1360 images with 17 different categories of flowers and each class has 80 images. The issue of similarity of two different classes is challenging and the intraclass variation is sometimes greater than that of the interclass between two species. According to [14], we follow the standard experimental settings for evaluation, that is, 60 random images are employed for training. Specifically, the receptive field size for max pooling is set to be 4 and patch size is exploited for the second layer. Besides, the dictionary size is fixed at 400 and 1600 for the first and second layers, respectively. All images are kept as RGB type and resized to be no larger than and . We obtain average classification accuracy over 10 trials. As shown in Table 1, the classification accuracy of FNOMP based deep learning method is far above the HSSL which leverages a hierarchical model comprised of sparse coding, saliency pooling, and local grouping. As a typical one, Ito’s method called color-CoHOG and CoHD, respectively, developing heterogeneous features based on cooccurrence is outperformed by FNOMP based approach. More importantly, image size has a great influence on final accuracy according to the results. Figure 5 shows some examples of this dataset.

4.2.2. UIUC-Sports Categorization

UIUC-Sports can be regarded as a statistic event category dataset which consists of 8 sport categories, for example, bocce, polo, rock climbing, and snowboarding. The total number of images is 1579 and 137~250 in each class. This dataset is quite challenging due to variations of poses and sizes across each category with cluttered backgrounds. According to the common experimental setting, we choose 70 images for training and 60 for testing at random per category. Figure 6 gives example images from classes of UIUC-Sports.

As is mentioned above, the settings of the experiment are the same as the previous one. The results from Table 2 indicate that FNOMP based method significantly outperforms the object bank (OB) and SIFT-based single layer sparse coding (SIFT + SC), respectively. Meanwhile, FNOMP based scheme can achieve highly competitive performance compared with the algorithm only using nonnegative sparse coding and spatial pyramid matching (Sc + SPM), adapted Gaussian models (AGM), and soft-assignment coding (SAC) approach. Similarly, it is found that larger image size can enhance performance by a large margin.

4.2.3. Caltech101 Categorization

This is a challenging dataset for object recognition task, comprising 9144 images in 102 classes. The number of images per category varies from 31 to 800. In addition to the background class, the remaining classes are composed of vehicles, flowers, animals, and so forth. Some sample images of Caltech101 are shown in Figure 7. Following the common experiment setup for Caltech101, we train on 30 images and test the rest. In the same way, we repeat the experiments 10 times with other experimental settings identical to the previous one. As can be seen from Table 3, the performance of FNOMP based algorithm is marginally better than ScSPM and LLC which are both SIFT-based algorithms. As to other hierarchical models, FNOMP based pattern outperforms deconvolutional networks (DN) by about 9% and deconvolutional networks with both nonnegative sparsity and selectivity (DNNS) by about 3%. However, DNNS employs the combined features of 1st and 4th layer from the model trained with both properties which have more complex deep architecture. Typically, hierarchical sparse coding (HSC) jointly learns two codebooks which are more complicated, while the algorithm combined with FNOMP as encoder can outperform HSC by 1.9%. Interestingly, the performance of low-rank nonnegative sparse coding (LR-Sc + SPM) is extremely close to the FNOMP based method, which adopts different strategies of nonnegativity constraints.

Finally, we compare the performance of FNOMP as encoder with OMP in deep networks with the same dictionary training scheme. Specifically, the patch size over dense grid with step size of one pixel is still adopted and the dimension of atoms remains unchanged. The dictionary size is set to be at 400 and 1600 for the first and second layers, respectively. We can see from Figure 8, in this trial, FNOMP shows more powerful performance than OMP on three different benchmarks. Using gsvq for dictionary training, we find FNOMP based algorithm considerably increases the classification accuracy by around 6%, 3.2%, and 3.1%, respectively.

5. Conclusion

In this paper, we have studied fast nonnegative OMP as an encoder in deep networks to obtain meaningful image representations. Impressive research results are obtained with FNOMP in terms of both computational efficiency and classification accuracy. It is found that FNOMP performs significantly faster than the standard NOMP with medium size of images in practice. In particular, the computation cost of NOMP becomes 2 times or more than that of FNOMP as the sparsity level increases. In addition, we have conducted further studies on three widely used benchmarks for image classification tasks. The experimental results show that FNOMP performs better than SIFT-based single layer sparse coding, hierarchical feature learning, and other state-of-the-art methods on Oxford Flowers, UIUC-Sports, and Caltech101 datasets. Furthermore, with the same dictionary training approach, we find that FNOMP is superior to OMP in terms of classification accuracy. More importantly, it is also found that image size has a great influence on classification accuracy according to the results.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Research Fund for the Doctoral Program of Higher Education of China (no. 20120032110034) and the National Program on Key Basic Research Project (no. 2014CB340403).