Abstract

We study the problem of detecting and localizing objects in still, gray-scale images making use of the part-based representation provided by nonnegative matrix factorizations. Nonnegative matrix factorization represents an emerging example of subspace methods, which is able to extract interpretable parts from a set of template image objects and then to additively use them for describing individual objects. In this paper, we present a prototype system based on some nonnegative factorization algorithms, which differ in the additional properties added to the nonnegative representation of data, in order to investigate if any additional constraint produces better results in general object detection via nonnegative matrix factorizations.

1. Introduction

The notion of low dimensional approximation has played a fundamental role in effectively and efficiently processing and conceptualizing huge amount of data stored in large sparse matrices. Particularly, subspace techniques, such as Singular Value Decomposition [1], Principal Component Analysis (PCA) [2], and Independent Component Analysis [3], represent a class of linear algebra methods largely adopted to analyze high dimensional dataset in order to discover latent structures by projecting it onto a low dimensional space. Generally, a subspace method is characterized by learning a set of base vectors from a set of suitable data templates. This vector spans a subspace which is able to capture the essential structure of the input data. Once the subspace has been found (during the off-line learning phase), the detection of a new sample can be accomplished (in the so-called on-line detection phase) by projecting it on the subspace and finding the nearest neighbor of templates projected onto this subspace. These methods have found efficient applications in several areas of information retrieval, computer vision, and pattern recognition, especially in the fields of face identification [4, 5], recognition of digits and characters [6, 7], and molecular pattern discovery [8, 9].

However, pertinent information stored in many data matrices are often nonnegative (examples are pixels in images, the probability of a particular topic appearing in a linguistic document, the amount of pollutant emitted by a factory, and so on [1015]). During the analysis process, taking into account this nonnegativity constraint could bring some benefits in terms of interpretability and visualization of large scale data, while maintaining the physical feasibility more closely. Nevertheless, classical subspace methods describe data as a combination of elementary features involving both additive and subtractive components; hence, they are not able to guarantee the conservation of nonnegativity.

The recent approach of low-rank nonnegative matrix factorization (NMF) becomes particularly attractive to obtain a reduced representation of data by using additive components only. This idea has been motivated in a couple of ways. Firstly in many applications (e.g., by the rules of physics) one knows that the quantities involved cannot be negative. Secondly, nonnegativity has been argued for based on the intuition that parts are generally combined additively (and not subtracted) to form a whole; moreover, psychological and physiological principles assume that humans learn objects part-based. Hence, the nonnegativity constraints might be useful for learning part-based representations [16].

In this paper, we investigate the problem of performing “generic” object detection in images using the framework of NMF. By performing “generic” detection, we mean to detect, inside a given image, classes of objects, such as any car, any face, rather than finding a specific object (class instance), such a particular car, or a particular face.

Generally, object detection task is accomplished by comparing object similarities to a small number of reference features which can be expressed in holistic (global) or sparse (local) terms and then adopting a learning mechanism to identify regions in the feature space that correspond to the object class of interest. Among subspace techniques, PCA constitutes an example of approach which adopts global descriptors related to the variance of the image space (the so-called eigenfaces) to visually represent a set of given face images [17]. Other holistic approaches are based on global descriptors expressed by color, texture histogram, and global image transformations [18]. On the other hand, local features have been proved to be invariant regarding noise, occlusion or pose view and they are also supported by the theory of “recognition-by-components” introduced in [19]. The most adopted features of local type are Gabor features [20], wavelet features [21], and rectangular features [22]. Some approaches using part-based representation were proposed in [23, 24], but they present the drawback of requiring manually defined object parts and vocabulary of parts to represent object in the target class. More recently, automatic extraction of parts possessing high information contents in terms of local signal change has been illustrated in [25] together with a classifier based on a sparse representation of patches extracted around interesting points in the image.

The nonnegativity constraints of NMF make this subspace method a promising technique to automatically extract parts describing the structure of object classes. In fact, these localized parts can be added in a purely additive way (with varying combination coefficients) to describe individual objects and could be used as learning mechanism to extract interpretable parts from a set of template images. Moreover, making use of the concept of distance from the subspace spanned by the extracted parts, NMF, could be also adopted as learning method to detect when an object is present or not inside a given image.

An interesting example of part-based representation of the original data can be found in the context of image articulation libraries. Here, NMFs are able to extract realistic parts (limbs) from image depicting stick figures with four limbs with different articulations. However, it should be pointed out that the existence of such a part-based representation heavily depends on the objects itself [26].

The firstly proposed NMF algorithms (the multiplicative and additive updated rules presented in [11]) have been applied in the fields of face identification to decompose a face image into parts reminiscent of features such as lips, eyes, and nose. More recently, comparisons between other nonnegative part-based algorithms (such as nonnegative sparse coding and local NMF) have been presented in the context of facial features, learning, demonstrating a good performance in term of detection rate by using only a small number of bases components [27]. A preliminary comparison on three NMF algorithms (classical multiplicative NMF [11], local NMF [28], and discriminant NMF [29]) has been illustrated in [30] on the recognition of different object color images. Moreover, results on the influence of additional constraints on NMF, such as the sparseness proposed in [31], have been presented in [32] for various dimensions of subspaces generated for object recognition tasks (particularly, face recognition and handwritten digits identification).

Here, we investigate the problem of performing detection of single objects in images using different NMF algorithms, in order to inquire if the representation provided by the NMF framework can effectively produce added value in detecting and locating objects inside images. The problem to be explored here can be formalized as follows. Given a collection of template images representing objects of the same class, that is a group of objects which may differ slightly from each other visually but correspond to the same semantic concept, for example, cars, digits, and faces, we would like to understand if NMF is able to provide some kind of local feature representations which can be used to individuate objects in test images.

The rest of the paper is organized as follows. The next section describes the mathematical problem of computing nonnegative matrix factorization and reviews some of the algorithms proposed in the literature and adopted to learn such a matrix decomposition model. These algorithms will constitute the core of an object detection prototype system based on the learning via NMF, proposed in Section 3 together with a brief description of its off-line and on-line learning phases. Section 4 presents experimental results illustrating the properties of the adopted NMF learning algorithms and their performance in detecting objects in real images. Finally, Section 5 concludes with a summary and possible directions for future work.

2. Mathematical Background and Algorithms

The problem of finding a nonnegative low dimensional approximation of a set of data templates stored in a large dimension data matrix 𝑉+𝑛×𝑚 can be stated as follows.

Given an initial dataset expressed by a 𝑛×𝑚 matrix 𝑉, where each column is an 𝑛-dimensional nonnegative vector of the original database (𝑚 vectors), find an approximate decomposition of the data matrix into a basis matrix 𝑊+𝑛×𝑟 and an encoding variable matrix 𝐻+𝑟×𝑚, both having nonnegative elements, such that 𝑉𝑊𝐻.

Generally the rank 𝑟 of the matrices 𝑊 and 𝐻 is much lower than the rank of 𝑉 (usually it is chosen so that (𝑛+𝑚)𝑟<𝑛𝑚). Each column of the matrix 𝑊 contains a base vector of the spanned (NMF) subspace, while each column of 𝐻 represents the weights needed to approximate the corresponding column in 𝑉 by means of the vectors in 𝑊.

The NMF is actually a conical coordinate transformation: Figure 1 provides a graphical interpretation in a two dimensional space. The two basis vectors 𝑤1 and 𝑤2 describe a cone which encloses the dataset 𝑉. Due to the nonnegative constraint, only points within this cone can be reconstructed through linear combination of these basis vectors:𝑣=𝑤1,𝑤21,2.(1)

The factorization of 𝑉𝑊𝐻 presents the disadvantages concerning the lack of uniqueness of its factors. For example, if an arbitrary invertible matrix 𝐴𝑟×𝑟 such that the two matrices 𝑊=𝑊𝐴 and 𝐻=𝐴1𝐻 are positive semidefinite can be found, then another factorization 𝑉𝑊𝐻 exists. Such a transformation is always possible if 𝐴 is an invertible nonnegative monomial matrix (a matrix is called monomial if there is exactly one element different from zero in each row and column). However, if 𝐴 is a nonnegative monomial matrix, in this case, the result of this transformation is simply a scaling and permutation of the original matrices [33].

An NMF of a given data matrix 𝑉 can be obtained by finding a solution of a nonlinear optimization problem over a specified error function. Two simple error functions are often used to measure the distance between the original data 𝑉 and its low dimensional approximation 𝑊𝐻: the sum of squared errors (also known as the squared Euclidean distance), which leads to the minimization of the functional:𝑉𝑊𝐻2(2) subject to the nonnegativity constraints over the elements 𝑊𝑖𝑗 and 𝐻𝑖𝑗, and the generalized Kullback-Leibler divergence to the positive matrices:Div(𝑉𝑊𝐻)=𝑖𝑗𝑉𝑖𝑗𝑉log𝑖𝑗(𝑊𝐻)𝑖𝑗𝑉𝑖𝑗+(𝑊H)𝑖𝑗,(3) subject to the nonnegativity of matrices 𝑊 and 𝐻.

2.1. Classical Algorithm

The most popular approach to numerically solve the NMF optimization problem is the multiplicative update algorithm proposed in [11]. Particularly, it can be shown that the square Euclidean distance measure (2) is nonincreasing under the iterative updated rules described in Algorithm 1.

Initialize nonnegative matrices 𝑊 ( 0 ) and 𝐻 ( 0 )
While  Stopping criteria are not satisfieddo
𝑊 𝑊 ( 𝑉 𝐻 ) ( 𝑊 𝐻 𝐻 )
𝐻 𝐻 ( 𝑊 𝑉 ) ( 𝑊 𝑊 𝐻 )
end while
{ and denotes the Hadamard product, that is the element-wise matrix
multiplication and the element-wise division, respectively}

Lee and Seung update rules can be interpreted as a diagonally rescaled gradient descent method (i.e., a gradient descent method using a rather large learning rate). It has been proved that the above algorithm converges into a local minimum. Other techniques, such as alternating nonnegative least squares method or bound-constrained optimization algorithms, such as projected gradient method, have also been used when additional constraints are added to the nonnegativity of the matrices 𝑊 or 𝐻 [3436].

2.2. NMF Algorithms with Orthogonal Constraints

Differently to other subspace methods, the learned basis vectors in NMF are not orthonormal to each other. Different modifications of the standard cost functions (2) and (3) have been proposed to include further constraints on the factors 𝑊 and/or 𝐻, such as orthogonality or sparsity.

As concerning the possibility of making the bases or the encoding matrices closer to the Stiefel manifold (the Stiefel manifold is the set of all real 𝑙×𝑘 matrices with orthogonal columns {𝑄𝑙×𝑘𝑄𝑄=𝐼𝑘}, being 𝐼𝑘 the 𝑘×𝑘 identity matrix) (which means that vectors in 𝑊 or 𝐻 should be orthonormal to each other), two different update rules have been proposed in [37] to add orthogonality on 𝑊 or 𝐻, respectively. Particularly, when one desires that matrix 𝑊 is as close as possible to the identity matrix of conformal dimension (i.e., 𝑊𝑊𝐼𝑟), the multiplicative update rule (1) can be modified as described in Algorithm 2 (see [38] for details).

Initialize nonnegative matrices 𝑊 ( 0 ) and 𝐻 ( 0 )
WhileStopping criteria are not satisfieddo
𝐻 𝐻 ( 𝑉 𝐻 ) ( 𝑊 𝐻 𝐻 )
𝑊 𝑊 ( ( 𝑉 𝐻 ) ( 𝑊 𝑊 𝑉 𝐻 ) ) ( 1 / 2 )
end while
{ and denotes the Hadamard product and the element-wise division,
respectively and ( ) ( 1 / 2 ) denotes the element-wise square root operation}

Different orthogonal NMF algorithms have been derived using directly the true gradient in Stiefel manifold [38, 39] and imposing the orthogonality between nonnegative basis vectors in learning the decomposition.

An interesting issue, strictly tied with the computation of the orthogonal NMF, when the adopted cost function is the generalized KL-divergence, is the connections with some probabilistic latent variable models. Particularly in [40], it has been pointed out that the objective function of a probabilistic latent semantic indexing model is the same of the objective function of NMF with an additional orthogonal constraint. Moreover, when the encoding matrix 𝐻 is required to possess orthogonal columns, it can be proved that orthogonal NMFs are equivalent to the K-means clustering algorithm [40, 41].

2.3. NMF Algorithm with Localization Constraints

NMF algorithms optimizing a slight variation of the KL-divergence (3) can be adopted to yield a factorization which reveals local features in the data, as proposed in [28]. Particularly, local nonnegative matrix factorization uses the error function:𝑖𝑗𝑉𝑖𝑗𝑉log𝑖𝑗(𝑊𝐻)𝑖𝑗𝑉𝑖𝑗+(𝑊𝐻)𝑖𝑗+𝛼𝑈𝑖𝑗𝛽𝑖𝑄𝑖𝑖,(4) where 𝛼,𝛽>0 are constants, and 𝑈=𝑊𝑊 and 𝑄=𝐻𝐻. The function (4) is the KL-divergence (3) with three additional terms designed to enforce the locality of the basis features. Particularly, the modified objective function (4) attempts to minimize the number of basis components required to represent the dataset 𝑉 and the redundancy between different bases, trying to make them as orthogonal as possible. Moreover, it maximizes the total activity on each component, that is, the total squared projection coefficients summed over all training data, so that only bases containing the most important information should be retained. The iterative update rules derived by the error function (4) are described in Algorithm 3.

Initialize nonnegative matrices 𝑊 ( 0 ) and 𝐻 ( 0 )
WhileStopping criteria are not satisfieddo
𝐻 ( 𝐻 ( 𝑊 ( 𝑉 𝑊 𝐻 ) ) ) ( 1 / 2 )
𝑊 𝑊 ( ( 𝑉 𝑊 𝐻 ) 𝐻 )
𝑊 𝑊 d i a g ( 𝑊 1 1 , 𝑊 2 1 , , 𝑊 𝑟 1 ) 1
end while
{ and denotes the Hadamard product and the element-wise division,
respectively, ( ) ( 1 / 2 ) denotes the element-wise square root operation,
d i a g ( 𝑊 1 1 , 𝑊 2 1 , , 𝑊 𝑟 1 ) indicates the 𝑟 × 𝑟 diagonal matrix
whose diagonal elements are the 1-norm of the column basis vectors in
𝑊 }

It has been proved that the update rules in Algorithm 3 decrease monotonically the objective function (4) to a local minimum.

2.4. NMF Algorithm with Sparseness Constraints

NMF algorithms can be extended to include the option to control sparseness explicitly in order to discover parts-based representations that are qualitatively better than those given by standard NMF, as proposed in [31]. Particularly, to quantify the sparseness of a generic given vector 𝑥𝑘, the following relationship between the 1-norm and the Euclidean norm (in the original Hoyer's paper the terminology 𝐿1-norm and 𝐿2-norm is adopted) has been adopted:sparseness(𝑥)=𝑘𝑥1/𝑥2.𝑘1(5) Function (5) assumes values in the interval [0,1], where 0 indicates the minimum degree of sparsity obtained when all the elements 𝑥𝑖 possess the same absolute value, while 1 indicates the maximum degree of sparsity, which is reached when only one component of the vector 𝑥 is different from zero. This measure can be adopted to impose a desired degree of sparseness on vectors in 𝑊 and/or the encoding coefficient vectors in 𝐻, depending on the specific application the nonnegative decomposition is seeking for.

To compute NMF with sparseness constraints, a projected gradient descent algorithm has been developed. This algorithm essentially takes a step in the direction of the negative gradient of the cost function (2) and subsequently projects onto the constraint space, that is, the cone of nonnegative matrices with a prescribed degree of sparseness ensured imposing that sparseness(𝑊𝑖)=𝑠𝑊 and sparseness(𝐻𝑖)=𝑠𝐻, where 𝑊𝑖 and 𝐻𝑖 are the 𝑖th column of 𝑊 and 𝐻, respectively, and 𝑠𝑊 and 𝑠𝐻 are the desired sparseness. The update rules used to compute 𝑊 and 𝐻 are described in Algorithm 4.

Input: 𝑝 𝑜 𝑠 𝑖 𝑡 𝑖 𝑣 𝑒 𝑐 𝑜 𝑛 𝑠 𝑡 𝑎 𝑛 𝑡 𝑠 𝜇 𝑊 > 0 , 𝜇 𝐻 > 0
𝐶 𝑜 𝑜 𝑠 𝑒 𝑎 𝑛 𝑎 𝑝 𝑝 𝑟 𝑜 𝑝 𝑟 𝑖 𝑎 𝑡 𝑒 𝑃 𝑟 𝑜 𝑗 𝑒 𝑐 𝑡 𝑖 𝑜 𝑛 ( ) 𝑡 𝑜 𝑒 𝑛 𝑠 𝑢 𝑟 𝑒 the degree of sparseness
 Initialize nonnegative matrices 𝑊 ( 0 ) 𝐻 ( 0 )
 while 𝑆 𝑡 𝑜 𝑝 𝑝 𝑖 𝑛 𝑔 𝑐 𝑟 𝑖 𝑡 𝑒 𝑟 𝑖 𝑎 𝑎 𝑟 𝑒 𝑛 𝑜 𝑡 𝑠 𝑎 𝑡 𝑖 𝑠 𝑓 𝑖 𝑒 𝑑 do
   𝑊 𝑖 𝑗 𝑊 𝑖 𝑗 𝜇 𝑊 ( ( 𝑊 𝐻 𝐻 ) 𝑖 𝑗 ( 𝑋 𝐻 ) 𝑖 𝑗 )
   𝑊 𝑃 𝑟 𝑜 𝑗 𝑒 𝑐 𝑡 𝑖 𝑜 𝑛 ( 𝑊 )
   𝐻 𝐻 𝑖 𝑗 𝜇 𝐻 ( ( 𝑊 𝑊 𝐻 ) 𝑖 𝑗 ( 𝑊 𝑋 ) 𝑖 𝑗 )
   𝐻 𝑃 𝑟 𝑜 𝑗 𝑒 𝑐 𝑡 𝑖 𝑜 𝑛 ( 𝐻 )
 end while
{ 𝜇 𝑊 > 0 and 𝜇 𝐻 > 0 are positive constants representing the step size
 of the algorithm and 𝑃 𝑟 𝑜 𝑗 𝑒 𝑐 𝑡 𝑖 𝑜 𝑛 ( ) indicates the appropriate projection
 operator}

It should be observed that when the sparsity constraint is not required by 𝑊 or 𝐻, the update rules are those provided by Algorithm 1 (the interested readers can be addressed to [31] for further details on this algorithm).

3. Object Detection System Based on NMF

In this section, we schematically present an object detection prototype system based on the learning via NMF. The working flow of the prototype system can be roughly divided in two main phases: the off-line learning phase and the on-line detection phase (mainly devoted to the object location activity).

The off-line learning phase consists in preparing the training image data and then learning a proper subspace representation of them. To be compliant to the format of the data matrix 𝑉 (in order to obtain one of its nonnegative factorizations), each given 𝑝×𝑞 training image has to be converted into a 𝑝𝑞-dimensional column vector (stacking the columns of the image matrix into a single vector) and then inserted as a column of the matrix 𝑉. It should be observed that this vector representation of an image data presents the drawback of losing the spatial relationship between neighborhood pixels inside the original image.

Once the image training matrix 𝑉+𝑛×𝑚 is formed (now being 𝑛=𝑝𝑞), its NMF can be computed by applying one of the following algorithms: (i)the Lee and Seung multiplicative update rule (indicated by NMF and described in Algorithm 1) [11],(ii)NMF with orthogonal additional constraint on the basis matrix 𝑊 (indicated by DLPP and described in Algorithm 2) [37],(iii)local NMF (indicated by LNMF and described in Algorithm 3) [28],(iv)NMF with sparseness additional constraint (indicated by NMFsc and described in Algorithm 4) [31].

Once the bases and the encoding matrices have been obtained using one of the previous algorithms, the on-line detection phase can be started. In particular, for each test sample image 𝑞, the distance from the subspace spanned by the learned basis matrix 𝑊 is computed by means of the following formula:dist(𝑊,𝑞)=𝑞𝑊𝑊𝑞2.(6) The value distance dist(𝑊,𝑞) is then compared with a fixed threshold 𝜗, which is adopted to positively recognize the test image 𝑞 as known object. Particularly, the decisional rule which can be easily derived is "IFdist(𝑊,𝑞)𝜗THEN𝑞islabelledasknownobjectandtheobjectislocatedinside".(7)

Since the dimensions of the test image are bigger than those of the training images, we adopt a common approach to detect rigid object such as faces or cars [42]. Particularly, a frame of the same dimensions of training images (i.e., a window-frame of 𝑝×𝑞 pixels) is slid across the test image in order to locate the subregions of the test image which contain known objects. To reduce computational costs, started from the left-up corner of the test image, the sliding frame is moved in steps of size 5 percent of the test image, first in the horizontal and in the vertical direction (as shown in Figure 2).

The detection threshold is relevant to label each query image as object belonging or not to the subspace representation of the training space. Lowering the threshold increases the correct detections, but also increases the false positives; raising the threshold would have the opposite effect. To overcome this weakness, a preliminary detection phase can be performed in order to determine a range [𝑑,𝐷] used to fix a default threshold value as follows:𝜗default=𝑑+(𝐷𝑑)0.1.(8)

The multiplicative factor 0.1 has been derived empirically. Although the simple mechanisms adopted to estimate the threshold value could cause the drawback, the proposed system identifies something also when it deals with images which do not contain any object of interest. Different estimation methods of the default threshold could be adopted to increase the detection rate; however, we delayed such aspect to a more detailed analysis to be tackled in a future work of ours. Figure 3 provides an example of the results obtained after the on-line detection phase: the picture on the left represents the test image, while the picture on the right represents a copy of the test image in which black pixels identify those pixels belonging to sliding windows which have not been identified as known objects.

4. Experimental Results

This section presents some experimental evaluation of the object detection/localization approach developed in the previous section. The prototype system is evaluated on single-scale images (i.e., images containing objects of the same dimension of the training data). After a brief description of the data sets adopted in the off-line training phase, some comparisons of the above-mentioned NMF algorithms are reported. Our primary concern is on the qualitative evaluation of the different algorithms in order to assess when additional constraints on basis matrix (such as sparseness and orthogonality) and/or different number of bases images (explicitly represented by the rank 𝑟) can produce better results in object detection.

All the numerical results have been obtained by Matlab 7.7 (R2008b) codes implemented on an Intel Core 2 Quad Q9400 processor, 2.66 GHz with 4 GB RAM. The execution time of each algorithm has been computed by the build in Matlab functions tic and toc.

In order to test the object detection prototype system based on the illustrated NMF algorithms, three image datasets have been adopted: CarData, USPS, and ORL. The exploited datasets represent three different typologies of objects: cars, handwritten digits, and faces, respectively. Figure 4 illustrates some training images from the adopted datasets.

The CarData training set contains 550 gray scale training images of cars of size 100×40 pixels, while the test set is composed by 170 single-scale test images, containing 200 cars at roughly the same scale as in the training images. The USPS dataset contains normalized gray scale images of handwritten digits of size 16×16 pixels, divided into a training set of 7291 images and a test set of 2007 images including all digits from 0 to 9. A preprocessing of USPS has been applied to rescale pixel values from the range [1,1] to the range [0,1]. Figure 4 illustrates some training images from the adopted datasets. The ORL dataset contains gray scale images of faces of 40 distinct subjects. Each image is of size 92×112 pixels and has been taken against a dark homogeneous background with the subjects in an upright, frontal position, with slight leftright out-of-plane rotation. We use the first 8 images of each subject for the training set and the remaining 2 images for the test set.

4.1. Experimental Setup

The off-line learning phase has been run for different values of the rank 𝑟 (representing the number of bases images) and with selected degree of sparsity imposed to NMFsc algorithm (particularly, the sparsity parameters in NMFsc have been fixed as 𝑠𝑊=0.5 and 𝑠𝐻=[]). As previously observed, we are interested in assessing the existence of any qualitative difference between the NMF learning algorithms in the context of generic object detection. In fact, the rank value 𝑟 represents the dimensionality of the subspace spanned by the matrix 𝑊: an increase in its value can be interpreted as an information gain with respect to the original dataset. On the other hand, large values of 𝑟 could introduce some redundancy in the basis representation of the dataset, nullifying the benefits provided by the part-based representation of the NMF. The algorithms have been trained on each dataset for various values of rank (CarData: 𝑟=20,110, USPS: 𝑟=80,220, ORL: 𝑟=20,80). We report the results related to the lowest and the highest rank values for each dataset. For the benefit of comparison, the same stopping criteria has been adopted for all NMF learning algorithms (i.e., the algorithms stop when the maximum number of iterations, set to 2500, is reached). Moreover, the results reported in the following sections represent the average values obtained over ten different random initializations of the nonnegative initial matrices 𝑊(0) and 𝐻(0). Note that, for each trial, the same initial matrices randomly generated (with proper dimensions with respect to the adopted dataset) have been used for all the algorithms.

The algorithms have been compared in terms of final approximation error, computed by MSE(𝑊,𝐻)=𝑉𝑊𝐻2, execution time (indicating the number of seconds required by each algorithm to complete the learning phase) and degree of orthogonality of 𝑊, measured by orth(𝑊)=𝑊𝑊𝐼𝐹. This latter measure has been added for highlighting when additional constraints (in the specific case the orthogonality of the basis factor) provide better results in the detection phase.

4.2. Results of the Off-Line Learning Phase

This section reports the results obtained at the end of the off-line training phase for all the three chosen image datasets. Table 1 reports the MSE, the execution time, and the degree of orthogonality of 𝑊, when the algorithms are trained on the chosen datasets. For each dataset, the results obtained for the initial value and the final value of the rank are reported. These results are related to the lower and the higher subspace approximation of each dataset.

Figure 5 illustrates the part-based representation of CarData dataset learned by the adopted algorithms. For the benefit of appreciating some visual difference between the obtained bases, we plot the bases only for the smaller value 𝑟=20. Analogously, Figures 6 and 7 report the bases representation of USPS (with rank value 𝑟=80) and ORL dataset (with rank value 𝑟=20), respectively. Algorithm NMF learns global representation of either set of face car and face image, while it provides local representation of handwritten digits. LNMF, DLPP, and NMFsc algorithms, instead, learn localized image parts some of which appear to roughly correspond to parts of faces, parts of cars, part of digit marks. Essentially, the NMF algorithms select a subset of the pixels which are simultaneously active across multiple images to be represented by a single bases vector.

As an example, Figure 8 illustrates the behavior of the MSE during the learning phase on the CarData dataset, with rank values 𝑟=20 and 𝑟=115, respectively. It should be observed that after some iterations all algorithms converge to similar values of the MSE. The LNMF algorithm presents a larger value of the MSE just because this algorithm is based on the KL-divergence cost function so it provides a rougher approximation of the dataset in term of MSE. To better appreciate the rate of convergence of all algorithms, Figure 9 reports the behavior of the MSE during the initial 600 iterates in the learning phase associated with the USPS dataset, with rank value 𝑟=80. A behavior similar to that depicted in Figures 8 and 9 is shown for all the other datasets and for different values of the rank 𝑟.

As concerning the degree of orthogonality of the matrix 𝑊 learned by each algorithm, Figure 10 reports the semilog plot of the orthogonality error for 𝑊 during the learning iterations on the CarData dataset (with the rank values 𝑟=20 and 𝑟=115, resp.). It should be observed that both LNMF and DLPP produce a matrix 𝑊 possessing a discrete degree of orthogonality. On the other hand, since NMF and NMFsc do not incorporate any additional constraint, they preserve or sometimes deteriorate the degree of orthogonality of the initial matrix 𝑊0. Similar plots for the orthogonal error can be depicted for the matrices obtained using the USPS and ORL dataset, respectively.

4.3. Results of the On-Line Detection Phase

Once the bases and the encoding matrices have been obtained at the end of the learning phase, we are ready to enter the on-line detection and localization phase in order to assess a qualitative analysis of the considered algorithms (by means of the prototype system). To measure the performance of the NMF-based object detection/localization system, we are interested in knowing how many of the objects it detects and how often the detection it makes is false. Particularly, the two quantities of interest are the number of correct detections and the number of false detections: the former should be maximized while the latter quantity has to be minimized. As we have already observed in Section 3, the decisional rule (7), which allows to identify a test image as known object, is dependent on the detection threshold 𝜗. Opportunely varying the threshold 𝜗, a different tradeoff between correct and false detections can be reached. This tradeoff can be estimated considering the 𝑟𝑒𝑐𝑎𝑙𝑙 and the 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. The recall is the proportion of objects that are detected, the precision is the fraction of corrected detected objects among the total number of detection made by the system. Denoting by 𝑇𝑃 the number of true positive, 𝑇𝐹 the number of false positive, 𝑛𝑃 and 𝑛𝐹 the total number of positives and negatives in the dataset, respectively, the performance measures are 𝑅𝑒𝑐𝑎𝑙𝑙=𝑇𝑃/𝑛𝑃 and 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑇𝑃/(𝑇𝑃+𝐹𝑃), and the number of false detections can be computed as 1𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛. It should be pointed out that precision-recall is a more appropriate measure than the common ROC curve, since this metric is designed for binary classification tasks, not for detection tasks [25].

The evaluation results have been obtained by manually determining the location of the windows containing interesting objects. Tables 2, 3 and 4 report the performance results for Cardata, USPS, and ORL, respectively, when different values of the dimensionality 𝑟 of the subspace dataset approximation are adopted. NMF algorithms evidence some differences in terms of recall and precision, particularly NMF anf NMFsc provide better results than LNMF and DLPP. The performance of the latter algorithms is also quite bad on the ORL face dataset, which represents one of the easiest database in terms of recognition.

Figure 11 reports the results obtained after the on-line phase on a car test example. The picture on the top illustrates the query image; the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with 𝑟=110).

Figure 12 illustrates the results obtained after the on-line phase on a handwritten digit test example. The picture on the top illustrates the query image, the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with 𝑟=80). As it can be noted the DLPP algorithm provides the worst result, since it locates all the background pixels around the digit images.

Figure 13 illustrates the results obtained after the on-line phase on a composited image with different ORL test images. Again, the picture on the top illustrates the query image, the remaining pictures provide the positive pixels provided by (a) NMF, (b) LNMF, (c) NMFsc, and (d) DLPP, respectively (trained with 𝑟=80). Also in this case, the worst results are given by DLPP algorithm, which is not able to correctly locate all the ORL test images.

4.4. Qualitative Analysis in Natural Images

The following images illustrate the results obtained during the on-line detection phase for each considered algorithm with different query images. Particularly, Figure 14 provides an example of detection of a car inside some test images taken from the CarData test set.

Figure 15 illustrates the detection and location of some digit images inserted in a large scale image with white background while Figure 16 reports the detection/location results of some digit image written on a large white page. Figure 17 shows the detection of some handwritten digits presenting on an image of a real letter envelope. In the latter case, it could be observed that there are some false positive detections such as the two stamps and the letters in the address. This can be explained in the case of the stamps by considering their bigger dimension with respect to the sliding window and also the bases (see Figure 6) learnt by the algorithm, in the case of the letters by considering the inherent resemblance between some handwritten numbers and letters (such as “0” and “O,” “B” and “8,” “6” and “b”).

Figure 18 gives evidence of the capability of NMF algorithms to recognize human face inside two real world pictures which portrait human figures with different backgrounds; as it can be observed the adopted algorithm is able to recognize the presence of a face different from the training faces learnt in the off-line training phase. This represents a confirmation that the part-based representation provided by NMF can effectively produce added value in detecting and locating objects inside images.

5. Conclusions and Future Work

To summarize, we have presented a prototype framework for learning how to detect and locate “generic” objects in images using the part-based representation provided by nonnegative matrix factorization of a set of template images. Comparisons between different NMF algorithms have been presented, evidencing that different additional constraints (such as sparseness) could be more suitable to identify localized parts describing some structures in object classes. Our experiments on the well-known databases demonstrated that the proposed NMF-based prototype system is able to extract such interpretable parts from a set of training images in order to use them in localizing similar object in real world image.

Future work could be undertaken to allow the elaboration of object images with different scales, to improve final localization (using, for instance, a repeated part elimination algorithm), and to apply different criteria and/or measures to identify when a test image does or not belong to the subspace of known objects.

Acknowledgment

The authors would like to thank the anonymous referees for their suggestions and comments, which proved to be very useful for improving the paper.