Extrinsic Least Squares Regression with Closed-Form Solution on Product Grassmann Manifold for Video-Based Recognition
Least squares regression is a fundamental tool in statistical analysis and is more effective than some complicated models with small number of training samples. Representing multidimensional data with product Grassmann manifold has recently led to notable results in various visual recognition tasks. This paper proposes extrinsic least squares regression with Projection Metric on product Grassmann manifold by embedding Grassmann manifold into the space of symmetric matrices via an isometric mapping. The proposed regression has closed-form solution which is more accurate compared with numerical solution of previous least squares regression using geodesic distance. Experiments on several recognition tasks show that the proposed method achieves considerable accuracy in comparison with some state-of-the-art methods.
As an important application of computer vision, video-based recognition such as action recognition  attracts more and more attention. For inferring the correct label of a query in a given database of examples, there are mainly two kinds of methods. One kind approach is based on representations with the handcrafted features and the other kind is based on deep learning architectures such as Convolutional Neural Networks (CNN) . Generally speaking, deep learning algorithms have been shown to be successful when large amount of data is available [3, 4]. However, the size of database for many recognition tasks in daily life is small. In this case, deep learning algorithms lose efficacy and it becomes important to analyze the structure of data and represent it with discriminant features.
Nowadays, Grassmann manifold has proven a powerful representation for video-based applications like activity classification , action recognition , age estimation , face recognition [8, 9], and so on. In the above applications, Grassmann manifold is used to characterize the intrinsic geometry of data. Taking one representative work as an example, Lui  factorized a data tensor using Higher Order Singular Value Decomposition (HOSVD) and imposed each factorized element on a Grassmann manifold. This representation yields a very discriminating structure for action recognition.
Inference on manifold spaces can be achieved extrinsically by embedding manifold into Euclidean space, which can be considered as flattening the manifold. In the literature, the most popular choice for embedding manifold is through considering tangent spaces [11, 12]. For example, Lui  presented a least squares regression on product Grassmann manifold, in which the weighted average from the training samples was computed in tangent space and was projected back to Grassmann manifold by standard logarithmic and exponential map. The distance between points to the tangent pole is equal to geodesic distance, which is restrictive and may lead to inaccurate modeling. An alternate method considers embedding Grassmann manifold into space of symmetric matrices by a diffeomorphism  and uses Projection Metric  which is equal to the true Grassmann geodesic distance up to a scale of .
In this paper, by representing multidimensional data on product Grassmann manifold with same form as Lui , we propose an extrinsic least squares regression on product Grassmann manifold using Projection Metric and give a closed-form solution which is more accurate. Least squares regression as a simple statistical model has many advantages such as simple calculation and being more effective than some complicated models with small number of training samples . We experiment with the proposed method on three kinds of small-scale datasets including hand gesture, Ballet, and traffic; the higher recognition rates reveal that our method is competitive to some state-of-the-art methods.
The rest of this paper is organized as follows: Section 2 introduces mathematical background; Section 3 gives product Grassmann manifold representation for video; Section 4 presents distance on product Grassmann manifold; Section 5 proposes extrinsic least squares regression on product Grassmann manifold; Section 6 gives classification based on extrinsic least squares regression; Section 7 shows experiments on different datasets, and experiment results show that the proposed method achieves considerable accuracy; Section 8 analyzes the time complexity of proposed method and Section 9 gives a conclusion.
2. Mathematical Background
In this section, we introduce the mathematical background used in this paper.
2.1. Grassmann Manifold
Stiefel manifold is the set of all matrices with orthonormal columns; that is, where is the identity matrix. Grassmann manifold can be defined as a quotient manifold of with an equivalence relation . In fact, for any , where is the subspace spanned by columns of . In other words, Grassmann manifold is the space of -dimensional linear subspaces of for , which may be specified by arbitrary orthogonal matrix with dimension . Notice it is not unique for the choice of matrix for a point on Grassmann manifold; that is, the same point on Grassmann manifold can be spanned by different matrix and .
2.2. Higher Order Singular Value Decomposition (HOSVD)
HOSVD is a multilinear SVD operating on tensor. Let be a tensor with order . The process of reordering the elements of an -mode tensor into a matrix is called matricization. The mode- matricization of a tensor is denoted by (see details in ). Then each is factored using SVD as follows:where is a diagonal matrix, is an orthogonal matrix which spanned the column space of , and is an orthogonal matrix which spanned the row space of . By using HOSVD method, an order tensor can be decomposed as follows: where is core tensor, are orthogonal matrices given in (3), and denotes mode- multiplication.
2.3. Product Manifold
Let be manifolds; the product manifold of the manifolds is defined as where denotes Cartesian product and is called factor manifold.
3. Product Grassmann Manifold Representation for Video
Video is a kind of multidimensional data and can be represented as tensor , where , , and represent height, width, and length of video, respectively. The variation of each mode can be captured by HOSVD. Lui et al.  found that traditional HOSVD is not appropriate for forming product manifold, so they redefined the traditional definition of HOSVD to factorize tensor using the orthogonal matrices , , and described in (3). That is, where is core tensor.
Since is a tall orthogonal matrix, hence it is a point on Stiefel manifold. Then is a point on Grassmann manifold. Hence, is a point on product Grassmann manifold. Then is a representation for videos on product Grassmann manifold.
4. Distance on Product Grassmann Manifold
The metric on Grassmann manifold is geodesic distance which is the shortest curve between two -dimensional subspaces and , that is, with representing the principal angles . Recently, Chikuse  introduced a projection embedding , , where denotes space of symmetric matrices. And Hamm and Lee  defined a distance called Projection Metric on Grassmann manifold as follows.
Definition 1. Given two points and on Grassmann manifold , the distance between and is defined as
Remark 2. In fact, for any matrix , there exists a orthogonal matrix such that , then element is equal to element . In this case, . Hence it is feasible to use the matrix representing . And is equal to geodesic distance of two points on Grassmann manifold .
Based on Definition 1, we give a kind of definition of distance on product Grassmann manifold which sums distance of each factor Grassmann manifold.
Definition 3. Given two points and on product Grassmann manifold , the distance between and is defined as
5. Extrinsic Least Squares Regression on Product Grassmann Manifold
Least squares regression is a simple and efficient technique in statistical analysis. In Euclidean space, parameter is estimated by minimizing the residual sum-of-square error where is training set and is regression value. The estimated parameter has closed solution as Hence the corresponding error is
In Grassmann manifold space, Lui  extended the linear least squares regression to a nonlinear form. In detail, the estimated parameter is equal to where is a nonlinear similarity operator, is a set of training samples on manifold, and is an element on manifold. So the corresponding error is where is an operator mapping points from vector space back to manifold. While Grassmann manifold is not closed under normal matrix subtraction and addition, the mapping is realized by employing exponential mapping and its inverse without closed-form solution. To realize the composition map , an improved Karcher Mean Computation algorithm is employed. To avoid loss of the above iterative algorithm, we introduce an extrinsic least squares regression on Grassmann manifold by embedding its elements to space of symmetric matrices. Due to the distance on product Grassmann manifold in (8) being additive for each factor, the extrinsic least squares regression on product Grassmann manifold equals three independent subregression problems on each factor. Taking one factor as example, we show the details in the following.
Let be training set where is number of samples, and is fitting parameter. is regression value. Similar to the idea of least squares regression in Euclidean space, we give a regression on Grassmann manifold, which is defined in the embedded space of symmetric matrices. The residual is measured as follows:where is the th element in vector . Next we show how to solve the optimization. We have and we define Hence model (14) becomes Let derivation of (17) with respect to equal to 0; we have So the solution of optimization (14) isHence the corresponding error becomes
6. Recognition Based on Extrinsic Least Squares Regression
In this subsection, we consider 3-order product Grassmann manifold for videos, while the situation for higher order is similar. Suppose classes are defined for the data. We denote training set corresponding with the th class as , where is number of samples. Our objective is inferring to which class the test sample belongs.
The residual error of query sample for class is defined as where are solutions of subregression on each factor Grassmann manifold, respectively. The category of the query sample is determined by
7. Experiments on Different Datasets
In this section, we show performance of the proposed method against some state-of-the-art methods on two kinds of datasets.
7.1. Action Recognition
7.1.1. Cambridge Hand Gesture Dataset
The Cambridge hand gesture dataset  contains 900 video sequences with nine kinds of hand gestures, which is divided into 5 sets according to different illuminations. Figure 1 shows some hand gesture samples. Set 5 (normal illumination) is considered for training while the remaining sequences (with different illumination characteristics) are used for testing. The original sequences are converted to grayscale and resized to . We denote our method as ELSR and report the correct recognition rate (CRR) for the four illumination sets in Table 1. Compared with product manifold (PM) , Grassmann Sparse Coding (gSC) , Grassmann Locality-Constrained Coding (gLC) , kernel Grassmann Sparse Coding (kgSC) , and kernel Grassmann Locality-Constrained Coding (kgLC) , we find that our method is competitive to these state-of-the-art methods.
7.1.2. Ballet Dataset
44 videos are collected from a Ballet instruction DVD as the Ballet dataset . In fact, 8 complex motion patterns from 3 persons are included in the dataset. In detail, the actions are “right-to-left hand opening,” “left-to-right hand opening,” “standing hand opening,” “jumping,” “leg swinging,” “hopping,” “turning,” and “standing still”. The main challenge of this dataset is large variations among classes such as speed, clothing, and motion paths. Figure 2 shows some examples of the dataset. Table 2 shows ELSR has superior performance compared with gSC-dic, gLC-dic, kgSC-dic, and kgLC-dic .
7.2. Scene Analysis
For scene analysis, we use the UCSD traffic dataset  which contains 254 videos of highway traffic under different weather conditions. Resolution is and number of frames ranges from 42 to 52. The dataset is divided into three classes (“heavy,” “medium,” and “light”) according to traffic congestion level. In total, there are 44 sequences defined as heavy traffic, 45 sequences labeled as medium traffic, and 165 sequences are light traffic. Figure 3 shows some typical examples. In experiment, we use the first 40 frames of each video and they are normalized as grayscale with resolution 48 × 48. We adopt the four pairs of training and testing sets provided in paper . The classification results are shown in Table 3; the average correct recognition rate of ELSR is higher than that of gSC and gLC but lower than kgSC and kgLC.
Through above experiments, we conclude that the proposed method is more effective for action recognition than scene analysis. In fact, the product Grassmann manifold could capture the appearance, horizontal motion, and vertical motion through three factor manifolds. To visualize the product manifold representation, the overlay appearance, horizontal motion, and vertical motion of examples from three dataset are given in Figure 4. Note that there are obvious variation features along horizontal motion for hand gesture examples, both horizontal and vertical motion for Ballet examples. These curves in last two columns characterize the motion and are the key factors for recognizing. This can be seen as an explanation of the higher CRR result of ELSR on Ballet dataset. Meanwhile, for samples from UCSD, horizontal and vertical motion features are not clear because of all cars running along the same path, and the critical factor is appearance, characterizing the number of cars. Hence for UCSD dataset, the CRR of ELSR is just little higher than gSC and gLC, but lower than kgSC and kgLC which maps to higher-dimensional manifolds using kernel function to diminish nonlinearity.
8. Performance Analysis
We analyze time complexity of inferring the label for a query with given training samples . The main computing steps include (19) and (22). We take one factor manifold for example. Some terms only related to such as can be computed offline. For computing , the time complexity of computing one element of is ; then the complexity of vector is , and hence the complexity of solving solution is . Computing the error needs . Therefore the whole time complexity of our approach is . For small-scale dataset, our proposed method is effective and the time complexity will not be too large. For example in experiment of Cambridge hand gesture dataset: , , , , , , and .
In this paper, we propose extrinsic least squares regression on product Grassmann manifold. Video can be viewed as third order tensor and then transformed to point on product Grassmann manifold factorized through HOSVD. One advantage of this method is the regression has closed-form solution which guides to a more accurate ratio of correct recognition. And when number of training samples is small, the proposed method is efficient. Several experiments on different recognition tasks (hand gesture recognition, action recognition, and scene analysis) show that our method performs very well on three small-scale public datasets.
In future work, we would like to devise kernel version of extrinsic least squares regression on product manifold.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research is supported by the National Natural Science Foundation of China (nos. 61390510, 61632006, and 61772049), the Beijing Natural Science Foundation (no. 4162009), Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality and Jing-Hua Talents Project of Beijing University of Technology, and Funding Project of Beijing Municipal Human Resources and Social Security Bureau (no. 2017-ZZ-031).
R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proceedings of the 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, pp. 2377–2385, can, December 2015.View at: Google Scholar
P. Turaga and R. Chellappa, “Locally time-invariant models of human activities using trajectories on the grassmannian,” in Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 2435–2441, usa, June 2009.View at: Publisher Site | Google Scholar
Z. Huang, R. Wang, S. Shan, and X. Chen, “Projection Metric Learning on Grassmann Manifold with Application to Video based Face Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pp. 140–149, USA, June 2015.View at: Publisher Site | Google Scholar
Y. Chikuse, Statistics on special manifolds, vol. 174 of Lecture Notes in Statistics, Springer-Verlag, New York, 2003.View at: MathSciNet
P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, Princeton, NJ, 2008.View at: MathSciNet