Abstract

Recognizing human actions in videos is an active topic with broad commercial potentials. Most of the existing action recognition methods are supposed to have the same camera view during both training and testing. And thus performances of these single-view approaches may be severely influenced by the camera movement and variation of viewpoints. In this paper, we address the above problem by utilizing videos simultaneously recorded from multiple views. To this end, we propose a learning framework based on multitask random forest to exploit a discriminative mid-level representation for videos from multiple cameras. In the first step, subvolumes of continuous human-centered figures are extracted from original videos. In the next step, spatiotemporal cuboids sampled from these subvolumes are characterized by multiple low-level descriptors. Then a set of multitask random forests are built upon multiview cuboids sampled at adjacent positions and construct an integrated mid-level representation for multiview subvolumes of one action. Finally, a random forest classifier is employed to predict the action category in terms of the learned representation. Experiments conducted on the multiview IXMAS action dataset illustrate that the proposed method can effectively recognize human actions depicted in multiview videos.

1. Introduction

Automatic recognition of human actions in videos becomes increasingly important in many applications such as intelligent video surveillance, smart home system, video annotation, and human-computer interaction. For example, finding out suspicious human behaviors in time is an essential task in intelligent video surveillance, and identifying fall actions of older people is of great importance for a smart home system. In recent years, a variety of action recognition approaches [15] have been proposed to solve single-view tasks, and some surveys [610] review the advances of single-view action recognition in detail. However, real-world videos bring about great challenges to single-view action recognition, since visual appearance of actions can be severely affected by viewpoint changes and self-occlusion.

Different from single-view approaches which utilize one camera to capture human actions, multiview action recognition methods exploit several cameras to record actions from multiple views and try to recognize actions by fusing multiview videos. One strategy is to handle the problem of multiview action recognition at classification level by annotating videos from multiple views separately and merging the predicted labels of all views. Pehlivan and Forsyth [11] designed a fusion scheme of videos from multiple views. They firstly annotated labels over frames and cameras using a nearest neighbor query technique and then employed a weighting scheme to fuse action judgments as the sequence label. Another group of methods resort to merging data from multiple views at feature level. These methods [1215] utilize 3D or 2D models to build a discriminative representation of an action based on videos from multiple views. In fact, how to represent an action video with expressive features plays an especially important role in both multiview and single-view action recognition. A video representation with strong discriminative and descriptive ability is able to express human action reasonably and supply sufficient information to action classifier, which will lead to an improvement in recognition performance.

This paper presents a multiview action recognition approach with a novel mid-level action representation. A learning framework based on multitask random forest is proposed to exploit a discriminative mid-level representation from low-level descriptors of multiview videos. The input of our method is multiview subvolumes, each of which includes continuous human-centered figures. These subvolumes simultaneously record one action from different perspectives. And then we sample spatiotemporal cuboids from subvolumes at regular positions and extract multiple low-level descriptors to characterize each cuboid. During training, cuboids from multiple views sampled at four adjacent positions are grouped together to construct a multitask random forest by using action category and position as two related tasks, and a set of multitask random forests are constructed in this way. In testing, each cuboid is classified by the corresponding random forest, and a fusion strategy is employed to create an integrated histogram for describing cuboids sampled at a certain position of multiview subvolumes. Histograms of different positions are concatenated to a mid-level representation for subvolumes simultaneously recorded from multiple views. Moreover, the integrated histogram of multiview cuboids is created in terms of the distributions of both action categories and cuboid positions, which endows the learned mid-level representation with the ability of exploiting spatial context of cuboids. To achieve multiview action recognition, a random forest classifier is adopted to predict the category of this action. Figure 1 depicts the overview of our multitask random forest learning framework.

The remainder of this paper is organized as follows. After a brief overview of the related work in Section 2, we detailedly describe our method in Sections 3, 4, and 5. Then a description of experimental evaluation procedure followed by the analysis of results is given in Section 6. Finally, the paper concludes with discussions and conclusions in Section 7.

The existing multiview action recognition methods fusing data at feature level can be roughly categorized into two groups, 3D based approaches and 2D based approaches.

Some action recognition methods based on 3D models have shown good performance on several public multiview action datasets. Weinland et al. [12] built 3D action representations based on invariant Fourier analysis of motion history volumes by using multiple view reconstructions. For the purpose of considering view dependency among cameras and adding full flexibility in camera configurations, they designed an exemplar-based hidden Markov model to characterize actions with 3D occupancy grids constructed from multiple views in another work [16]. Holte et al. [14] combined 3D optical flow of each view into enhanced 3D motion vector fields, which are described with the 3D Motion Context and the view-invariant Harmonic Motion Context in a view-invariant manner. Generally, 3D reconstruction from multiple cameras requires additional processing such as camera calibration, which would lead to high computational cost and reduce the flexibility. In order to overcome the limitation of 3D reconstruction from 2D images, some methods employ depth sensors for multiview action recognition. Hsu et al. [17] addressed the problem of view changes by using RGB-D cameras such as Microsoft Kinect. They constructed a view-invariant representation based on the Spatiotemporal Matrix and integrated the depth information into the spatiotemporal feature to improve the performance.

In recent years, different methods based on 2D models have been proposed for multiview action recognition. These methods aim to construct discriminative and view-invariant action representations from one or more descriptors. Souvenir and Babbs [13] learned low-dimensional and view-independent representations of actions recorded from multiple views by using manifold learning. In the work of [18], scale and location invariant features are calculated from human silhouettes to obtain sequences of multiview key poses, and action recognition is achieved through Dynamic Time Warping. Kushwaha et al. [19] extracted scale invariant contour-based pose features and uniform rotation invariant local binary patterns for view-invariant action recognition. Sargano et al. [20] learned discriminative and view-invariant descriptors for real-time multiview action recognition by using region-based geometrical and Hu-moments features extracted from human silhouettes. Chun and Lee [15] extracted local flow motion from multiview image sequences and estimated the dominant angle and intensity of optical flow for head direction identification. Then they utilized histogram of the dominant angle and intensity to represent each sequence and concatenated histograms of all views as the final feature of multiview sequences. Murtaza et al. [21] developed a silhouette-based view-independent action recognition scheme. They computed Motion History Images (MHI) for each view and employed Histograms of Oriented Gradients (HOG) to extract low-dimensional description of them. Gao et al. [22] evaluated seven popular regularized multitask learning algorithms on multiview action datasets and treated different actions as different tasks. In their work, videos from each view are handled separately. Hao et al. [23] employed a sparse coding algorithm to transfer the low-level features of multiple views into a discriminative and high-level semantics space and achieved action recognition by a multitask learning approach in which each action is considered as an individual task.

Besides, some other methods employ deep learning technique to learn discriminative features for multiview action recognition, and several neural networks are developed to build these deep-learned features directly from the raw data. Lei et al. [24] utilized convolutional neural network to extract effective and robust action features for continuous action segmentation and recognition under multiview setup.

The proposed method is also relevant to our previous work [25], in which a random forest based learning framework is designed for building mid-level representations of action videos. Different from [25], the proposed method aims to solve the problem of multiview action recognition, and an integrated mid-level representation is learned for an action depicted in videos recorded from multiple views. Meanwhile, our multitask random forest learning framework is able to effectively exploit the spatial context of cuboids.

3. Overview of Our Method

Our goal is to recognize a human action by using videos recorded from multiple views. To this end, we propose a novel multitask random forest framework to learn a uniform mid-level feature for an action. In order to remove the influence of the background, we firstly employ a human body detector or tracker to obtain the human-centered figures from a video, and then a video is divided to a series of subvolumes with fixed size, each of which is a sequence of human-centered figures. We densely extract spatiotemporal cuboids (e.g., ) from subvolumes, and each of them is represented by multiple low-level features.

Our multitask random forest framework utilizes a fusion strategy to get an integrated histogram feature for cuboids sampled at the same position of subvolumes that simultaneously record an action from different views. Concretely, a multitask random forest is built upon cuboids extracted at four adjacent positions of multiview subvolumes, and thus we can construct a set of multitask random forests corresponding to different groups of positions. For the purpose of exploiting spatial context of cuboids, position of cuboid is treated as another task besides action category in the construction of multitask random forest. Decision trees in a multitask random forest vote on the action category and position of cuboids and generate a single histogram for cuboids sampled at the same position of simultaneously recorded multiview subvolumes, according to the distribution of both action category and cuboid position. The concatenation of histograms of all positions is normalized to get the mid-level representation for multiview subvolumes. For multiview action recognition, a random forest classifier is adopted to predict the category of this action.

4. Low-Level Features

Our multitask random forest based framework is general for merging multiple low-level features. In our implementation, we extract three complementary low-level features to describe the motion, appearance, and temporal context of the interested human. The optical flow feature computed from the entire human figure is able to characterize global motion information, the HOG3D spatial-temporal descriptor extracted from a single cuboid captures the local motion and appearance information, and the temporal context feature expresses the relative temporal location of cuboids. Therefore, the mid-level feature built upon the above three types of low-level features is more robust to video variations such as global deformation, local partial occlusion, and diversity of movement speed.

4.1. Optical Flow

Optical flow [35] is used to calculate the motion between two adjacent image frames. This motion descriptor shows favorable performance with noise, so it can tolerate the jitter of human figures caused by human detector or tracker. Given a sequence of human-centered figures, pixel-wise optical flow feature is calculated at each frame using Lucas-Kanade algorithm [36]. The optical flow vector field is split into two scalar fields corresponding to the horizontal and vertical components of the flow, and . Then and are half-wave rectified into two nonnegative channels , and , , respectively; namely, and . Each channel is blurred with Gaussian filter and normalized to obtain the final four sparse and nonnegative channels, , , , and , which constitute the motion descriptor of each frame.

4.2. HOG3D

HOG3D [37] is a local spatiotemporal descriptor based on histograms of oriented 3D spatiotemporal gradients. It is an extension of HOG descriptor [38] to the video. 3D gradients are calculated with arbitrary spatial and temporal scales, followed by the orientation quantization using regular polyhedrons. A local support region is divided into cells. An orientation histogram is computed for each cell, and the concatenation of all histograms is normalized to generate the final descriptor. In this paper, HOG3D descriptors are computed for cuboids densely sampled from human-centered subvolumes. We set and , respectively, and utilize icosahedron with full orientation to quantize the 3D gradients of each cell. So the dimension of HOG3D feature is .

4.3. Temporal Context

Temporal context feature is characterized by the temporal relation among different cuboids and is regarded as a type of low-level feature in this paper. Given a video with frames, a cuboid is extracted from a subvolume which contains frames and begins with the frame. The temporal context of is described as a two-dimensional vector , where represents the temporal position of in the whole video and denotes the temporal offset of relative to the center of the video.

5. Multitask Random Forest Learning Framework

We detail the proposed multitask random forest framework in this section. Suppose that an action is recorded by cameras simultaneously, and we obtain human-centered subvolumes with fixed size from each video, denoted as . Then we densely sample spatiotemporal cuboids from every subvolume with particular size and stride and denote them by , each of which is characterized by multiple low-level features. In order to exploit the spatial context of cuboids, we treat spatial position of cuboids as another type of annotations and employ cuboids of various action instances extracted at adjacent positions to build a multitask random forest by using both action labels and position labels. The proposed multitask random forest framework constructs an integrated histogram to describe cuboids sampled at position of multiview subvolumes , and histograms of cuboids are concatenated to create a unified mid-level representation for subvolumes that are simultaneously recorded from multiple views.

5.1. Construction of Multitask Random Forest

Our training cuboids are extracted from subvolumes of action instances, and each video of the instance generates subvolumes. Cuboids of the instance share the same action label , and the position label of cuboid is . As is shown in Figure 1, we draw cuboids at regular positions of subvolumes and utilize training cuboids sampled at four adjacent positions to construct a multitask random forest. We totally obtain random forests, denoted as .

Multitask random forest is an ensemble of decision trees, and each tree only takes cuboids from a particular view as input. Decision trees of the same view share the original training set , where represents the set of positions belonging to multitask random forest . As is shown in Figure 1, includes four positions that are adjacent in the direction of width or height. For example, cuboids at position 1, position 2, position 5, and position 6 (i.e., , , , and in Figure 1) belong to , and thus .

In order to build decision tree , we randomly sample about of cuboids from the original training set and obtain its own training dataset , using bootstrap method. All of the training cuboids in go through the tree from the root. We split a node and the training cuboids assigned to it according to a particular feature chosen from a set of randomly sampled feature candidates. Since a cuboid is described by three types of low-level features (i.e., optical flow, HOG3D, and temporal context), two parameters and are predefined to control the selection of feature candidates. Specifically, we generate two random numbers and to decide which type of low-level features is utilized for node split. If , then a quantity of optical flow features is randomly selected as feature candidates; otherwise some randomly selected HOG3D features comprise the set of feature candidates. Meanwhile, if , then all temporal context features are added to the set of feature candidates. Each feature candidate divides the training cuboids at this node into two groups, and feature candidate with the largest information gain is chosen for node split. Then the node splits into two children nodes and each cuboid is sent to one of the children nodes. As the multitask random forest takes action category and cuboid position as two classification tasks, a random number and a prior probability codetermine which task is used to calculated the information gain of data split.

A node stops splitting when it has gotten to the limited tree depth or all samples arriving at this node belong to the same action category and position, and then it is regarded as a leaf. Two vectors and are created to store the distributions of action categories and cuboid positions, respectively. Here denotes the posterior probability of cuboids arriving at the corresponding leaf node belonging to action , and we have actions in total. Similarly, represents the proportion of cuboids at this leaf node being extracted from a particular position. Both and of a leaf node are calculated from training cuboids assigned to it. The construction of a decision tree is summarized in Algorithm 1.

Input: The original training dataset ;
Predefined parameters , , , and ;
Output: Decision tree ;
(1) Build a bootstrap dataset by random sampling from with replacement;
(2) Create a root node and set its depth to 1, then assign all cuboids in to it;
(3) Initialize an unsettled node queue and push the root node into ;
(4) while    do
(5) Pop the first node in ;
(6) if depth of is larger than   or cuboids assigned to belong to the same action and position then
(7) Label node as a leaf, and then calculate and from cuboids at node ;
(8) Add a triple into decision tree ;
(9) else
(10) Initialize the feature candidate set ;
(11) if random number   then
(12) Add a set of randomly selected optical flow features to ;
(13) else
(14) Add a set of randomly selected HOG3D features to ;
(15) end if
(16) if random number   then
(17) Add two-dimensional temporal context features to ;
(18) end if
(19) , generate a random number ;
(20) for each   do
(21) if    then
(22) Search for the corresponding threshold and compute information gain in terms of action labels
of cuboids arriving at ;
(23) else
(24) Search for the corresponding threshold and compute information gain in terms of positions of
cuboids arriving at ;
(25) end if
(26) if    then
(27) ;
(28) end if
(29) end for
(30) Create left children node and right children node , set their depth to , and assign each cuboid
arriving at to or according to and ; then push node and into ;
(31) Add a quintuple into decision tree ;
(32) end if
(33)end while
(34)return Decision tree ;
5.2. Construction of Mid-Level Features

During testing, our task is to recognize an action instance by using videos recorded from views simultaneously. With respect to multiview subvolumes , cuboids sampled at position are handled by the corresponding multitask random forest , satisfying the condition . Particularly, cuboid is dropped down decisions trees ; this means tree only takes the cuboid from view as input. Suppose that the input cuboid arrives at leaf node of tree , with histograms and representing the distributions of action categories and cuboid positions. Then the average distributions voted by all decision trees can be calculated by The concatenation of and constitutes an integrated local descriptor of cuboids , denoted by . We deal with cuboids sampled at each sampling position separately and obtain a series of histograms . Histograms of all the positions are concatenated to a mid-level representation of subvolumes that are simultaneously recorded from multiple perspectives.

Following [25], the out-of-bag estimate [39] is employed in the construction of mid-level representations during training to solve the overfitting problem. As described in Algorithm 1, decision trees of view share an original training set , and about 2/3 of cuboids in constitute the bootstrap training set for tree . The construction of local descriptor for training cuboids is the same as that for test cuboids, except that tree does not contribute to if it was trained on . Accordingly, we rewrite (1) aswhere is an indicator function defined bySimilarly, the mid-level representation of training subvolumes is created by concatenating local descriptors of all positions.

5.3. Action Recognition with Mid-Level Representations

Given mid-level representations of training subvolumes, where denotes the integrated feature of multiview subvolumes , we train a random forest classifier [39] which is able to learn multiple categories discriminatively.

For a new action instance , all of the decision trees in random forest vote on the action category of each sample and assign a particular action label to . According to majority voting, we predict the final action category of this instance aswhere is an indicator function; that is, is 1 if and 0 otherwise.

6. Experiments

6.1. Human Action Datasets

Experiments are conducted on the multiview IXMAS action dataset [12] and the MuHAVi-MAS dataset [31] to evaluate the effectiveness of the proposed method.

The IXMAS Dataset. It consists of 11 actions performed by 10 actors, including “check watch”, “cross arms”, “scratch head”, “sit down”, “get up”, “turn around”, “walk”, “wave”, “punch”, “kick”, and “pick up”. Five cameras simultaneously recorded these actions from different perspectives, that is, four side views and one top view. This dataset presents an increased challenge since actors can freely choose their position and orientation. Thus, there are large inter-view and intra-view viewpoint variations of human actions in this dataset, which make it widely used to evaluate the performance of multiview action recognition methods.

The MuHAVi-MAS Dataset. It contains 136 manually annotated silhouette sequences of 14 primitive actions: “CollapseLeft”, “CollapseRight”, “GuardToKick”, “GuardToPunch”, “KickRight”, “PunchRight”, “RunLeftToRight”, “RunRightToLeft”, “StandupLeft”, “StandupRight”, “TurnBackLeft”, “TurnBackRight”, “WalkLeftToRight”, and “WalkRightToLeft”. Each action is performed several times by 2 actors and captured by 2 cameras from different views.

6.2. Experimental Setting

Since our method takes human-centered subvolumes recorded from multiple views as input, we utilize background subtraction technique to obtain human silhouettes and fit a bounding box around each silhouette. In our implementation, a subvolume is composed of 10 successive human-centered bounding boxes which are scaled to pixels. We densely extract 84 cuboids from each subvolume with particular size (i.e., ), and the strides between cuboids are 5 pixels.

6.3. Experimental Results

We compare our method with state-of-the-art methods on two datasets, and experimental results on the IXMAS dataset and the MuHAVi-MAS dataset are illustrated in Tables 1 and 2, respectively. In our experiments, the leave-one-actor-out cross-validation strategy is adopted on both datasets. We execute the random forest classifier for 10 times and report the recognition accuracy by averaging over the results of 10 classifiers.

6.3.1. Results on the IXMAS Dataset

As shown in Table 1, our method significantly outperforms all the recently proposed methods for multiview action recognition, which demonstrates the effectiveness of the proposed learning framework based on multitask random forest. The confusion matrix of multiview action recognition results is depicted in Figure 2. We can observe from Figure 2 that the proposed method achieves promising performance on most actions, among which four actions (i.e., “sit down”, “get up”, “walk”, and “punch”) are correctly recognized. Meanwhile, some actions have similar motion, which may result in misclassification. For example, it is difficult to distinguish actions “cross arms”, “scratch head”, and “wave”, since they all involve motion of the upper limb. Similarly, actors crouch in both actions of “sit down” and action “pick up”, which may be a possible reason for the misclassification of “pick up”.

The proposed method is also compared with other methods for single-view action recognition, and the results are summarized in Table 1. Videos of different views are handled separately, and a mid-level representation is created for a single video. Concretely, we build a set of multitask random forests for each view by using cuboids extracted from subvolumes of this view. Accordingly, all the decision trees in a multitask random forest share an original training dataset composed of cuboids from a certain view. It is observed that the proposed method performs better than [26, 28] on all of the five views. We can see that our method is competitive with [11, 25, 27] on two views and achieves much better performance on three views. However, our method is able to outperform the above three methods by fusing videos of multiple views into an integrated representation.

6.3.2. Results on the MuHAVi-MAS Dataset

As is shown in Table 2, our method achieves much better performance than the listed methods on the MuHAVi-MAS dataset. The promising results demonstrate the effectiveness of the proposed method. Figure 3 reveals the confusion matrix of our results. We can see that our method can correctly identify eleven of the fourteen actions, that is, “CollapseLeft”, “KickRight”, “PunchRight”, “RunLeftToRight”, “RunRightToLeft”, “StandupLeft”, “StandupRight”, “TurnBackLeft”, “TurnBackRight”, “WalkLeftToRight”, and “WalkRightToLeft”. It is also observable that our method does not do well in distinguishing “GuardToKick” and “GuardToPunch”, since they have very similar motion.

6.4. Effects of Parameters

In this section, we evaluate the effect of two parameters in the construction of multitask random forest on the IXMAS dataset.

In consideration of the computation cost, we limit the depth of each decision tree to , and Figure 4 depicts the accuracy of both single-view action recognition on five perspectives and multiview action recognition with different tree depths. Generally, the curves first rise and then decline slightly with the increasement of tree depth. One possible reason is that large decision trees may overfit training data. Meanwhile, it is interesting to observe that the depths from which the forests are overfitting vary on different views. Furthermore, we can observe from Figure 4 that multiview action recognition method is able to achieve better performance than single-view action recognition with all tree depths, which demonstrates the effectiveness of the multiview fusion scheme.

Another key parameter of our method is the prior probability . Our multitask random forest is designed to solve two tasks including action recognition and position classification. For the purpose of node split, we introduce the parameter to select a certain task for calculating the information gain of data split. More concretely, denotes the probability that the action recognition task is selected at each node. We tune the value of to investigate how it affects the performance and summarize the action recognition results in Figure 5. It should be noted that multitask random forest is reduced to random forest which takes action recognition as its only task if is set to 1. From Figure 5 we can observe that action recognition results of multitask random forest (e.g., ) are better than that of single-task random forest (i.e., ), which demonstrates the effectiveness of our multitask random forest learning framework.

7. Conclusion

We presented a learning framework based on multitask random forest in order to exploit a discriminative mid-level representation for videos from multiple views. Our method starts from multiview subvolumes with fixed size, each of which is composed of continuous human-centered figures. Densely sampled spatiotemporal cuboids are extracted from subvolumes and three types of low-level descriptors are utilized to capture the motion, appearance, and temporal context of each cuboid. Then a multitask random forest is built upon cuboids from multiple views that are sampled at four adjacent positions, taking action category and position as two tasks. Each cuboid is classified by its corresponding random forest, and a fusion strategy is employed to create an integrated histogram for describing cuboids sampled at a certain position of multiview subvolumes. Concatenation of histograms for different positions is utilized as a mid-level representation for subvolumes simultaneously recorded from multiple views. Experiments on the IXMAS action dataset show that the proposed method is able to achieve promising performance.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grants no. 61602320 and no. 61170185, Liaoning Doctoral Startup Project under Grants no. 201601172 and no. 201601180, Foundation of Liaoning Educational Committee under Grants no. L201607, no. L2015403, and no. L2014070, and the Young Scholars Research Fund of SAU under Grant no. 15YB37.