Abstract

Human action recognition has the potential to predict the activities of an instructor within the lecture room. Evaluation of lecture delivery can help teachers analyze shortcomings and plan lectures more effectively. However, manual or peer evaluation is time-consuming, tedious and sometimes it is difficult to remember all the details of the lecture. Therefore, automation of lecture delivery evaluation significantly improves teaching style. In this paper, we propose a feedforward learning model for instructor’s activity recognition in the lecture room. The proposed scheme represents a video sequence in the form of a single frame to capture the motion profile of the instructor by observing the spatiotemporal relation within the video frames. First, we segment the instructor silhouettes from input videos using graph-cut segmentation and generate a motion profile. These motion profiles are centered by obtaining the largest connected components and normalized. Then, these motion profiles are represented in the form of feature maps by a deep convolutional neural network. Then, an extreme learning machine (ELM) classifier is trained over the obtained feature representations to recognize eight different activities of the instructor within the classroom. For the evaluation of the proposed method, we created an instructor activity video (IAVID-1) dataset and compared our method against different state-of-the-art activity recognition methods. Furthermore, two standard datasets, MuHAVI and IXMAS, were also considered for the evaluation of the proposed scheme.

1. Introduction

A tremendous amount of video sequences is generated every day from CCTV cameras, YouTube, surveillance systems, the entertainment industry, and academic institutes. Manual analysis of visual information is time-consuming and error-prone. In an era of advanced computer vision technology, it is possible to use automatic visual understanding methods to understand the visual semantics of the classroom. Teaching effectiveness is a fundamental concept in contemporary education, valued by academic institutions as a goal on their own right. Some researchers have explored human pose recognition techniques using handcrafted features for estimating the instructor’s activities in the classroom [14] walking, writing, pointing towards the board, standing, and addressing and pointing towards presentations, respectively. Silhouette representation is often computationally less expensive but demands precise segmentation of human silhouettes for pose estimation and such techniques have primarily focused on handcrafted representations of spatial information. The temporal anchoring of spatial frames is not incorporated in these methods.

The literature on Human Activity Recognition (HAR) can be grouped into traditional (handcrafted) and deep learning action representation. Handcrafted spatiotemporal features encode the appearance and movement profile of actor for better action prediction [57]. For instance, Dollar et al. [5] propose the mapping of 2D to 3D spatiotemporal interest points as cuboid descriptions for actions prediction, while Wang and Schmid [7] establish dense motion trajectories (iDT) by computing the camera movement information. There are various types of spatiotemporal features to generalize action recognition including, spatiotemporal features [1], 3D-SIFT [8], HOG3D [9], extended SURF [10], iDT [7], histogram of optical flow (HOF) [11], and motion boundary histogram (MBH) [12]. Describing the iDT with MBH, HOG, HOF have shown the better prediction of activities on benchmark datasets (UCF101 [13], HMDB [14] and THUMOS [15]). However, sometimes the trajectory information is degraded due to large variations among action categories and it can be argued that they are not suitable for realistic HAR tasks [16].

Recently, deep learning approaches for HAR modified the 2D CNN models to capture 3D spatiotemporal action representation. Ji et al. [17] modify the 2D-CNN into 3D-CNN using neighboring frame information. However, the performance of 3D-CNN is comparable with 2D-CNN, where 2D-CNN was modeled for image category recognition tasks [18]. Simonyan et al. [19] suggest a two-stream ConvNet to encode frame appearance information combined with motion information. The two streams ConvNet performance is comparable with the performance of IDT on UCF101 and HMDB51. The combination of handcrafted and deep features have improved action prediction rate [18], due to the fact that sparse spatiotemporal handcrafted features are encoded into deeper representation and accurately recognize the activities. The 2D and 3D-CNN for activity recognition are trained by back propagation methods to reduce the classification loss and sometimes suffer from overfitting. However, for action recognition applications, the amount of training data is small to establish a generic model. Some of the standard HAR datasets are UFC-101 [13], comprising 13K videos of 101 action classes, MuHAVI-Uncut [20], which consists of 2898 videos of 17 classes and HMDB-51 [14] dataset consisting of 3.7 K videos of 51 categories.

The visual information of the classroom holds significant metaphors to provide genuine feedback for lecture effectiveness [21, 22]. We believe that computer vision techniques are beneficial to automate a fair instructor’s [3] learning model to recognize eight activities of an instructor within the lecture room. In the proposed technique, instructor silhouettes are segmented from the static background using graph-cut segmentation. Silhouettes of each video frame are used to encode spatial and temporal dynamics of instructor activities through motion profiles. These motion profiles store the spatiotemporal instructor’s contextual information from each video sequence in single templates. Then, these motion profiles are used to compute deep features by applying deep convolutional operations and induce nonlinearity among the deep spatiotemporal representation. Then, these learned features are presented to Extreme Leaning Machines (ELM) to generate a feedforward model for instructor activity recognition. An overview of the proposed model is shown in Figure 1.

At times, transfer learning of pretrained CNN models suffers from poor performance and overfit due to lack of data. However, our proposed technique learns deep spatiotemporal action representation and performs fast and accurate action recognition, even with limited data. We have elaborated this contribution using our IAVID-I dataset. The proposed technique works on feedforward learning and performs better than backpropagation CNN models for action recognition. Moreover, the motion profile effectively reduces the video’s spatiotemporal and computational complexity.

The deep features capture the high-level discriminative representation from the motion profile for prediction of instructor actions. The performance of the proposed technique has been evaluated on our recorded single-view IAVID-I dataset for instructor activity recognition and also on benchmark multiview activity recognition datasets (MuHAVi, IXMAS). However, as far as we know, such a feedforward model has not been reported before for action prediction task. The proposed technique achieved higher prediction scores on MuHAVI-Uncut with 2989 videos compared to state-of-the-art techniques.

Our contributions are summarized as follows:(i)We propose a new feedforward learning model for fast and accurate instructor activity recognition.(ii)The fast feedforward proposed technique can learn deep features from any kind of CNN model.(iii)The technique is able to be applied for silhouette-based activity recognition applications.(iv)We have shown that the proposed approach can be used for multiview human action recognition. This contribution is explained further in the experiments section using a standard multiview HAR dataset (MuHAVI-Uncut).

2. Proposed Method

The proposed technique consists of a three-step process, as shown in Figure 1. First, we extract the instructor silhouettes f of each video frame to generate cumulative spatiotemporal instructor’s motion profile . Then, we learn the deep spatiotemporal features x. Then, we present these spatiotemporal deep features to an ELM for instructor’s activities recognition into eight action classes.

2.1. Silhouettes Segmentation and Spatiotemporal Motion Profile Formation

The instructor silhouettes f are segmented from RGB videos using graph-cut segmentation [23] pab of video frame generating a corresponding graph vertex vab of the graph. The foreground silhouettes f and static background B of lecture room are presented as two additional vertices. The weights on the links between the pixel vertices and foreground f, background B are derived from the difference between the background and the current frame at the corresponding pixel, qab.where is a threshold parameter that determines the association of with instructor silhouettes f and lecture room background B. The instructor silhouettes f(1,2,…,N) segmented from each video frame are used to generate a single instructor motion profile Mf to encode spatiotemporal movement information at time t:Here, is a total number of frames to generate for every action video. All the resulting are normalized and rescaled to predefined dimensions for further presenting to deep CNN models.

2.2. Spatiotemporal Deep Feature Learning

After obtaining the , deep representations x of instructor actions are generated from through CNN. In our algorithm, we have adapted Alexnet [24], and VGG19 [19] are denoted as x17, x 20, and x23 (extracted from Alexnet with 1x4096, 1x4096, 1x1000 dimensions), x39, x42, and x45 (extracted from VGG19 with 1x4096, 1x4096, 1x1000 dimensions). The visual representation of deep spatiotemporal features is illustrated in Figure 2. The x subscript represents the layer depth used for computation of spatiotemporal features.x are normalized through the min-max normalization algorithm (eq. (4)). Implementation and observed results are discussed in the Results section.

2.3. Extreme Learning Machine as a Classifier

The extreme learning machine is a feedforward learning algorithm using a single layer of the neural network and usually known as Single Layer Feedforward Neural Network (SLFN) [24, 25]. In this work, we investigate this to recognize instructor activities, something not reported in the literature to date [25, 26] which is used to predict a single output unit to classify instructor activity recognition problem using L hidden nodes described aswhere are output weights between L hidden nodes and output vectors, . The classification decision function of ELM with logistic sigmoid and hyperbola tangent sigmoid activation function is expressed as follows:

The logistic sigmoid transforms the input at each hidden neuron and generates a nonlinear output within the 0-1 interval, using the expression (7a). Another activation function is hyperbolic tangent ‘tanh’ function and its output is within , using the expression (7b).Selection of optimal training parameters is effective in reducing the ELM classification error [25, 26]. The ELM’s input weights are produced randomly using any continuous distribution function. However, the weights at output nodes are produced using a linear system of the minimum norm. In the proposed technique, x is an NxD matrix of deep spatiotemporal features of D dimension and N is the number of training samples. w is a DxL matrix and represents the link between the ELM’s input layer and ELM’s hidden layer. b is a bias NxL matrix. H is an NxL matrix known as hidden matrix, where G(.) is a continuous function satisfying ELM universal approximation capability theorem and it is a piecewise nonlinear function. In our proposed technique, we have evaluated G(.) with a logistic sigmoid and a hyperbola tangent sigmoid activation function. The output weight matrix is β of LxC dimensions.

H is a Moore-Penrose generalized inverse matrix of H. The T matrix is of dimension NxC and referred to as target label matrix and holding label vectors in One-Hot encoding scheme for training examples, where C is the number of instructor activity classes. ELM optimizes the classification process to target generalized performance with minimum training error and norm of output weights using the following objective function:Here, H is the output matrix of the hidden layer asTo minimize the norm of output weights is achieved through maximizing the margins for strengthening the decision boundary among the eight instructor’s action classes within feature representation () of ELM, using a minimal least square method as is the generalized Moore-Penrose inverse matrix calculated through orthogonal projection and the single value decomposition method. The working of ELM in the proposed technique is expressed in Algorithm 1.

Input: Deep spatiotemporal features x, target label T, number of hidden nodes L,
activation function G. Let, w be the weight between ELM input layer and hidden
layer, b is biased vector, β is output weights, G is the ELM activation function, Y is
the predicted output vector, H output matrix of hidden layer, is the generalized
Moore-Penrose inverse matrix.
Output: parameters of ELM, w, b, β, and prediction response Y.
Generate randomly w and b
Compute H=G(xw + b)
Computeβ=HT.
Compute Y=H β
Return w, b, β, Y

3. Results and Discussion

In this section, we describe a series of tests performed to evaluate our approach. The following techniques are applied:(i)Examine the impact of ELM’s hidden nodes for action recognition.(ii)Quantitative analysis.(iii)Comparison with other state-of-the-art methods.

3.1. Datasets

Our investigation includes three different action recognition datasets: our recorded single-view dataset IAVID-I and two standard multiview datasets (MuHAVI-uncut and IXMAS), to evaluate the performance of proposed technique. Some sample action frames are shown in Figure 3. These datasets are described as follows in Figure 3.

3.1.1. IAVID-I

We have constructed a dataset of Instructor Activity Video Dataset-I IAVID-I to evaluate the proposed scheme. Twelve actors participated in data recording in realistic lecture room environment focusing on the stage. There are 100, 24-bit RGB videos having 1088x1920 high-resolution. 12 subjects perform the 8 instructor actions and so approximately 12 instances of each action class are present in the dataset. The dataset comprises the following actions: interacting or idle, pointing towards the board, pointing towards the screen, using a mobile phone, using a laptop, reading notes, sitting, walking, and writing on the board, as demonstrated in Figure 3. IAVID-I, publicly available for academic research, is the first attempt with the primary goal to contribute resources in instructor activity recognition. Our dataset will support researchers to test their algorithms for understanding the semantic information within the lecture room. IAVID-I can be a valuable source for algorithm assessment, evaluation, and comparison with state-of-the-art methods.

3.1.2. MuHAVI-Uncut

The MuHAVi-uncut dataset is a multiview activity recognition dataset. It contains 17 activities performed by 14 actors at multiple durations. The 8 CCTV cameras were mounted at 45° view difference to capture an action sequence. The MuHAVI-Uncut dataset is a large video dataset (2898 videos) and has segmented single actor’s silhouettes.

3.1.3. INRIA Xmas (IXMAS)

INRIA Xmas Motion Acquisition Sequence (IXMAS) contains 12 activities (cross arms, check watch, sit down, starch head, get up, wave, walk, turn around, kick, punch, pointing, and pick up). Twelve actors perform these activities three times. The dataset was captured from five different views. The frame resolution of each video sequence is 390x291 pixels.

The proposed technique requires actor’s silhouettes to form motion templates, as better segmented silhouettes forms better MHIs, and, therefore, MuHAVI-uncut and IXMAS are the most suitable silhouettes datasets for evaluation of proposed technique. Another benefit for proposed technique’s evaluation is that MuHAVI-uncut and IXMAS allow us to examine the performance of action prediction for a multiview setting and all the actions in MuHAVI-Uncut and IXMAS dataset are performed by a single actor with a static background, a similar scenario to a single instructor demonstrating in the class.

3.2. Experimental Setup

Leave one actor out (LOAO), leave one camera out (LOCO), and leave one sequence out (LOSO) validation schemes are employed in our experiments to evaluate the performance of the proposed model. These schemes define the training and testing splits. For example, in LOAO all the action sequences of one actor are used as a testing and the remaining are used for training. This process is repeated for all the actors and the average performance of the system is recorded. Similarly, in LOCO, action sequences from one camera view are used for testing while the remaining sequences are used for training. In LOSO, all the action sequences are considered for training except for one sequence that is used as a testing sample. The reported accuracies are the average values for each of the experiments.

3.3. Impact of Number of ELM Hidden Nodes

ELM is not dependent upon iterative or backpropagation approaches to adjust the weights and bias of SLFNs, rather it analytically estimates the suitable SLFN’s parameters, using universal approximation capability with random hidden nodes to establish a generalized model for learning. However, the selection of the number of hidden nodes as a training parameter is effective in reducing classification error [25, 26] behavior of the proposed system.

The parameter ‘number of ELM hidden nodes’ was chosen using a grid search technique after empirical analysis of the proposed technique on IAVID-I, MuHAVI-uncut and IXMAS single and multiview dataset within the interval of . We have empirically examined the deep spatiotemporal features from the various depths of two types of CNN models (i.e., Alexnet and VGG19) and, in light of our observations, x17 and x39 performed better action recognition. It can be noticed from Figure 4 that deep representation from x17 performed better recognition for both kinds of decision functions. However, x39 slightly shows variation in performance. The sigmoid decision function performed better, as compared to hyperbola tangent sigmoid for a given deep spatiotemporal representation x. The best number of ELM hidden nodes for IAVID-I dataset is 1300 nodes for x17 and 500 for x39. However, for MuHAVI-uncut and IXMAS dataset the optimal number of ELM hidden nodes is chosen as 129,060 for x17 and x39 features.

Some of the key findings observed from this experiment are listed as follows:(i)The choice of ELM activation function has a significant role in improving the recognition rate. In this experiment, the logistic sigmoid activation function performed better than the hyperbola sigmoid. One explanation of this behavior is that ELM establishes different probability distributions with different activation functions. The different distributions of deep spatiotemporal features mapping employing different activation function are different, which directly affects the recognition rate of ELM. In other words, our experiment validates that logistic sigmoid is a more meaningful nonlinear mapping of deep spatiotemporal features than the hyperbola tangent.(ii)This parameter also depended upon the amount of data, as IAVID consists of 100 videos; therefore, a small number of hidden nodes are required. Whereas MuHAVI-Uncut and IXMAS dataset are larger than IAVID, the number of hidden nodes is greater.(iii)Adding more hidden nodes may not always result in better performance. However, the random selection of hidden nodes may lead the learning model to be suffering from overfitting or underfitting. Therefore, an incremental selection of hidden nodes will help obtain a better network model followed by pruning the unnecessary hidden nodes for better performance.

After the empirical selection of the type of decision function and number hidden nodes, these parameters will remain the same for the rest of the experiments.

3.4. Quantitative Analysis of Proposed Technique

IAVID dataset was evaluated using LOSO, LOAO and validation training testing splits schemes and recorded recognition rates for 8 instructor’s actions of 82.19%, 67.98%, and 81.43% using x17.

It can be observed from Table 1 that deep spatiotemporal features x17 and x39 performed better as compared to other variants x20 x23, x39, x42, and x45 computed from the same CNN model at LOAO scheme. This empirical analysis indicates that the shallower layers of network exhibit a deeper representation of action classes as compared to higher layers. At higher layers of CNN, some features are dropped out due to compression of representation using pooling and dropped-out layers. Moreover, the LOAO validation scheme illustrates the strength of the proposed technique against person independent HAR. As in LOAO all the action sequences of one actor is used as a testing and the remaining are used for training. This process is repeated for all the actors and the average performance of the system is recorded. Even when missing representation of one actor, the proposed technique has an accuracy of 67.98%.

From these results, we can infer that features x17 and x39 are reasonable choices to examine performance on LOSO validation scheme. The average recognition rate at LOSO is 82.19%.

From Table 2, it is observed that the number of hidden nodes is a significant parameter in tuning the performance of proposed technique, as similar deep spatiotemporal representations x17 and x39 generated different prediction rate at different numbers of hidden nodes, peaking at 2100. Increasing the hidden nodes further decreases accuracy, due to the different probability distribution of decision boundaries.

Similarly, from Table 3 it is observed that spatiotemporal deep representation x17 and x39 performed better than other variants of deep spatiotemporal representation computed from the same CNN model. It can be observed from Table 3 that the recognition rate is higher when using x17 and x39 representations, for randomly sampled 70-30 training testing validation splits. The results of LOAO and LOSO on IAVID-I illustrate the strength of the proposed technique for activity recognition. We have compared the performance of the proposed technique with state-of-the-art methods and elaborated in detail in the comparison section.

From the confusion matrix for validation 70-30 split in Figure 5(a), the per class recognition rates for ‘Interacting or idle’, ‘Pointing towards board or screen’, ‘Pointing students’, ‘Sitting’, ‘Using laptop’, ‘Using phone’, ‘Walking’, and ‘Writing on board’ are 44.4%, 57.1%, 50%, 100%, 100%, 100%, 100%, and 100%, respectively. The average accuracy rate is 81.43%.

Similarly, from the confusion matrix for validation LOAO in Figure 5(b), the per class recognition rates for ‘Interacting or idle’, ‘Pointing towards board or screen’, ‘Pointing students’, ‘Sitting’, ‘Using laptop’, ‘Using phone’, ‘Walking’, and ‘Writing on board’ are 50%, 62.5%, 16.7%, 100%, 66.7%, 73.3%, 85.7%, and 88.9%, respectively. The instructor action class ‘Pointing Board’ achieved the lowest recognition rate because it is visually similar to action ‘Interacting or idle’, ‘Pointing towards board or screen’, ‘Using phone’. The average accuracy of LOAO is 67.98%.

In case of LOSO, the per class recognition rate for ‘Interacting or idle’, ‘Pointing towards board or screen’, ‘Pointing students’, ‘Sitting’, ‘Using laptop’, ‘Using phone’, ‘Walking’, and ‘Writing on board’ is 91.7%, 80.0%, 85.7%, 61.5%, 100%, 100%, 78.6%, and 60%, as depicted in Figure 5(c). The average accuracy of LOSO is 82.19%.

3.4.1. Comparison between Backpropagation CNN Model and Feedforward Proposed Technique

In this section, we elaborate on the effectiveness of our proposed technique as compared to CNN models. We have examined the performance of the proposed technique when deep spatiotemporal features x are used to train the backpropagation CNN model and feedforward ELM. The recorded results are presented in Table 3. The inference time for backpropagation CNN is higher than the proposed technique, which requires 1.5 milliseconds to recognize an activity; however, backpropagation CNN requires 50 msec (26.32 times slower). The backpropagation CNN model training and testing of 100 video sequences take 900 seconds; i.e., on average, it takes 0.05 seconds per sequence to recognize action at a frame rate of 1200 frames/second (FPS). However, our proposed technique recognizes actions at a frame rate of 40,000 FPS.

Backpropagation CNN also requires a considerable amount of time to reduce the training error during model learning depending on the amount of data. Our proposed technique performs better when we increase the number of ELM nodes, but the same is not true for backpropagation CNN. In backpropagation, the CNN model needed a large number of parameters tuning usually suffering the problem of overfitting when the amount of data is small like an IAVD-1 dataset. Therefore, we preferred to use smaller networks like Alexnet and VGG19 for computing spatiotemporal features from motion profiles. However, the proposed technique is able to extract deep features from any kind of CNN model. In our proposed scheme deep features are extracted from the CNN model without transfer learning in feedforward mode.

The results in Table 4 show that spatiotemporal features x17 and x39 recognize activities accurately when ELM is used as a classifier, compared to backpropagation CNN models. These results confirm that the proposed technique outperforms the backpropagation CNN models with respect to prediction accuracy and computational time.

3.4.2. Comparison of Standard ELM with Variants of ELM

We have evaluated the performance of the standard ELM classifier used in this paper against various variants of ELM classifier, i.e., minimum class variance ELM (MCV-ELM) [29], minimum variance ELM (MV-ELM) [30], self-adaptive evolutionary ELM (SADE-ELM) [28], and regularized ELM (R-ELM). The MVC-ELM [29] and MV-ELM [30] were introduced to improve the intraclass variance among the fine-grained activities and problems of unbalance data for activity recognition. In order to improve the intraclass variation among the action classes, MCV-ELM and MV-ELM employed clustering based discriminant analysis and X2 distance within the action class scatter matrix. The SADE –ELM [28] is an evolutionary variant of ELM that optimized the hidden weight bias along with input weights through the differential evolutionary algorithm. While regularized ELM [27] explored the structural minimization of data outliers to reduce the problem of model overfitting without increasing the computational time.

To examine the behavior of standard ELM, MCV-ELM, MV-ELM, R-ELM, and SADE-ELM, the operational parameters remain the same for coherent evaluation. The deep spatiotemporal features x17 extracted from motion templates are used for model learning of standard ELM and other ELM variants. The number of hidden nodes L is chosen as 4096, as the feature dimension of x17 extracted from Alexnet’s first fully connected layer is 1x4096. Therefore L is set as 4096.

The ELM and R-ELM cost parameter C1 and kernel parameter γ are determined empirically within and . The C1 is regulation coefficient presented to reduce the training error and norm of output weight, whereas γ is a constant that usually is greater than the norm of interconnection matrix of ELM hidden layer and bias vector [44]. Similarly, the operational cost parameter C1 and regression regularized parameter C2 for MCV-ELM and MV-ELM are empirically opted within and . The parameter C2 for MCV-ELM and MV-ELM is helpful in determining the output weights. These output weights are significant in estimating the trade-off between the training errors and training vector dispersion of the scatter matrix. Higher dispersion enables stronger decision boundaries and reduces outliers. The recognition performance of ELM and its variants is sensitive to the combination of regularization parameters C1, C2, and γ. The optimal combination of parameters is dataset specific and achieved within the narrow range for model generalization.

The comparison presented in Table 5 highlights the effectiveness of the proposed approach. The performance of standard ELM for learning instructor action classes is comparable with R-ELM when high dimensional deep spatiotemporal action representation x17 is used for model learning. However, SADE-ELM, MCV-ELM, and MV-ELM performance is low for instructor action recognition using deep spatiotemporal action representation over IAVID-1 dataset. However, the performance of MCV-ELM and MV-ELM is slightly low but comparable on MuHAVI-Uncut dataset using LOCO validation scheme. The reason for performance gain in case of MuHAVI-Uncut dataset as compared to the IAVID-1 dataset is that the number of samples comprising the MuHAVI-Uncut dataset classes is higher. However, it also highlights that these variants have higher dependency on the class sample rate as compared to the standard ELM. Moreover, the computational time (as presented in Table 5) for model learning of SADE-ELM, MCV-ELM, and MV-ELM is also higher than standard ELM and R-ELM, due to the generation of data-driven hidden node weights and bias. From Table 6 we can observe that the optimal combination (C1, γ) for ELM and R-ELM is (2−6, 10−6), whereas the optimal combination (C1, C2) for MCV-ELM and MV-ELM is (23,103) for better action recognition.

3.4.3. Comparison with State-of-the-Art Methods

In this section, we conclude the findings of our proposed technique as compared to the state-of-the-art techniques on the IAVID-1 dataset and also on publicly available multiview action recognition datasets (MuHAVI-Uncut and IXMAS), as illustrated in Table 7.

The proposed technique outperforms other techniques based on silhouettes in terms of precise recognition. To this end, we analyzed and compared our technique with methods [17, 20, 31, 32] on the IAVID-I dataset. The C3D features are computed from RGB IAVID-1 videos and produce 48.77% and 40% prediction accuracy using SVM and CNN. The performance of C3D features is comparable to 2D CNN without considering temporal information for HAR at frame level [18] Similarly, Bag of Expression [32] for HAR produces 26.67% recognition rate using handcrafted 3D-Harris and 3D-SIFT. Some recent silhouettes based HAR techniques are using MHIs described through HOG [20] and LBP-HOG [31] descriptor to recognize human activities through nearest neighbor and SVM classifiers. The instructor activity recognition rate for [20] is 63.5% and 55% for [31].

Since all the reported action recognition techniques based on MHI are using traditional features descriptor to represent the action, none of the reported techniques imply deep learning features to represent the spatiotemporal movement of the actor using MHIs. We believe that deep features are able to learn features in higher dimensions from motion templates that show better discriminative model learning for activity recognition. Table 7 shows comparative results for the IAVID-I dataset using 70% training and 30% testing data.

Evidently, learned feature representation is beneficial for action recognition as the proposed technique outperforms the traditional feature representation, such that HOG [20, 31] confirms the benefits of good decision boundary among 8 instructor action classes using a feedforward network. These results confirm the benefits of deeply learned features for over traditional and deep learning action recognition techniques, as shown in Table 7.

Similar performance benefits are also obtained for the two standard multiview action recognition datasets, i.e., MuHAVI and IXMAS, as shown in Figure 6 and Table 7. We believe that the good performance of proposed technique on the MuHAVI video action recognition data is due to the flexibility of fusion of spatiotemporal motion profile with unsupervised deep learned features ‘x’ with feedforward network for learning model. The proposed technique improved the baseline recognition results on the MuHAVI-Uncut dataset in LOCO scheme by 29.84% and LOAO scheme by 9.56%. Similarly, in LOSO there is a slight improvement of 0.42%. However, on IXMAS dataset performance of the proposed technique is not so outstanding, due to the fact that actors of IXMAS do not have fixed angular positions towards the cameras. This characteristic introduces some misclassification in results. However, there is no significant variation in aspect within each view. The actions of IXMAS are visually fairly similar to each other like folding arms, watching watch and scratching head. The motion profile generated from these actions is therefore visually similar to each other. Therefore, the proposed technique does not precisely predict the action classes, though it does to some extent, as shown in Table 7.

To compare performance, we consider other approaches to MuHAVi-uncut and IXMAS dataset. For comparison we have implemented the approach [20, 3335]. The operational parameters and noise removal technique across all the [20, 3335] methods remain the same for fair comparison. In [20], is the total number of video frames used for silhouettes generation, like our method. In [35], motion profiles were used to model the classifier for action recognition. The recognition rates of [35] using LOAO, LOCO and LOSO validation scheme are not more than 60%, while proposed approach significantly improves the recognition accuracy by 36.96%, 46.13%, 40.42% respectively as compared to [35]. In [33], action motion profiles were clustered through a self-organizing map (SOM) and clusters were further projected on manifold space to predict the action class through observable Markov Model. The technique proposed here performed 9.76% better than observable Markov Model at LOAO validation scheme, as shown in Table 7.

4. Conclusion

In this paper, we presented a framework for instructor activity recognition by deep spatiotemporal features and feedforward Extreme Learning Machines by incorporating spatiotemporal instructor silhouettes information in single motion profile and representing them with high dimensional deep convolutional features. These deep spatiotemporal representations are used to learn the model for instructor activity recognition by employing an extreme learning machine as a classifier. The proposed scheme has shown several salient features including accurate prediction of instructor actions and performs recognition in feedforward fashion despite backpropagation or iterative learning. Moreover, the proposed technique shows improvements in the challenges of scale, viewpoint variation, and multiple actors and accurately predicts actions. We have improved the baseline recognition rate on one of the multiview HAR datasets (MuHAVI-Uncut). In the future, we will explore new techniques to understand the classroom semantics for supporting instructor self-reflection mechanism for lecture effectiveness.

Data Availability

We have used two publicly available datasets (MuHAVI and IXMAS). However, the IAVID-1 dataset used to support the findings of this study may be provided on request to “Muhammad Haroon Yousaf”, who can be contacted at [email protected].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for research work carried in Centre for Computer Vision Research (C2VR) at University of Engineering and Technology Taxila, Pakistan. Sergio A Velastin acknowledges funding by the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for Research, Technological Development and Demonstration under grant agreement no. 600371, el Ministerio de Economía y Competitividad (COFUND2013-51509), and Banco Santander. We are also very thankful to participants, faculty, and postgraduate students of Computer Engineering Department who took part in the data acquisition phase. Without their consent, this work was not possible.