Human Activity Recognition Using Gaussian Mixture Hidden Conditional Random Fields
In healthcare, the analysis of patients’ activities is one of the important factors that offer adequate information to provide better services for managing their illnesses well. Most of the human activity recognition (HAR) systems are completely reliant on recognition module/stage. The inspiration behind the recognition stage is the lack of enhancement in the learning method. In this study, we have proposed the usage of the hidden conditional random fields (HCRFs) for the human activity recognition problem. Moreover, we contend that the existing HCRF model is inadequate by independence assumptions, which may reduce classification accuracy. Therefore, we utilized a new algorithm to relax the assumption, allowing our model to use full-covariance distribution. Also, in this work, we proved that computation wise our method has very much lower complexity against the existing methods. For the experiments, we used four publicly available standard datasets to show the performance. We utilized a 10-fold cross-validation scheme to train, assess, and compare the proposed model with the conditional learning method, hidden Markov model (HMM), and existing HCRF model which can only use diagonal-covariance Gaussian distributions. From the experiments, it is obvious that the proposed model showed a substantial improvement with value ≤0.2 regarding the classification accuracy.
In real-life environments, there are some fascinating applications in which the analysis of human activities plays a significant role. Some applications include human/object detection and recognition based on vision object analysis and processing areas such as tracking and detection [1, 2], computer engineering , physical sciences , health-related issues, natural sciences, and industrial academic areas . Most of the authors [6–11] recognized the human activities in indoor environments based on different methodologies. However, in their respective systems, they used stable environment like fixed camera setting and prelighting setting, and most of the activities were performed by the instructions provided by the instructor. Similarly, the authors of [10, 12–14] proposed different methods to recognize the human daily activities in outdoor environments. However, in most of the used datasets, they used static background and this is one of the common drawbacks in their systems. Similarly, different sensors were utilized by the authors of [15–17] in order to classify indoor and outdoor human activities.
Moreover, in telemedicine and healthcare, human activity recognition (HAR) can be explained by helping physically disabled persons’ scenario. A paralyzed patient with half of the body critically disturbed by stroke is completely unable to walk and the one way to recover him is through daily exercises. Normally, the daily exercises (activities) are recommended by the doctors to the stroke patients for getting better improvements in their health. A human activity recognition (HAR) system can correctly train and identify the activities performed by the stroke patients, through which the doctors easily can monitor the improvement scale in the patients’ health.
There are four modules in a typical HAR system: preprocessing (segmentation), feature extraction, feature selection, and recognition as shown in Figure 1. Most of the existing works [18–23] focused on feature extraction and selection; however, very limited works have been done for the recognition module. Some studies exploited conventional techniques [24–28]. Among them, HMM is one of the best candidates for the activity recognition; however, HMM is generative in nature and less precise than its matching part like HCRF model .
The inspiration behind the recognition stage is the lack of enhancement in the learning method. Therefore, we have made the following contribution:(i)The existing HCRF model is inadequate by independence assumptions, which may reduce classification accuracy. Therefore, the first objective of this study is to propose a recognition model that presents a new algorithm to relax the assumption, allowing our model in order to use full -covariance distribution.(ii)Another objective of the work is to prove that computation wise our method has very much lower complexity against the existing methods. In this method, our goal was to find some parameters to maximize the conditional probability of the training data at the training phase. Therefore, in our work, we utilize limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) method to search for the optimal point. However, instead of repeating the forward and backward algorithms to compute the gradients as others did , we run the forward and backward algorithms only when calculating the conditional probability, and then we reuse the result to compute the gradients. As a result, the computation time is significantly reduced.(iii)A comprehensive set of experiments which yielded a weighted average classification rate 97% that is better improvement in the performance against the state-of-the-art methods.
The rest of the paper is organized as follows: Section 2 presents related works with their limitations. Section 3 provides the proposed recognition model with its advantages. Section 4 describes the experimental setup for the proposed model against four datasets. Based on the setup, a series of experiments are presented in Section 5. Finally, Section 6 describes conclusion with some future directions.
2. Related Works
In a typical HAR system, different types of latest segmentation methods were used in preprocessing module in order to extract the human body from the activity frame. This process helps to improve the performance of the activity recognition system. Therefore, in the literature, the authors of [31–36] utilized the latest methods to segment the human body from the video frames. Similarly, for the feature extraction, different latest methodologies have been employed which help the classifiers to accurately classify the human activities (as the workflow shown in Figure 1) [37–42]. They showed better performance on different datasets, and most of them achieved average accuracy between 70 and 90%.
Regarding recognition, the researchers have proposed diverse systems which exploit various classifiers such as Gaussian mixture model (GMM) [43, 44], artificial neural network (ANN) [45, 46], and support vector machine (SVMs) [47–50]. These classifiers were principally employed for frame-based classification. Contrarily, in many HAR systems [37, 51, 52], the eminent hidden Markov model (HMM) has extensively been utilized for sequence-based classification. In the case of frame-level features, HMMs are benefited over vector-based classifiers like SVM, GMM, and ANN in terms of effectively handling the sequential data. However, the Markovian property implied in the traditional HMM assumes that the current state is a function of the past state only. This causes the labels of two adjacent states in the observation sequence to hypothetically appear in succession. But in practical implementation, this assumption often does not meet satisfaction. Besides, the generative characteristic of HMM and independence presumptions between observations and states also limit its performance . To get rid of these limitations, the maximum entropy Markov model (MEMM) had been proposed which comparatively performs better than HMM . However, MEMM is associated with the well-known disadvantage termed as “label bias problem”.
Two generalized models of MEMM known as conditional random fields (CRFs)  and HCRF  were developed to fix the shortcoming of “label bias problem” . For learning the hidden structure of the sequential data, HCRF facilitates the effectiveness of CRF with hidden states. However, in both models, the per-state normalization is replaced with global normalization, permitting the weighted scores which in turn result in larger parameter spaces as compared to HMM and MEMM.
For example, the CRFs achieved in the HAR system having the observed frames from a video are represented by feature vector U, resultant label V, and unknown state label K.
Suppose, the problem image labeling is assumed by original labels K with image features U and parameter of the model is , then the later probability () maximized by CRF is given aswhere the normalization factor is
Some issues in HCRF implementation are reviewed and analyzed in the following description. The later probability of CRF in (1) has been updated by the in a HCRF model that is the addition of exponentials of latent functions with all expected labels L as given below
The above equations are used to warranty the sum to one rule of the conditional probability. is the possible tag for the series of frames, and is a series of hidden states , and equations (1) and (2) have constant values from 1 to Q (the number of states), is the vector factor, and is a feature vector that will yield a decision which parameter will be educated by the model. Then, the feature vector concludes the addition of the existing HCRF model. For example, the underneath selections will create a Markov restraint HCRF with a Gaussian distribution at every state:where each is the expected tag and every is a predicted vector. The per-component square of the observation vector at state t (i.e., ) is given as
It can be seen that along with certain set of parameters , the HCRF addition is similar to the hidden Markov model, for instance along with the abovementioned feature vector, if we choosewhere b in (6) is an earlier dissemination of Gaussian HMM and C in (7) is an evolution matrix; then conditional possibility numerator might be explained as
In the above equation, N represents Gaussian distribution. Equation (11) is the conditional probability of U, given V is calculated along with a Gaussian HMM through equation (11) which has an earlier distribution b with a conversion matrix C.
Moreover, the authors of  proposed a comprehensive form of the HCRF model to tackle composite scatterings utilizing a linear combination of Gaussian distribution functions, which is explained as
In equation (12), M indicates the number of components in Gaussian mixture.
Lots of works have been developed which showed better performance based on the usage of the abovementioned HCRF [55, 56]; however, most of them did not consider the limitations of the model. It is obvious from the aforementioned equations that the existing model employed diagonal (sloping)-covariance Gaussian distribution, which means that the variables (columns of ) were presumed to be couples independent. On the other hand, equations (8)–(10) suggest that with a specific set of value, each state observation density will congregate to Gaussian procedure. Unluckily, there is no training method designed yet to guarantee this convergence, and those suppositions might decrease the accuracy results.
Therefore, we proposed the improved version of the HCRF technique that has the ability to openly employ full-covariance Gaussian mixture in the feature function. The proposed model will get the benefits of hidden conditional random field model that completely considered the drawbacks of the previous method.
3. Proposed Methodology
3.1. Feature Extraction
In our previous work, we utilized symlet wavelet  for extracting various features from the activity frames. There are number of reasons for using the symlet wavelet which produces relatively better classification results. These include its capability to extract the conspicuous information from the activity frames in terms of frequency and its support to the characteristics of the grayscale images like orthogonality, biorthogonality, and reverse biorthogonality. For a certain provision size, the symlet is characterized with the highest number of vanishing moments and has the least asymmetry.
3.2. Proposed Hidden Conditional Random Fields (HCRFs) Model
As described earlier, the current Gaussian mixture HCRF model does not have the capability of utilizing full-covariance distributions and also does not guarantee the conjunction of its factors to certain values upon which the conditional probability is demonstrated as a combination of the normal density functions.
To address these limitations, we explicitly involve a mixture of Gaussian distributions in the feature functions as illustrated in the following forms:then,where N represents the number of density functions, Gamma “” considers the appropriate information of the entire observations, D indicates the dimension of the observation, and presents the partying weightiness for the constituent along with mean and covariance matrix .
As indicated in equation (14), when we change some of the parameters such as , μ, and , then we may build a combination of the standard densities. The resultant conditional probability might be written astherefore,
In the training data, to maximize the conditional probability, we initially focused on calculating the parameters (). In the proposed approach, limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BGFS) method has been implemented in order to search the optimum point. Unlikely the other models , both the forward and backward algorithms are used to compute the conditional probability and the results were reused for finding the gradients. This makes the algorithm more significant in reducing the computation time.
At the observation level, we particularly incorporated the full-covariance matrix in the feature function as shown in (16). Equation (17) may be used for getting the normal distribution which is further elaborated in the following equations:
The function is a gradient function for a variable of the prior probability vector:
The function is a gradient function for a variable of the transition probability vector:
The function is a gradient function for a Gaussian mixture weight variable. Here, a function can be determined as
The function is a gradient function for the Gaussian distribution mean:
The function is a gradient function for the covariance of the Gaussian distributions.
Equations (24)–(27) presented above describe an analysis method algorithm for calculating values of gradients for a feature function, the mean of Gaussian distributions, and the covariance of the prior probability vector, the transition probability vector, and the observation probability vector obtained from the existing HCRF.
In our model, the recognition of a variety of real-time activities can be divided into two steps: a training step and an inference step. In the first step, data with known labels are inputted for recognizing the target as well as training the hidden conditional field model. In the inference step, the inputs to be actually estimated are ordered dependent on parameters determined in the training phase.
If the activity frame is acting as an input in the training step, then, in the preprocessing step, the applied distinctive lighting effects are decreased for detecting and extracting faces from the activity frames. At that point, the movable features are extricated from the various facial parts for creating the feature vector. After that, the feature vector obtained serves as an input to a full-covariance Gaussian-mixed hidden conditional random field model of the suggested recognition model.
As mentioned in the earlier discussion, a feature gradient is generally determined by LBFG approach in the training phase of the HCRF model. Nonetheless, in the current gradient calculation technique, a forward and backward iterative execution algorithm is iteratively called upon, which needs an exceptionally high computational time and thus leads to reduction in the computational speed. Another analysis approach has been formulated that reduces the invoking of the forward and backward iterative execution algorithm using five gradient functions determined by equations (24)–(28). Using this analysis, the real-time computation can be carried out at a higher speed resulting in an enormous decrease in the computational time compared to a known analysis approach. The overall workflow of the proposed model is shown in Figure 2.
4. Model Validation
4.1. Datasets Used
In this work, we employed four open-source standard action datasets like Weizmann action datasets , KTH action dataset , UCF sports dataset , and IXMAS action dataset  for corroborating the proposed HCRF model performance. All the datasets are explained below.
4.1.1. Weizmann Action Dataset
This dataset consisted of 10 actions such as bending, running, walking, skipping, place jumping, side movement, jumping forward, two hand waving, and one hand waving that were performed by total 9 subjects. This dataset comprised of 90 video clips with average of 15 frames per clip where the frame size is 144 × 180.
4.1.2. KTH Action Dataset
KTH dataset employed for activity recognition comprised of 25 subjects who performed 6 activities like running, walking, boxing, jogging, handclapping, and hand waving in four distinctive scenarios. Using a static camera, in the homogenous background, a total of 2391 sequences were taken with a frame size of 160 × 120.
4.1.3. UCF Sports Dataset
In this dataset, there were 182 videos which were evaluated by n-fold cross-validation rule. This dataset has been taken from different sports activities in broadcast television channels. Some of the videos had high intraclass similarities. This dataset was also collected using a static camera. This dataset covers 9 activities like running, diving, lifting, golf swinging, skating, kicking, walking, horseback riding, and baseball swinging. Each frame has a size of 720 × 480.
4.1.4. IXMAS Action Dataset
IXMAS (INRIA Xmas motion acquisition sequences) dataset comprised of 13 activity classes which were performed by 11 actors, each 3 times. Every actor opted a free orientation as well as position. The dataset has provided annotated silhouettes for each person. For our experiments, we have selected only 8 action classes like walk, cross arms, punch, turn around, sit down, wave, get up, and kick. IXMAS dataset is a multiview dataset for a view-invariant human activity recognition where each frame has a size of 390 × 291. This dataset has a major occlusion and that may cause misclassification; therefore, we utilized global histogram equalization  in order to resolve the occlusion issue.
For a comprehensive validation, we carried out the following set of experiments executed using Matlab.(i)The first experiment was conducted on each dataset separately in order to show the performance of the proposed model. In this experiment, we employed 10-fold cross-validation rule, which means that data from 9 subjects were utilized for training data, while the data from one subject was picked as a testing data. The procedure was reiterated for 10 times provided each subject data is utilized for both training and testing.(ii)The second experiment was conducted in the absence of the proposed recognition model on all the four datasets that will show the importance of the developed model. For this purpose, we used the existing eminent classifiers like SVM, ANN, HMM, and existing HCRF  as a recognition model rather than utilizing the proposed HCRF model.(iii)The third experiment was conducted to show the performance of the proposed approach against the state-of-the-art methods.(iv)In the last experiment, the computational complexity of the proposed HCRF model was compared with forward/backward algorithms.
5. Results and Discussion
5.1. First Experiment
As described before, this experiment validates the performance of the proposed recognition model on an individual dataset. The overall results are shown in Tables 1 (using Weizmann dataset), 2 (using KTH dataset), 3 (using UCF sports dataset), and 4 (using IXMAS), respectively.
As observed from Tables 1–4, the proposed recognition model constantly obtained higher recognition rates on individual dataset. This result shows the robustness of the proposed model which means that the model not only showed better performance on one dataset but also showed better performance across multiple spontaneous datasets.
5.2. Second Experiment
As described before, the second experiment was conducted in the absence of the proposed recognition model, to show the importance of the proposed model using all the four datasets. For this purpose, we used the existing eminent classifiers like SVM, ANN, HMM, and existing HCRF  as a recognition model rather than utilizing the proposed HCRF model.
Tables 5–8 show that when the proposed HCRF model was substituted with ANN, SVM, HMM, and existing HCRF , the system failed to accomplish higher recognition rates. The better performance of the proposed HCRF model is visualized in Tables 1–4, which show that the proposed HCRF model effectively fix the drawbacks of HMM and existing HCRF that has been extensively utilized for sequential HAR.
5.3. Third Experiment
In this experiment, a comparative analysis was made between the state-of-the-art methods and the proposed model. All of these approaches were implemented by the instructions provided in their particular articles. A 10-fold cross-validation rule was employed on each dataset as explained in Section 4. The average classification results of the existing methods along with the proposed method across different datasets are summarized in Table 9.
It is obvious from Table 9 that the proposed method showed a significant performance against the existing state-of-the-art methods. Therefore, the proposed method accurately and robustly recognizes the human activities using different video data.
5.4. Fourth Experiment
In this experiment, we have presented the computational complexity that is also one of the contributions in this paper. The implementations of the previous HCRF are available in literature, which calculate the gradients by reiterating the forward and backward techniques, while the proposed HCRF model executes them once only and cashes the outcomes for the later use. From (21) and (22), it is clear that the forward or backward technique has a complexity of , where T represents the input sequence length, Q represents the number of states, and M indicates the number of mixtures. The proposed HCRF model, however, requires a full complexity of to calculate gradients as can be seen from (22)–(29).
Figure 3 shows a comparison of the execution time when the gradients are computed by the forward (or backward) algorithm and by our proposed method. The computational time is calculated by running Matlab R2013a with the specification of Intel® Pentium® Core™ i7-6700 (3.4 GHz) with a RAM capacity of 16 GB.
In healthcare and telemedicine, the human activity recognition (HAR) can be best explained by helping physically disable persons’ scenario. A paralyzed patient with half of the body critically attacked by paralysis is completely unable to perform their daily exercises. The doctors recommend specific activities to get better improvement in their health. So, for this purpose, the doctors need a human activity recognition (HAR) system through which they can monitor the patients’ daily routines (activities) on a regular basis.
The accuracy of most of the HAR systems depends upon the recognition modules. For feature extraction and selection modules, we used some of the existing well-known methods, while for the recognition module, we proposed the usage of HCRF model which is capable of approximating a complex distribution using a mixture of Gaussian density functions. The proposed model was assessed against four publicly available standard action datasets. From our experiments, it is obvious that the proposed full-covariance Gaussian density function showed a significant improvement in accuracy than the existing state-of-the-art methods. Furthermore, we also proved that such improvement is significant from statistical point of view by showing value ≤0.2 of the comparison. Similarly, the complexity analysis points out that the proposed computational method strongly decreases the execution time for the hidden conditional random field model.
The ultimate goal of this study is to deploy the proposed model on smartphones. Currently, the proposed model is using full-covariance matrix; however, this might be time consuming, especially when using on smartphones. Using a lightweight classifier such as K-nearest neighbor (K-NN) could be one possible solution. But K-NN is very much sensitive to environmental factor (like noise). Therefore, in future, we will try to investigate further research to reduce the time and sustain the same recognition rate when employing on smartphones in real environment.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare there are no conflicts of interest.
This study was supported by the Jouf University, Sakaka, Aljouf, Kingdom of Saudi Arabia, under the registration no. 39/791.
C. Zhang and Y. Tian, “RGB-D camera-based daily living activity recognition,” Journal of Computer Vision and Image Processing, vol. 2, no. 4, pp. 1–7, 2012.View at: Google Scholar
M. Leo, T. D’Orazio, and P. Spagnolo, “Human activity recognition for automatic visual surveillance of wide areas,” in Proceedings of the ACM 2nd International Workshop on Video Surveillance and Sensor Networks, pp. 124–130, ACM, New York, NY, USA, October 2004.View at: Publisher Site | Google Scholar
W. Niu, J. Long, D. Han, and Y.-F. Wang, “Human activity detection and recognition for video surveillance,” in Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No. 04TH8763), vol. 1, pp. 719–722, IEEE, Taipei, Taiwan, June 2004.View at: Publisher Site | Google Scholar
L. M. G. Fonseca, L. M. Namikawa, and E. F. Castejon, “Digital image processing in remote sensing,” in Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Computer Graphics and Image Processing, pp. 59–71, IEEE, Rio de Janeiro, Brazil, October 2009.View at: Google Scholar
M. Sharif, M. A. Khan, T. Akram, M. Y. Javed, T. Saba, and A. Rehman, “A framework of human detection and action recognition based on uniform segmentation and combination of Euclidean distance and joint entropy-based features selection,” EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, p. 89, 2017.View at: Publisher Site | Google Scholar
J. Zang, L. Wang, Z. Liu, Q. Zhang, G. Hua, and N. Zheng, “Attention-based temporal weighted convolutional neural network for action recognition,” in Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 97–108, Springer, Rhodes, Greece, May 2018.View at: Publisher Site | Google Scholar
J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” in Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pp. 282–289, Williamstown, MA, USA, June-July 2001.View at: Google Scholar
A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, “Hidden conditional random fields for phone classification,” in Proceedings of the INTERSPEECH, vol. 2, pp. 1117–1120, Citeseer, Lisbon, Portugal, September 2005.View at: Google Scholar
L. Yang, Q. Song, Z. Wang, and M. Jiang, “Parsing r-cnn for instance-level human analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 364–373, Long Beach, CA, USA, June 2019.View at: Google Scholar
M. M. Requena, “Human segmentation with convolutional neural networks,” Thesis, University of Alicante, Alicante, Spain, 2018.View at: Google Scholar
M. H. Siddiqi, A. M. Khan, and S.-W. Lee, “Active contours level set based still human body segmentation from depth images for video-based activity recognition,” KSII Transactions on Internet and Information Systems (TIIS), vol. 7, no. 11, pp. 2839–2852, 2013.View at: Publisher Site | Google Scholar
A. Jalal, S. Kamal, and D. Kim, “A depth video-based human detection and activity recognition using multi-features and embedded hidden markov models for health care monitoring systems,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 4, no. 4, p. 54, 2017.View at: Publisher Site | Google Scholar
Y. Yang, B. Zhang, L. Yang, C. Chen, and W. Yang, “Action recognition using completed local binary patterns and multiple-class boosting classifier,” in Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 336–340, IEEE, Kuala Lumpur, Malaysia, November 2015.View at: Publisher Site | Google Scholar
H. Foroughi, A. Naseri, A. Saberi, and H. S. Yazdi, “An eigenspace-based approach for human fall detection using integrated time motion image and neural network,” in Proceedings of the 2008 9th International Conference on Signal Processing, pp. 1499–1503, IEEE, Beijing, China, October 2008.View at: Publisher Site | Google Scholar
A. K. B. Sonali, “Human action recognition using support vector machine and k-nearest neighbor,” International Journal of Engineering and Technical Research, vol. 3, no. 4, pp. 423–428, 2015.View at: Google Scholar
V. Swarnambigai, “Action recognition using ami and support vector machine,” International Journal of Computer Science and Technology, vol. 5, no. 4, pp. 175–179, 2014.View at: Google Scholar
S. Kumar and M. Hebert, “Discriminative Fields for Modeling Spatial Dependencies in Natural Images,” in Proceedings of the NIPS, Vancouver, Canada, December 2003.View at: Google Scholar
K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos,” in Computer Vision in Sports, pp. 181–208, Springer, Berlin, Germany, 2014.View at: Google Scholar
M. Elmezain and A. Al-Hamadi, “Vision-based human activity recognition using ldcrfs,” International Arab Journal of Information Technology (IAJIT), vol. 15, no. 3, pp. 389–395, 2018.View at: Google Scholar
A. Roy, B. Banerjee, and V. Murino, “A novel dictionary learning based multiple instance learning approach to action recognition from videos,” in Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, pp. 519–526, Porto, Portugal, February 2017.View at: Publisher Site | Google Scholar
A. P. Sirat, “Actions as space-time shapes,” International Journal of Application or Innovation in Engineering and Management, vol. 7, no. 8, pp. 49–54, 2018.View at: Google Scholar