Journal of Healthcare Engineering

Review Article

A Review on Human Activity Recognition Using Vision-Based Method

Table 1

Taxonomy of activity recognition literatures.


References	Year	Representation (global/local/depth)	Classification	Modality	Level	Dataset	Performance result

Yamato et al. [94]	1992	Symbols converted from mesh feature vector and encoded by vector quantization (G)	HMM	RGB	Action/activity	Collected dataset: 3 subjects × 300 combinations	96% accuracy
Darrell and Pentland [92]	1993	View model sets (G)	Dynamic time warping	RGB	Action primitive	Collected instances of 4 gestures.	96% accuracy (“Hello” gesture)
Brand et al. [102]	1997	2D blob feature (G)	Coupled HMM (CHMM)	RGB	Action primitive	Collected dataset: 52 instances. 3 gestures × 17 times.	94.2% accuracy
Oliver et al. [97]	2000	2D blob feature (G)	(i) CHMM; (ii) HMM;	RGB	Interaction	Collected dataset: 11–75 training sequences +20 testing sequences. Organized as 5-level hierarchical interactions.	(i) 84.68 accuracy (average); (ii) 98.43 accuracy (average)
Bobick and Davis [17]	2001	Motion energy image & motion history image (G)	Template matching by measuring Mahalanobis distance	RGB	Action/activity	Collected dataset: 18 aerobic exercises × 7 views.	(a) 12/18 (single view); (b) 15/18 (multiple views)
Efros et al. [10]	2003	Optical flow (G)	K-nearest neighbor	RGB	Action/activity	(a) Ballet dataset; (b) tennis dataset; (c) football dataset	(a) 87.4% accuracy; (b) 64.3% accuracy; (c) 65.4% accuracy
Park and Aggarwal [103]	2004	Body model by combining an ellipse representation and a convex hull-based polygonal representation (G)	Dynamic Bayesian network	RGB	Interaction	Collected dataset: 56 instances. 9 interactions × 6 pairs of people.	78% accuracy
Schüldt et al. [105]	2004	Space-time interest points (L)	SVM	RGB	Action/activity	KTH dataset	71.7% accuracy
Blank et al. [5]	2005	Space-time shape (G)	Spectral clustering algorithm	RGB	Action/activity	Weizmann dataset	99.63% accuracy
Oikonomopoulos et al. [36]	2005	Spatiotemporal salient points (L)	RVM	RGB	Action/activity	Collected dataset: 152 instances. 19 activities × 4 subjects × 2 times.	77.63% recall
Dollar et al. [37]	2005	Space-time interest points (L)	(i) 1-nearest neighbor (1NN); (ii) SVM;	RGB	Action/activity	KTH dataset	(i) 78.5% accuracy (1NN); (ii) 81.17% accuracy (SVM)
Ke et al. [38]	2005	Integral videos (L)	Adaboost	RGB	Action/activity	KTH dataset	62.97% accuracy
Veeraraghavan et al. [93]	2005	Space-time shape (G)	Nonparametric methods by extending DTW	RGB	Action/activity	(a) USF dataset [154]; (b) CMU dataset [155]; (c) MOCAP dataset	No accuracy data presented.
Duong et al. [98]	2005	High level activities are represented as sequences of atomic activities; atomic activities are only represented using durations (−).	Switching hidden semi-Markov model (S-HSMM)	RGB	Interaction	Collected dataset: 80 video sequences. 6 high level activities.	97.5 accuracy (average accuracy; Coxian model)
Weinland et al. [20]	2006	Motion history volumes (G)	Principal component analysis (PCA) + Mahalanobis distance	RGB	Action/activity	IXMAS dataset [20]	93.33% accuracy
Lu et al. [49]	2006	PCA-HOG (L)	HMM	RGB	Action/activity	(a) Soccer sequences dataset [10]; (b) Hockey sequences dataset [156]	The implemented system can track subjects in videos and recognize their activities robustly. No accuracy data presented.
Ikizler and Duygulu [18]	2007	Histogram of oriented rectangles and encoded with BoVW (G)	(i) Frame by frame voting; (ii) global histogramming; (iii) SVM classification; (iv) dynamic time warping;	RGB	Action/activity	Weizmann dataset	100% accuracy (DTW)
Huang and Xu [19]	2007	Envelop shape acquired from silhouettes (G)	HMM	RGB	Action/activity; action primitive	Collected dataset: 9 activities × 7 subjects × 3 times × 3 views.	Subject dependent + view independent: 97.3% accuracy; subject independent + view independent: 95.0% accuracy; subject independent + view dependent: 94.4% accuracy
Scovanner et al. [46]	2007	3D SIFT (L)	SVM	RGB	Action/activity	Weizmann dataset	82.6% accuracy
Vail et al. [106]	2007	—	(i) HMM (ii) conditional random field	—	Interaction	Data from the hourglass and the unconstrained tag domains generated by robot simulator.	98.1% accuracy (CRF, hourglass); 98.5% accuracy (CRF, unconstrained tag domains)
Cherla et al. [21]	2008	Width feature of normalized silhouette box (G)	Dynamic time warping	RGB	Action/activity	IXMAS dataset [20]	80.05% accuracy; 76.28% accuracy (cross view)
Tran and Sorokin [25]	2008	Silhouette and optical flow (G)	(i) Naïve Bayes (NB); (ii) 1-nearest neighbor (1NN); (iii) 1-nearest neighbor with rejection (1NN-R); (iv) 1-nearest neighbor with metric learning (1NN-M)	RGB	Interaction; Action/activity	(a) Weizmann dataset; (b) UMD dataset [15]; (c) IXMAS dataset [20]; (d) collected dataset: 532 instances. 10 activities × 8 subjects.	(a) 100% accuracy; (b) 100% accuracy; (c) 81% accuracy; (d) 99.06% accuracy (1NN-M & L1SO)
Achard et al. [26]	2008	Semi-global features extracted from space-time micro volumes (L)	HMM	RGB	Action/activity	Collected dataset: 1614 instances. 8 activities × 7 subjects × 5 views.	87.39% accuracy (average)
Rodriguez et al. [91]	2008	Action MACH-maximum average correlation height (G)	Maximum average correlation height filter	RGB	Interaction; Action/activity	(a) KTH dataset; (b) collected feature films dataset: 92 kissing + 112 hitting/Slapping; (c) UCF dataset; (d) Weizmann dataset	(a) 80.9% accuracy; (b) 66.4% for kissing & 67.2% for hitting/slapping; (c) 69.2% accuracy; (d) reported a significant increase in algorithm efficiency, with no overall accuracy data presented
Kiaser et al. [30]	2008	Histograms of oriented 3D spatiotemporal gradients (L)	SVM	RGB	Interaction; Action/activity	(a) KTH dataset; (b) Weizmann dataset; (c) Hollywood dataset	(a) 91.4% (±0.4) accuracy; (b) 84.3% (±2.9) accuracy; (c) 24.7% precision
Willems et al. [39]	2008	Hessian-based STIP detector & SURF3D (L)	SVM	RGB	Action/activity	KTH dataset	84.26% accuracy
Laptev et al. [50]	2008	STIP with HOG, HOF are encoded with BoVW (L)	SVM	RGB	Interaction; Action/activity	(a) KTH dataset; (b) Hollywood dataset	(a) 91.8% accuracy; (b) 38.39% accuracy (average)
Natarajan and Nevatia [95]	2008	23 degrees body model (G)	Hierarchical variable transition HMM (HVT-HMM)	RGB	Action/activity; Action primitive	(a) Weizmann dataset; (b) gesture dataset in [157]	(a) 100% accuracy; (b) 90.6% accuracy
Natarajan and Nevatia [107]	2008	2-layer graphical model: top layer corresponds to actions in particular viewpoint; lower layer corresponds to individual poses (G)	Shape, flow, duration-conditionalrandom field (SFD-CRF)	RGB	Action/activity	Collected dataset: 400 instances. 6 activities × 4 subjects × 16 views (×6 backgrounds).	78.9% accuracy
Ning et al. [108]	2008	Appearance and position context (APC) descriptor encoded by BoVW (L)	Latent pose conditional random fields (LPCRF)	RGB	Action/activity; Action primitive	HumanEva dataset	95.0% accuracy (LPCRFinit)
Marszalek et al. [158]	2009	SIFT, HOG, HOF encoded by BoVW (L)	SVM	RGB	Interaction	Hollywood2 dataset	35.5% accuracy
Li et al. [76]	2010	Action graph of salient postures (D)	Non-Euclidean relational fuzzy (NERF) C-means & Hausdorf distance-based dissimilarity measure	Depth	Action/activity	MSR Action3D dataset	91.6% accuracy (train/test = 1/2); 94.2% accuracy (train/test = 2/1); 74.7% accuracy (train/test = 1/1 & cross subject)
Suk et al. [101]	2010	YIQ color model for skin pixels; histogram-based color model for face region; optical flow for tracking of hand motion (L)	Dynamic Bayesian network	RGB	Action primitive	Collected dataset: 498 instances. (a) 10 gestures × 7 subjects × 7 times (isolated gesture); (b) 8 longer videos contain 50 gestures (continuous gestures)	(a) 99.59% accuracy; (b) 84% recall & 80.77% precision
Baccouche et al. [124]	2010	SIFT descriptor encoded by BoVW (L)	Recurrent neural networks (RNN) with long short-term memory (LSTM)	RGB	Interaction	MICC-Soccer-Actions-4 dataset [159]	92% accuracy
Kumari and Mitra [29]	2011	Discrete Fourier transform on silhouettes (G)	K-nearest neighbor	RGB	Action/activity	(a) MuHaVi dataset; (b) DA-IICT dataset;	(a) 96% accuracy; (b) 82.6667% accuracy;
Wang et al. [51]	2011	Dense trajectory with HOG, HOF, MBH (L)	SVM	RGB	Interaction; Action/activity	(a) KTH dataset; (b) YouTube dataset; (c) Hollywood2 dataset; (d) UCF Sport dataset	(a) 94.2% accuracy; (b) 84.2% accuracy; (c) 58.3% accuracy; (d) 88.2% accuracy
Wang et al. [56]	2012	STIP with HOG, HOF are encoded with various encoding methods (L)	SVM	RGB	Interaction; Action/activity	(a) KTH dataset; (b) HMDB51 dataset	(a) 92.13% accuracy (Fisher vector); (b) 29.22% accuracy (Fisher vector)
Zhao et al. [77]	2012	Combined representations: (a) RGB: HOG & HOF upon space-time interest points (L) (b) depth: local depth pattern at each interest point (D)	SVM	RGB-D	Interaction	RGBD-HuDaAct dataset	89.1% accuracy
Yang et al. [78]	2012	DMM-HOG (D)	SVM	Depth	Action/activity	MSR Action3D dataset	95.83% accuracy (train/test = 1/2); 97.37% accuracy (train/test = 2/1); 91.63% accuracy (train/test = 1/1 & cross subject)
Xia et al. [84]	2012	Histograms of 3D joint locations (D)	HMM	Depth	Action/activity	(a) collected dataset: 6220 frames, 200 samples. 10 activities × 10 subjects × 2 times. (b) MSR Action3D dataset	(a) 90.92% accuracy; (b) 97.15% accuracy (highest); 78.97% accuracy (cross subject)
Yang and Tian [85]	2012	EigenJoints (D)	Naïve-Bayes-Nearest-Neighbor (NBNN)	Depth	Action/activity	MSR Action3D dataset	96.8% accuracy; 81.4% accuracy (cross subject)
Wang et al. [160]	2012	Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D)	SVM	Depth	Interaction; Action/activity	(a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) CMU MOCAP dataset	(a) 88.2% accuracy; (b) 85.75% accuracy; (c) 98.13% accuracy
Wang et al. [53]	2013	Improved dense trajectory with HOG, HOF, MBH (L)	SVM	RGB	Interaction	(a) Hollywood2 dataset; (b) HMDB51 dataset; (c) Olympic Sports dataset [161]; (d) UCF50 dataset [162]	(a) 64.3% accuracy; (b) 57.2% accuracy; (c) 91.1% accuracy; (d) 91.2% accuracy
Oreifej and Liu [74]	2013	Histogram of oriented 4D surface normals (D)	SVM	Depth	Action/activity; Action primitive	(a) MSR Action3D dataset; (b) MSR Gesture3D dataset; (c) Collected 3D Action Pairs dataset	(a) 88.89% accuracy; (b) 92.45% accuracy; (c) 96.67% accuracy
Chaaraoui [88]	2013	Combined representations: (a) RGB: silhouette (G) (b) depth: skeleton joints (D)	Dynamic time warping	RGB-D	Action/activity	MSR Action3D dataset	91.80% accuracy
Ren et al. [152]	2013	Time-series curve of hand shape (G)	Dissimilarity measure based on Finger-Earth Mover’s Distance (FEMD)	RGB	Action primitive	Collected dataset: 1000 instances. 10 gestures × 10 subjects × 10 times.	93.9% accuracy
Ni et al. [163]	2013	Depth-Layered Multi-Channel STIPs (L)	SVM	RGB-D	Interaction	RGBD-HuDaAct database	81.48% accuracy (codebook size = 512 & SPM kernel)
Grushin et al. [123]	2013	STIP with HOF (L)	Recurrent neural networks (RNN) with long short-term memory (LSTM)	RGB	Action/activity	KTH dataset	90.7% accuracy
Peng et al. [31]	2014	(i) STIP with HOG, HOF and encoded by various encoding methods; (L) (ii) iDT with HOG, HOF, MBHx, MBHy and encoded by various encoding methods (L)	SVM	RGB	Interaction	(a) HMDB51 dataset; (b) UCF50 dataset; (c) UCF101 dataset	Hybrid representation: (a) 61.1% accuracy; (b) 92.3% accuracy; (c) 87.9% accuracy
Peng et al. [32]	2014	Improved dense trajectory encoded with stacked Fisher kernal (L)	SVM	RGB	Interaction; Action/activity	(a) YouTube dataset; (b) HMDB51 dataset; (c) J-HMDB dataset	(a) 93.38% accuracy; (b) 66.79% accuracy; (c) 67.77% accuracy
Wang et al. [82]	2014	Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D)	SVM	Depth	Interaction; Action/activity	(a) MSR Action3D dataset; (b) MSR DailyActivity3D dataset; (c) Multiview 3D event dataset; (d) Cornell Activity Dataset [164]	(a) 88.2% accuracy; (b) 85.75% accuracy; (c) 88.34% accuracy (cross subject); 86.76% accuracy (cross view); (d) 97.06% (same person) 74.70% accuracy (cross person)
Simonyan and Zisserman [115]	2014	Spatial stream ConvNets & optical flow based temporal stream ConvNets (L)	SVM	RGB	Interaction	(a) HMDB51 dataset; (b) UCF101 dataset	(a) 59.4% accuracy; (b) 88.0% accuracy
Lan et al. [33]	2015	Improved dense trajectory with HOG, HOF, MBHx, MBHy enhanced with multiskip feature tracking (L)	SVM	RGB	Interaction	(a) HMDB51 dataset; (b) Hollywood2 dataset; (c) UCF101 dataset; (d) UCF50 dataset; (e) Olympic Sports dataset	(a) 65.1% accuracy (L = 3); (b) 68.0% accuracy (L = 3); (c) 89.1% accuracy (L = 3); (d) 94.4% accuracy (L = 3); (e) 91.4% accuracy (L = 3)
Shahroudy et al. [83]	2015	Combined representations: (a) RGB: dense trajectories with HOG, HOF, MBH (L) (b) Depth: skeleton joints (D)	SVM	RGB-D	Interaction	MSR DailyActivity3D	81.9% accuracy
Wang et al. [114]	2015	Weighted hierarchical depth motion maps (D)	Three-channel deep convolutional neural networks (3ConvNets)	Depth	Interaction; Action/activity	(a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) UTKinect Action dataset [84]; (d) MSR DailyActivity3D dataset; (e) Combined dataset of above	(a) 100% accuracy; (b) 100% accuracy; (c) 90.91% accuracy; (d) 85% accuracy; (e) 91.56% accuracy
Wang et al. [165]	2015	Pseudo-color images converted from DMMs (D)	Three-channel deep convolutional neural networks (3ConvNets)	Depth	Interaction; Action/activity	(a) MSR Action3D dataset; (b) MSR Action3DExt dataset; (c) UTKinect Action dataset [84]	(a) 100% accuracy; (b) 100% accuracy; (c) 90.91% accuracy
Wang et al. [117]	2015	Trajectory-pooled deep-convolutional descriptor and encoded by Fisher kernal (L)	SVM	RGB	Interaction	(a) HMDB51 dataset; (b) UCF101 dataset	(a) 65.9% accuracy; (b) 91.5% accuracy
Veeriah et al. [125]	2015	(i) HOG3D in KTH 2D action dataset; (L) (ii) skeleton-based features including skeleton positions, normalized pair-wise angels, offset of joint positions, histogram of the velocity, and pairwise joint distances (D)	Differential recurrent neural network (dRNN)	RGBD	Action/activity	(a) KTH dataset; (b) MSR Action3D dataset	(a) 93.96% accuracy (KTH-1); 92.12% accuracy (KTH-2); (b) 92.03% accuracy
Du et al. [126]	2015	Representations of skeleton data extracted by subnets (D)	Hierarchical bidirectional recurrent neural network (HBRNN)	RGBD	Action/activity	(a) MSR Action3D dataset; (b) Berkeley MHAD Action dataset [166]; (c) HDM05 dataset [167]	(a) 94.49% accuracy; (b) 100% accuracy; (c) 96.92% (±0.50) accuracy
Zhen et al. [58]	2016	STIP with HOG3D and encoded with various encoding methods (L)	SVM	RGB	Interaction; Action/activity	(a) KTH dataset; (b) UCF YouTube dataset; (c) HMDB51 dataset	(a) 94.1% (Local NBNN); (b) 63.0% (improved Fisher kernal); (c) 30.5% (improved Fisher kernal)
Chen et al. [81]	2016	Action graph of skeleton-based features (D)	Maximum likelihood estimation	Depth	Action/activity	(a) MSR Action3D dataset; (b) UTKinect Action dataset	(a) 95.56% accuracy (cross subject); 96.1% accuracy (three subset evaluation); (b) 95.96% accuracy
Zhu et al. [87]	2016	Co-occurrence features of skeleton joints (D)	Recurrent neural networks (RNN) with long short-term memory (LSTM)	Depth	Interaction; Action/activity	(a) SBU Kinect interaction dataset [168]; (b) HDM05 dataset; (c) CMU dataset; (d) Berkeley MHAD Action dataset	(a) 90.41% accuracy; (b) 97.25% accuracy; (c) 81.04% accuracy; (d) 100% accuracy
Li et al. [116]	2016	VLAD for deep dynamics (G)	Deep convolutional neural networks (ConvNets)	RGB	Interaction; Action/activity	(a) UCF101 dataset; (b) Olympic Sports dataset; (c) THUMOS15 dataset [116]	(a) 84.65% accuracy; (b) 90.81% accuracy; (c) 78.15% accuracy
Berlin & John [119]	2016	Harris corner-based interest points and histogram-based features (L)	Deep neural networks (DNNs)	RGB	Interaction	UT Interaction dataset [169]	95% accuracy on set1; 88% accuracy on set2
Huang et al. [120]	2016	Lie group features (L)	Lie Group Network (LieNet)	Depth	Interaction; Action/activity	(a) G3D-Gamingdataset [170]; (b) HDM05 dataset; (c) NTU RGBD dataset [171]	(a) 89.10% accuracy; (b) 75.78% ± 2.26 accuracy; (c) 66.95% accuracy
Mo et al. [113]	2016	Automatically extracted features from skeletons data (D)	Convolutional neural networks (ConvNets) + multilayer perceptron	Depth	Interaction	CAD-60 dataset	81.8% accuracy
Shi et al. [55]	2016	Three stream sequential deep trajectory descriptor (L)	Recurrent neural networks (RNN) and deep convolutional neural networks (ConvNets)	RGB	Interaction; Action/activity	(a) KTH dataset; (b) HMDB51 dataset; (c) UCF 101 dataset [172]	(a) 96.8% accuracy; (b) 65.2% accuracy; (c) 92.2% accuracy
Yang et al. [79]	2017	Low-level polynormal assembled from local neighboring hypersurface normals and are then aggregated by Super Normal Vector (D)	Linear classifier	Depth	Interaction; Action/activity; Action primitive	(a) MSR Action3D dataset; (b) MSR Gesture3D dataset; (c) MSR ActionPairs3D dataset [173]; (d) MSR DailyActivity3D dataset	(a) 93.45% accuracy; (b) 94.74% accuracy; (c) 100% accuracy; (d) 86.25% accuracy
Jalal et al. [80]	2017	Multifeatures extracted from human body silhouettes and joints information (D)	HMM	Depth	Interaction; Action/activity	(a) Online self-annotated dataset [174]; (b) MSR DailyActivity3D dataset; (c) MSR Action3D dataset	(a) 71.6% accuracy; (a) 92.2% accuracy; (a) 93.1% accuracy