Review Article

A Review on Human Activity Recognition Using Vision-Based Method

Table 1

Taxonomy of activity recognition literatures.

ReferencesYearRepresentation (global/local/depth)ClassificationModalityLevelDatasetPerformance result

Yamato et al. [94]1992Symbols converted from mesh feature vector and encoded by vector quantization (G)HMMRGBAction/activityCollected dataset:
3 subjects × 300 combinations
96% accuracy
Darrell and Pentland [92]1993View model sets (G)Dynamic time warpingRGBAction primitiveCollected instances of 4 gestures.96% accuracy (“Hello” gesture)
Brand et al. [102]19972D blob feature (G)Coupled HMM (CHMM)RGBAction primitiveCollected dataset: 52 instances.
3 gestures × 17 times.
94.2% accuracy
Oliver et al. [97]20002D blob feature (G)(i) CHMM;
(ii) HMM;
RGBInteractionCollected dataset: 11–75 training sequences +20 testing sequences.
Organized as 5-level hierarchical interactions.
(i) 84.68 accuracy (average);
(ii) 98.43 accuracy (average)
Bobick and Davis [17]2001Motion energy image & motion history image (G)Template matching by measuring Mahalanobis distanceRGBAction/activityCollected dataset:
18 aerobic exercises × 7 views.
(a) 12/18 (single view);
(b) 15/18 (multiple views)
Efros et al. [10]2003Optical flow (G)K-nearest neighborRGBAction/activity(a) Ballet dataset; (b) tennis dataset; (c) football dataset(a) 87.4% accuracy;
(b) 64.3% accuracy;
(c) 65.4% accuracy
Park and Aggarwal [103]2004Body model by combining an ellipse representation and a convex hull-based polygonal representation (G)Dynamic Bayesian networkRGBInteractionCollected dataset: 56 instances.
9 interactions × 6 pairs of people.
78% accuracy
Schüldt et al. [105]2004Space-time interest points (L)SVMRGBAction/activityKTH dataset71.7% accuracy
Blank et al. [5]2005Space-time shape (G)Spectral clustering algorithmRGBAction/activityWeizmann dataset99.63% accuracy
Oikonomopoulos et al. [36]2005Spatiotemporal salient points (L)RVMRGBAction/activityCollected dataset: 152 instances.
19 activities × 4 subjects × 2 times.
77.63% recall
Dollar et al. [37]2005Space-time interest points (L)(i) 1-nearest neighbor (1NN);
(ii) SVM;
RGBAction/activityKTH dataset(i) 78.5% accuracy (1NN);
(ii) 81.17% accuracy (SVM)
Ke et al. [38]2005Integral videos (L)AdaboostRGBAction/activityKTH dataset62.97% accuracy
Veeraraghavan et al. [93]2005Space-time shape (G)Nonparametric methods by extending DTWRGBAction/activity(a) USF dataset [154];
(b) CMU dataset [155]; (c) MOCAP dataset
No accuracy data presented.
Duong et al. [98]2005High level activities are represented as sequences of atomic activities; atomic activities are only represented using durations (−).Switching hidden semi-Markov model (S-HSMM)RGBInteractionCollected dataset: 80 video sequences.
6 high level activities.
97.5 accuracy (average accuracy; Coxian model)
Weinland et al. [20]2006Motion history volumes (G)Principal component analysis (PCA) + Mahalanobis distanceRGBAction/activityIXMAS dataset [20]93.33% accuracy
Lu et al. [49]2006PCA-HOG (L)HMMRGBAction/activity(a) Soccer sequences dataset [10];
(b) Hockey sequences dataset [156]
The implemented system can track subjects in videos and recognize their activities robustly. No accuracy data presented.
Ikizler and Duygulu [18]2007Histogram of oriented rectangles and encoded with BoVW (G)(i) Frame by frame voting;
(ii) global histogramming;
(iii) SVM classification;
(iv) dynamic time warping;
RGBAction/activityWeizmann dataset100% accuracy (DTW)
Huang and Xu [19]2007Envelop shape acquired from silhouettes (G)HMMRGBAction/activity;
action primitive
Collected dataset:
9 activities × 7 subjects × 3 times × 3 views.
Subject dependent + view independent: 97.3% accuracy;
subject independent + view independent: 95.0% accuracy;
subject independent + view dependent: 94.4% accuracy
Scovanner et al. [46]20073D SIFT (L)SVMRGBAction/activityWeizmann dataset82.6% accuracy
Vail et al. [106]2007(i) HMM
(ii) conditional random field
InteractionData from the hourglass and the unconstrained tag domains generated by robot simulator.98.1% accuracy (CRF, hourglass);
98.5% accuracy (CRF, unconstrained tag domains)
Cherla et al. [21]2008Width feature of normalized silhouette box (G)Dynamic time warpingRGBAction/activityIXMAS dataset [20]80.05% accuracy;
76.28% accuracy (cross view)
Tran and Sorokin [25]2008Silhouette and optical flow (G)(i) Naïve Bayes (NB);
(ii) 1-nearest neighbor (1NN);
(iii) 1-nearest neighbor with rejection (1NN-R);
(iv) 1-nearest neighbor with metric learning (1NN-M)
RGBInteraction;
Action/activity
(a) Weizmann dataset;
(b) UMD dataset [15];
(c) IXMAS dataset [20];
(d) collected dataset: 532 instances.
10 activities × 8 subjects.
(a) 100% accuracy;
(b) 100% accuracy;
(c) 81% accuracy;
(d) 99.06% accuracy (1NN-M & L1SO)
Achard et al. [26]2008Semi-global features extracted from space-time micro volumes (L)HMMRGBAction/activityCollected dataset: 1614 instances.
8 activities × 7 subjects × 5 views.
87.39% accuracy (average)
Rodriguez et al. [91]2008Action MACH-maximum average correlation height (G)Maximum average correlation height filterRGBInteraction;
Action/activity
(a) KTH dataset;
(b) collected feature films dataset:
92 kissing + 112 hitting/Slapping;
(c) UCF dataset;
(d) Weizmann dataset
(a) 80.9% accuracy;
(b) 66.4% for kissing & 67.2% for hitting/slapping;
(c) 69.2% accuracy;
(d) reported a significant increase in algorithm efficiency, with no overall accuracy data presented
Kiaser et al. [30]2008Histograms of oriented 3D spatiotemporal gradients (L)SVMRGBInteraction;
Action/activity
(a) KTH dataset;
(b) Weizmann dataset;
(c) Hollywood dataset
(a) 91.4% (±0.4) accuracy;
(b) 84.3% (±2.9) accuracy;
(c) 24.7% precision
Willems et al. [39]2008Hessian-based STIP detector & SURF3D (L)SVMRGBAction/activityKTH dataset84.26% accuracy
Laptev et al. [50]2008STIP with HOG, HOF are encoded with BoVW (L)SVMRGBInteraction;
Action/activity
(a) KTH dataset;
(b) Hollywood dataset
(a) 91.8% accuracy;
(b) 38.39% accuracy (average)
Natarajan and Nevatia [95]200823 degrees body model (G)Hierarchical variable transition HMM (HVT-HMM)RGBAction/activity;
Action primitive
(a) Weizmann dataset;
(b) gesture dataset in [157]
(a) 100% accuracy;
(b) 90.6% accuracy
Natarajan and Nevatia [107]20082-layer graphical model: top layer corresponds to actions in particular viewpoint; lower layer corresponds to individual poses (G)Shape, flow, duration-conditionalrandom field (SFD-CRF)RGBAction/activityCollected dataset: 400 instances.
6 activities × 4 subjects × 16 views (×6 backgrounds).
78.9% accuracy
Ning et al. [108]2008Appearance and position context (APC) descriptor encoded by BoVW (L)Latent pose conditional random fields (LPCRF)RGBAction/activity;
Action primitive
HumanEva dataset95.0% accuracy (LPCRFinit)
Marszalek et al. [158]2009SIFT, HOG, HOF encoded by BoVW (L)SVMRGBInteractionHollywood2 dataset35.5% accuracy
Li et al. [76]2010Action graph of salient postures (D)Non-Euclidean relational fuzzy (NERF) C-means & Hausdorf distance-based dissimilarity measureDepthAction/activityMSR Action3D dataset91.6% accuracy (train/test = 1/2);
94.2% accuracy (train/test = 2/1);
74.7% accuracy (train/test = 1/1 & cross subject)
Suk et al. [101]2010YIQ color model for skin pixels; histogram-based color model for face region; optical flow for tracking of hand motion (L)Dynamic Bayesian networkRGBAction primitiveCollected dataset: 498 instances.
(a) 10 gestures × 7 subjects × 7 times (isolated gesture);
(b) 8 longer videos contain 50 gestures (continuous gestures)
(a) 99.59% accuracy;
(b) 84% recall & 80.77% precision
Baccouche et al. [124]2010SIFT descriptor encoded by BoVW (L)Recurrent neural networks (RNN) with long short-term memory (LSTM)RGBInteractionMICC-Soccer-Actions-4 dataset [159]92% accuracy
Kumari and Mitra [29]2011Discrete Fourier transform on silhouettes (G)K-nearest neighborRGBAction/activity(a) MuHaVi dataset;
(b) DA-IICT dataset;
(a) 96% accuracy;
(b) 82.6667% accuracy;
Wang et al. [51]2011Dense trajectory with HOG, HOF, MBH (L)SVMRGBInteraction;
Action/activity
(a) KTH dataset;
(b) YouTube dataset;
(c) Hollywood2 dataset;
(d) UCF Sport dataset
(a) 94.2% accuracy;
(b) 84.2% accuracy;
(c) 58.3% accuracy;
(d) 88.2% accuracy
Wang et al. [56]2012STIP with HOG, HOF are encoded with various encoding methods (L)SVMRGBInteraction;
Action/activity
(a) KTH dataset;
(b) HMDB51 dataset
(a) 92.13% accuracy (Fisher vector);
(b) 29.22% accuracy (Fisher vector)
Zhao et al. [77]2012Combined representations:
(a) RGB: HOG & HOF upon space-time interest points (L)
(b) depth: local depth pattern at each interest point (D)
SVMRGB-DInteractionRGBD-HuDaAct dataset89.1% accuracy
Yang et al. [78]2012DMM-HOG (D)SVMDepthAction/activityMSR Action3D dataset95.83% accuracy
(train/test = 1/2);
97.37% accuracy
(train/test = 2/1);
91.63% accuracy (train/test = 1/1 & cross subject)
Xia et al. [84]2012Histograms of 3D joint locations (D)HMMDepthAction/activity(a) collected dataset: 6220 frames, 200 samples.
10 activities × 10 subjects × 2 times.
(b) MSR Action3D dataset
(a) 90.92% accuracy;
(b) 97.15% accuracy (highest); 78.97% accuracy (cross subject)
Yang and Tian [85]2012EigenJoints (D)Naïve-Bayes-Nearest-Neighbor (NBNN)DepthAction/activityMSR Action3D dataset96.8% accuracy;
81.4% accuracy (cross subject)
Wang et al. [160]2012Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D)SVMDepthInteraction;
Action/activity
(a) MSR Action3D dataset;
(b) MSR Action3DExt dataset;
(c) CMU MOCAP dataset
(a) 88.2% accuracy;
(b) 85.75% accuracy;
(c) 98.13% accuracy
Wang et al. [53]2013Improved dense trajectory with HOG, HOF, MBH (L)SVMRGBInteraction(a) Hollywood2 dataset;
(b) HMDB51 dataset;
(c) Olympic Sports dataset [161];
(d) UCF50 dataset [162]
(a) 64.3% accuracy;
(b) 57.2% accuracy;
(c) 91.1% accuracy;
(d) 91.2% accuracy
Oreifej and Liu [74]2013Histogram of oriented 4D surface normals (D)SVMDepthAction/activity;
Action primitive
(a) MSR Action3D dataset;
(b) MSR Gesture3D dataset;
(c) Collected 3D Action Pairs dataset
(a) 88.89% accuracy;
(b) 92.45% accuracy;
(c) 96.67% accuracy
Chaaraoui [88]2013Combined representations:
(a) RGB: silhouette (G)
(b) depth: skeleton joints (D)
Dynamic time warpingRGB-DAction/activityMSR Action3D dataset91.80% accuracy
Ren et al. [152]2013Time-series curve of hand shape (G)Dissimilarity measure based on Finger-Earth Mover’s Distance (FEMD)RGBAction primitiveCollected dataset: 1000 instances.
10 gestures × 10 subjects × 10 times.
93.9% accuracy
Ni et al. [163]2013Depth-Layered Multi-Channel STIPs (L)SVMRGB-DInteractionRGBD-HuDaAct database81.48% accuracy (codebook size = 512 & SPM kernel)
Grushin et al. [123]2013STIP with HOF (L)Recurrent neural networks (RNN) with long short-term memory (LSTM)RGBAction/activityKTH dataset90.7% accuracy
Peng et al. [31]2014(i) STIP with HOG, HOF and encoded by various encoding methods; (L)
(ii) iDT with HOG, HOF, MBHx, MBHy and encoded by various encoding methods (L)
SVMRGBInteraction(a) HMDB51 dataset;
(b) UCF50 dataset;
(c) UCF101 dataset
Hybrid representation:
(a) 61.1% accuracy;
(b) 92.3% accuracy;
(c) 87.9% accuracy
Peng et al. [32]2014Improved dense trajectory encoded with stacked Fisher kernal (L)SVMRGBInteraction;
Action/activity
(a) YouTube dataset;
(b) HMDB51 dataset;
(c) J-HMDB dataset
(a) 93.38% accuracy;
(b) 66.79% accuracy;
(c) 67.77% accuracy
Wang et al. [82]2014Local occupancy pattern for depth maps & Fourier temporal pyramid for temporal representation & actionlet ensemble model for characterizing activities (D)SVMDepthInteraction;
Action/activity
(a) MSR Action3D dataset;
(b) MSR DailyActivity3D dataset;
(c) Multiview 3D event dataset;
(d) Cornell Activity Dataset [164]
(a) 88.2% accuracy;
(b) 85.75% accuracy;
(c) 88.34% accuracy (cross subject);
86.76% accuracy (cross view);
(d) 97.06% (same person)
74.70% accuracy (cross person)
Simonyan and Zisserman [115]2014Spatial stream ConvNets & optical flow based temporal stream ConvNets (L)SVMRGBInteraction(a) HMDB51 dataset;
(b) UCF101 dataset
(a) 59.4% accuracy;
(b) 88.0% accuracy
Lan et al. [33]2015Improved dense trajectory with HOG, HOF, MBHx, MBHy enhanced with multiskip feature tracking (L)SVMRGBInteraction(a) HMDB51 dataset;
(b) Hollywood2 dataset;
(c) UCF101 dataset;
(d) UCF50 dataset;
(e) Olympic Sports dataset
(a) 65.1% accuracy (L = 3);
(b) 68.0% accuracy (L = 3);
(c) 89.1% accuracy (L = 3);
(d) 94.4% accuracy (L = 3);
(e) 91.4% accuracy (L = 3)
Shahroudy et al. [83]2015Combined representations:
(a) RGB: dense trajectories with HOG, HOF, MBH (L)
(b) Depth: skeleton joints (D)
SVMRGB-DInteractionMSR DailyActivity3D81.9% accuracy
Wang et al. [114]2015Weighted hierarchical depth motion maps (D)Three-channel deep convolutional neural networks (3ConvNets)DepthInteraction;
Action/activity
(a) MSR Action3D dataset;
(b) MSR Action3DExt dataset;
(c) UTKinect Action dataset [84];
(d) MSR DailyActivity3D dataset;
(e) Combined dataset of above
(a) 100% accuracy;
(b) 100% accuracy;
(c) 90.91% accuracy;
(d) 85% accuracy;
(e) 91.56% accuracy
Wang et al. [165]2015Pseudo-color images converted from DMMs (D)Three-channel deep convolutional neural networks (3ConvNets)DepthInteraction;
Action/activity
(a) MSR Action3D dataset;
(b) MSR Action3DExt dataset;
(c) UTKinect Action dataset [84]
(a) 100% accuracy;
(b) 100% accuracy;
(c) 90.91% accuracy
Wang et al. [117]2015Trajectory-pooled deep-convolutional descriptor and encoded by Fisher kernal (L)SVMRGBInteraction(a) HMDB51 dataset;
(b) UCF101 dataset
(a) 65.9% accuracy;
(b) 91.5% accuracy
Veeriah et al. [125]2015(i) HOG3D in KTH 2D action dataset; (L)
(ii) skeleton-based features including skeleton positions, normalized pair-wise angels, offset of joint positions, histogram of the velocity, and pairwise joint distances (D)
Differential recurrent neural network (dRNN)RGBDAction/activity(a) KTH dataset;
(b) MSR Action3D dataset
(a) 93.96% accuracy (KTH-1);
92.12% accuracy (KTH-2);
(b) 92.03% accuracy
Du et al. [126]2015Representations of skeleton data extracted by subnets (D)Hierarchical bidirectional recurrent neural network (HBRNN)RGBDAction/activity(a) MSR Action3D dataset;
(b) Berkeley MHAD Action dataset [166];
(c) HDM05 dataset [167]
(a) 94.49% accuracy;
(b) 100% accuracy;
(c) 96.92% (±0.50) accuracy
Zhen et al. [58]2016STIP with HOG3D and encoded with various encoding methods (L)SVMRGBInteraction;
Action/activity
(a) KTH dataset;
(b) UCF YouTube dataset;
(c) HMDB51 dataset
(a) 94.1% (Local NBNN);
(b) 63.0% (improved Fisher kernal);
(c) 30.5% (improved Fisher kernal)
Chen et al. [81]2016Action graph of skeleton-based features (D)Maximum likelihood estimationDepthAction/activity(a) MSR Action3D dataset;
(b) UTKinect Action dataset
(a) 95.56% accuracy (cross subject);
96.1% accuracy (three subset evaluation);
(b) 95.96% accuracy
Zhu et al. [87]2016Co-occurrence features of skeleton joints (D)Recurrent neural networks (RNN) with long short-term memory (LSTM)DepthInteraction;
Action/activity
(a) SBU Kinect interaction dataset [168];
(b) HDM05 dataset;
(c) CMU dataset;
(d) Berkeley MHAD Action dataset
(a) 90.41% accuracy;
(b) 97.25% accuracy;
(c) 81.04% accuracy;
(d) 100% accuracy
Li et al. [116]2016VLAD for deep dynamics (G)Deep convolutional neural networks (ConvNets)RGBInteraction;
Action/activity
(a) UCF101 dataset;
(b) Olympic Sports dataset;
(c) THUMOS15 dataset [116]
(a) 84.65% accuracy;
(b) 90.81% accuracy;
(c) 78.15% accuracy
Berlin & John [119]2016Harris corner-based interest points and histogram-based features (L)Deep neural networks (DNNs)RGBInteractionUT Interaction dataset [169]95% accuracy on set1;
88% accuracy on set2
Huang et al. [120]2016Lie group features (L)Lie Group Network (LieNet)DepthInteraction;
Action/activity
(a) G3D-Gamingdataset [170];
(b) HDM05 dataset;
(c) NTU RGBD dataset [171]
(a) 89.10% accuracy;
(b) 75.78% ± 2.26 accuracy;
(c) 66.95% accuracy
Mo et al. [113]2016Automatically extracted features from skeletons data (D)Convolutional neural networks (ConvNets) + multilayer perceptronDepthInteractionCAD-60 dataset81.8% accuracy
Shi et al. [55]2016Three stream sequential deep trajectory descriptor (L)Recurrent neural networks (RNN) and deep convolutional neural networks (ConvNets)RGBInteraction;
Action/activity
(a) KTH dataset;
(b) HMDB51 dataset;
(c) UCF 101 dataset [172]
(a) 96.8% accuracy;
(b) 65.2% accuracy;
(c) 92.2% accuracy
Yang et al. [79]2017Low-level polynormal assembled from local neighboring hypersurface normals and are then aggregated by Super Normal Vector (D)Linear classifierDepthInteraction;
Action/activity;
Action primitive
(a) MSR Action3D dataset;
(b) MSR Gesture3D dataset;
(c) MSR ActionPairs3D dataset [173];
(d) MSR DailyActivity3D dataset
(a) 93.45% accuracy;
(b) 94.74% accuracy;
(c) 100% accuracy;
(d) 86.25% accuracy
Jalal et al. [80]2017Multifeatures extracted from human body silhouettes and joints information (D)HMMDepthInteraction;
Action/activity
(a) Online self-annotated dataset [174];
(b) MSR DailyActivity3D dataset;
(c) MSR Action3D dataset
(a) 71.6% accuracy;
(a) 92.2% accuracy;
(a) 93.1% accuracy