Abstract

In the current era of technological development, human actions can be recorded in public places like airports, shopping malls, and educational institutes, etc., to monitor suspicious activities like terrorism, fighting, theft, and vandalism. Surveillance videos contain adequate visual and motion information for events that occur within a camera’s view. Our study focuses on the concept that actions are a sequence of moving body parts. In this paper, a new descriptor is proposed that formulates human poses and tracks the relative motion of human body parts along with the video frames, and extracts the position and orientation of body parts. We used Part Affinity Fields (PAFs) to acquire the associated body parts of the people present in the frame. The architecture jointly learns the body parts and their associations with other body parts in a sequential process, such that a pose can be formulated step by step. We can obtain the complete pose with a limited number of points as it moves along the video and we can conclude with a defined action. Later, these feature points are classified with a Support Vector Machine (SVM). The proposed work was evaluated on the benchmark datasets, namely, UT-interaction, UCF11, CASIA, and HCA datasets. Our proposed scheme was evaluated on the aforementioned datasets, which contained criminal/suspicious actions, such as kick, punch, push, gun shooting, and sword-fighting, and achieved an accuracy of 96.4% on UT-interaction, 99% on UCF11, 98% on CASIA and 88.72% on HCA.

1. Introduction

Government and security institutions install surveillance cameras in homes, markets, hospitals, shopping malls, and public places to capture real-time events to ensure the safety of people. In the threat-laden context of vandalism, terrorism, or suspicious activities, the surveillance videos are of the utmost necessity for any incident investigation. These threatening situations highlight the critical need to develop a suspicious action recognition system to aid forensic experts in capturing criminals and resolving their criminal investigations. The concept of action recognition encompasses around detection, understanding, and classification of a simple action like clapping, walking, meetings, etc. In recent years, scholars started an investigation of actions in a complex environment like sports. Now, for criminal actions, crime can be defined as an action harmful to any individual, community, or society. They can be differentiated into many forms like homicide, robbery, burglary, and cybercrime, etc. Criminal actions are less studied and we can hardly find any dataset which provides substantial criminal actions. The interaction of potential victim and offender makes a criminal action. The motivation of the offender decreases when he is conscious of being watched [1]. Criminal actions are comparatively different from a person’s regular activities. Criminal actions are generally those actions where an individual may harm other individuals, society, or the public. Criminal actions are unique, as threatening human gestures, poses, and activities are very different compared to other normal actions, which makes them difficult to recognize.

Human motion analysis is the most active research area in computer vision. Motion analysis can be divided into two different tasks. The first task is to describe the physical movements of the body parts, e.g., the raising of a hand or the turning of the head. Pose estimations and tracking of body parts are useful methods of identification. The second task is to describe the semantics of the movements, such as picking up an object or shaking hands.

Action recognition approaches require a large amount of data to process the actions, which requires computational power. For this reason, action recognition is receiving immense focus from the research society due to its considerable application. The action recognition process can generally be subdivided into preprocessing, feature extraction, feature encoding, and classification. For the feature extraction and encoding processes, there is a substantial area to explore, but the classification process is very mature. Currently, feature extraction is done either by a handcrafted process or by a learned features (deep) process. The most widespread feature extraction process in the category of handcrafted processes is the Histogram of Oriented Gradients (HOG) [2], the Histogram of Optical Flow (HOF) [3], the Motion Boundary Histograms (MBH) [4], and the Scale Invariant Feature Transform [5], etc. These descriptors use different methods for feature extraction from the various regions of video. Their methods include extracting features, such as interest points, dense samplings [6], and motion trajectories [7]. Recently, extracting features using deep neural networks has inspired new directions in the field and is achieving impressive results [810]. Feature encoding allows the translation of features into a feature space. Fisher Vectors [11] and Vector of Locally Aggregated Descriptors (VLAD) [12] are commonly used; such methodologies provide good performance for many solutions [8, 10, 13]. However, these encoding schemes lack spatiotemporal data, which is vital while dealing with videos. Another popular method [14], known as the “Bag of Expression (BOE)” model, provides an encoding solution by maintaining the spatiotemporal information. With the advancements in deep neural networks, the features of neural networks achieve better results compared to the result of handcrafted methods. The main advantage of deep features is that they provide higher discriminative powers on the top layers of the networks that are learned from low-level features. These features are transformed with deep neural layers, where handcrafted solutions mostly contain low-level information, such as edges and corners [9, 1517]. Currently, three-dimensional poses can be extracted from monocular images or videos, where the human body is represented as a stick skeleton surrounded by surface-based (using polygons) or volumetric (using spheres or cylinders) flesh [18].

In the last few years, the researchers also explored the variants of action recognition with the help of suitable sensors [19]. Sensor based recognition is based on time series data collected from accelerometers either in mobile phones [20, 21] or wrist-worn [22, 23], magnetometers, and gyroscopes [24]. In these approaches, the raw data is acquired from the sensors, which are preprocessed and normalized. The features are extracted either using manual [25] or using CNN [26]. The time series data is segmented sequentially into smaller segments. Each segment is labeled based on the feature response in that segment. To analyze the time series, there are many parameters like Moving average (MA)/sliding window, autoregression (AR), Autoregressive Moving Average (ARMA) [27], etc. In our work, we have used Moving Averaging, as it models the next step in the sequence as a linear function. In our case, we only link the interest points present in the first frame with the next frames.

In recent years, pose estimation methods have become more complex and accurate. Many studies on pose estimation problems were concentrated on finding the body parts of a single actor in the image [28]. One approach to solving this problem, which is named a top-down approach [29], is to reduce the multiperson pose estimation to a single-person case. Therefore, a person detector is applied to the image, and then a single-person pose estimation is performed for images inside the bounding boxes of each detection. Examples of the systems that use a top-down approach include the work in [2931]. However, this approach introduces several additional problems; the main problem is when a nonactor is detected as an actor. This increases errors in pose estimations of nonactors. Second, regions with detected persons may overlap and the body parts of different individuals, making it difficult for pose estimation algorithms to associate detected body parts with the corresponding actor.

Several multipeople bottom-up pose estimation systems were developed [28, 32] which used deep neural networks to achieve better performance of the pose estimation. Pose Estimation and action recognition can be done together as they are related to each other. Reference [33] used action recognition methods to build a pose estimation algorithm. Our approach utilizes the preprocessing steps employed in [28], as the idea is to extract the motion information of the human body parts. In our work, we have used pose estimation as a baseline for feature extraction. Our approach, convert the extracted frames from the video into a feature map. These feature maps are fed to the CNN network to provides the location of actor’s body part in the frame. The location of each part is stored as a feature representation and used to track the respective motion in the video frames. The performance of our approach depends on how accurately the human poses are estimated and linked with the associated body parts. In this study, the main research contributions are as follows:(i)We have used the CNN network [28] for limbs extraction in a frame and used that information to further extract the limbs localization.(ii)Additionally, the pose or skeleton is modified and restructured to get extra importance from the head and neck area, as, in suspicious actions, the head plays a vital role.(iii)The extracted features from the previous and current frames are stored and served as guidance to temporally relate the motion. This will also help the descriptor to cope with situations where body parts are missing or occluded.

The rest of the paper is arranged as follows: in Section 2, we describe the proposed algorithm. In Section 3, we discuss the experiments and their results. The conclusions are summarized in Section 4.

2. Proposed Approach

Our approach is based on the pose estimations of actors in video clips. The proposed approach is given in Figure 1. Each step is elaborated in the sections below.

2.1. Feature Extraction

For feature extraction, the network [28] is used in such a way to extract the body parts and their association with each other to constitute a full skeleton (pose). The videos are decomposed into image frames and reshaped to 368 × 654 × 3 to fit the GPU memory. Each frame is first analyzed through CNN (VGG-19, first 10 layers and fine-tuned), it generates feature maps (F) which are input to the network. The network is divided into two parts with multiple stages. At stage 1, 6 convolutions layers of 3 × 3 and 2 Conv layers of 1 × 1 and max-pooling layer at the end of each stage. The feature maps from preprocessing stage (VGG-19 layers) are used as an input to both parts. The first part works as a feedforward network calculates a 2D confidence map (S) and the second stage calculates the set of part affinities vectors (L). At the end of each stage, the output of both the branches is concatenated along with the image feature maps. This process is iterative and successfully refines the predictions of the previous stage. This will provide the degree of relativity between the parts of one actor to another. has J confidence maps, one for each limb, and will have a C vector for each limb. Here, we have lowered down the matrices and extracted only 18 key points for one actor comprising 17 joints, which help us in achieving the lower dimension of feature vectors.

Equations (1) and (2) represent the resultant confidence maps and affinity vector fields, respectively.

To fine-tune the network for precise detection of the body parts, loss functions are incorporated between the estimated predictions and ground truth. The loss functions [28] are as follows:where and represent the ground-truth confidence map and relative affinity field, respectively. W (P) is the window function that gives zero output in the case where the annotation is missing in the image location P. This whole process is pictorially represented in Figure 2. (a) The network takes images as input, (b) calculates the confidence maps for each limb, (c) the second stage periodically find the part affinity field, (d) information from b and c is used to join the relative body parts for each candidate, and (e) assemble them into the full-body pose.

2.2. Formulation of Feature Vector

The extracted features are the combination of body parts in the form of affinity vectors and the confidence maps as discussed above. We calculated the body parts which can work together to perform the motion. The movement or orientation of joints can also help in forming the action. We decomposed the body parts with the help of affinity vectors and encoded them in a set of 18 key points associated with “joints” and 17 lines connecting the joints are associated with “limbs” of the body. Our main aim is to capture the motion of each limb and joint as a vector of coordinates location and its orientation. The coordinates provide the location of each limb in each frame/image and the orientation encapsulates the direction of motion.

We encode each limb with its x and y locations for the position and orientation of limbs in each frame with 34 points (17 for both x and y coordinates). We make a set of 68 points that is successive points in the two consecutive frames. To track the movement of each limb separately, the formed skeleton and the calculated coordinates are shown in Figure 3. Our approach is robust because of the very low count of interest points per frame as compared to recent approaches [8, 13, 3335]. This representation allows encoding both position and orientation of the body parts. To take into account possible differences in the frame sizes, we use coordinates related to the center instead of the usual image coordinates (the center of the frame is 0). An example of the visual representation from a single frame is shown in Figure 4. It is also important to note that the pose estimation approach sometimes will not be able to extract full pose from the frame and there might be different numbers of extracted poses on different frames, mainly due to occlusion.

2.3. Action Formulation with Time Series

The next task is to combine the information at each frame with the help of joints and limbs to form an action. We get results only for each frame individually. It is important to connect them to get continuous movement individually. For each actor, we compute a centroid of all computed points as a point for comparison. For consecutive frames, estimations are connected if their centroids are closer to the frames. For the formulation of actions, few things might happen, like:(i)Partial occlusion or self-occlusion of the body parts: In this case, pose estimation cannot produce all the key points and only partial information about the pose can be obtained.(ii)Disappearing from the frame: Detected actors may disappear from the camera view during the video segment.(iii)Incorrect pose estimation: The pose can be incorrectly detected, and it does not belong to any actor in the video segment. An example of such misdetection is shown in Figure 5.

To construct a time series [36] to track the motion of the actors in the video, we start with the first frame and compute the number of interest points present in the frame. If there is only one actor, then we will have 17 points and their coordinate values, which will be used in the next frame to track the motion of an actor. Similarly, in the next frames, we will use the previous coordinates and current location of our interest points (joints) which makes the total of 68 points in the consecutive frames. This process will continue until all the frames are completed. Now, after the completion of actions, they are evaluated by comparing them with the original input videos. We first checked the average motion in the video of the original video and then performed the same operation on our time series-based feature extracted video as shown in Figure 6. The upper part depicts the average motion of three videos comprising different actions, and the lower part of the figure contains the average motion of the same video but with the help of feature extracted points. Here, we can predict the action by looking at the lower portion of the figure and the environmental noise is removed as our descriptor is only dependant on the body parts and their relative motion.

The descriptor is further evaluated by calculating the Motion History Image (MHI) [37], it represents the motion or movement in the single static template. It highlights the location and path of the motion along with the frames and represents the motion history at a particular location. The brighter values signify the most current motion. MHI is widely used in video processing, where video sequence is decomposed into a single image. It contains the motion flow and moving parts of the action video. The MHI of the original video and feature extracted video can be depicted in Figure 7. The upper portion of the figure represents the original videos of different actions and the lower portion shows the MHI of the feature extracted video. The features only contained the motion of moving body parts and did not contain any other information. The MHI of extracted video only contains the relative information of actions happening in the video.

Next, we calculate a gradient of the values in the time series. The segments with the high magnitude of the gradient will indicate that an actor performed a lot of movements at this point. On the other hand, the segments with the gradient close to zero will indicate a part of the video when an actor remained still. It will allow us to remove the part of the video at the beginning and the end of the video segments when actors did not perform any actions leaving only the localized part of the video. It helps in more accurate action segmentation in the video. The average gradient can be visualized in Figure 8 and the following equation:where t is the total number of interest points present in frames and D signifies the ensemble of interest point coordinates and orientations along n (no. of frames). The frame with very low average gradients depicts either no motion or very small motion. We can exclude a couple of frames, which not only improves the classification but also reduces computation time.

The resultant Vector represents the motion or actions in the video. We combined the descriptor along the temporal axis to make a time series to analyze the action more discreetly. In case of any missing frames or key points occluded by objects, we check if the part of the body is not visible all the time. It is removed from the consideration. For the rest, we fill small gaps with the closest previous or future value. We apply the Savitsky-Golay filter [38] to exclude random spikes and get smooth time series for the movement. This filter increased the precision without altering any interest points and as a result, we have the vector of 68 gradients of interest points with the total number of frames in the video.

2.4. Action Classification with Proposed Descriptor

We have calculated feature vectors from each video, as shown in Figure 9; the extracted features showing the human movement along the frames. We trained a classifier that will distinguish between different actions in the video. We used a set of SVM classifiers [39] for each class of actions with varying sigmoid kernel functions to best fit our featured data. The performance of the SVM classifier for the best kernel function can be seen from Figure 10. The graph clearly represents the kernel function with a 0.5 value gives better results for the UT-interaction dataset. Each classifier will estimate the probability that the action performed in the analyzed video belongs to the specific category.

Classifiers for the first class of actions are trained on the entire training set with two labels, “first action” and “not first action,” assigned to the video segments. Then, the video segments are excluded from the dataset, and the classifier for the second class is trained on the remaining data with labels, “second action,” and “not second action” and so on. In the case of the N class of actions, there will be classifiers, and the last classifier will distinguish between actions “” and “N.” Additionally, we use a sequential feature selection technique for each classifier to reduce the number of predictors used in the classification. This will allow us to use only information about the movements of the body parts that are the most relevant for this class of actions.

3. Experiments and Discussion

We evaluated the proposed method using a leave-one-out cross-validation technique and a k-fold (10-fold) validation. A classifier, as described above, is trained using the training set and used to predict action in the video from the prediction set. The procedure is repeated for each of “N” videos in the dataset, and the results of all the predictions are collected and represented in the form of a confusion matrix.

3.1. Datasets

The algorithm was assessed on four action datasets: the UT-Interaction dataset [40], the YouTube action dataset [41], the CASIA dataset [42], and HCA [43].

The UT-Interaction dataset has been considered a standard for human interaction. The dataset contains six classes of two-actor interactions, which include hand-shaking, hug, kick, point, punch, and push.

The YouTube action dataset (also known as UCF11) contains 11 action categories with about a hundred video segments in each category and a total of 1595 videos. Most of these videos are extracted from YouTube and contain camera motion, cluttered backgrounds, different appearances of the actors, different illumination conditions, and viewpoints of the camera.

The CASIA action database contains various activities captured in an outdoor environment with different angles of view. The database contains various types of human actions (walking, running, bending, etc.). There are seven types of interaction involving two actors (robbing, fighting, follow-always, follow-together, meeting-apart, meeting-together, and overtaking). We selected only interaction videos for action recognition with respect to suspicious activities.

HCA dataset contains video sequences for six criminal actions. Each category contains a set of videos to depict a particular action. Each video is recorded with different actors performing various actions under numerous environmental conditions. There are 302 videos in total. Actions include fight, kick, push, punch, gun shooting, and sword-fighting.

3.2. Results and Comparison with Current Approaches

The proposed approach is evaluated on four datasets. The system used for the simulation was Intel Xeon Processor, 32 GB RAM, and NVIDIA GTX1660 graphics.

The runtime performance of our approach depends on the number of people present in the video. The number of interest points (body parts) increased with people's count, which is the main factor in calculating the time complexity. The runtime complexity comprises two major parts: one CNN processing time with complexity is O (1), constant with a variable number of people; second, multiperson analyzing time, which takes the complexity of O (n2), where n represents the number of actors present in the video. The rest is time series approach, whose complexity is O (n).

All four datasets contained a different variety of actions, which helped in better assessing the performance of the proposed approach. The first dataset processed was the UT-interaction dataset with a total number of 20 videos (10 in Set 1 and 10 in Set 2); set 1 contained actions with static backgrounds, which generated close to zero noise, and set 2 videos were captured in a more natural environment with multiple actors in the background. We used the leave-one-out validation and the 10-fold cross-validation separately to train the model, and the confusion matrix is shown in Figure 11. Our approach outperforms the state-of-the-art techniques. The recognition rate for Set 1 was 96.4%. The reason for this high accuracy is that our descriptor extracts the most relevant information about the actions performed, and the full body of actors is visible with minimal occlusions. Similarly, for Set 2, the same results were achieved, regardless of the environmental effects, our approach extracts the motion of the body parts, and it is the least affected by the environmental changes or occlusion. Table 1 shows the action-wise accuracies for each action, which shows good performance on all the actions except few misclassifications in the push and punch actions. The reason for misclassification is due to interclass similarity. The comparisons of the approaches with the state-of-the-art methods; our approach improved the accuracy by 1%, as shown in Table 2.

Our approach successively classifies the Shakehand, Hug, kick, and point actions as all the actions have uniquely defined movement of body parts, which clearly differ from other actions, but for the case of punch and push, we can see few misclassifications as both the actions comprise of similar movement. For example, for the actions of push and punch, one actor remains still, while the other actor approach and executes the action, which results in an impact on the first actor and he leans back (with the effect of push or punch). The hand movement of both actions is quite similar in few cases, which causes misclassification in few cases.

Next, we evaluated the proposed approach on the UCF11 dataset, which contained 11 actions, and most of the videos are taken in a realistic environment, so this dataset was quite challenging with respect to UT interactions due to large viewpoint variations, backgrounds, camera motions, and object appearances. Different body parts remained unseen for a fraction of time, as this dataset contains different viewpoints. Table 2 shows the performance comparisons for the UCF11 dataset with other state-of-the-art. Our approach outperforms because we first detected the poses and shaped the actions by joining the body parts together in the temporal domain. Here, in this dataset, we have multiple occlusions where actor movements are overlapping with other actors, which can cause misclassification. Figure 11(e) shows the confusion matrix for the UCF11 dataset, which represents few misclassifications. The reason for misclassification is specifically in spiking and basketball shooting where the actor information is occluded and hand movement is overlapped with other actors. Therefore, it is crucial not to lose the body parts for a long-time. To overcome this issue, we picked the centroid of an actor and followed its motion relative to other body parts so that we do not lose the action attributes. As we are extracting the motion of the body parts, the background jitter did not have much of an effect.

For the CASIA dataset, there are three viewpoints, the angle view, the horizontal view, and the top view. The horizontal and angle viewpoints are better than the vertical (bird-eye) viewpoints. We have picked interaction videos to test the performance of our approach. Figure 11 shows the confusion matrices for horizontal (c) and vertical (d). Our approach requires the visualization of body parts for the maximum duration to extract the information about the action. In the vertical viewpoints, most of the actions look comparatively similar, and most of the body parts are hidden, so the transformation of the pose into motion will cause misclassifications as only the head, shoulder, and arms are highlighted for a very small period.

Our approach performed best for horizontal viewpoint and achieved an accuracy of 98%. Table 3 shows the activity-wise accuracies. The results only show the horizontal view as our approach performs best where most of the body part is visible in the entire video. The actions in this dataset are different from one another. Therefore our approach does not suffer from interclass similarity issues as we faced in UT-interaction (push and punch).

However, few misclassifications were observed in the “following” and “overtaking” actions where actors were overlapped in most of the frames and the approach was not able to classify the exact action. For the vertical viewpoint, most of the body cannot be seen due to camera position. Therefore, we can see many misclassifications. To better understand the actions in the vertical viewpoints and similar cases, optical flow and trajectory information was helpful for the translation of the poses into actions.

The last dataset selected for evaluation is Hybrid Criminal Action (HCA) dataset. This dataset comprised videos from different datasets; push, punch, and kick action videos were taken from the UT-interactions dataset; gun shooting and sword-fighting action videos were taken from HMDB51, and fight action videos were taken from the CASIA dataset. The overall experimental parameters change for each criminal action. Therefore, each of the actions is calculated separately and the average value is shown in Table 2.

Previously the proposed approach is evaluated separately on each of the actions and per-class accuracy for a fight, kick, push, punch, gun-fighting, and sword-fighting are 96%, 99%, 93%, 91%, 78.4%, and 76%, respectively, as shown in Table 4.

The action videos in this dataset contain very low background noise, our approach extracts the features and computes the relative descriptor efficiently, but the actions of sword-fighting and gun-fighting were classified with less accuracy. The videos from HMDB51 were collected mostly from movies and web sources. The videos are low quality with camera motion, illumination effects, nonstatic background, changes in position, viewpoint, and occlusion. Our approach misclassifies few videos due to camera motion and viewpoint variations, but the overall accuracy is acceptable.

4. Conclusions

The proposed approach achieved good performance on all the datasets. Our method utilizes the position of the actor and computes the movement of body parts for feature representation. Then, by combining them into actions using a time series approach. The proposed method efficiently computes the features and later formulates the action. This is why the background noise and occlusion do not affect the overall performance of the approach. However, for future directions, additional research is required to refine the extraction of the features from the background information, and vertical viewpoints of the actors, trajectory information, and optical flow can also help in extracting valuable information. Extraction of additional types of features, as well as dimensionality reductions of the feature space using feature selection methods or Principal Component Analysis (PCA), can lead to a higher performance of the system.

Data Availability

The authors have used publicly available datasets (UT-Interaction, CASIA, and UCF-11). However, HCA dataset can be made available on request to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors are thankful to the Directorate of Advanced Studies, Research and Technological Development (ASRTD), University of Engineering and Technology Taxila for their support. This research work was carried at Swarm Robotics Lab under National Centre for Robotics and Automation (NCRA), funded by Higher Education Commission, Pakistan.