Abstract

This paper represents the recognition of group activity in public areas, considering personal actions and interactions between people from the field of computer vision. Modeling the interaction relationships between multiple people is essential for recognizing group activity in the video scene. In artificial intelligence applications, identifying group activities based on human interaction is often a challenging task. This paper proposed a model that formulates a group action context (GAC) descriptor. The descriptor was developed by integrating the focal person action descriptor and interaction joint context descriptor of nearby people in the video frame. The model used an efficient optimization principle based on machine learning to learn the discriminative interaction context relations between multiple persons. The proposed novel group action context descriptor is classified by support vector machine (SVM) to recognize group activity. The proposed technique effectiveness is evaluated for group activity recognition by performing experiments on a publicly available collective activity dataset. The proposed approach infers a group action class when multiple persons are together in the video sequence, especially when the interaction between people is confusing. The overall group action recognition model is interrelated with a baseline model to estimate the performance of interaction context information. The experimental result of the proposed group activity recognition model is comparable and outperforms the previous methods.

1. Introduction

Multiple person activity recognition algorithms have established significant attention in the field of computer vision as well as artificial intelligence. However, group activity recognition from video sequences is often a challenging task due to the dynamic interaction between multiple people. Group activity recognition is important in many applications such as computer-human interaction [1], video surveillance [1], content-based video recovery [2], video summarization [3], and healthcare [1]. In surveillance, medical, and social care fields, these algorithms are used to detect abnormal activities in healthcare fields and in public spaces such as air terminal and metro station places. In [4], for recognizing human activities from videos, a computationally storage efficient approach is proposed. In [5], k-nearest neighbors’ techniques are developed for human activity recognition.

Most traditional methods in the computer vision system are focused on the recognition of an individual person’s activities [69]. Although several recent works [1016] have been handling group activities in real time, scenes often involve multiple persons in action along with their interrelated actions. Group activity recognition recognizes actions that are performed by multiple people.

It is normally hard to discriminate the activities of multiple people based on the appearance of an individual person alone. The visual appearance of the highlighted person in Figure 1(a) is just a standing action as an individual. However, the person is waiting in the queue or talking with other persons. The highlighted focal person in Figure 1(b) is interrelated with the people nearby. In the interest of group activity recognition, it is essential to deal with the context information between the focal person and surrounding people nearby. Hence, context modeling is necessary for recognizing group activities. The proposed model detects the focal person and interaction context information. In this direction, several researchers are investigating the contextual information to analyze a group activity through interactions between multiple people, denoted as “group activity” or “collective activity” recognition [1316].

The focus is on modeling the group activity descriptors by integrating focal person’s action descriptor and interaction joint context information in the direction of group activity recognition. The performance of the proposed technique is estimated on the collective activity dataset [10]. This approach contributes two main purposes: the first is to improve the misclassification of group activities descriptors by eliminating the confusion of similar actions in the scenes and the second is to recognize the group activities to streamline the interaction inference technique.

Furthermore, the interaction joint context is used to develop an innovative group action context (GAC) descriptor model for efficient group activity recognition process. The proposed approach contributes, first, to the development of an interaction joint context based on Bag-of-Words (BoW) approach representation for individual actions and pose interactions along with the dominant poses and actions in the video within the frame. An algorithm is developed based on the dominant pose and action to determine the interaction within multiple people in the video scenes. Secondly, it proposes a novel group action context descriptor (GAC) that encodes the interaction between joint context and action descriptor of the focal the person. A group action context descriptor is classified through the SVM classifier for group activity recognition.

The rest of the paper is presented as follows. In Section 2, we described review work related to group activity recognition with different approaches. Section 3 explains the detailed discussion of the proposed framework. Section 4 demonstrates the effectiveness of the propped technique through experimental results and evaluations. In Section 5, performance of the proposed approach for a group activity recognition is concluded.

Recent methods have outperformed in recognition of individual actions [7, 8]. In computer vision, human action recognition has diverse applications in intelligent surveillance, sports analytics systems, etc. In [17], the concept of human activity recognition specifically for video surveillance was explored. In this application activity, understanding is important for improving human-computer interactions. However, recognizing the group actions from multiple people was not restricted to only recognizing the actions of individuals in the group. In recent research work, group activity recognition is based on the actions of persons and the interaction context within multiple people [10, 1214]. In [18], activity recognition was represented by global representations with local representations approach. The interaction between multiple persons within group most often encodes as context information. Machine learning (ML) techniques encourage an effective framework for modeling the interaction context between multiple people. In [19], various ML approaches for group activity recognition were discussed. Furthermost of the present methods, consider that most of the people present in a scene exhibit singular activities as group actions. However, this is not true especially in surveillance sports videos. In addition to that, people might show different pose interactions within a group which exhibits a specific activity.

In [10, 11, 16, 20], a multiple person action recognition technique has been discovered in video. In some research work, context information among multiple people has been proposed for group activity recognition. In [10, 21], contextual information is integrated by extracting feature descriptors from multiple persons. This context information is a more significant feature descriptor to analyze the interaction for group activity recognition. However, in this model, the action of everyone is classified independently because this spatial and temporal constancy in the group interaction is not always confirmed. In [12, 14, 15, 22], the proposed graph structure model described the interaction among persons.

There exists contextual pose interaction information which differs the overall group activity as shown in Figure 1. The hierarchical AND-OR graph model is proposed in [20, 21] for group activity, which models temporal and framewise relations in the video. However, this method was expensive to apply.

In [9], spatiotemporal local (STL) descriptor considers spatial variation, and this descriptor generalizes. In [23] RSTV was proposed which captured the context of the person in the crowd but it failed in the noisy pose, hence proposing 3D MRV. In [14], an action context (AC) descriptor was proposed to capture contextual information through HoG feature vector. However, it does not consider the person’s posture context interaction information. It considers the action scores of focal persons and all nearby people in the context region. However, this descriptor is sensitive to changes in viewpoint. In [24] the relative action context (RAC) descriptor is proposed, which encodes the relative relations within the activity to represent the viewpoint invariants. In the model of [15] we considered temporal consistency within the group, but the interactions considered limited only in successive frames. Due to this, temporary misclassification results in these models.

In [13, 25], to model interactions between people, a spatiotemporal pattern, hierarchical graphical model was proposed, which involved composite preprocessing and inference processes. As in [14], contextual information is considered only in the adjacent region due to this temporal and spatial uniformity missing due to this misclassification in a group activity classification.

In [26], the interactions between people are integrated through fully connected conditional random fields (CRFs) to avoid misclassification of group actions. These multiscale features are considered which are integrated through CRF to represent the interaction context. The approach in [27] uses a model of human behavior considering semilocal parts and interactions between them, by which the classified multiclass activities developed reasonable functionality. In [28], a graph-based clustering method was proposed for recognition of group action in a crowded scene by considering motion and local interaction information.

However, it is very difficult to handle complex interaction context information based on graphical models. This approach is competent only for modeling human level trajectory info, which is inadequate to recognize confused group activity such walking and crossing. These activities can be recognized by human action and pose appearance.

Fan et al. [29] offer technique for understanding human gaze communication by studying human interaction in social videos. In this design a spatiotemporal graph neural network is used to model dynamic human interaction by passing messages over the graph. To capture the temporal dynamics LSTM based temporal reasoning module is incorporated to predict atomic gaze communication. Paper [30] detected shared attention intervals spatially and predicted shared attention location in video frames by proposing spatial-temporal neural network. The convolutional Long Short-Term Memory network is employed to optimize temporal domain in the shared attention intervals. The Graph Parsing Neural Network (GPNN) is a framework proposed in [31] for detecting and recognizing human-object interactions (HOI) in images and videos. The proposed GPNN signifies HOI structure and automatically analyzes the optimal graph structures and this method is valid for spatial and spatial-temporal domains.

In computer vision deep learning approaches a significant improvement in image classification, human activity recognition, and video classification is shown. The deep neural network learning model is presented in [32] for recognizing the activities performed by multiple people based on contextual relationships. In surveillance scenes for group activity recognition, Deng et al. [33] integrated hierarchical graphical models and deep networks. In [34], deep LSTM-based temporal hierarchical structure model was proposed to learn sport activity data and in [35] Confidence-Energy Recurrent Network (CERN) encompasses two-level hierarchy of LSTMs. In recent work of group activity recognition, Tora et al. [36] proposed pretrained CNN model combined with LSTM recurrent neural network to capture interaction context information. Paper [37] developed multilevel hierarchical recurrent network to model interaction context framework for group activity recognition. The power of deep learning RNN model [38] captured person-level temporal context information. The interaction related to long motion time of individual is aggregated by Bi-LSTM by proposing a novel Participation-Contributed Temporal Dynamic Model (PC-TDM) in [39] which improves performance of group activity recognition. Multistream spatiotemporal architecture by a convolutional fusion is proposed in [40] for collective activity recognition. Tang et al. [41] proposed Coherence Constrained Graph LSTM (CCGLSTM) to model the relevant motions of individuals to effectively recognize group activity, by suppressing the irrelevant motions. The problem of group activity recognition is solved by modeling person-level and group-level actions in [42] by proposing graph LSTM framework by exploiting temporal features.

The existing research work presented is based on a learned handcrafted feature descriptor. The techniques are evaluated in the direction of context modeling. It has been observed that the context descriptor considers spatiotemporal features which improves the classification of group activity. Most of the previous group activity recognition methods do not handle flexible interactional context information. Owing to this in this research group action context descriptor is formulated using joint interaction context information for recognition of group activity.

The experimental results presented in the paper outperform the graph structure models [14, 23, 26]. The proposed model of group activity recognition has focused on interaction context. Context information considered how the focal person connected with nearby person’s actions and pose for group activity. This model produces context modeling which has discriminative interactional features to handle varying number of persons in a group and is flexible to model scalable context information.

3. Approach

Group actions are categorized by pose as well as actions of persons along with interactions within multiple people. The interaction context information exhibits an important role in recognizing the group action. However, it poses a challenging problem owing to the change in people’s actions and more precisely the variation in the human pose which exhibits variation in interactions within the group action. In the group activity, the recognition key purpose is to ensure the positional appearance of an individual through interaction context cues in each group.

This section describes the strategy of the group activity recognition method. Thus, the group action context (GAC) descriptor is formulated from the people interaction in a scene and then this descriptor is classified into group activity category by using a multiclass SVM classifier.

The proposed method constructs GAC by combining the focal person action descriptor and joint interaction context descriptor. However, it is assumed that the head poses of people and 3D trajectory space are available in the database [10]. The proposed framework of group action recognition is presented in Figure 2.

As shown in Figure 2, the persons are detected in the video frame and let I denote a person. In the center of the video frame one person Im is detected as the focal person in the frame, and people nearby in the region of Im can be considered for the interaction joint context as Jm. In each frame, the focal person is selected and corresponding to its interaction, the joint context feature is computed. The proposed group action context descriptor learns through weighted function Wc, between the focal person action descriptor Im and interaction joint context descriptor Jm. In the proposed group interaction model, the assumption is that the focal person action descriptor would be extremely related to the interaction context as a group action, which is affected by the multiple people pose and actions in the video frame.

In the following, model, the formulation is illustrated in Section 4 that learns an optimal Wc, by optimizing the model for the inference of activity. Before that, Section 3.1 describes the features in detail.

3.1. Feature Details
3.1.1. Focal Person Action Descriptor

It has been assumed that the video frame is preprocessed, and persons are detected along with available locations [10]. The feature is extracted from the detected person using histogram oriented gradient (HoG). HoG is [43] an appearance-based feature vector which is an anticipated technique in complex environments to diminish occlusion and illumination variations for individual action recognition. Owing to this, HoG transform is used to extract the feature from the detected person. However, instead of directly using raw HoG features, here is anticipated the individual action descriptor [13]. In the proposed framework for individual action descriptors, the KNN classifier is trained on HoG features for 5 action classes. In the center of the frame, focal person action descriptor is selected as Im, where Im is a vector with a classification score for the 5 action classes.

3.1.2. Pose Context Feature

In support of this spatial information around a person, we explore by proposing BoW as a pose feature Pi which includes eight pose categories: right, front-right, front, front-left, left, back-left, back, and back-right. In addition to this most influential pose feature is also considered in the video frame that imitates the interaction context between the focal person and nearby person , in the region of . This influential pose feature is discovered as the contextual relationship between the focal person’s pose and nearby people.

Let represent pose context feature vector, which is formulated in equation (1) by contacting the most influential pose feature and BoW of nearby people’s pose surrounded to the focal person , where :

3.1.3. Interaction Joint Context Descriptor

In a video frame, for example, consider multiple people with two actions, crossing and walking, in which sometimes they have the same collective pose representing that persons are following each other. In such a case, it is significant to integrate individual actions and interaction context information between these activities to discriminate them. To investigate these complex structures which involve spatial dependencies, the interaction joint context plays a significant role. Thus, to capture the interactions within the group, we propose an interaction joint context descriptor which encodes pose context features as well as individual action descriptor. The proposed interaction joint context model formulates BoW illustration for the pose and individual actions of a person in the video frame. In addition, the dominant pose and actions present in the frame are involved in the descriptor. The interaction joint context descriptor is described aswhere is a pose context feature and is BoW of actions as shown in Figure 2. Hence, it helps to consider spatial relations between people which are capable of discriminating against the activities.

3.1.4. Group Action Context Descriptor

The group action context (GAC) descriptor is formulated by integrating the focal person action descriptor and interaction joint context descriptor of nearby people. Group activities involved individuals’ actions and positional movements which effect interaction context information. Hence, it is essential to encode contextual features into a novel group action context descriptor.

As shown in Figure 2, proposed GAC descriptor is focused on a focal person and illustrates the relative interaction context between the focal person and other people nearby. In the interest of group activity recognition, it is required to have an effective group action context descriptor which formulates the interaction between persons. In addition, GAC encodes the discriminative information of everyone in the group activity.

The significance of the GAC descriptor is essential which formulates the interaction context future which enhances the performance of the classification algorithm. GAC represents a contextual relationship between the focal person and people nearby in BoW style, which captures actions and pose features. Thus, GAC captures spatiotemporal dynamic information which supports enhancing the learning performance of SVM classifier to recognize group activity.

4. Model Formulation

The person in the frame has been detected with the given location of persons [10]. In addition, to train the descriptor data, in a supervised learning mode, each person in the frame is labeled by action and pose labels.

With multiple people, interaction signifies cooccurrence between the actions of all individuals and the positional head pose to form a group action. The main purpose of the proposed approach is to estimate the group action context descriptors through the interaction context for the group activity recognition.

Let be the video sequence captured with a camera and having M set of images.

Let be the set of m persons in frame. Having this information, the main aim is to extract interaction context features between multiple people. The HoG feature descriptor is extracted from the person I as , which is trained using KNN classifier to develop individual action descriptors.

In the frame all the person actions denote as action label where , and H is possible action label set. The video frame K is allied through a group activity label as , where y is possible to group activity labels set. The focal person action descriptor Im is selected at the center of frame K. The action descriptor for focal person Im is given as

for h possible action where SA(i) is the classification probability score of people Im to action h.

The group action context (GAC) descriptor is computed as follows:where can be observed as the interaction function between which is optimized by as weight matrix and the weighted relationship between the focal person and interaction joint context Jm is learned. If there is a maximum correlation among focal person descriptor and interaction context for the given group activity owing to being maximum or else, it is not in equation (4). By considering the interaction modeling function among a set of m persons,

4.1. Learning

In each video sequence, the aim is to recognize group activities. Each video frame is signified as a group action context descriptor that encodes the focal person action descriptor along with the surrounding person interaction joint context. The GAC descriptor implicitly infers group actions during learning and inference. There are N training frames, and group activity label which belongs to that video frame. In equation (3), the group action context descriptor is learned through weight matrix by integrating the focal person’s action descriptor Im and it is surrounding an interaction joint context descriptor Jm. The matrix should represent the interaction context structure within a group activity. The SVM multiclass [28] classifier is trained on GAC descriptor by optimizing through parameter tuning for correct group activity recognition.

Assume there are m people in the video frame K and the corresponding focal person and interaction joint context are Im and Jm, correspondingly. Then, equation (5) optimizes the group interaction context response concerning a group activity class Y:

Multiclass linear SVM classifier is trained on GAC which is represented as the feature descriptor to learn the weighted matrix Wc to classify group activities for the video sequence once model is trained. Two-category ground-truth response of group activity to other group activities defines a margin δ in

The ultimate formulation of the model becomes

Based on the above formulation model, optimize all , .

In equation (7) Lagrangian Relaxation (also known as Dual Decomposition) constraints are employed on the model which solves by optimization cost function .

4.2. Inference

In equation (3) the interaction context response is computed based on the weight matrix Wc by developing the voting scheme for each group activity. is used to optimize all the weight matrices , where through learning classifier kl.

In frame K, compute Bag-of-words for pose and action labels for an individual and calculate the consistent joint interaction context Jm across the focal person Im. Then, we compute the value

In equation (8), , on each k, l ∈ {1, 2, ..., Y}, k ≠ l is computed for recognition of group activity from a video sequence Vm. If equation (6) score is greater than zero, it gets the maximum votes, which means more likely from group activity class.

5. Experiment and Results

This section describes the performance results of the proposed GAC descriptor model for group activity recognition.

5.1. Dataset

In this paper, for experimentation, Collective Activity Dataset is selected that is proposed in [10]. The dataset provides automatic person detection and trajectory generation and represents real-time noisy occlusion observations. The proposed dataset is labeled around bounding boxes of the person carrying out an action with their pose and activity class for the recognition purpose of every 10th frame. The proposed dataset contains video frames of five group activity classes, together with waiting, walking, crossing, talking, and queuing, and eight poses with right, front-right, front, front-left, left, back-left, back, and back-right. In the dataset, 44 short video sequences are involved with different multiple people’s actions. The performance of the proposed interaction contextual relationship was validated on 5 group action category datasets [10].

The proposed framework’s focus is group activity recognition and interaction contextual modeling is used to improve the performance of the GAC classifier model. Although it is observed that optimization improves action recognition through interaction joint context, in the experimentation leave-one-video-out, a cross-validation scheme [22] is performed.

5.2. Experimentation

Group action classification: based on Sections 3.1.13.1.4, the obtained feature context descriptor set can be sent to the SVM classifier for training and testing, and the performance of the model is estimated. In the learning and inference phase of the classifier, it is essential to provide a random and independent feature descriptor set. Owing to this, to ensure SVM classifier accuracy, performance in the experimentation feature set of video sequence is divided into 70% training set and 30% testing set randomly and independently. After training SVM classifier based on 10-fold cross-validation group action classification performs on the test data feature set for 5 action classes of crossing, waiting, queuing, walking, and talking. To evaluate the performance of the model, following performance indicators are investigated: accuracy, precision, recall, F-1 score, and confusion matrix and compared with baseline model.

5.3. Results

The experimental results are compared with existing techniques in this section. This paper found that the proposed model achieves improved performance compared to [14, 23, 26] as shown in Table 1. In Figure 3, the recognition accuracy for each activity is listed by confusion matrix. It presented that the proposed GAC model achieves a significant improvement compared to the baseline model. Additionally, the proposed GAC descriptor optimization method can capture the diverse forms of interaction context in group actions.

The GAC model is trained with a SVM classifier intended for group action recognition by utilizing the libSVM library [44]. The efficient parameter tuning of SVM had significant improvement on the classifier’s accuracy. The group descriptor is robust as in the existence of noisy observations since the descriptor builds on the interaction context. The proposed model average classification accuracy is 88.8% on collectivity activity dataset [10] shown in Table 1 and compared with state-of-art methods.

To evaluate the performance of the model, following performance indicators are investigated: accuracy, precision, recall, and F1 score, compared with baseline model in Table 2.

The proposed GAC descriptor exploited context information in terms of focal person, action descriptor, and interaction join context descriptor. The proposed framework automatically infers person interaction context information through an optimal GAC descriptor. In Table 2, it is observed that GAC descriptor improves the performance compared to baseline model. In the interaction, the joint context encodes the information about the focal person actions and multiple people poses and actions in BoW formation. This context feature-level method offers an efficient way to comprise the interaction context temporally and spatially. In the joint context, the descriptor considers the dominant pose and the action of nearby people, which is a benefit for the activity which is not discriminative based on the pose of multiple people (e.g., walking). In this case, bag-off word technique shows promising performance. Thus, formulating an interaction context descriptor gives more intellect for diverse group activities.

In Figure 4 we reported qualitative results on a 5-activity dataset indicating group action using the proposed method. In Figure 4, the first two sequences successfully recognized the action and improved the classification results, whereas the final row represents a failure in classification due to wrong action label recognition which causes misclassification of activity. The group activity classification result is visualized in Figure 4, which is the learned GAC structure of person interaction context descriptor by SVM Machine Learning algorithm. Note that the person in red color made an incorrect prediction for individual action. Owing to this in this frame there is an incorrect classification, which reduces the performance of the model.

5.4. Discussion on Result

The proposed model performance is evaluated along with state-of-art methods and baseline models. The baseline model uses only an interaction joint context descriptor. The GAC framework demonstrated that joint interaction context capturing performs successful group activity classification using SVM. The interaction descriptor considers a BoW representation of a person’s actions and pose in the frame. This paper mainly proposed a group action context descriptor that extracts contextual information between the focal person and people nearby. Owing to this, the GAC model achieves improvement in performance in Table 1 over the baseline methods for group activity recognition. The results of AC descriptor [14], RSTV + MRF [23], and AC-RAC + FC-CRF [26] used for group activity classification improved by the proposed GAC descriptor method. In [14] AC descriptor considers the action score in the context region although the interaction pose context does not consider it. In the proposed GAC descriptor, consider the pose context among multiple people, which improves the classification performance.

It is observed that in [14, 26] the classification between walking and crossing was ambiguous. The proposed model inference techniques are improved performance of recognizing group activity that is involving confusion in activities like walking, waiting, and crossing. By integrating interaction contextual information confusion between these activities reduced because the pose context feature fetches important cues for discriminating these activities. As a pose context feature, consider BoW as the pose features along with the most influential pose feature in a group of people. In crossing, persons always cross the street with the same pose and in waiting people stand in the same pose direction, hardly their posture facing each other. However, by considering the most influential pose and action feature, the performance of group activity classification accuracy improved. This implies that the GAC contextual modeling between the focal person and the nearby person is an efficient mechanism for interaction context information detection among multiple people.

6. Conclusion

This paper presents a model for group activity recognition in video. This group activity recognition task was resolved by considering interaction contextual information between multiple people. Based on that, a novel group action context (GAC) descriptor is proposed to model the interaction context between focal person actions and nearby people within a group activity. The group descriptor incorporates focal person action and interaction joint context information to discriminate different group activities. The GAC descriptor model infers group activities efficiently by establishing an effective optimization algorithm SVM. The best average accuracy of 88.8% of the proposed model has shown significant performance as compared to the state-of-art methods on collective activity dataset for group activity recognition. The proposed algorithm utilizes interaction joint context information which can be effective for the development of group action context descriptor. Furthermore, the GAC descriptor is robust for confusing activities such as crossing and walking. In surveillance applications for high-level activity and behavior analysis, the proposed model is rendered easily. Future scope includes investigation of other useful context information such as scene context and research on effective automated context feature learning.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors thank Professor Sangeeta Jadhav for her valuable supervision.