Abstract

With the rapid growth of population, more diverse crowd activities, and the rapid development of socialization process, group scenes are becoming more common, so the demand for modeling, analyzing, and understanding group behavior data in video is increasing. Compared with the previous work on video content analysis, factors such as the increasing number of people in the group video and the more complex scene make the analysis of group behavior in video face great challenges. Therefore, a group behavior pattern recognition algorithm based on spatio-temporal graph convolutional network is proposed in this paper, aiming at group density analysis and group behavior recognition in the video. A crowd detection and location method based on density map regression-guided classification was designed. Finally, a crowd behavior analysis method based on density grade division was designed to complete crowd density analysis and video group behavior detection. In addition, this paper also proposes to extract spatio-temporal features of crowd posture and density by using the double-flow spatio-temporal map network model, so as to effectively capture the differentiated movement information among different groups. Experimental results on public datasets show that the proposed method has high accuracy and can effectively predict group behavior.

1. Introduction

The growth of population and the diversity of crowd activities make group scenes become common. Group behavior [13] contains many important clues in interdisciplinary fields. Understanding the formation mechanism of group behavior has long been one of the important research topics in sociology and natural science. When the number of people in the video increases and the crowd scene becomes more complex [4], how to automatically and effectively model, analyze, and understand the group behavior data so as to better serve human beings becomes an important challenge. Research on group behavior analysis can provide support and corresponding solutions for many key engineering applications, such as intelligent video surveillance, crowd anomaly monitoring, and public facility planning. From the perspective of the cognitive mechanism of group behavior, this paper studies the effective computational framework and algorithm model of group behavior, trying to mine the dynamic group pattern and behavior in the real scene video data [5, 6], so as to solve the practical problems in the field of computer vision.

At present, the problem of group behavior analysis [7, 8] in video is based on ordinary surveillance video, and it uses computer vision technology to understand and analyze group behavior and events in the monitored scene. This changes the problem that traditional video surveillance relies too much on manpower. It can automatically realize the analysis and description of group behavior and realize the intelligent monitoring of large-scale crowd scenes. Group behavior analysis and understanding has become an important research branch of video surveillance, which has been widely applied in many fields such as public security, transportation, and facility planning. At the same time, the vigorous development of artificial intelligence [912], machine vision [1315], cognitive science, and other cutting-edge technologies also provides a guarantee for intelligent understanding of video content. Previous behavior understanding work [16] in video content analysis mainly focused on understanding individual behaviors, such as motion detection, target tracking, and object recognition, while ignoring the understanding of large-scale group behaviors. Compared with the analysis and recognition of individual behavior, group behavior is more real and complicated.

In the detection of crowd density [17, 18] and crowd behavior [19] by a computer, the movement of the crowd is complex and the scene is changeable. Due to the change of illumination, the blocking of the crowd, the perspective effect, the different shooting angles, and other factors, it will bring difficulties to the detection by the computer. Crowd behaviors have different semantics in different scenes. It is of great significance to quickly and effectively understand and distinguish the semantics of normal and abnormal behaviors of crowds and realize effective judgment, which is an urgent problem to be solved in the field of computer vision [20].

The processing of video images through computer vision can further replace manual monitoring to perform real-time and efficient monitoring of crowd density and crowd behavior. Recently, many scholars have used deep learning-based methods to conduct research on multiple tasks such as pedestrian detection, face recognition, and group behavior recognition and have made major breakthroughs. At present, when computer vision performs crowd detection, there are problems such as large crowd, poor detection accuracy, variable scenes, and high complexity. The existing technology can effectively overcome the abovementioned difficulties on the basis of deep learning and affect the distribution of the population. As well as by real-time monitoring of behavior, it provides solutions for crowd supervision, which has great practical significance and application value.

The main innovations and contributing points of this paper is to propose a group behavior pattern recognition algorithm [21, 22] based on spatio-temporal graph convolutional network, which can effectively recognize group behavior. The paper also proposed to use the dual-stream spatio-temporal map network model to extract spatio-temporal features of the crowd posture and density to effectively capture the differentiated movement information between different crowds.

The paper is organized as follows. Section 2 represents briefly the related work to the proposed research. Section 3 elaborates the methodology of the paper with details in sections. Experiments and results of the paper are given in Section 4. The paper is concluded in Section 5.

The initial population research is mainly based on the detection of the crowd. The image is segmented before the target detection of the crowd using a sliding window, and finally, the crowd is counted based on the classifier. Detection-based methods include detection based on the whole [23] and detection based on parts of the human body [24]. The typical traditional method uses random forest matrix, SVM detector, and other methods to train the classifier and extracts various features such as pedestrian direction gradient histogram, edge, texture, and whole body wavelet. In scenes with highly dense crowds, crowds are severely occluded, and the method of detecting parts of the human body such as the head and shoulders is used instead of the method based on overall detection. The effect is improved, but the robustness of human detection is still not high.

Crowd density analysis and crowd counting based on regression are mainly used to learn the mapping relationship between image features and number of people [25]. Image segmentation is based on the regression method first, the image, texture, edge, and the prospect of gradient low-level features such as extraction and then the linear regression, Gaussian regression, ridge regression, and regression function are studied, such as learning exists in the mapping function of the number of low-level features and the image, generating a static background model, which is sensitive to illumination changes. The model needs to be retrained each time the scene is transformed, which is costly in terms of time and computation. Regression-based methods usually believe that the relationship between the number of people in the image and the foreground area can be approximately linear. However, such linear relationship is difficult to be established because of the problems of occlusion, overlap, and perspective of the crowd in the real scene.

In densely crowded images, deep learning usually uses convolutional neural networks to generate end-to-end models to extract features of different scales of pedestrians in the image, so as to generate crowd density maps through Gaussian kernel functions to achieve the effect of crowd counting [26]. The crowd density map can not only realize crowd counting but also provide rich spatial information, detect crowd density distribution, and further analyze crowd behavior through crowd density detection. Zhang et al. [27] proposed a multicolumn deep convolutional neural network MCNN model, which used different subnetworks with different convolutional kernel sizes to realize crowd count in the scene of serious crowd occlusion and height transformation. In the latest research, Sam et al. [28] proposed a switching network based on MCNN, which has multiple CNN subnetworks with different depths and different convolution kernel sizes of each subnetwork, thus improving the accuracy and robustness of crowd density analysis and crowd count results of high occlusion and multiscale scene transformation. Sindagi and Patel [29] proposed a context pyramid model CP-CNN. In order to extract global and local context information, the network learns the MCNN network of multicolumn architecture, designs two subnetworks to map the input image or video frame data to a high-dimensional feature map, and uses the CNN network to estimate the context at all levels. To reduce technical errors and generate higher quality density maps, Li et al. [30] proposed the deep neural network [3135] model CSRNET, which abandoned the multicolumn framework, and believed that the multicolumn framework had no obvious advantages compared with the single-column framework. The front end of the model was the VGG-16 model, which abandoned the full connection layer and only retained the convolutional layer and pooling layer, followed by the void convolution to expand the receptor field and obtain the features of different levels of images. Generate the population density distribution map, and obtain better detection results.

3. Methodology

3.1. Overall Framework

Crowd flow in video has the characteristics of time dynamic, space correlation, and uncertainty. Aiming at these characteristics, this paper proposes a kind of spatio-temporal dynamic graph convolutional network [36, 37] to study and predict crowd flow. Figure 1 shows the framework of spatiotemporal dynamic graph convolutional network (STDGCN) proposed in this paper. The STDGCN model consists of an input transformation layer, an STDGCN layer, and an output layer composed of a full connection layer.

The model uses the spatio-temporal data collected by the crowd flow sensor in the video and external factors to predict the crowd flow and other parameters in the future and comprehensively obtain the spatio-temporal network [38] prediction output. The input conversion layer embeds and converts crowd flow attribute data and exogenous factor data, among which three types of data are used for exogenous factors. The STDGCN layer contains a graph convolution module and a time-dimensional encoder-decoder structure. The output layer generates the prediction result of each node through a fully connected layer.

The core ideas of the STDGCN model can be summarized in the following two points. First, regard the sensor data at the same time as a graph data, connect the nodes and neighbor nodes to represent the spatial correlation of the crowd flow, and use the graph convolutional network to capture traffic. Second, treat the data at different moments of the same node as a time series and use the gated recurrent unit and attention mechanism to deal with the time dynamics of the traffic flow. The STDGCN layer structure is shown in Figure 2.

The spatio-temporal dynamic graph convolution module consists of two parts: graph convolutional network (GCN) and attention encoder network (AEN). Graph convolutional network is used to deal with the spatial dependence of crowd data, and attention encoder network is used for capture time dimension dynamics.

3.2. Spatial Feature Extraction of Crowd

Compared with the use of two-dimensional image convolution to obtain the patterns and characteristics of the crowd, the pedestrian sensor data with the characteristics of map data can obtain more primitive and real spatial attributes. In the proposed model STDGCN, graph convolution is directly applied to graph structure, and highly meaningful patterns and features are extracted in the spatial domain. Traditional convolutional neural networks can effectively extract local features of data, but they are not suitable for general graph structures. There are two types of methods to generalize convolutional neural networks to graph structures. One method is to expand the spatial definition of convolution, and the other is to use the Fourier transform of the graph to operate in the spectral domain.

The spectrogram method of graph convolution is to use the diagonalized linear operator defined in the Fourier domain to convolve the graph signal and use the convolution kernel . The convolution operation on the graph signal can be expressed aswhere is the Fourier basis composed of eigenvectors and is the diagonal matrix composed of eigenvalues of . Because the scale of the graph becomes larger, that is, when the crowd is large, the computational complexity of eigendecomposition of the Laplace matrix in equation (1) is very high, which can be approximated by Chebychev polynomial:

3.3. Time Feature Extraction of Crowd Flow

As shown in Figure 2, after the crowd flow data is extracted through the graph convolutional network for spatial feature extraction, the spatial feature sequence and the embedding representation of exogenous factors are used as the input of time dimension modeling. The AEN module is composed of two GRU networks with independent parameters. The GRU network on the left is the encoder module, and the GRU network on the right is the decoder module. The encoder encodes the input time sequence and initializes the decoder through the last moment of the encoder. Module, the decoder generates prediction output from the context vector in time steps:where is the output representation of sensor node at time , is the feature sequence obtained by the graph convolution operation, and is the exogenous factor.

A potential problem of the encoder-decoder model is that the model needs to be able to compress the context information of the source sentence into a fixed-length vector. This makes it difficult for the model to handle long sequences, especially those longer than the feature sequences in the training data. In other words, the encoder-decoder model may be difficult to grasp the longer periodic features in the crowd flow, such as weekly regularity.

3.4. Semantic Relevance of Group Behavior

In this paper, a spatio-temporal correlation model of video is designed to infer the behavioral semantics of group figures in video sequences. The model is composed of two layers of GRU. The first layer of GRU predicts feature mask sequence to encode. The second layer GRU decodes the hidden codes of the first layer output one by one and outputs the action semantics of the characters and the behavior semantics of the group characters in the corresponding time sequence after spatio-temporal correlation. The model can be divided into two stages: encoding stage and decoding stage in the process of spatio-temporal correlation of group character behavior.

In the coding stage, the first part of the GRU structure of the first layer has a value of , and the hidden layer information is calculated. includes the hidden layer information of a single person and the hidden layer output of the group of people. The equation for calculating the output of each character in the th video frame is as follows:where is the predicted feature mask input of the person in the th frame of the first-layer encoding stage, is the output result of the hidden layer of the person in the th frame of the video frame in the first-layer encoding stage, and is the fusion function. The calculation equation of the hidden layer output structure of group character behavior is as follows:where represents the character feature after fusion and is the maximum pooling operation.

In the decoding stage, is output according to the semantic description of the action of the previous character and the hidden layer of the previous moment. Analyze the behavioral semantics of the group characters so that the group characters’ behaviors after the GRU structure have temporal sequence information.

The obtained group character behavior semantic prediction probability set is calculated by a maximization equation, and the group character behavior semantics with the largest prediction probability is taken as the video group character behavior semantics:where is the set of semantic prediction probabilities of group character behavior we obtained.

4. Experiments and Results

4.1. Experimental Setup

The Volleyball data set is selected to verify the semantic extraction method of sparse group behavior based on the spatio-temporal trajectory of video. The Volleyball data set contains 55 real volleyball match videos and 4380 frame labels. The image size of each frame is 720 × 108, and each frame label contains the number of the current video frame. So, the position information of the player is composed of the coordinates of the upper left corner of the character’s bounding box and the height and width of the bounding box. 3493 frame labels of the first 39 videos were used as the IJ} L training set, and 1137 video frames of the last 16 frames were used as the test set.

In the experiment, the length of the input video sequence is , the individual action and group behavior semantics of players are extracted, and is defined. The first 4 frames and the next 5 frames, including the labeled video frames, are, respectively, taken as a video sequence fragment, and is defined according to the characteristics of the volleyball match in the data set. All experiments in this section were developed using TensorFlow and run on Linux platform.

4.2. Evaluation Standard

In order to extract the semantics of sparse group behavior based on the spatial-temporal trajectory of video, the test is conducted with Volleyball data set and the mask position matching feature F_ code-B is used to complete the matching of people. The experimental results are compared with Inception and HDTM. It includes the comparison of the semantics of human action and the semantics of group action and takes the accuracy of the extracted semantic of group action and the semantic of individual action as the evaluation standard.

4.3. Experimental Results

Table 1 shows the accuracy comparison results of Inception, HDTM, and our algorithm, including two parts: group behavior semantics and individual action semantics. It can be seen from Table 1 that the algorithm in this paper is superior to the above two algorithms in terms of semantics of people’s actions and group behaviors. Compared with the above two algorithms, the semantic accuracy of individual actions increases by 4.5% and 2.1%, and that of group actions increases by 8.3% and 1.8%. After integrating the related movement track of the group figures, the complete movement clues of the figures in the video sequence can be grasped by the accurate tracking of the group figures. Figure 3 is an example of successful semantic extraction of character actions and group behaviors in some videos of the data set. In Figure 3, the bounding box information and individual action semantics of each player in this video frame are specifically drawn, and the current group behavior semantics are marked.

4.4. Group Anomaly Recognition Experiments

Aiming at the evaluation of the detection effect of abnormal motion behavior in dense groups, this section uses the PETS 2009 data set containing sequence activities of different groups of people. The data set is divided into five parts: calibration, training, counting, density estimation, and crowd tracking. The video frame image has a resolution of 576 × 768, contains 9 videos, and has 152 abnormal data. The first 1134 frames of the experiment in this section are used as the training set, and the last 378 frames are used as the test set.

For the detection of abnormal gathering behaviors of dense groups, the experiments in this section are divided into image-level detection and pixel-level detection. Image-level detection gives abnormal aggregation detection results, and pixel-level detection can locate the abnormal gathering place and calculate the number of gatherings. Adopt the same data input and processing methods as the abnormal dispersion behavior.

The detection results of the abnormal dispersion behavior of dense groups on the PETS 2009 data set were quantitatively detected with the DBM algorithm based on optical flow and the D-IncSFA method based on deep learning. The results are shown in Table 2.

Experiments show that this paper has a good detection effect based on the abnormal dispersion behavior of crowd density distribution images and can detect abnormal video frames more accurately. Only when the movement speed exceeds a certain threshold can it be judged as abnormal. The specified movement speed is not less than 0.5 meters per second and more than 1.2 meters is abnormal dispersion behavior. The qualitative evaluation of abnormal dispersion behavior detection in dense groups is shown in Figures 4 and 5.

5. Conclusion

With the speedy growing population, various crowd activities, and the rapid development of the socialization process, group scenes are becoming more common. Due to this, the demand for analyzing, modeling, and understanding group behavior data in video is increasing. In this paper, we take group density analysis and group behavior recognition in video as the goal and propose a group behavior pattern recognition algorithm based on spatio-temporal graph convolutional network. We designed a crowd detection and positioning method based on density map regression guided classification and, finally, a crowd behavior analysis method based on density level division to complete crowd density analysis and video group behavior detection. In addition, this paper also proposes to use the dual-stream spatio-temporal map network model to extract spatio-temporal features of the crowd posture and density to effectively capture the differentiated movement information between different crowds. We have conducted experiments on public data sets, and the experimental results show that the method has high recognition accuracy and can effectively predict group behavior. The experimental results of the study have shown the effectiveness of the proposed research.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this paper.

Acknowledgments

This work was supported by Special Funds for Basic Scientific Research in Central Universities (ZY20215126) and China Scholarship Fund.