Abstract

With the rapid growth of the network user base and the number of short videos, a large number of videos related to terrorism and violence have emerged in the Internet, which has brought great challenges to the governance of the network environment. At present, most short-video platforms still adopt manual-review and user-report mechanisms to filter videos related to terrorism and violence, which cannot adapt to the development trend of short-video business in terms of recognition accuracy and timeliness. In the single-mode recognition method of violent video, this paper mainly studies the scene recognition mode. Firstly, the U-Net network is improved with the SE-block module. After pretraining on the Cityscapes dataset, semantic segmentation of video frames is carried out. On this basis, semantic features of scenes are extracted using the VGG16 network loaded with ImageNet pretraining weights. SE-U-Net-VGG16 scene recognition model is constructed. The experimental results show that the prediction accuracy of SE-U-Net model is much higher than that of the FCN model and U-Net model. SE-U-Net model has significant advantages in the modal research of scene recognition.

1. Introduction

With the expansion of the network user base of short videos and the rapid growth of the number of short videos, a large number of videos related to terrorism and violence have emerged on the Internet, and the uncontrollable nature of video content has become more and more obvious, which is a great challenge to the supervision of network ecology. At present, teenagers account for 25 percent of Internet users. Videos related to terrorism and violence on the Internet will have a very bad influence on the growth and development of teenagers.

At present, most short-video content sharing platforms still adopt a combination of manual-review and user-reporting mechanisms to screen and filter short videos related to terrorism and violence. However, with the rapid popularization of short-video forms, the increasing number of users and videos, and the difficulty of reviewing short videos related to terrorism and violence, the current review mechanism will not be able to adapt to the development trend of the short-video business. Therefore, an identification method is urgently needed to complete the automatic screening of terrorism-related and violence-related videos in short videos, which cannot only improve the accuracy and efficiency of content filtering for terrorism-related and violence-related videos but also greatly reduce human costs.

In the single-mode violent video recognition research, the current research focuses on convolutional neural networks and long short-term memory (LSTM). Song et al. [1] proposed a violent video detection method based on an improved 3D convolution neural network (3D ConvNet), which adopted a uniform sampling method to construct a 3D convolution neural network and achieved a good competitive effect. Several researchers [24] added long short-term memory (LSTM) basis on convolutional networks to expand the length of recognizable videos. Mahalle and Rojatkar [5] proposed a classifier extreme learning machine (ELM) based on audio to detect violent scenes by using audio features in the time domain and frequency domain. The two types of datasets were used, and the accuracy reached over 85%. Several researchers [68] studied sports information, quickly detected violent activities, and achieved better performance. In the article of Solmaz et al. [9], the distribution of motion representation was examined from the velocity, direction, and acceleration in the crowd, which can effectively detect abnormal activities. Gu et al. [10] extracted the features of three different modes—appearance, motion, and audio—and selected the feature-level fusion strategy to fuse multimodal features, which achieved good results on the public dataset. U-Net semantic segmentation network plays an important role in violent video recognition. Several research studies [1115] were mainly about the application of U-Net in medicine. By optimizing the local modules of U-Net and strengthening the training of U-Net on different datasets, the accuracy of U-Net in medical image segmentation was improved to a certain extent. In the articles of several researchers [1621], a variety of segmentation models with different performances were integrated. The advantages and disadvantages of different segmentation models complement each other, thus forming an integrated semantic segmentation framework, which ultimately improves the segmentation effect and the segmentation accuracy to a certain extent. A deep learning network based on VGG16 is mainly used to improve the accuracy of image classification. Yan and Ma [22] used AIM-VGG16 acceleration to design reconfigurable heterogeneous computing hardware, which had very obvious advantages. Chen et al. [23] applied the VGG16 model to solar radio spectrum recognition, greatly improving the efficiency of solar activity research. Due to the small amount of data in the dataset of violence detection, it is easy to have the problems of low overfitting and generalization ability. Several researchers [2427] used a hybrid model, which achieved high accuracy on Hockey Fights test sets. Belaid and Loudini [28] proposed deep learning techniques based on the combinations of pretrained VGG-16 CNNs to classify three types of brain tumors, with an accuracy of more than 90%, higher than that of the most advanced classifier.

In the above methods, violent video recognition is mainly studied from the three aspects of appearance, motion, and audio based on deep learning, and video scene recognition is not taken into account. However, in most violent videos, the appearance of violent frames is a premonition, the scene will have some changes, or the scene is a collection of violent frames. Therefore, the accuracy of violent video recognition in the above paper is not high or has certain limitations. In the research of single-mode violent video recognition, scene recognition mode needs to be studied.

Based on semantic segmentation, this paper considers modeling the environment where the video takes place and the entity state in the environment and constructs the scene recognition mode of the violent video. Firstly, the basic architecture of FCN and U-Net semantic segmentation network is introduced. Then, the network structure of the U-Net is improved by combining SE-block channel attention module, and the SE-U-Net semantic segmentation network and SE-U-Net-VGG16 scene recognition model are proposed. Finally, the performance of FCN, U-Net, and SE-U-Net semantic segmentation network was tested and analyzed, and the effectiveness of SE-U-Net-VGG16 scene recognition model was finally verified in two different datasets. The detection accuracy of a single-mode of scene modes in cross-validation reached 89.6%.

With the wide use of video as a carrier of information transmission, violent video recognition has become an important research direction in computer vision, with very broad application scenarios. Aiming at the problem of violent video recognition, this paper innovatively proposed scene recognition modes based on the three most common identification modes. From the perspective of scene recognition modes, the detection method of the violent video was explored and studied. The main innovations are as follows:(1)In the single-mode recognition model of violent video, based on the most common action recognition mode and speed mode as well as the identification of modal acceleration, this paper adds the scene recognition mode, using a semantic segmentation network to generate the scene features of the modes and then uses VGG16 pretraining network to extract scene information from the feature graph and identify it.(2)Based on the U-Net semantic segmentation network, SE-block channel attention mechanism module is added, and the SE-U-Net semantic segmentation network is proposed. The accuracy of semantic segmentation of SE-U-Net in the CatsAndDogs and Cityscapes datasets is 95.3% and 92.0%, respectively, both of which are better than that of the FCN and U-Net.(3)The SE-U-Net-VGG16 model was constructed based on the SE-U-Net network, which achieved an identification accuracy of 89.6% in the Hockey Fights dataset.

This part mainly describes the basic theories and methods involved in the paper model, including CNN, RNN, LSTM, and integrated learning, and also introduces the dataset involved in this paper. Understanding the related basic work of deep learning can help us better understand the subsequent models and methods of this paper.

2.1. DNN and CNN

As the basis of deep learning, the development of a deep neural network (DNN) has gone through a long process. Early neural network research was significantly influenced by relevant theories in neuroscience, and there was an obvious hierarchical structure. Information transmitted from upper neurons was processed in the neurons in the same layer and then transmitted to the neurons in the lower layer.

Convolution neural network (CNN) is commonly used in the fields of computer vision in the feature extraction of tools, which are mainly composed of front-end of convolution and pooling layer network structure to perform feature extraction. Two kinds of network layer often appear alternately in network structure, and it is also unique to the network (CNN) structural unit.

The main function of the convolution layer is to extract features from the input feature graph in accordance with the convolution operation. The convolution expression in discrete form is

The matrix is expressed as

The definition of two-dimensional convolution is

In the above equation, means convolution. Each layer of the convolutional layer has multiple convolutional kernels (filters), and the commonly used filter sizes are 3 × 3 and 5 × 5. The function of the convolution kernel is to extract and fuse the features of the output of the upper neurons in accordance with the parameters of the convolution kernel, to enhance some image features and reduce noise. The function of the pooling layer is to reduce the feature dimension, speed up the calculation, and prevent the problem of overfitting.

2.2. RNN and LSTM

In DNN and CNN mentioned in Section 2.1 of this chapter, the input dimension and output dimension of training samples are determined, and such a model can solve common problems. However, when training samples are continuous and of different lengths, DNN and CNN are difficult to perform, such as speech fragments and handwritten text sequences. Since these sequences have different lengths and are generally associated with the features before and after, they cannot be directly disassembled into independent samples for training through DNN/CNN. In this case, the recurrent neural network (RNN) came into being.

To solve the defects of RNN, long and short-term memory (LSTM) neural networks emerged. LSTM controls the memory state of the information through a gated structure and adopts the cell state updating structure of the forgetting gate, input gate, and output gate. The updating mechanism of the memory state is

In the equations above, the symbol represents function, represents product of tensors, represents implied state, represents memory unit, represents gate variables, and variables and represent 2-dimensional tensors in LSTM.

By adding a gated structure to the memory line, LSTM can control whether the sequence information of different positions is remembered or not, and the defect of RNN can be avoided in the prediction and classification of long sequence tasks. In this paper, the short video to be dealt with can be regarded as a frame of the element of the time series. LSTM has a wide range of applications in this kind of problem.

2.3. Attention Mechanism

According to the different mechanisms, the attention mechanisms in CV can be divided into spatial and channel attention mechanisms [29].

Mechanism of spatial attention is a control model focusing on the image position. According to the adjustment of spatial attention, the mechanism can reduce the noise in the process of feature extraction and so on has nothing to do with learning task information for the interference degree of training, that is, the feature extraction from the original image. The Internet pays more attention to the learning task, which has more valuable space information.

Different from spatial attention mechanisms, channel attention controls the layer of content in which the model focuses on image information. Commonly used feature extraction networks generally contain multiple feature extraction kernel functions, and each kernel will generate unique feature dimensions. The task of channel attention is to assign the model’s emphasis to each feature channel.

In the model fusion, in this paper, the Concat method is used to splice gradient, optical flow, and acceleration feature maps in the channel dimension. Channel attention mechanism can be used to further adjust the model’s emphasis on these modal features.

3. Materials and Methods

3.1. Datasets

In this paper, the Hockey Fight violence video detection dataset extracted from ice hockey games is used. The dataset contains 1000 video clips divided into positive and negative categories with a resolution of 360 × 288. In the dataset, parts of the video clips are shown in Figure 1(a). This dataset is not a dataset with a fixed number of frames. The statistics of 1000 clip frames are shown in Figure 1(b). To facilitate the establishment of the model, the first 40 frames of each clip are extracted in this paper.

3.2. Scene Recognition Framework Based on Semantic Segmentation

Based on the semantic segmentation of video frames, this chapter builds an end-to-end scene recognition model based on the environment where the video takes place and the entity state in the environment. The input is the original video, and the output is the prediction category. Firstly, the model extracts any frame from the video, uses the semantic segmentation module to perform semantic segmentation on the frame, then obtains the feature map of semantic segmentation, and puts it into the VGG 16 network. Features were extracted, and finally, average pooling was used to flatten the channel dimensions and put into a fully connected neural network for classification [30], and the network structure of the model is shown in Figure 2.

3.3. Semantic Segmentation
3.3.1. FCN

The traditional semantic segmentation network is built based on CNN. The network takes the area around the target prediction pixel (such as 10 × 10 and 15 × 15) as the network input and then uses the full connection layer at the back end to recognize and predict the semantic tags. There are some problems with this approach, for example, how to determine the region size and how the region limits the convolution kernel receptive field size. On this basis, full convolutional neural network (FCN) [31], at the historic moment, has a milestone significance in image semantic segmentation. The label prediction at the previous image level is extended to the pixel level, which greatly improves the accuracy of the semantic segmentation.

In FCN, a deconvolution layer that can enhance the image size replaces the traditional full-connection layer at the back end of CNN. The network structure of FCN-8S is shown in Figure 3, so that the network can accept the input of images of any size. At the same time, to eliminate the convolution and pooling of image resolution reduction effect, we cleverly used the deconvolution (deConv) strategy to extend the image to the original dimensions of the input image, as shown in Figure 4, deconvolutes first to 0 elements filling characteristic figure, namely, the upper pool, and then carries on the convolution operation, thus realizing the image resolution.

During deconvolution in the deConv layer, the semantic information of some pixels was missing. To compensate for the semantic information lost in the deConv layer, FCN incorporates a hierarchical skipping structure according to different levels of depth to ensure the accuracy and robustness of semantic segmentation.

The hierarchical structure combines the coarse high-level information learned by FCN with the detailed low-level information to generate the end-to-end semantic segmentation image. Based on the results of the convolutional layer and pooling layer, FCN-32S performs 32 times upsampling to restore the predicted image size to the original image size. This processing method is similar to the traditional semantic segmentation network based on CNN. Due to the loss of too much information, the prediction results are not fine enough. FCN-16s first deconvolute the results of convolutional layer and pooling layer, then combine the corresponding pixels with the output feature map of POOL4 layer, and finally deconvolute 16 times, which preserves both high-level semantic information and low-level image information. FCN-8s firstly upsamples the results of convolutional layer and pooling layer by 4 times and then combines them with the results of POOL4 layer and POOL3 layer by 2 times upsampling. Meanwhile, it retains high-level semantic information and low-level image information of POOL3 and POOL4.

3.3.2. U-Net

FCN is the pioneering work of image semantic segmentation networks, and many semantic segmentation networks have been optimized and improved based on FCN; the U-net is one of them, and the basic structure of FCN is more detailly designed and more efficient. U-Net consists of two parts. The first part is also the traditional image feature extraction structure of the convolution layer and pooling layer, which is similar to FCN. Different from FCN, the second part of U-Net adopts a complex deconvolution structure to fuse the early feature information. The overall structure of the network is shown in Figure 5.

U-Net consists of four parts, namely, input, convolution, deconvolution, and output modules. In the convolution part, four downsampling modules are used to achieve 16-fold downsampling. To obtain full-size output, the deconvolution layer employed four upsampling modules, each consisting of a deConv layer and two ReLU activated convolutional layers with a 3 × 3 convolution kernel, achieving a total of 16-fold upsampling. At the same time, to fuse multiscale features of different depth levels, the U-Net network integrates the high-level information of each upper sampling layer with the low-level information of the corresponding size of the lower sampling layer. Four copy and crop structures were constructed in the four upper sampling layers to fuse the multiscale features. The information of any depth level can penetrate the whole network.

In addition, in the process of samples, as shown in Figure 6, the semantic network information fusion segmentation such as FCN adopted corresponding pixel fusion and U-Net adopted a completely different way of late former feature fusion. In the dimension of channel, the output features of network layers with the same size and different depths are superimposed to form a more complete feature map. The integrity of the information is guaranteed.

3.3.3. SE-U-Net Based on SE-Block

Squeeze-and-excitation networks (SE-block) [32] The original intention of the design is to solve the problem of feature loss caused by the different importance of each channel in the feature map generated by CNN in the process of Conv-pool. In a traditional CNN feature extraction network, feature extraction mainly relies on convolution operation to fuse spatial information and channel information in the local sensing field. By default, all channels of the feature map generated by each layer are equally important, but in practical problems, the importance of different channels may be significantly different. There may even be dependencies between different channels.

For the feature graph with the number of feature channels W, the work of SE-block is divided into three parts:(1)The first part is the squeeze module. To obtain the global receptive field with different feature channels, the module compresses the input feature graph in the channel dimension. The compression methods include global average pooling or global maximum pooling. The pooling formula is shown in formulae (10) and (11). The two-dimensional feature channel in the feature map is transformed into a real number with a global receptive field.(2)The second module is the excitation module. The module uses a two-layer fully connected neural network, with more nonlinear procedures to fit the complex dependencies between the channels. The first layer is the dimension reduction layer, and the number of neurons is C/r, using the ReLU activation function. The number of neurons in the second layer is C. To generate a real number of 0-1, sigmoid is used. The output dimension matches the number of input channel dimensions, and the module mapping function is shown as follows:, , and is the activation function; is the activation function; and is the nonlinear mapping function.(3)The third module is the reweight module. The module adds importance weights for each channel dimension based on the excitation output.

The construction of SE-block is very simple, and the constructed network structure can be trained in forwarding and backpropagation. In addition, it has good characteristics in terms of model and computational complexity.

To realize the fusion of multiscale features, the U-Net network is equipped with a jump bridge structure. The jump bridge structure directly transfers the feature maps of the corresponding size in the upper sampling module to the lower sampling structure and overlaps them with the feature maps in the lower sampling structure in the channel dimension. This figure the transfer and spell folding method makes the network in the process of sampling recovery. This method of feature map and overlapping makes it possible to use the underlying original image information in the process of sampling and recovery. But because of the simplicity of the stack method used, the task chart may not include the importance of channels and the interdependence that exists. Therefore, in this paper, on the basis of standard U-NET network, SE-block module is added to the bridge structure. The network structure is shown in Figure 7.

4. Experimental Verification and Analysis

4.1. Introduction of Pretraining Dataset and Experimental Configuration Environment

This paper uses Cityscapes and CatsAndDogs which are two classic semantic segmentation datasets in the pretraining of semantic segmentation network.

The Cityscapes dataset is a street scene dataset of image segmentation in autonomous driving. The main scenes in the dataset are street scenes taken while driving. There are 34 kinds of semantic labels, including characters, vehicles, streets, and street lamps.

The CatsAndDogs dataset is released by Kaggle for image classification and semantic segmentation. The dataset contains a total of 7,393 pets in 37 categories, and these images have great differences in size, posture, and illumination.

The experimental model is built based on the TensorFlow framework, and GPU is used to accelerate calculation. The experimental platform and software version is shown in Table 1.

4.2. Experimental Results and Analysis

First, FCN, U-Net, and SE-U-Net semantic segmentation networks are trained in two classical datasets, respectively, to verify and analyze the validity of SE-U-Net network. In the training and testing, the method of five-fold cross-validation was adopted. Firstly, the dataset was randomly divided into five subsets, four of which were used as training sets and one was used as test sets. In addition, the accuracy predicted by the semantic label was used as the measurement index, and the iteration curve of the model’s accuracy in the test set is shown in Figures 8 to 12.

Figures 8(a) and 8(b), respectively, show the accuracy iteration curves of FCN model in the test set after training in the CatsAndDogs and the Cityscapes datasets. Considering that the positive and negative samples of the dataset are balanced, the accuracy of prediction labels is used as a metric in the training process. In the CatsAndDogs dataset, the prediction accuracy of semantic tags reaches 89.4% when the learning rate is 0.01,and in the Cityscapes dataset, the prediction accuracy of the model reaches 81.6% when the learning rate is 0.001.

Figures 9(a) and 9(b), respectively, show the accuracy iteration curves of the U-Net model in two datasets. When the learning rate is 0.0001, the prediction accuracy of the model in the CatsAndDogs dataset reaches 93.2%. When the learning rate is 0.001, the prediction accuracy of the model reaches 89% in the Cityscapes dataset.

Figures 10(a) and 10(b), respectively, show the accuracy iteration curves of SE-U-Net model in two datasets. In the CatsAndDogs dataset, the prediction accuracy of the model reaches 95.3% when the learning rate is 0.01. In the Cityscapes dataset, when the learning rate is 0.0001, the model has the best performance and the prediction accuracy reaches 92.0%.

The semantic segmentation diagram predicted by the model in the dataset is shown in Figure 11, where the first column is the original image, the second column is the standard semantic segmentation label, and the last three columns are the semantic segmentation diagram predicted by FCN, U-Net, and SE-U-Net, respectively.

According to the accuracy iteration curve and the predicted semantic segmentation diagram, it can be found that all three models perform well in the CatsAndDogs dataset, and FCN model has the lowest accuracy, which also reaches 89.4%. In addition, it can be seen from the actual effect diagram of semantic segmentation that the semantic labels predicted by FCN model are of good integrity. However, it lacks some details compared with U-Net and SE-U-Net. In the Cityscapes dataset, the effects of the three models are quite different because the dataset contains 34 semantic labels, which are much more complex than the three semantic labels in the CatsAndDogs dataset. It can also be seen from the actual renderings of semantic segmentation that the semantic labels of street view predicted by FCN are relatively rough. U-Net prediction tag has category prediction errors of the entity as a whole, and it is obvious that SE-U-Net has the best actual effect. In addition, from the point of view of the model convergence speed, FCN and U-Net model convergence speed is faster. Because of the addition of the channel attention mechanism module, SE-U-Net model convergence speed significantly slows down training in the two datasets for more than 40 generations and nearly 80 generations, respectively, before convergence.

Since this is a pretraining model, the convergence speed in the training process does not affect the detection of violent videos. In this paper, SE-U-net model is used for semantic segmentation in the scene recognition model, and different learning rates are set for comparison. The iterative curve of the model training is shown in Figure 12. When the learning rate is 0.0001, the prediction effect of the model is the best, the average recognition accuracy reaches 89.6% in the five-fold cross-validation, and the convergence speed is fast, which is close to convergence in the 20th generation. In the subsequent model fusion, the paper takes the pretraining network with the learning rate of 0.01 as the basic model of scene identification mode.

5. Conclusions

Based on the most common motion recognition modes, velocity recognition modes, and acceleration recognition modes, this paper creatively proposes the scene recognition modes. The semantic network is used to model the environment where the video takes place and the physical state of the environment, and the scene recognition mode of violent video is constructed. Then, U-Net network is improved with SE-block channel attention module, and SE-U-Net semantic segmentation network and SE-U-Net-VGG16 scene recognition model are proposed.

Finally, in two different datasets, the performance of FCN, U-Net, and SE-U-Net semantic segmentation networks are tested and analyzed. The experimental results show that, in the FCN model in the CatsAndDogs dataset, when the learning rate is 0.01 and 0.001, the prediction accuracy of the semantic label can reach 89.4% and 81.6%, respectively. In the U-net model in the CatsAndDogs dataset, when the learning rate is 0.0001 and 0.001, the prediction accuracy of the model reaches 93.2% and 89%, respectively. In the CatsAndDogs dataset, the prediction accuracy of SE-U-Net model established in this paper can reach 95.3% and 92.0% when the learning rate is 0.01 and 0.0001, respectively. SE-U-Net model outperforms the FCN model and U-net model by several percentage points under the same learning rate and has significant performance and effect in the scene identification mode, which has potential application value.

The follow-up work will be carried out from the following aspects:(1)In the process of feature extraction, extract more effective features, such as audio features; of course, this also needs the support of data sets, so constructing appropriate data sets is also an important work. At the same time, the semantic segmentation data of violent scene is also missing in the pretraining of semantic segmentation network.(2)The model proposed in this paper is established on the basis of fixed resolution and frame rate. However, the video size and frame rate in real application scenarios are often not fixed. How to dynamically adapt the model to different parameters is also a research direction.(3)Attention mechanism is added to modal fusion in this paper. Besides, there are no more studies on the correlation between different modal features, which is very important for better realization of modal fusion, fusion feature extraction, and missing feature completion.(4)In this paper, a fixed network structure is used in the fusion process of multimodal features, which means that the network of the recognition model can only fuse features with a fixed level of depth and cannot adjust the level of depth of fusion features adaptively. Adaptive fusion algorithm is of great significance for the expansion of model usage scenarios.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the Group Building Scientific Innovation Project for Universities in Chongqing (CXQT21021), Science and Technology Research Project of Chongqing Education Commission (KJQN202100712), and the Joint Training Base Construction Project for Graduate Students in Chongqing (JDLHPYJD2021016).