Nowadays, with the increasing number of surveillance cameras, human behavior detection is of importance for public security. Detection of fight behavior using video surveillance is an essential and challenging research field. We propose a multiview fight detection method based on statistical characteristics of the optical flow and random forest. Cyberphysical systems for monitoring can obtain timely and accurate information from this method. Two novel descriptors named Motion Direction Inconsistency (MoDI) and Weighted Motion Direction Inconsistency (WMoDI) are defined to improve the performance of existing methods for videos with different shooting views and solve the misjudgment on nonfight, such as running and talking. First, YOLO V3 algorithm is applied to mark the motion areas, and then, the optical flow is computed to extract descriptors. Finally, Random Forest is used for classification based on statistical characteristics of descriptors. The evaluation results on CASIA dataset demonstrate that the proposed method can improve the accuracy and reduce the rate of missing alarm and false alarm for the detection, and it is very robust against videos with different shooting views.

1. Introduction

With the increasing public demand for security and the development of machine vision technology, research in the field of intelligent monitoring continues to deepen, and many scholars are committed to use the powerful computing power of computers to process surveillance video data to strengthen security control. Nowadays, large-scale monitoring cyberphysical systems surround us and the massive videos obtained from these systems should be analyzed to guarantee public demand for security. Nevertheless, security systems relying on human observers are inefficient, and it may cause missed alarms due to the limited human capability to monitor surveillance video continuously, resulting in an urgent demand to a research of automatic alarm technology for the abnormal behaviors like fighting. And the result of proposed method will be a foundation for the cyberphysical systems to monitor and control of entities in the physical world [1].

As a key step to achieve behavior recognition, behavior analysis can explain and describe the motion of the target by using machine vision. Generally, the behavior analysis methods are categorized in two classes: the methods based on high-level human body structure and the methods based on low-level image information. The information used in high-level methods includes body point model, 2D model of human body [2], and 3D model of human body [35]. Moreover, in order to obtain more information, multiview videos or a 3D camera are required for this method. However, it makes the model complicated and computationally intensive. And it is difficult to guarantee the stability of this kind of method due to the immaturity of the human posture estimation technology. On the other hand, in the behavior analysis methods based on low-level image information, motion trajectory [6], shape features [7], texture features [8], optical flow features, and other image information are used to perform the behavior analysis. This kind of method describes the target’s behavior from a macroperspective, and it has the characteristics of simplicity and low complexity. Compared with the methods based on high-level human body structure, it performs better in real-time processing.

Fight as a typical abnormal behavior not only threatens the safety of citizens, but also has a negative influence on public. Therefore, it is crucial to achieve the real-time and reliable fight behavior recognition to assist security agents and ensure safety. The optical flow information extracted from the changes in pixels of the video sequences has satisfactory spatial and temporal characteristics, which is widely used to describe the target’s motion tendency in video processing [9]. In recent years, the fighting detection based on the optical flow has been widely concerned. Some researchers detect fight by setting threshold on optical flow features. After that, scholars focus on employing machine learning algorithms for fight detection.

Based on the extracted optical flow information, some detection methods based on descriptors and threshold are proposed. Lin et al. [10] combined texture features and optional flow features to improve recognition rate in crowed scenes. The amplitude-based weighted direction histogram can suppress the direction confusion caused by noise in a small area and the entropy of this histogram can characterize the degree of chaos in the motion. However, it does not consider the changes of the optical flow energy when fight occurs [11]. Xu et al. [12] combined the weighted reconstruction trajectory and histogram entropy to detect fight behavior. The changes in the optical flow energy of the target object can be used to identify abnormal behavior. Nevertheless, it does not make a good distinction between fighting and running [13]. Lin et al. [7] used the motion features and shape features to detect fight, which can reduce the misjudgment rate of anomaly recognition effectively. However, it still has the problem of misjudgment in the conversation behavior. Based on the statistics of the amplitude and direction of optical flow, Huang et al. [14, 15] adopted mean and variance to recognize fight, providing a novel idea for definition of the features. Though the detection approach based on threshold method is simple and performs well in real-time, it performs not very well in videos with different shooting views and can hardly avoid misjudgment on behaviors like running and talking.

Besides, some scholars investigated fighting detection by using machine learning models for purpose of distinguishing fight and other behaviors effectively. Wang et al. [16] proposed a new feature, called Acceleration Violent Flow (AViF), and adopted support vector machine (SVM) and random forest to detect fight behavior. However, this method has a low recognition rate of the fight with insignificant movements such as punching etc. Yang et al. [17] proposed a real-time method based on the optical flow histograms, which calculates a scale descriptor and a rotation invariant feature descriptor histogram of optical flow orientation (HOFO) and uses the SVM classifier for detection. However, this method also has the problem of misjudgment in conversation. Considering the change of optical flow, Gao et al. [18] proposed the oriented violent flows (OViF) method for violence detection, which adopts AdaBoost and SVM classifiers for detection. However, this method performed poorly in crowded scenes. Recently, Mahmoodi et al. [19] proposed a new feature descriptor named Histogram of Optical flow Magnitude and Orientation (HOMO) to improve existing violence detection. Xu et al. [20] proposed a localization guided framework which exploits optical flow maps to extract motion activation information for detecting fight actions in surveillance videos. Febin et al. [21] proposed a cascaded method of violence detection based on motion boundary SIFT (MoBSIFT) and movement filtering for action recognition.

Approaches based on machine learning models show great performance in many scenes, but most of them do not consider the robustness in multiview which is essential for fight detection in real scenes. The existing algorithms present some limitations in solving misjudgment problems on running, overtaking, etc. which are similar to fighting in terms of motion characteristics sometimes. In terms of robustness on shooting view of videos and misjudgment on nonfight behaviors, we propose two descriptors, named Motion Direction Inconsistency (MoDI) and Weighted Motion Direction Inconsistency (WMoDI), by analyzing the motion characteristics of fight. In addition, a deep learning framework is used for motion region marking in this paper. We combine three existing descriptors and two novel descriptors, and the final feature with six statistical characteristics of these descriptors are calculated to improve the accuracy and robustness in multiview.

Experimental results on CASIA Action Dataset and UT-Interaction Dataset show that the proposed descriptors are discriminative enough to distinguish fight and nonfight behaviors in videos with different shooting views. In addition, the proposed method has certain performance advantages in accuracy and can recognize fight in different datasets effectively.

In summary, the main contributions of this paper are in the following aspects: (1) the robustness of fight behavior recognition from multiple video perspectives is considered in our proposed method. (2) The misjudgment rate of fight behavior detection can be reduced effectively by using MoDI and WMoDI. Part of this paper has appeared in [22]. As the extension of [22], we improve the method in motion regions marking and descriptors defining and redesign several substantial experiments to evaluate the performance of our method in different scenes.

The remainder of this paper is structured as follows. Section 2 discusses the fighting detection and describes the proposed method. Section 3 introduces the data sets used and provides experimental results. Finally, in Section 4, we summarize our paper and outline some promising future research directions.

2. Fighting Detection

As studies mentioned in Section 1, optical flow as a low-level image information can express the pattern of motion, and based on it, scholars proposed a series of descriptors for fight detection. Compared with nonfight behaviors, e.g., talking and walking, fight behaviors usually have the characteristics of suddenness, violence, and messy movements, which means it is difficult to define the fight behavior with standard actions [12]. Actually, accelerated and sudden movement is often associated with states of high activation (happiness, anger) [23] which confirms that movement such as punches, kicks, or blows in a frame can be identified through suitable features. But the challenge of this work is to distinguish fight and other behaviors with high speed and sudden movement such as running.

To solve this challenge, we analyze the consistency and intensity of different motions in surveillance videos with different shooting views by calculating the displacement of each corner point. The result shows two typical distinctions. Compared with corner points in nonfighting (talking, walking, and running) videos whose inclination degree of displacement is consistent, corner points in fighting videos show chaotic direction changes. In addition, the velocity amplitude of corner points of nonfight tends to be stable while that of fight is larger and shows irregular changes. Therefore, velocity, acceleration and orientation are considered as the important factors for identifying fight behavior.

In our method, instead of recognizing the body structure and the movement of limbs, we adopt optical flow information to generate descriptors and Random Forest algorithm to identify the fight behavior. Figure 1 shows the overview of the proposed method. Firstly, the state-of-art deep learning model Yolo-V3 is adopted to mark the motion regions of passengers. Then, optical flow vectors are calculated, and a series of descriptors including two novel descriptors are extracted. Next, we smooth the descriptors, compute several statistical characteristics of them, and generate the final feature vector for training by concatenating the statistical characteristics. Finally, Random Forest algorithm is used to classify the frames.

2.1. Motion Regions Marking and Optical Flow Extraction

With the purpose of reducing the impact of noise points, the first step performed in the method is motion region marking. Xu et al. [20] used optical flow to extract motion activation boxes from consecutive frames. Recently, Zhang et al. [24] proposed an improved algorithm based on ViBe (visual background extraction) algorithm [25] to achieve moving target edge detection. In this paper, we employ You Only Look Once (YOLO) V3 [26], which is a deep learning framework based on Darknet-53 [27] and detects objects at three different scales, to mark the minimum circum-rectangles of passengers. Figure 2 shows the architecture of it.

After obtaining the motion regions, the optical flow magnitude and orientation are estimated by Lucas-Kanade (LK) algorithm [28] which detects any rapid change of key points at the pixel level. By solving Equation (1), we can get the optical flow vectors of frame . where , , and are partial differentials with respect to the position and time evaluated at key point . The result is shown in Equation (2).

Then, is converted to the polar coordinate form as in Equations (3) and (4). where () represents the velocity magnitude and the orientation of vector .

Inputs: the video
Outputs: the final feature vectors
 1: for each frame of do
 2:  Compute optical flow vectors () by Equation (2)
 3:  Converted () to the polar coordinate form by Equations (3) and (4)
 4:  Compute , , , by Equations (5), (6), (7), and (8)
 5:  ifthen
 7:  else
 9:  ifthen
 11:  else
 13:  Compute MoDI by Equation (9)
 14:  Compute WMoDI by Equation (10)
 15:  Compute descriptors , , and by Equations (11), (12), and (13)
 16:  Obtain smoothed descriptors by (15)
 17: end for
 18: Calculate the statistical characteristics of smoothed descriptors
 19: Concatenate the statistical characteristics to generate the final vectors
2.2. Feature Extraction

Considering input sequences captured from different views in real scenes, we evaluate the robustness of descriptors in multiview. According to Deniz et al. [29], when a sudden movement with high acceleration occurs, velocity magnitude of the key points in motion areas is usually larger than normal behaviors such as walking and talking. Based on that finding, we utilize optical flow to calculate two proposed and three existing descriptors [1013], which are discriminative enough for fight behavior identification.

Based on the differences between fight and nonfight mentioned above, two novel descriptors are defined to emphasize the change in orientation and magnitude of velocity and ensure the multiview robustness. Next, we give the extraction process of two proposed descriptors.

Firstly, the polar coordinates are divided into 12 intervals as shown in Figure 1, where each interval’s size is 30° [30]. The frequency and magnitude of the vectors falling into left intervals and right intervals can be calculated as follows. where represents the number of key points in motion region , is the angle interval in which the vector falls, represents the angle interval, and is the Kronecker delta function.

Then, the definitions of the proposed descriptors Motion Direction Inconsistency (MoDI) and Weighted Motion Direction Inconsistency (WMoDI) are given.

Definition 1. Given the frequency and , the MoDI of current frame, denoted by , is defined as where is the lager one in and .

Definition 2. Given the magnitude and , the WMoDI is defined as where is the lager one in and .

In addition to MoDI and WMoDI, three existing descriptors that are proved to be effective in detecting fight behavior are adopted in the method. Here, we give the definitions of them.

Definition 3. Given the normalized frequency , the Optical Flow Direction Entropy is defined as

Definition 4. Given the normalized magnitude , the Weighted Optical Flow Direction Entropy is defined as

Definition 5. Given the velocity magnitude and the number of corner points , the Average Kinetic Energy is defined as Normalized frequencies and are calculated as follows. is calculated based on the optical flow direction histogram, and the higher the value, the more chaotic the motion in current frame. Besides, based on magnitude and orientation of optical flow vector considers the effect of amplitude energy. considers the intensity of action in different behaviors.
Finally, the above descriptors are smoothed according to the continuity of video stream to enhance the regularity and reduce the impact of noise. The FPS of the input video is 25, and we accumulate the descriptors of 10 consecutive frames as the accumulated descriptors of the th frame (as shown in Equation (15)). The last step of feature extraction is to calculate statistical characteristics of accumulated descriptors. Statistical characteristics are commonly used in data analysis. Inspired by the reference [15], we combine the statistical characteristics of accumulated descriptors to generate the final feature vector, and it is fed into the classifier for training. Table 1 gives the descriptors and statistical characteristics used in our method. And Algorithm 1 explains the feature extraction.

2.3. Classification Model

In our method, Random Forest (RF) algorithm is adopted as the classifier. The algorithm was first proposed by Breiman et al. in 2001 [31]. In this paper, bootstrap samples and Classification and Regression Tree (CART) algorithm are utilized to build a decision tree, which deploys the Gini Index, also named Gini impurity, to measure the quality of a split.

Next, we discuss the values of some hyperparameters of classifier in our method. Note that the classifier is trained in CASIA action database, which has been widely used in the research of behavior analysis [1012, 32, 33]. This data set contains 432 videos, including 15 types of behavior, captured by 3 still cameras in angle, horizontal, and top-down view. And 105 of them are interactive videos, and the remaining 327 are single-person videos.

According to existing work [34], the accuracy of the classifier is mainly affected by the number of trees and the maximum depth of RF. Therefore, the hyperparameters are adjusted from both two aspects. And the accuracy of the classifier is evaluated by using 10-fold cross validation.

2.3.1. The Number of Trees

It is the number of decision trees contained in RF, and it has a great influence on the accuracy. By setting other hyperparameters of the model as fixed, we evaluate the accuracy of the model with different number of trees. Figure 3 shows the accuracy with respect to the number of trees. The -axis represents the number of trees starting from 10 to 200, whereas the -axis represents accuracy result for that specific parameter. The result shows that the accuracy improves with the increase of the number of trees, and then, it is gradually stable when the value of -axis is more than 80. Moreover, the accuracy value reaches the maximum value when the value of -axis is 110. To this end, the number of trees is set to 110 in the experiments of Section 3.

2.3.2. The Maximum Depth of RF

This parameter is adjusted under the condition that the number of trees is 110. Figure 4 shows the accuracy with respect to the maximum depth of RF, where the -axis represents the maximum depth of the decision tree starting from 2 to 30, and the -axis represents accuracy result for that specific parameter. The result shows that the accuracy increases with the increase of the maximum depth when the maximum depth is under 16. After the parameter reaches 16, the speed of accuracy converging slows down. And the accuracy reaches the maximum value when the parameter is 26. Therefore, 26 is selected as the maximum depth of RF in experiments of Section 3. Table 2 shows the hyperparameters of the RF.

3. Experiments

In this section, the data sets used in our experiments are introduced firstly. Then, in the experiments, we evaluate the performance of RF classifier. And the robustness of the proposed descriptors is verified by data visualization and ROC curves. Finally, the advantage of the proposed method is shown by comparing with some existing methods.

3.1. Datasets

CASIA Action Dataset and UT-Interaction Dataset [35] are used to evaluate the effectiveness and robustness of our method. From CASIA Action Dataset, all videos of fighting and 15 other videos in 5 categories are selected, and the details are shown in Table 3. With these videos, we verify the performance of the proposed method in different shooting views. Sample frames of fighting in three views of the dataset are shown in Figure 5.

In addition, UT-Interaction Dataset is selected as another dataset, which is taken by an outdoor still camera and contains 6 types of interactive videos. The videos of punching and kicking are used as samples. Figure 6 gives some sample frames.

3.2. Experimental Settings

We evaluate the performance of our classifier in terms of accuracy, missing alarm (MA), false alarm (FA), and F1-score and compare the RF classifier with other representative classifiers including support vector machine (SVM), adaptive enhancement (AdaBoost) and bagging guided aggregation (Bagging). The evaluation parameters are computed based four measures, named True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Their definitions are given in Table 4.

The experimental results are shown in Table 5 and Figure 7. Note that the confusion matrices for each classifier are obtained from 300 frames sampled randomly from CASIA dataset. We can find that the RF outperforms other classifiers in terms of accuracy. In addition, the MA and FA of RF are suitable for fight detection in real scenes. Therefore, we adopt RF as classifier in our method.

Furthermore, 18 videos of six behaviors shot from 3 different views in CASIA data set are selected to analyze the robustness of the proposed descriptors. Figure 8 gives the quantitative results of descriptors of six behaviors over time. The results show that the descriptors of fight are certainly higher than that of others in most frames. To guarantee that the classifier can recognize fight effectively when the descriptors are not highest and improve the accuracy in frame level, we compute the statistical characteristics of descriptors, which represent the global changes of descriptors.

In order to evaluate the effectiveness of our proposed descriptors and the statistical characteristics used in our method, the Receiver Operating Characteristic (ROC) curves for each descriptor are illustrated in Figure 9. For the model with statistical characteristics of proposed and existing descriptors [1013, 36], its Area Under Curve (AUC) is 0.976 that outperforms with other descriptors. The results show that statistical characteristics of descriptors are more discriminative than descriptors and can facilitate classifier distinguish fight and nonfight.

We evaluate the performance of our proposed model against the following state-of-art approaches: (i)The method proposed by Xia et al. [37]: it is a multiview interactive based on self-similarity descriptors and graph shared multitask learning(ii)The method proposed by Wang and Gao [38]: it is a deep learning based abnormal behavior recognition approach by utilizing interframe information and two-stream convolution neural network(iii)MI_ULBP [39]: it combines moment invariants and uniform LBP feature and trains a binary support vector machine pattern classifier for the purpose of recognizing behaviors(iv)The method proposed by Xu et al. [12]: it is a human abnormal behavior detection algorithm by combining the weighted reconstruction trajectory and histogram entropy

The accuracy of proposed method and existing methods is shown in Table 6. Our method gives the best result for the CASIA Action Dataset.

In addition, UT-Interaction Dataset is used to verify the applicability. Totally 20 videos of human-human interactions are selected, and half of them contain kicking occurring, and the rest contain punching occurring. Note that each video frames range from 80 to 120, and we suppose that there is fight occurring in the video when the frames predicted as positive are more than thirty percent of total frames. Table 7 shows the results, and it indicates the method proposed in this paper is also applicable to UT-Interaction Dataset and can effectively identify the fight behavior in videos.

In summary, the experimental results above show that the proposed method can effectively recognize fight behavior in videos with different shooting views. The reason mainly includes three aspects. Firstly, the feature vectors proposed in this paper have a high degree of distinction and multiview universality. Secondly, the smoothing process of extracted descriptors reduces the impact of noise and improves the robustness. Thirdly, as an efficient machine learning classifier, Random Forest shows the superiority for this task.

4. Conclusions

In this paper, in order to make the cyberphysical systems perceive the required environment information more effectively, we proposed an effective method for fight behavior detection which can reduce the misjudgment rate on nonfight behaviors and improve the robustness of detection with videos in different shooting views. Based on the analysis of motion characteristic, two novel descriptors MoDI and WMoDI are presented which are extracted from optical flow information. Instead of feeding the descriptors into classifier directly, 5 statistical characteristics are selected to generate the final vector. Experimental results demonstrate that the proposed detection method performs well in videos with different shooting views, and it has the highest accuracy in CASIA dataset. As part of our future works, we will pay close attention to improve the early-warning capability and extend our method to poor light and even dark scenes.

Data Availability

You can get CASIA Action Dataset in website: http://www.cbsr.ia.ac.cn/english/action%20databases%20en.asp. And you can get UT-Interaction Dataset in website: https://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.


This work is supported by the National Key R&D Program of China (No. 2018YFC0807500), National Natural Science Foundation of China (No. 62072067), National Natural Science Foundation of China (No. 71904022), Social Science Foundation of Liaoning Province (No. L17CTQ002), and Fundamental Research Funds for the Central Universities (No. DUT21JC27).