Abstract

Over the past decade, a wide attention has been paid to the crowd control and management in intelligent video surveillance area. Among the tasks of automatic video-based crowd management, crowd motion modeling is recognized as one of the most critical components, since it lays a crucial foundation for numerous subsequent analyses. However, it still encounters many unsolved challenges due to occlusions among pedestrians, complicated motion patterns in crowded scenarios, and so forth. Addressing these issues, we propose a novel spatiotemporal Weber field, which integrates both appearance characteristics and stimulus of crowd motion patterns, to recognize the large-scale crowd event. On the one hand, crowd motion is recognized as variations of spatiotemporal signal, and we then measure the variation based on Weber law. The result is referred to as spatiotemporal Weber variation feature. On the other hand, motivated by the achievements in crowd dynamics that crowd motion has a close relationship with interaction force, we propose a spatiotemporal Weber force feature to exploit the stimulus of crowd behaviors. Finally, we utilize the latent Dirichlet allocation model to establish the relationship between crowd events and crowd motion patterns. Experiments on PETS2009 and UMN databases demonstrate that our proposed method outperforms the previous methods for the large-scale crowd behavior perception.

1. Introduction

Over the past decades, crowd phenomenon has become an important carrier of economic development and culture exchange along with the steady population growth and worldwide urbanization. Meanwhile, the difficulty in crowd managing is improving rapidly with crowd scale increasing. Examples of large-scale crowd are shown in Figure 1. According to the statistics of The Guardian, there have happened more than fifteen fatal crowd accidents that resulted in high casualties within the past twenty years when people lost control during the crowded special events, for example, the stampede in the Cambodia Water Festival and the Love Parade stampede in Germany. Obviously, such terrible event is much easier to control, if we get aware of the abnormal clues and nip the tragedy in the bud before it gets serious. Nowadays, massive applications of the video surveillance system provide a possibility to manage the large-scale crowd and prevent such unfortunate events. Therefore, the perception of the large-scale crowd event has attracted the attention from technical research discipline, especially for the anomalous behaviors in the crowd activities where computer vision algorithms play a growing role.

1.1. Related Work

Vision-based crowd behavior perception is thoroughly studied during the past few years which are categorized into two main philosophies. The first category implements crowd behavior understanding in the appearance perspective. Some works can be regarded as microscopic model, which deals with the crowd as a collection of discrete individuals and extends the methods designed for the individual behavior perception to crowd analysis. The representative works include the methods proposed by Jacques et al. [1] and Pellegrini et al. [2], in which the Voronoi diagram and linear trajectory avoidance are utilized to recognize crowd behavior, respectively. Obviously, it is essential for these algorithms to segment or detect individuals and track isolated pedestrians. A lot of related works have been done on these issues, including the pedestrian detection [3] and crowd tracking [47]. However, these methods always perform a significant degradation when the density of people in the scene increases. After all, in most cases, it is not necessary to track each pedestrian for crowd event perception, which is time consuming. Another is macroscopic model, which deals with the crowd as an entity, without the need to segment each individual. To the best of the authors’ knowledge, most of these algorithms capture the low-level motion pattern of crowd behavior via optical flow [810], spatiotemporal motion pattern [1114], or combining both of them [15]. However, most of these methods are sensitive to noise, for example, illumination change, or have a good possibility to fail when crowd motion has sudden changes, which always indicates an irregular crowd flow and is crucial in crowd analysis. An alternative method is to model the crowd motion as a time dependent flow flied [16, 17], since it is observed that pedestrian crowds behave with some striking analogies with the motion of fluids [18, 19]; for example, the footprints of pedestrian crowd in the snow look similar to streamlines of fluids. Although the fluid dynamic model has achieved a success in crowd simulation [20, 21], it still faces a difficulty in estimating velocity of pedestrians for crowd behavior analysis.

Recently, another category of algorithms becomes popular, which analyzes the stimuli, or driven factor, of crowd behavior. These methods lie on an assumption that crowd behavior originates from the interaction of its elementary individuals. Helbing has proposed a social force model [22, 23] to investigate the crowd motion dynamics. According to his work, the crowd behavior is stimulated by a social force field, and the pedestrians will react to energy potentials caused by other pedestrians and static obstacles through a repulsive force. Mehran et al. [24] and Raghavendra et al. [25] proposed to detect the abnormal event of the crowd. However, the model is designed and used for simulation purposes and overall it is a microscopic model. For the application of crowd analysis, it faces the difficulties in estimating the velocity of the pedestrians, especially for the large-scale crowd. Both of the algorithms approximate the desired force by averaging the optical flow around the pedestrian, which is not applicable to all cases, since the desired motion is very subjective. Moreover, social force concept aims to model the interaction force between pairwise individuals and thus is inappropriate for dense crowd flow. Nevertheless, it provides us an inspiring perspective to analyze the crowd behavior according to the force model.

1.2. Our Proposal

Motivated by the previous work, we proposed a spatiotemporal Weber field which integrates both the appearance and stimulus of crowd behavior. The theoretical framework of the paper is illustrated in Figure 2. In appearance perspective, the moving crowd will disturb the distribution of the background and cause a fluctuation of the signal in both spatial and temporal domains. It motivated us to measure the variation based on which crowd motion pattern can thus be explored. Weber law is an important achievement in psychophysics, which attempts to describe the relationship between the physical magnitudes of stimuli and the perceived intensity of the stimuli [26]. The Weber law points out that the threshold that human could discriminate does not only depend on the absolute variation but also on the original stimulus. It has been proved to be an effective tool to measure the variation of the stimulus and has been used as a image texture recognition [27]. In this paper, we propose a Spatiotemporal Weber Variation Feature (ST-WVF) for the crowd behavior perception, which adopts the Weber law to measure the variation of the video signal.

According to the research in crowd dynamic, crowd motion is always driven by the stimulus of force. In this paper, we propose a potential function and analyze the intensity of the force field for the Weber variation feature, referred to as Spatiotemporal Weber Force Feature (ST-WFF), to explore the stimulus of crowd behavior. The previous spatiotemporal Weber variation and force features thus compose the spatiotemporal Weber field. The contributions of this paper are as follows.(i)Firstly, we propose a spatiotemporal Weber variation feature to estimate the variation of the video signal and explore crowd motion pattern.(ii)Secondly, we propose a spatiotemporal Weber force feature, which explores the stimulus of crowd behavior.(iii)Finally, we propose a crowd behavior perception system, which integrates the previous spatiotemporal Weber variation and force feature with location information and utilizes the latent Dirichlet allocation model to analyze it. Experiments on the UMN database and PETS2009 show that the proposed method can achieve a more desirable result than the conventional methods.

The remainder of this paper is organized as follows. The structure of the system is presented in Section 2. Then the details of the proposed algorithm are discussed in Section 3, including the methods to extract spatiotemporal Weber variation and force feature. In Section 4, we utilize the latent Dirichlet allocation (LDA) model to realize the crowd behavior perception, incorporating with bag of feature algorithm. Experimental results of our system are shown in Section 5. Finally, the conclusion is made in Section 6.

2. System Architecture

In this section, we establish an intelligent vision system that is capable of modeling the large-scale crowd motion pattern and recognizing the crowd behavior. Specifically, motion pattern of crowd behavior is explored in terms of both appearance and stimulus.

In appearance perspective, the moving crowd will disturb the distribution of background and cause a fluctuation of the signal in both spatial and temporal domains. Figure 3 shows a sample of fluctuation resulting from moving crowd in both spatial and temporal domains. Specifically, Figure 3(a) is selected from one sequence of PETS2009, in which the pedestrians enter the scene from the right boundary, run on the paved road, and exit the scene on the left. We marked a red dot line on the pave to demonstrate spatiotemporal fluctuations when pedestrians pass it, and we show the fluctuations in Figure 3(b). The green circle indicates the fluctuation at red dot line on the paved road from the 80th frame to the 120th frame. As is observed in green circle in Figure 3(b), the fluctuation along the temporal direction is small, since all pedestrians have passed the red dot line after the 80th frame and the fluctuation thus resulted from spatial variation in background, for example, strong edges. On the other hand, the fluctuation in the red circle is corresponding to the variation from the 30th frame to the 80th frame, when the pedestrians passed the red dot line. The fluctuation thus resulted from the variation in both spatial and temporal domains. From the result, we can observe that the background regions, that is, static pixels, always exhibit uniform distribution with little variation in temporal domain (indicated with the green circle in Figure 3(b)), and regions of moving crowd present drastic fluctuations in spatiotemporal domain (indicated with the red circle in Figure 3(b)).

In stimulus perspective, interaction force in this paper is estimated by investigating the distinctiveness of local motion patterns; that is, the interaction force is relatively small if the pedestrians’ motion is identical and large if their motion patterns are distinctive. Force vectors of each pixel construct a spatiotemporal Weber force field.

The schematic diagram of the proposed system is shown in Figure 4, which consists of feature extraction and feature analysis modules. In this paper, we focus on the module to extract spatiotemporal Weber variation and force feature, which is referred to as spatiotemporal Weber field. For a specific location of input video, we construct a cylindroid spatiotemporal volume with the corresponding pixel as its center. Afterwards, we propose a spatiotemporal Weber variation feature, which adopts the Weber law to measure the variation within the volume. After that, a spatiotemporal Weber force feature is proposed to explore the stimulus of crowd behavior by analyzing the properties of the Weber variation feature. The bag of feature (BOF) algorithm is utilized to estimate the likelihood of the spatiotemporal Weber field. Finally, we utilize the latent Dirichlet allocation model to recognize the crowd behavior.

The proposed spatiotemporal Weber field reveals the crowd motion pattern from the appearance and stimulus aspects, respectively. Furthermore, the abnormal event, which is essentially an eccentric state of the crowd motion, can be regarded as the irregular change of the signal, either in spatial or temporal domain. Meanwhile, there usually exists anomalous stimulus for the abnormal event. Compared with the conventional system for crowd behavior analysis, the system proposed in this paper benefits from the following characteristics.(i)Firstly, the system extracts the feature directly in the spatiotemporal domain. In this case, it does not depend on individual detecting or tracking, and there is no need for background modeling, which is too complicated to implement in heavily crowded scene. Therefore, the system is more suitable and practical for large-scale crowd analysis.(ii)Secondly, the system models the crowd behavior as a variation of signal in spatiotemporal domain and measures the change with the Weber law, which has been proved effectively in psychophysics.(iii)Thirdly, the system explores the stimulus for crowd behavior not only from the appearance of the behavior, which reveals essential motion characteristics.(iv)Finally, the system utilizes latent Dirichlet allocation to recognize the crowd behavior model, and crowd behavior is recognized effectively.

We detail the proposed algorithm in the following sections.

3. Spatiotemporal Weber Field

3.1. Overview of the Weber Law

Weber law was first proposed in the nineteenth century by the German physiologist Weber and was later formulated quantitatively by Fechner [28], founder of the modern psychophysics. The law reveals that the threshold of a just noticeable difference is a constant proportion of the original stimulus value. This relationship, known since as Weber law, can be expressed as where denotes the increment threshold or the just noticeable difference for discrimination, denotes the initial stimulus intensity, and represents the Weber fraction. The Weber law reveals that the difference threshold for discrimination does not only depend on the absolute variation but more essentially on the relative change. It interprets a law of human beings’ cognition and provides a measurement of the variation of the stimulus, which is important for video analysis.

3.2. The Spatiotemporal Weber Variation Feature

Generally speaking, the moving crowd will disturb the distribution of the background and cause a variation both in spatial and temporal domains. Therefore, the behavior can be regarded as a distortion of the signal over the spatiotemporal domain. The Weber law provides an effective way to measure it. For a specific location in the image sequence, we construct a cylindroid spatiotemporal volume with radius and height with as its center. The volume is denoted as , as is shown in Figure 5. The analytical expression of the volume is shown in the following: where denotes the maximum distance in the spatial domain away from and denotes the largest margin in temporal domain. For the sake of analysis, the expression is transformed into cylindrical coordinates, and the expression is shown as

As is shown in Figure 5, and are two arbitrary pixels in the spatiotemporal volume and the corresponding intensity of the pixels is and . We will calculate the difference of the intensity between the two pixels and integrate the difference with all in the volume as

In order to exploit the properties of the variation for the signal comprehensively, the parameters and are changed, with the expression shown in the following: where . The features with all construct a matrix, as the following shows:

The determinant of the feature matrix could reflect the variations in the volume, which is denoted as . If the effective variation or the increment threshold is denoted as , where is the Weber fraction and is the intensity of the corresponding pixel . The Weber variation feature for , which reveals the appearance property of the behavior, could be calculated as the following shows: where is the Weber fraction and is the determinant of the feature matrix .

We illustrate a sample result of the spatiotemporal Weber variation field in Figure 6. The left and right columns are corresponding to the results for PETS2009 and UMN dataset, respectively. The colors pass through blue, green, and red with the amplitude increasing. As is shown in Figures 6(b1), 6(c1), 6(b2), and 6(c2), the positions of walking crowd always have much larger variations in spatiotemporal domain and present red. Oppositely, the positions of background, with smaller variations, always present blue. The motion is indicated distinctly as the simulation results show. Therefore, it can be inferred that the spatiotemporal variation fluid field reflects the crowd motion pattern. Particularly speaking, the abnormal behavior, which is essentially an eccentric state of crowd, is regarded as irregular variations in sequence, either in spatial or temporal domain.

3.3. The Spatiotemporal Weber Force Feature

The crowd behavior is driven by the social force, including the desired force and interaction force [29]. However, it is still a challenging task to estimate the interaction force. In [29], the interaction force is estimated following the potential between the gas molecules, but it is obviously not applicable for crowd with medium or high density. Mehran et al. [24] calculated the interaction force by subtracting the personal desire force from the acceleration force. However, the work lies on an assumption that the desired velocity of a pedestrian is the average of neighboring optical flow, which is not applicable to all cases, since the desired velocity is quite subjective.

It is observed that the motion of crowd, especially for crowd with medium or high density, presents some striking analogies with fluids [30]. Moreover, it is easily understood that the interaction force between the pedestrians is closely related to motion pattern of the crowd (e.g., the friction between a pedestrian walking in opposite direction among crowds). It is reasonable that the interaction force is highly related to the relative force. In other words, the force is relatively small if the pedestrians’ motion is identical, while the force is large if the motion pattern is distinct. Therefore, the interaction force has analogous properties with electromagnetic field. In this paper, the crowd motion pattern is modeled as electromagnetic field lines, and the interaction force, which is driving factor of crowd behavior, is then estimated with the interaction force in the electromagnetic field.

The result of the spatiotemporal Weber variation feature is denoted as , as the following shows: where is a scalar field constructed by the Weber variation feature of each pixel. According to the related theory of electromagnetic field, the intensity, which reflects the property of the force, has a close relation with the potential of the field, as the following shows: where denotes the intensity of the field and denotes the potential of the field. The equation reveals that the more dense the isopotential line, the stronger the field intensity. In this paper, the intensity of the Weber variation feature field can be calculated as where denotes the intensity of the Weber variation feature field . In this paper, we concentrate on the amplitude of the intensity and the force field can be calculated as the following shows:

The force field is the driving factor of the variation feature and reveals the stimulus of the crowd behavior. The simulation result of force field is shown in Figure 7. The result is also described in JET color-map, ranging from the colors blue, green to red. Different from the previous figures, the value of the force is constantly positive in this paper, so the blue indicates the positions with smaller amplitude and with the amplitude increasing, the color is shifting to red. As the results show, the background or the positions with little variation of motion present blue color and indicate smaller amplitude in force field. However, the positions with significant movement variation present red and with a larger amplitude force field, which indicate a greater possibility for the change of motion pattern or the evidence of irregular behavior.

4. Crowd Event Perception Based on the Spatiotemporal Weber Field

The spatiotemporal Weber variation and force features provide a picture of the local crowd activities, but the features do not capture the relationship between their occurrences. In other words, the discrete values are not a clear evidence of crowd behavior or abnormal event, and the crowd behavior cannot be recognized robust merely depending on the approach in the previous sections. Therefore, we adopt the latent Dirichlet allocation (LDA) model to establish the relation between the proposed features and the crowd behavior, which is proved effectively in video activity perception [24, 31].

4.1. Spatiotemporal Weber Word Generation

LDA is a hierarchical Bayesian model, which has gained great success in language processing [32]. To use LDA, we partition a video sequence into blocks with size , and the properties of each block are treated as words for word-document analysis. We aim to infer the distribution of word cooccurrence and thus recognize the crowd behavior.

By combining the spatiotemporal variation feature in (9) and the corresponding force feature in (12) together, we get a spatiotemporal Weber feature , which is denoted by where and are the spatiotemporal variation and force feature, respectively.

Furthermore, we utilize a multivariate normal distribution to model all the feature vectors in block : where is the -dimensional mean vector with and being the mean of the variation and force feature, respectively; and is a 2 × 2 covariance matrix , because we assume that each dimension of is independent. Furthermore, and variation can be calculated as

Afterwards, we use the BoF method [33] to construct a codebook including visual words. The descriptors for block are partitioned into clusters by minimizing a cost function, which is a sum of pairwise distance between the descriptors within the same cluster :

In this paper, we employ the Bhattacharyya distance [34] to measure the distance between and : where and are the normal distribution parameters for and , respectively. Note that the first term is related to the distance of the mean vector, and the second term takes into account the variance distance.

Furthermore, the visual word is the center of cluster , and we thus construct a codebook . The features are then quantized to with the codebook: where is the mean vector of .

4.2. Crowd Behavior Recognition Based on LDA

We then learn the distribution of cooccurrence for visual words with LDA model and infer the crowd behavior of video. The basic idea of LDA model is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [32]. In this case, the inference problem for a document is to compute the posterior distribution given the hidden variables. In this paper, the whole video sequence is treated as a corpus; the sequence is uniformly divided into nonoverlapping short clips , and the video clips are treated as documents. Each document is modeled as a mixture of topics , with a joint distribution parameter . Moreover, the corpus has two Dirichlet prior parameters: determines the per-document topic distributions, and is corresponding to the topic-word distribution. The corpus-level parameters and are estimated by maximizing the marginal log-likelihood of the data during the learning procedure. The probabilistic graphical representation is illustrated in Figure 8.

Based on these descriptions, we calculate the posterior distribution for each video clip: A video clip is then classified into various crowd behaviors based on its particular topic mixture by variational Bayes inference algorithm [32]. Crowd behavior recognition is implemented by calculating the pairwise similarity between the video clip and the training data, which is measured by JS divergence [35].

5. Experiments and Discussion

Data and Parameters. In this paper, the approach is tested on some publicly available datasets of crowd videos. Seq. 1 are the videos from PETS2009, which contain videos of crowd walking, evacuation (rapid dispersion), local dispersion, and crowd gathering/splitting. We select more than 5000 frames from 30 sequences in the dataset and manually label every frame as ground truth. Seq. 2 are crowd videos from University of Minnesota (UMN), which contain videos of 11 different scenarios of escape events. The videos are shot in 3 different scenes, including both indoor and outdoor. We randomly select 40% frames for each scenario for the parameter optimizing during the learning procedure and use the other frames for testing.

In order to explore the variation properties of the videos, the parameters in (5) are set to . For construction of visual words, videos are partitioned into blocks with size 8 × 8 × 5, and visual words are extracted from the properties of the blocks with BoF method to construct the codebook. In LDA model, we use latent topics for learning and recognition. We adopt the proposed method to identify the crowd event as well as the start and end of the event. Furthermore, the results are compared with the ground truth, the detection results based on social force model [24], HOG/HOF [36], HOG3D [37], and optical flow [8]. For comparison, we use the same videos and parameter setting for model learning. Visual words are extracted with the other corresponding features in the same spatiotemporal patches, and we also create a codebook from them.

Recognition Results. We demonstrate some sample recognition results in Figure 9. Color bars are utilized to represent the labels of each frame for videos in the sequence, and different crowd behaviors are indicated with different colors. As a comparison, results based on social force, HOG/HOF, HOG3D, and optical flow are also shown in Figure 9. It is observed that our proposed method outperforms the conventional methods, because we exploit both the appearance and driven factor of the behavior. The social force model gains a comparable result in abnormal event detection with our proposed method, such as crowd splitting and evacuation. However, it results in some false detection for normal crowd behavior recognition, for example, crowd walking, because the social force is not obvious in such cases. HOG/HOF has balanced results in both normal and abnormal events perception, but it faces a difficulty if the motion is not obvious such as crowd gathering. The performance of HOG3D is not as desirable as the previous ones, because the spatial feature is “overweight” for this descriptor, and the spatiotemporal volume construction for HOG3D analysis also leads to a lag in detection. Optical flow fails to exploit the spatial characteristic of the behavior, and the performance is also inferior to our proposed method. Overall, these results demonstrate that crowd behavior can be recognized effectively and accurately based on our proposed spatiotemporal Weber field, because it exploits both the appearance and driven factor of the behavior.

Quantitative Evaluations. In order to evaluate our method quantitatively, we denote the behaviors, which are labeled red in Figure 9, as positive events, that is, run, gather, split, evacuate, and so forth. The positive events are usually more important in practical crowd analysis. The behaviors, which are labeled as green in Figure 9, are denoted as negative events, that is, walk, wait, and so forth.

We measured performance of each algorithm in terms of precision, recall, and : where is the true positive, or correct detection for positive events, is the false positive, or error detection of positive events, and is the false negative, or missing detection of positive events.

The results of criteria for each crowd behavior are illustrated in Table 1. The results show that, compared with social force model, our proposed model has a much better performance in terms of precision, since the social force is not obvious for normal behaviors. HOG/HOF has a good performance in terms of precision, but it misses some detections for positive events. The main reason is that it fails to exploit the interaction between pedestrian, which is a crucial factor such as opposite flow, and crowd gathering. HOG3D and optical flow perform inferiorly to other algorithms, because the spatial feature is too much emphasized in HOG3D, whereas the optical flow fails to utilize the spatial information. The overall evaluation for all the behaviors is demonstrated in Figure 10. It is observed that our proposed method performs good in terms of precision and recall and thus has a much better result in .

Moreover, by changing the decision parameter of LDA, we illustrate the receiver operating characteristic (ROC) curves (true positive rate versus false positive rate) in Figure 11. Figures 11(a) and 11(b) are the results of our proposed and other comparative algorithms for Seq. 1 and Seq. 2, respectively.

We use area under ROC curve (AUC) and accuracy (ACC) as the metrics to evaluate the performance. A larger AUC indicates a better performance in robustness of recognition, and ACC, which indicates the effectiveness of the perception, is defined as where is false negative, or error recognition of negative events.

The results are reported in Table 2, which demonstrates that our proposed method outperforms other algorithms in terms of both AUC and ACC. Social force model has a comparable AUC index, but the ACC is much lower. It is due to the difficulty in estimating social force for normal behavior, which leads to some error detections. Our method gains a much better performance by integrating both the appearance and driven factors of the crowd behavior, comprehensively.

6. Conclusion

This paper proposes a novel spatiotemporal Weber field to recognize the large-scale crowd event, which is an attractive topic in the area of intelligent video analysis. The motion of the crowd is modeled as a variation of signal in spatiotemporal domain, and we propose a spatiotemporal Weber variation feature to measure the change, which adopts the Weber law. Afterwards, the paper proposes a potential function and analyzes the intensity of the force for the Weber variation feature to exploit the stimulus of the crowd behavior. Finally, the authors utilize the latent Dirichlet allocation model to recognize the crowd behavior, combined with the bag of feature algorithm. Overall, the paper exploits the characteristics of the behavior from both appearance and stimulus perspectives. The experiments show that the proposed method is effective and robust for the large-scale crowd event perception. Additionally, the proposed method does not base on the premise that the background should be extracted perfectly or individual tracking, which makes it more suitable for the large-scale crowd behavior perception.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.