Abstract

The topology inference is the study of spatial and temporal relationships among cameras within a video surveillance network. We propose a novel approach to understand activities based on the visual coverage of a video surveillance network. In our approach, an optimal camera placement scheme is firstly presented by using a binary integer programming algorithm in order to maximize the surveillance coverage. Then, each camera view is decomposed into regions based on the Histograms of Color Optical Flow (HCOF), according to the spatial-temporal distribution of activity patterns observed in a training set of video sequences. We conduct experiments by using hours of video sequences captured at an office building with seven camera views, all of which are sparse scenes with complex activities. The results of real scene experiment show that the features of histograms of color optic flow offer important contextual information for spatial and temporal topology inference of a camera network.

1. Introduction

Video surveillance networks are being widely deployed in public security field to monitor wide areas and detect events. With the development of intelligent video and data fusion, collaborative information processing among multiple smart cameras is becoming more and more essential to identify, reconstruct, and track targets automatically. To facilitate more efficient multicamera surveillance, growing research efforts have been undertaken on automated activity understanding in camera networks, focusing on camera topology inference [1], camera placement, scenes decomposition, and activity analysis [2]. The camera placement is a typical optimization problem, where some constraints are given by the characteristics of the camera (field of view and focal length) and the quality (resolution), as well as the environment (obstacle and occlusions). Semantic scenes decomposition is the basis for camera topology inference. The aim of topology inference is to infer spatial and temporal relationships among cameras. An example is shown in Figure 1 in our experiment of camera topology inference, which describes the relationships of 7 cameras within a building in Shanghai. As for scenes decomposition and activity analysis, one wishes to understand activities captured by multiple cameras holistically by building activity models. These problems are nontrivial, especially given multiple disjoint cameras with nonoverlapping views, in which activities can only be observed partially with different views being separated by unknown time gaps. The unknown and large separation of cameras in space and over time increases the uncertainties in activity understanding due to drastic feature variations and temporal discontinuity in visual observations.

Visual coverage is the most important constraint for traditional camera placement. The AGP (Art Gallery Problem) [3] is a classical theory on how to place guards in an arbitrarily shaped polygon. However, AGP cannot be applied to camera placement directly. Because AGP theory assumes the sensor has unlimited visibility, while every visual sensor or camera has its constraints on sensing range referred to as FoV (Field of View). During the last two decades, the camera placement attracts lots of research interests. Hörster and Lienhart [4] developed a binary integer programming model by discretizing the region and determined the minimum number of cameras needed to cover the space completely at a given sampling frequency. Zhao and Cheung [5] proposed a general framework for camera placement, where the goal is to identify distinctive visual features of objects in two or more camera views. Gonzalez-Barbosa et al. [6] specifically describes the surveillance model of directional cameras and omnidirectional cameras in 2D grid graphs and simulates the minimal number of cameras under the situation of simultaneously using two kinds of cameras. Yabuta and Kitazawa [7] have introduced the concept of “Essential Region.” The essential areas are fully covered, while ordinary ones are weighted to be covered; the number of cameras have been further decreased. Gupta et al. [8] proposed a framework for optimal camera placement, giving simultaneous consideration to different qualitative aspects using multiobjective genetic algorithm.

In [9], Loy et al. proposed a semantic scene segmentation method based on static and moving foreground, which got good performance for crowded pedestrian. But in the building, the number of pedestrians is sparse; Loy’ method is not very well in this situation. A solution to semantic scene segmentation and activity understanding seems to be tracking objects within and across camera views. Indeed, most previous methods rely on tracking to detect entry and exit events [10, 11]. These methods generally assume reliable object detection. However, these assumptions are invalid in real-world surveillance settings.

2. System Overview

As shown in Figure 2, we propose a novel approach to understanding activities from their observations monitored through multiple nonoverlapping cameras.

Firstly, a novel 3D surveillance model is proposed to simulate the FoV of a camera. An optimal camera placement scheme is then presented by using a binary integer programming algorithm, in order to maximize the visual coverage of video surveillance network.

Secondly, in our approach each camera view is decomposed automatically into regions based on the correlation of object dynamics across different spatial locations in all camera views.

A new method of histograms of color optical flow (HCOF) is then formulated to decompose each camera view. Then the correlations of regional activities observed within and across multiple camera views are discovered and quantified in a single common reference space. We automatically decompose a complex scene into regions, according to the spatial-temporal distribution of activity patterns observed in a training set of video sequences.

In particular, the image space is first divided into equal-sized blocks with pixels each. histogram of color optical flow was computed using Horn-Schunck (HS) model over the whole image space.

Correlation distances are computed among local block activity patterns to construct an affinity matrix, which is then used as an input to a spectral clustering algorithm for semantic scene decomposition. Given the scene decomposition, regional activity patterns of a camera view are formed based on the local block.

We demonstrate the effectiveness of the proposed approach using several hours of videos captured at 0.5 fps from a building with seven camera views, all of which feature sparse scene and complex activities. We show that the features of histograms of color optical flow offer important contextual information for spatial and temporal topology inference of a camera network.

The rest of this paper is organized as follows. In Section 3, we describe the blind-area issue caused by the 2D model and introduce our 3D surveillance model. In Section 4, the coverage requirement and the camera placement scheme is discussed. In Section 5, we describe the histograms of oriented color optical flow. In Section 6, we describe the local block activity pattern representation. In Section 7, activity-based scene decomposition is presented. In Section 8, we present scene decomposition results of the proposed method. Finally, we summarize the paper.

3. Blind-Area Issue and 3D Surveillance Model

In the actual surveillance scene, the sensing area of a camera and the target to be monitored are both in 3D. However, the 3D scene is usually simplified to 2D for the convenience of computation. Since a directional camera has perspective restrictions, there are always blind areas existing under the kind of camera. Figure 3(a) illustrates the blind-area issue, where the target is out of the sensing area. But the 2D surveillance model would consider that the target can be covered by the upper camera. When applying the 2D model, the blind-area issue is not able to be predicted, which will reduce the coverage performance of the video surveillance network.

This paper proposes to extend the classical 2D surveillance model to 3D as in Figure 3(b) in order to avoid the blind-area issue. The sensing area of the camera is the shadowed cone approximately, which is characterized by parameters of and . Our camera placement task is to find out the camera configuration parameters in the 3D surveillance model, including the minimum number of cameras that has to be deployed, the coordinate value to denote the location, and the rotation angle and tilt angle to denote the camera posture.

The coordinate system (we call it as scene coordinate) is established from the real surveillance scene. We establish a new coordinate system (we call it the camera coordinate) whose origin is the location of the camera and -axis passes through the midline of the sensing cone. On the basis of homogeneous coordinate transformation, we establish the formula as defined by where and are the coordinate values of the point in the two coordinate systems, respectively; and are the camera position in scene coordinate; are transformation matrices.

According to Geometrical Visibility Analysis, the point in camera coordinate can be monitored as long as the following conditions are satisfied:

If obstacles exist in the surveillance area, we need to guarantee that the obstacles would not impact the sensing area of a camera, that is, no obstacles across the lineation of camera and test point. We consider the obstacles in surveillance area mainly are bottom-to-top like uprights, walls,and so on; so we analyze it in 2D plane. From Figure 4(a), we know when the point is inside the obstacle, there is where represents the area of triangle or rectangle.

Similarly, when the point is outside the obstacle, see Figure 4(b), there is

So besides condition (4), we have to ensure that all the points in the lineation of camera and test point satisfy condition (6).

4. Visual Coverage Constraint

First, we discretize the surveillance scene for convenient processing. The process involves region division, defining essential regions, and defining test points of the region, which is illustrated in Figure 5.

We take a real exhibition center with the size of as the surveillance scene. The whole area is divided into small regions by the function blocks (booth, gate, service facility, and so on). In order to reduce the number of cameras and costs, several regions are categorized as essential regions according to the particular requirements, for instance, the entrance of the hall. Canceling the limit that there is only one test point in a region, we propose an adaptive method of test point selection. The algorithm will search the region and set a test point at the center of every plane. The number and the position of test points are adaptive to the size and the shape of the regions. This method can enhance the surveillance quality. The result of the surveillance scene discretization is shown in Figure 6. Surveillance regions in the real scene are a set of cubes, and we use floor plans just for convenience. We totally get 64 regions and 163 test points, in which 10 regions and 29 points are essential.

We assume that the camera can only be deployed at the center of the discrete regions. We need to find out the optimal locations and angles of the cameras with constraints of surveillance coverage and deployment cost.

We define a 0-1 decision variable to represent whether there is a camera with rotation angle and tilt angle in region , and it is equal to 1 if the camera exists; otherwise, it is equal to 0. As for the coverage of test point , we define the variable . It is equal to 1 if the camera in region , with orientation and , can cover point ; otherwise, it is equal to 0.

If there is only one test point in region , it can be monitored by the surveillance network only if ; that is, the test point in this region can be monitored by at least cameras. If the test points in a region are more than one, we can similarly define and to represent whether the test point in region can be monitored, and

We can set the constraint flexibly according to the surveillance importance of the region when considering the whole region coverage. For example, we require all of test points to be observed by the cameras for the essential region, while part of the test points being observed is enough for the ordinary region.

We define to define whether region is essential. For all the essential regions (), they must be totally covered by surveillance network for any cost, while for ordinary ones (), defining as (7) for their coverage:

For the purpose of getting a tradeoff between coverage level and deployment cost, we apply the balance scheme in [7]. The video surveillance network should cover all the essential regions and as many ordinary regions as possible. We define as expected gains. If one camera can bring about or more regions to be monitored, then the camera is deployed or it will not be deployed. In practice, we can set proper expected gain according to specific circumstance to achieve the balance between coverage and cost.

Based on the above analysis, the camera placement algorithm can be abstracted as the following optimal problem with constraints: subject to

5. Features Selection: Histograms of the Oriented of Color Optical Flow (HCOF)

In [12], Wang and Snoussi proposed histograms of the orientation of optical flow (HOF) for abnormal events detection. In this paper, we extend HOF to use HCOF for scenes decomposition. HCOF can offer more useful features for semantic scenes decomposition than HOF.

Optical flow is the distribution of apparent velocities of movement of brightness patterns in an image. It can give important information about the spatial arrangement of the objects and the change rate of this arrangement [13]. Abnormal action can be exhibited by the direction and the amplitude of the movement; optical flow is chosen for scene description. Horn and Schunck [13] proposed the algorithm introducing a global constraint of smoothness to computer optical flow. The basic Horn-Schunck (HS) optical method is used in our work. The HS method combines a data term that assumes constancy of some image property with a spatial term that models how the flow is expected to vary across the image [14]. For two-dimensional image sequences, the optical flow is formulated as a global energy functional.

We propose, in this paper, to compute the histograms of the orientation of optical flow (HOFs). HOFs are similar to histograms of oriented gradients (HOGs) [15], but they are computed quite differently. HOGs are computed in dense grids of the gradients of the image at a single scale without dominant orientation alignment. HOFs are computed in dense grids of the optical flow. HOFs are also different from the descriptor proposed in [16], where the differential optical flow is considered. Here, HOFs descriptors are computed over dense and overlapping grids of spatial blocks, with optical flow orientation features extracted at fixed resolution and gathered into a high dimensional feature vector [17].

We propose, in this paper, to compute the histograms of the orientation of optical flow (HCOF). HOFs are similar to histograms of oriented gradients (HOGs) [9], but they are computed quite differently. HOGs are computed in dense grids of the gradients of the image at a single scale without dominant orientation alignment. HCOFs are computed in dense grids of the optical flow. HCOFs are also different from the descriptor proposed in [10], where the differential optical flow is considered. Here, HCOFs descriptors are computed over dense and overlapping grids of spatial blocks, with optical flow orientation features extracted at fixed resolution and gathered into a high dimensional feature vector [11]. HCOFs are more useful than HOGs for motion detection.

The image is divided into small spatial regions (“cells”), for each cell accumulating a local 1D histogram of the orientation optical flow over the pixels of the cell. Several cells combine into a block.

Figure 7 shows a 2 × 2 cells HOFs descriptor. The HOFs feature vectors are on individual frames. Horizontal and vertical optical flows voting into orientation bins are in . In our examination, a block contains cells, while a cell contains cells pixels. HOFs are computed with an overlapping proportion of two neighboring blocks.

The combined histogram entries form the representation. For better invariance to illumination, shadowing, and so forth, it is also useful to contrast-normalize the local responses before using them. This can be done by accumulating a measure of local histogram “energy” over spatial regions (“blocks”) and using the results to normalize all of the cells in the block. We will refer to the normalized descriptor blocks as histogram of oriented optical flow (HOF) descriptors.

Different block normalization schemes are chose;, the normalization methods are shown in (5). Let be the unnormalized vector of descriptor feature, let be its 2-norm, and let be a small constant. The number of the characteristics of each block is dimension; .

Consider

The model decomposes the color as brightness () and a color coordinate system (, ). The difference between the two is the description of the color plane. and describe a vector in polar form, representing the angular and magnitudinal components, respectively. , and , however, form an orthogonal Euclidean space.

Conversion formula of and is as follows:

The number of the HCOF characteristics of each block is dimension.

6. Local Block Activity Pattern Representation

Consider

First, we divide the image space of a camera view into equal-sized blocks with () × () pixels each (Figure 7). Activity patterns of a block are then represented as a 108 dimension time series.

represents th feature of th frame. represents th activity pattern in each block in the image space. is the total number of frames. Note that needs to be sufficiently large to cover enough repetitions of activity patterns, depending on the complexity of a scene.

7. Activity-Based Scene Decomposition

After feature extraction, we group blocks into regions according to the similarity of local spatiotemporal activity patterns represented as histograms of oriented color optical flow (HCOF). Specifically, two blocks are considered similar and grouped together if they are closed to each other spatially and exhibit high correlations in HCOF activities over time. The grouping process begins with computing correlation distances among local activity patterns of each pair of blocks.

A correlation distance is defined as a dissimilarity metric derived from Pearson’s correlation coefficient [18] given as

Correlation distant is given as

Upon obtaining the normalised affinity matrix, we employed spectral clustering method proposed by Zelnik-Manor and Perona [19] to decompose each camera view into regions with the optimal number of regions being determined automatically.

8. Experimental Results

8.1. Dataset

The dataset employed in our experiments contain synchronized views, captured at a frame rate of 0.5 fps from uncalibrated and disjoint cameras installed in a building of Shanghai Advanced Research Institute, China Academy of Science. Each image frame has a size of 320 × 230 pixels. A snapshot of each of the 7 camera views and the camera topology of this building are depicted in Figure 1.

8.2. Visual Coverage

We conduct the simulation experiments on the surveillance scene of Figure 6. We set the directional camera with focal length and field of view 30 degrees.

Our proposed 3D surveillance model adds the camera heights of cameras and test points as optimization parameters and applies adaptive selection scheme of test points. Figure 8 shows the minimum number of camera versus the height of cameras. The minimum number is showing a general tendency to increase as the height rises; that is, the probability of the test points appearing in the blind area of camera increases as the camera height rises and more cameras are needed to cover these points. When the cameras are placed at the height of   or so, the minimum number of cameras is needed to satisfy the coverage requirement. In that case, the configurations of camera match the scene suitably, which means it is the optimal height. We can place the cameras at this height when deploying video surveillance network for this scene. It is to be noted that some singular points exist in Figure 8 (e.g., the point with height of ), because the location and rotation angle of cameras are not continuous.

8.3. The Experiment of Optical Flow

The experiment of optical flow is shown in Figure 9.

8.4. The Experiment of Scene Decomposition

The experiment of activity-based scene decomposition is shown in Figure 10.

We used 3000 frames (1-hour in length) from each camera view for activity-based scene decomposition. In particular, the seven camera views from dataset were automatically decomposed into several regions (Figure 5). As can be seen from Figure 5, the camera views were decomposed automatically into semantically meaningful regions.

It is difficult to provide quantitative result on semantic scene decomposition as the correct region segmentation is subjective, especially when the segmentation is not based on visual information but activity patterns observed over time. We performed only qualitative comparisons between scene decomposition methods introduced by Loy et al [9] and our method (Figure 5). The two methods differ mainly in their feature representations, that is, time-series representation in Loy’s method and HCOF in our method. We found that our method yielded more meaningful region boundaries. Experiment of activity-based scene decomposition based on HCOF is better than experiment based on static and moving foreground. Some results are shown in Figure 5.

9. Discussion and Future Work

In this work we have presented a novel approach to multicamera activity understanding by modeling the correlations within a video surveillance network. In particular, a camera placement algorithm is presented for collaborative information processing between correlative cameras. Then we introduced HCOF to detect and quantify correlation and temporal relationships between partial observations across local regions. Experimental results have shown that the activity correlations are useful for activity interpretation and video temporal segmentation. Consequently, as demonstrated through our experiments, it can be applied to the topology of a camera network and most challenging surveillance videos for future work.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the Opening Project of Shanghai Key Laboratory of Digital Media Processing and Transmission (Grant no. 2011KF02).