Abstract

Intelligent video surveillance network has many practical applications such as human tracking, vehicle tracking, and event detection. In this paper, an active multicamera network framework is designed for human detection and tracking by optimizing the cameras collaborating control. A multicamera collaborating control algorithm is proposed based on Bayes network to minimize the number of PTZ cameras with control and optimize the cameras' field of view. Hybrid human local feature transform selected by AdaBoost algorithm is adopted to improve the tracking precision. Experimental results on real world environment indicate the effectiveness and efficiency of proposed framework and algorithm.

1. Introduction

Monitoring and tracking mobile objects in public region such as subway stations, square, railway station, and commercial center are the most important function of surveillance [13]. Though a large amount of cameras have been installed, it is hard to obtain high-quality and widely covered surveillance videos in real world environment because the station cameras have their fixed Fields Of View (FOV) and some obstacles are likely to occlude the objects. With the rapid growing of sensor network, camera network including both station cameras and active pan/tilt/zoom (PTZ) cameras is possible to be constructed to solve this problem. PTZ cameras could change their angles to interest objects according to control instructions.

However, with the size of camera network increasing, it becomes infeasible for administrators to observe many cameras and discover interest events. Regarding the control of each PTZ camera, such as calibrating and zooming out to a moving object, it is hard to image. Consequently, it is a practical research to design intelligent camera network which could turn to interest objects automatically or semiautomatically according to control instructions in order to observe given targets.

Such surveillance systems face two difficulties: how could cameras automatically direct to the center of given objects? and what is an effective scheme of multicamera collaborating tracking?

As far as the multicamera network concerned, the premier problem is to calibrate the cameras to the center of an interest object. To achieve robust detection and tracking results, cameras should be calibrated to a good viewpoint to ensure the high quality of captured videos. Some definitions about projective rotations were proposed in [4, 5]. Some methods to determine the intrinsic and extrinsic parameters were proposed in [6, 7]. In [8], a kind of image-ground-plane transformation of each camera was proposed based on a ground-plane coordinate system which was nearly independent of object location change. Some researchers have proposed camera network calibration strategies for both stationary and actively controlled camera nodes from multiple viewpoints [9, 10]. Above works lay more emphasis on independent camera calibration. With the prevalence of camera network, some researches began to take the effect of relation among cameras into consideration when they design camera calibration strategies. For example, Collins et al. [11] and Zhou et al. [1] employed a stationary wide-FOV camera to control an PTZ camera. More recent works reflect the consideration of content-awareness transmission strategy and transmission delay in their control strategy [2, 12].

In multicamera network, researches on strategies of scheduling multicameras to recognize and track one or more moving objects have attracted many researchers' attention. One scenario is described as tracking moving objects in an environment covered by multiple cameras with overlapping FOV. In this scenario, the tracking problem should be defined as “consistent labeling of objects” which aims at establishing correspondence between tracks of the same object, seen in different cameras, to recover complete information about the object [3]. Another scenario considers the linkage among multiple cameras which is usually applied to schedule, coordinate, and control the network of active cameras to focus on and observe given targets at high resolution. In this case, the targets’ information such as location and moving direction is usually utilized to predict cameras which are likely to observe the targets. This paper will focus on this kind of problem.

Considering multicamera collaborating, most works focus on the control strategy in a fully observable surveillance environment, where the locations and directions of targets could be directly observed and their velocity could be estimated by using station cameras or configuring one or more active PTZ cameras to zoom out to them. In [13], an adaptive algorithm was proposed to provide automated control of a PTZ camera by using the captured visual information only. In [14], an image-based PTZ camera control method was proposed based on projective rotations for automated multicamera surveillance systems. The work in [15] focused on the control of a set of PTZ cameras for acquiring close-up views of subjects based on the output of a set of fixed cameras. Reference [16] described using the color model of objects to integrate information from multiple cameras. Reference [17] evaluated various strategies for scheduling a single active camera to acquire biometric imagery.

However, with some uncertainty, dynamic factors often affect the surveillance quality. For example, when multiple pedestrians are present in the scene, it is a difficult work for camera to decide which one should be focused on. More general scenario is partially observable [18]. That is, the exact locations of the objects may not always be observed completely by the cameras. Purely adding station cameras or PTZ cameras does not help the matter because their measured locations become less accurate regardless of the calibration method when the objects move further away from the cameras. More intelligent coordination method should be explored.

Recently, some researchers began to integrate some estimating algorithms into multicamera coordination to deal with the uncertainty of multicamera surveillance system. Zhang et al. [19] proposed a kind of quantized control design for fuzzy network system. Zhou et al. [1] tracked pedestrians using an active camera. When multiple people are present in the scene, the person who is closest to the last tracked person is chosen. In another view, Hampapur et al. [20] deal with the issues by deciding how cameras should be assigned to various people present in the scene. Ostland et al. [21] used a Markov Chain Monte Carlo approach for object detection and tracking in a multicamera system. In [2], both video transmission and its impact on the accuracy of camera control were considered and content-aware transmission strategy was designed. In [22], missing measurements were considered in their control system. In [23], a principled Partially Observable Markov Decision Process-based approach was proposed to coordinate and control a network of active cameras for tracking and observing multiple mobile objects. These works help to eliminate the dependency on FOV of cameras for tracking the objects’ locations by exploiting probability framework to model the multicamera network. Our work is partially illuminated by these works.

Nevertheless, most of above works consider a virtual or simulated environment for their algorithm designing and experiment conducting. It is different from the real world environment to some extent. For example, the distance between cameras in a large region (e.g. a district) should be computed according to the route between cameras which may be not straight line. The objects' movement between different cameras is dynamic. Occlusion phenomenon is general in this environment. All of these factors make the real world surveillance a more complicated problem. In this work, we propose an algorithm for real world video surveillance with PTZ cameras, where the multicamera collaboration scheme is constructed based on Bayes Network by taking into account the GIS information and camera action into models. It is likely to work in a wide region surveillance system with high efficiency.

The remainder of the paper is organized as follows. In Section 2, we present the system model. In Section 3, we propose a multicamera coordinating algorithm in probability framework. In Section 4, the experimental results are introduced and discussed. Section 5 concludes and proposes future research topics.

2. Proposed System Model

The proposed multicamera surveillance network model is shown in Figure 1, where the camera node works as an agent who codes the captured videos and transmits the coded packets to SCC (Surveillance Control Center) over a wireless network. Videos from all cameras' nodes are decoded, in which the interest targets will be detected. Once the interest target is found in videos from a camera, the SCC will send control decision to camera which is likely to capture the targets. We assign the computation of detection and controlling at SCC because it has more powerful computational capability which is necessary for video search tasks. Therefore, based on the video analysis at the SCC, the control model will send instructions to related cameras and these cameras will turn to targets and prepare for transmitting videos. In this process, we mainly address the following factors to advance the intelligence and efficiency of surveillance system.

(a) Content-Aware Transmission. Although videos are coded before being transmitted to SCC, they still severely occupy node resource. Since the captured video frame can be regarded as the combination of target and its background, the target could be coded and transmitted with a higher priority than the background. We adopt a content-aware approach that assigns the limited network resource to interest target firstly and processes the background in a best-effort fashion. In order to extract the target region in a frame, some background-foreground separation methods should be selected. Taking the tradeoff between the separating accuracy and computing complexity at the camera node, the strategy proposed in [2] is conducted in our system.

(b) Video Content Detection. Taking human tracking as example, we consider the problem as detecting the certain human in given video volumes. Some front-view image and some rotated images of certain person are taken as labeled examples. The SCC detects the same person or suspected persons in videos by using a hybrid feature that combines several local transform features by means of AdaBoost feature selection method proposed in [24]. In this style, the best local transform feature among several local transform features having the lowest classification error is sequentially selected until the classification performance is satisfied. This hybridization makes human detection robust to changes of external environment and high efficiency.

(c) Multicamera Collaboration. A controller in SCC is responsible for the interaction between the PTZ cameras and detection results in SCC. Once the suspect target is detected in the videos from Camera , the controller needs to decide which camera should be zoomed out and (or) be rotated to focus on the suspected target. Next camera which is possible to capture the target is uncertain and is determined by object's moving direction, velocity, and camera distribution. In real world conditions, occlusion phenomenon or the blind area of camera often affects the tracking results. In this work, we propose a multicamera collaborating algorithm based on Bayes network. The GIS information of cameras is merged to estimate the probability of target location. The detailed algorithm will be addressed in Section 3.

(d) PTZ Camera Control. Only when the control decision from SCC becomes the parameter of camera motor, camera could zoom out and change its view to the center of target. In this process, the PTZ cameras are instructed to observe the human and a look-up table is computed for each PTZ camera, which associates the location of the person with the corresponding camera settings. The cameras being controlled send high quality videos to SCC and the tracking model of SCC will begin to track target in these videos. The cameras will recover to original status after a given time. The delay estimating algorithm in [2] is borrowed here because it can resolve the time delay in our system successfully.

(e) Targets Tracking. As soon as the cameras are controlled to turn to targets, the tracking model begins to work on the video sequences transmitted from these cameras. In tracking model, TLD (Tracking-Learning-Detection) is adopted due to its high performance and robust to some unpredictable factors [25]. TLD decomposes the tracking task into tracking, learning, and detection. The tracker follows the object from frame to frame. The detector localizes the location of object in frame and corrects the tracker if necessary. The learning estimates detector’s errors and updates it to avoid these errors in the future. If the target is successfully found in these videos, they will be selected for user's application.

The conducting procedure of each module including object detection, camera election, camera control, and object tracking could be described as Algorithm 1. Where, represents the candidate camera list in and represent the videos captured by camera at time . All videos captured by assigned cameras consist of the set of candidate videos . If the object is successfully tracked in , it will be put into which represents the final video set including given objects.

Surveillance process with active camera control:
Initialization: at , object is observed in camera
(1) manually label object
(2) for
(3) { clear
(4)  if add   into
(5)   for each do
(6)    { send PTZ control instruction to ;
(7)      conduct instruction;
(8)      transmit videos to SCC;
(9)       }
(10)   conduct tracking algorithm in ;
(11)  if is found in add into
(12)  }

3. Multicamera Collaborative Tracking Based on Bayes Network

The collaboration of cameras is affected by various factors from target such as the location, the moving direction, and the velocity. The cameras' geographical distribution should also be concerned because it is the direct factor for calculating the target's moving time among cameras. Their relations could be modeled by Bayes network. This section will show the model definition and parameter estimation in the network.

A multicamera interactive network concerning target moving and camera actions is defined as a triple , where the set represents the states pairs of PTZ cameras and targets in surveillance system, represent the action of the PTZ camera, and is transitional function which defines the transfer relations from state to .

The state is defined as a pair of , where represents the target to be observed, represents the PTZ camera, and consist of all possible combination of targets and cameras. For any target , it can be described as a triple . Further, is defined on to represent target's location, moving direction, and velocity, respectively. The state space of PTZ camera is a set of positions. For each camera , represents the cover region of camera which can be linked to targets by judging the relation between and . Once the target is near to the FOV of camera , a control instruction of adjusting the PTZ parameters should be sent to camera to obtain better view.

There are two assumptions in our system as follows.(1)Given the state and action , camera 's next state is conditionally independent of the other cameras' states and any target's state in current time.(2)Given the target , its next state is conditionally independent of the other targets' states and any camera's state in current time.

Then, the transition model could be simplified as in the following formula: where represent the next state of by executing action . function is defined as in the following formula:

The next state of target is conditionally dependent on the current . That is, the target's location in next state is decided by the current location, its direction, and velocity. So, could be expressed as in the following formula:

It is supposed that the target's direction and velocity obey Gaussian distribution and , respectively. and are the direction and velocity in current state, respectively. and are the variance of direction and velocity which could be estimated from a large amount of targets' individual behaviors. The transition probability could be computed according to the general velocity-direction motion model proposed in [26].

For a large surveillance region, the camera's GIS information which is reserved on SCC will be regarded as the location of targets once the targets enter the FOV of some camera. GIS information will be used to compute the target's moving time from one camera to another.

In Bayes paradigm, the event of “which camera should be controlled” is changed as a problem of “which camera has much more probability to cooccur with given target.” We could obtain a candidate camera list by computing the probability of . The SCC will send control command to the top camera node according to the real task requirement.

4. Experiments and Discussions

4.1. Experimental Setting

Experiments of video surveillance were performed in a subway interchange station in Shanghai with 20 exits and more than 100 cameras including 60 active PTZ cameras (200 million pixel). The performance of network is evaluated in terms of the effectiveness of camera coordinating and PTZ control. The PTZ camera is normally installed at the exits, especially the place where there is a large pedestrian flow. We tested our coordinating strategy by setting different autonomous person and different number of targets. For each trail, the target is selected in the videos transmitted from a station camera. Once it is decided as the tracking object, our coordinating algorithm began to work.

All captured video frames except the first one were coded as inter frames. Each video frame was segmented into a foreground part and a background part by the Mixture Gaussians model and was separately coded. Based on the received videos, the target's location was labeled according to the GIS information of corresponding camera and its future location was estimated at SCC by using the Mean Shift algorithm for PTZ camera control decision making.

In the experiments, we adopt a hybrid feature that combines several local transform features by means of the AdaBoost method proposed in [24], where the best feature having the lowest classification error is sequentially selected until we obtain the required classification performance. The features include local gradient patterns (LGP) and binary histograms of oriented gradients (BHOG) and local binary patterns(LBP). LBP is robust to global illumination changes, LGP is robust to intensity changes, and BHOG could resist the effect of local pose change. Their hybridization makes face and human detection robust to various environment changes.

4.2. Collaborated Tracking Process

The process of multicamera collaborating example is indicated in Figure 2, where a suspicious man is found in the video from camera number 1 at 1st Exit (the image on the top middle in Figure 2); a user in SCC selects this target with a red frame and adds some pictures of this person which have been provided by users, the collaborating process starts. Camera number 1 on the access of 1st Exit receives the PTZ (pan, tilt, and zoom) control command from SCC, conducts rotation and zoom operations, and transmits videos zoomed out to SCC. Meanwhile, other possible cameras which is estimated by our algorithm receive control instructions also and conduct similar operations. From Figure 2, we could observe that camera number 2 at 11th Exit (the image on the top right corner in Figure 2) takes over the observing task, which in turn is taken over by camera number 1 at 9th Exit (the image on the lower right corner in Figure 2) and camera number 2 at 9th Exit (the image on the lower left corner in Figure 2). In Figure 2, three cameras obtained the zoomed out videos and the first camera could not obtain it because the target had left its FOV when the control command arrived. Target's faces are occluded manually in the picture as they involve the problem of personal privacy protection.

4.3. Multicamera Collaboration Performance Analysis

The overall performance is evaluated by the which is proposed in [23]. where is the total number of time step in a trial of experiments, is the total number of targets tracked by PTZ cameras at time step , and is the real time steps that targets present in the experiment environment. This metric indicates the average observed time by the PTZ camera network compared to target's real present time.

In fact, the time of targets presenting in front of each camera is different from others because cameras are various in FOV and the environment is also different. The metric is likely to be affected by the performance of some major cameras. For example, if a target presents in front of a camera with a good view and the camera has captured it for a long time, the overall evaluation by using this metric is prone to be better. Another case is that a target stayed in front of camera for a long time and passed by other cameras rapidly. As a result, the overall performance is dominated by the performance of camera . In order to reflect the average level of cameras, metric is proposed here to consider each camera equally as shown in the following formula:

The performance evaluated by two metrics, respectively, is shown in Table 1, where “number of ” represents the number of cameras that the targets really pass and “number of ” represents the number of cameras having observed targets. To distinguish the two metrics, the is renamed as which reflects the average level of time span.

From Table 1, we could observe that the proposed algorithm correctly assigned PTZ cameras in most cases. The most important reason for falsely assignment is the occlusions by other pedestrian when there was a crowded pedestrian flow. In experiment 4 and experiment 6, there is one camera which could not capture the targets, respectively. However, the mistake mainly resulted from failed tracking instead of multicamera coordinating algorithm. As the number of pedestrian grows, the occlusion problem is more and more serious and the tracking results are not satisfied. The false tracked target will consequently affect the tracking performance of other cameras. The design of model keeps the control decision made by SCC, which could avoid this problem to some extent because a heuristic search could be reinitiated by SCC once the loss track occurs.

We could also find that the is a little lower than corresponding if all the cameras successfully observed the targets. If some cameras failed to observe the targets, the is obviously lower than its value.

According to the proposed multicamera collaborating algorithm, once the tracking targets are selected, the algorithm will decide which camera should be controlled. The possible cameras estimated by the algorithm are initiated to participate tracking process. Supposed there is no estimating process, the camera selecting process may depend on the neighbor search. As a result, all the neighboring cameras of current camera will be controlled and the tracking model will work on more videos from all these cameras. In fact, only the cameras in the target's moving direction may be useful for tracking task. In our experiments, the time cost in the case of our collaborating algorithm is generally no more than one third of the time cost working on all neighboring cameras.

4.4. Effect of Camera Control

In this subsection, we will analyze the effect of camera control model by comparing the results without camera PTZ operations. In Figure 3, we could observe that the overall performance of camera network is decreased when there is no pan-tilt-zoom control on cameras. The reason is that the PTZ control on cameras could enlarge the FOV of cameras and then the cameras could capture target in a longer period. The zoom operation helps to obtain the clear target's image even when it is a little far from the camera. As we all know, clear image can be detected with less ambiguity.

5. Conclusion and Future Works

In this paper, an active camera network framework is designed for video surveillance by jointly considering the cameras collaborating control and effective targets tracking. To maximize the FOV of targets being observed and minimize the number of cameras being controlled, a Bayes network based collaborating control algorithm is proposed with estimation of the target's location in next time step. The hybrid human feature selection is also concerned to improve the tracking precision. Experimental results on real world environment indicate the efficacy of proposed framework and algorithm. Nevertheless, with the pedestrian flow growth, system performance will be decreasing. The tracking performance is also affected by the number of targets which resulted from the relatively low precision of multiobject tracking problem. Future research will focus on these problems and improve the adaptability of system.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was partly supported by the National Natural Science Foundation of China under Grant nos. 61202170, 61203247, and 61273304 and the Fundamental Research Funds for the Central Universities under Grant no. 0800219226 and the opening project of Shanghai Key Laboratory of Digital Media Processing and Transmission (no. 2011KF03).