Abstract

This paper presents an algorithm for camera localization using trajectory estimation (CLUTE) in a distributed network of nonoverlapping cameras. The algorithm recovers the extrinsic calibration parameters, namely, the relative position and orientation of the camera network on a common ground plane coordinate system. We first model the observed trajectories in each camera's field of view using Kalman filtering, then we use this information to estimate the missing trajectory information in the unobserved areas by fusing the results of a forward and backward linear regression estimation from adjacent cameras. These estimated trajectories are then filtered and used to recover the relative position and orientation of the cameras by analyzing the estimated and observed exit and entry points of an object in each camera's field of view. The final configuration of the network is established by considering one camera as a reference and by adjusting the remaining cameras with respect to this reference. We demonstrate the algorithm on both simulated and real data and compare the results with state-of-the-art approaches. The experimental results show that the proposed algorithm is more robust to noisy and missing data and in case of camera failure.

1. Introduction

Over the last few decades, ubiquitous use of large-scale camera networks has been ramping up in a wide range of applications such as visual surveillance of mass transportation sites, calamity watching, and traffic control. These camera networks essentially enable monitoring of extensively large areas and hence detection of interesting activities on a larger scale, which is impossible with the use of single cameras (see [1, 2]). Existing activity detection systems primarily perform manual analysis of the data collected by such networks, which is an extremely tedious job; therefore, the development of automated data analysis and summarization tools are essential to accomplish maximum from these camera networks [3].

Calibration of camera networks is the first and foremost important step in the development of such an automated activity summarization system. The calibration defines the correspondence between points in the image plane and points in the 3D space and can be divided into intrinsic and extrinsic calibration (also labeled as localization). The intrinsic calibration establishes the relationship between camera-centric coordinates and images coordinates and can be performed by acquiring an object with known Euclidean structure, while extrinsic calibration defines relationship between scene-centric coordinate system and camera-centric coordinate system. For overlapping camera networks, the estimation of epipolar geometry is a popular choice for extrinsic calibration, where candidate corresponding points are initially extracted from the scene and then a model is learnt that minimizes the images and their reprojects [4]. However, in many real scenarios, cameras do not have overlapping views (Figure 1). Examples of such scenarios are wide area surveillance of underground train stations and/or subways, and mobile ad-hoc networks of low-cost surveillance cameras. The large number of cameras makes it difficult or expensive to record cameras' locations manually or to equip each camera with a GPS unit [5]. One way of addressing this issue is through the novel paradigm of automated nonoverlapping camera calibration algorithms, which enable cameras to determine their positions and orientations after the placement.

In this paper, we present a camera localization algorithm using trajectory estimation (CLUTE) for a network of nonoverlapping cameras. The algorithm addresses the problem of recovering the relative position and orientation of multiple cameras whose intrinsic parameters are known. CLUTE uses temporal and geometric constraints derived from the available trajectory information to estimate the unobserved trajectory segments, which are then used to position the cameras on the common ground plane. The object motion information also helps in estimating the relative orientation of the cameras. The registration process aligns the cameras with respect to each other in a single camera frame (that of the reference camera). The algorithm is demonstrated on both simulated and real data, and its results are compared with state-of-the-art approaches.

This paper is organized as follows. Section 2 reviews state-of-the-art approaches for camera localization. Section 3 formalizes the problem under consideration. Section 4 provides detailed description of CLUTE. Section 5 demonstrates the results and analysis of CLUTE on real and simulated data and also compares its performance with two existing techniques. Lastly, Section 6 draws the conclusions.

Localization is a well-established problem in sensor (camera, audio, RFID, etc.) networks ([68]). Algorithms for localization can be categorized into two main classes, namely, fine-grained algorithms (see [9, 10]) and coarse-grained algorithms ([11, 12]). Fine-grained methods use timing and/or signal strength for localization. In this class of localization methods, only a few sensors positions are known. These sensors are called beacons or anchors (see [1315]). The knowledge of beacons is then propagated across the entire network to find the position of the remaining sensors. The nodes measure the distance to their neighbors whenever possible using hardware ranging techniques such as received signal strength (RSS) and time difference of arrival (TDoA). However, the selection of the anchor nodes is a significant problem and using all anchor nodes does not give the most precise position. Moreover, there is the need to identify criteria for selecting the optimal number of anchor nodes to achieve a more accurate position estimation.

Economic factors and hardware limitations are key motivations for the use of coarse-grained localization methods. These methods estimate the proximity of the sensors in a network to an arbitrary reference sensor. Coarse-grained algorithms can further be divided into nonstatistical and statistical approaches. The first nonstatistical approach is multidimensional scaling (MDS) for localization ([1618]). MDS arranges the sensors in a lower-dimensional space, whose size depends on the application data. MDS is very accurate in recovering the sensor network configuration when precise distances are available between all sensor pairs. To compute the locations of cameras on a 2D space using MDS [19], an affinity matrix is constructed based upon pairwise Euclidean distances. Then, the inner product matrix using the double centering matrix and the affinity matrix is calculated, followed by the computation of the eigenvalues and the eigenvectors of . After sorting the eigenvalues in descending order, the sensors’ locations are calculated using the top eigenvalues and their corresponding vectors. The major shortcoming of this approach is its dependency on the affinity matrix and the unavailability of a few distances degrades the overall performance significantly.

Another nonstatistical approach is used to calibrate a network of randomly placed cameras with nonoverlapping fields of view using moving scene features in the near and far fields [20]. A strong assumption is made that object motions are deterministic. Distant objects (e.g., stars) enable the recovery of the orientation (rotation) of the cameras, while close objects (e.g., people or cars) enable the recovery of the translation of the cameras up to a scalar multiple. In this approach, the camera parameters are recovered by solving a complex geometry problem, without imposing a probabilistic framework.

Statistical approaches include numerical, motion-based, maximum a posterior (MAP), and velocity extrapolation based approaches. Numerical solutions are iterative methods for network localization, and each iteration contributes to the reduction of the residual errors. Existing approaches are generally variations of the gradient descent or the Gauss-Newton methods. As an example, Taylor et al. [21] use the Newton Raphson method to estimate the network configuration. Although the simplicity of such approaches is a key advantage, they heavily depend on the proper initialization and increment (or decrement) rate to find the global (or local) minimum.

In structure from motion (SFM), the trajectory of a moving camera, and the 3D coordinates of a stationary target are recovered simultaneously from a series of 2D images of a scene. In [22], the focus is on real-time processing of the image data using an extended Kalman filter, whereas the concept of recursive or sequential SFM can be found in [23, 24]. Similar to SFM, simultaneous localization and mapping (SLAM) localizes a moving sensor (robot) and estimates its trajectory using its egomotion and the stationary objects in the scene (see [25, 26]). The performance of SLAM algorithms is affected by noise, as the robots rely on their camera to compute the distance traveled and therefore noisy measurements add up quickly. Environment maps can be helpful in these situations.

Rahimi et al. [27] used the maximum a posterior (MAP) framework for simultaneous calibration and tracking. A network of nonoverlapping cameras is localized by using the motion of a target. The MAP estimates for the calibration parameters are calculated using the trajectory prior (i.e., the motion model) and the likelihood function, which are constructed from the available observations. The MAP approach is highly computationally complex. Furthermore, it is also possible that the solution may place the target inside the field of view of another sensor for which no observations are available at that particular time instance.

Javed et al. [28] use the concept of velocity extrapolation to project the field of view of one camera onto the other. The projection is then used as a tool to find the calibration parameters. However, the approach assumes that people walk in a straight line in the unobserved regions. Finally, Junejo et al. [29] propose an approach in which vanishing points are used to find the relative orientation of the cameras whose positions are already known.

A summary of the state-of-the-art approaches for sensor localization is presented in Table 1.

3. Problem Formulation

Suppose we have a network of non-overlapping cameras , similar to [33]. Let a trajectory within be represented as , where is the estimated position of the target in the image plane and is the number of target observations from camera .

Furthermore, let each observation be generated by a motion model as where () is the velocity of the object. In addition, () may change over period of time. Moreover, () is modeling additive noise with covariance .

Let each camera provides a vertical top-down view of the scene (i.e., its optical axis is perpendicular to the ground plane or the trajectories are preprocessed using a homography transformation [34]). Under this assumption, the number of parameters for the localization of each camera is reduced to two, namely, the camera position, , and the rotation angle, , expressed as the relative angle between the camera and the horizontal axis (Figure 2). To summarize, the unknown parameters for camera are If observes the object at a particular time instant and after time intervals the object enters into with , then it can be visualized as is viewing the object from the position, where and are the rotation matrix and the translation vector.

The camera localization process estimates () such that the configuration estimation error becomes minimum, that is, where is the projected estimate of the object’s position from at to at .

4. Proposed Approach

Let an object move in the environment and be tracked by each camera in the network. To find the rotation matrix,, and the translation vector, , between adjacent cameras, we propose a two-step iterative process. Two consecutive batches of measurements from two adjacent cameras form one iteration and the adjacency between cameras are defined by the motion of the object itself. The first step calculates by estimating the missing trajectory between pairs of adjacent cameras. In the second step, is calculated by utilizing the trajectory information and the object’s exit and entry points of the fields of view of adjacent cameras. The details of each step are given below.

4.1. Trajectory Estimation in Unobserved Areas

Figure 3 shows the flow diagram of the trajectory estimation process. Unlike [31], each trajectory segment from each camera is smoothed for intercamera trajectory estimation: where is a smoothing function and . If and are the two adjacent cameras, then is the estimated trajectory between the camera pair: where and are nonparametric and a parametric functions, respectively, which extrapolate a smoothed trajectory in the unobserved region.

We use the Kalman filter (see [35, 36]) as a parametric function (). The approach is inspired by speech recognition, where the Kalman filter [37] has extensively been used to estimate intervals with missing observations [38]. However, the Kalman filter's innovation signal distorts significantly in the absence of target information. To overcome this problem, we employ linear regression estimation as a non-parametric function, as, in the unobserved regions, Kalman filter and the linear regression models exhibit a similar behavior (see the appendix).

Within each camera's field of view, the Kalman filtering is applied to obtain the initialization of the parameters. When the target enters an unobserved region, the linear regression model replaces the Kalman filter to estimate the target position. In order to improve the estimation, we use both forward and backward motion models. The process consists of trajectory estimation in both the forward () and the backward () direction between and .

The final estimation is obtained as weighted average over and , for each segment : where and are segments of the estimated trajectory. The forward estimation results are given weights that are decreasing with the distance from the border of the camera’s fields. The backward estimation results are given higher weights when the object gets closer to the next camera’s field of view. The underlying assumption for this approach is that both trajectories contribute to the construction of the estimated trajectory [39]. The linear regression model expands over the uncertainty volume of the region so that when the target is once again visible in a camera, it can immediately reinitialize the Kalman filter. For simplicity, sharp turns are modeled as smooth curves (Figure 4). The process terminates if the difference between five consecutive iterations is smaller than a threshold or when all the available data are utilized.

4.2. Orientation Estimation

The relative angle between two adjacent cameras and is computed by calculating the angle between the observed object position in camera and the corresponding estimated object position in the same camera by extrapolating the trajectory from :

Once is computed for all pairs of adjacent cameras, the final configuration is obtained by rearranging all the cameras () with respect to the reference sensor . An example of the complete localization process of the proposed algorithm in shown is Figure 5. The figure shows the original trajectory segments (linear and nonlinear) as well as reconstructed trajectory segments in unobserved regions. Furthermore, intermediate and final localization results of the proposed algorithm are also provided in the same figure to demonstrate the process of localization over iterations.

5. Experimental Results and Analysis

In this section, we compare the proposed approach (CLUTE) with the MDS (see [16, 17]) and the MAP approaches (see [27, 32]) on both simulated and real data. A 4-camera and a 8-camera network, are tested with simulated data, and a 4-camera network is tested with real data. The performance of the algorithms is evaluated based on the translation and rotation errors in the localization using ground-truth information. The translation error is calculated as Euclidean distance between the true camera position and the estimated camera position. The rotation error is calculated as absolute difference between the true camera orientation (the angle with respect to the reference camera) and the estimated camera orientation. The details about the dataset used and the experimental results are given below.

5.1. 4-Camera Network

In this setup, we first simulate a network of 4 cameras and then test the algorithms on a real camera network. Let an object travel across the network and generate the trajectory. Each camera observes the object within its field of view (in Figure 6, the observations within each field of view are shown with circles). The lines outside the squares are the unobserved track points of the object. We analyze the results of three experiments using the same network configuration, but with three different moving objects. The CLUTE results are shown in Figure 8. The translational error at each iteration is shown in Figure 9. The performance of CLUTE is evaluated and compared with MAP and MDS on the original, noisy, and subsampled data. Also, the algorithms are compared in the case of camera failure.

Figure 10 compares the results of CLUTE, MAP, and MDS along with the true positions of the cameras. The visual inspection of the results show that in all three experiments, the performance of the statistical approaches is better than that of nonstatistical approaches (especially for in the experiment 1 dataset, for in the experiment 2 dataset, and for and in the experiment 3 dataset). The detailed experimental results are shown in Table 2. In the first experiment, the average translation error for CLUTE is 0.18 units or 4.5% of the environment area. Furthermore, the average rotation error for CLUTE is . In comparison with the other two approaches, on average CLUTE is more accurate in locating the network. The average translation error for MAP and MDS is poorer by 0.50% and 3.25%, respectively. Likewise, the average rotation errors for MAP and MDS are worse by and , respectively. In the second experiment, the average translation error for CLUTE is 0.13 units or 3.25% of the environment area, which is better by 1.25% and 4.25% compared to MAP and MDS, respectively. Moreover, the average rotation error estimated by CLUTE is better by and than that of MAP and MDS, respectively. In the third experiment, the average translation error for CLUTE is 0.24 units, which are 6% of the environment area. Also, the average rotation error for CLUTE is . In comparison, the average translation error for MAP and MDS is 0.32 units and 0.46 units, respectively. Similarly, the average rotation angles for MAP and MDS approaches are and . In summary, these experiments show that the performance of CLUTE is better by an average of 1.25% for translation and of for rotation with respect with MAP. In comparison to MDS, the performance of CLUTE is better by 4.5% and for estimating translation and rotation, respectively. In addition to this, the error variance for CLUTE is smaller than that for MAP and MDS. Furthermore, it is noticeable that the estimation results for the statistical approaches (i.e., CLUTE and MAP) are better than that of the nonstatistical approach (i.e., MDS). In general, the main limitation of MDS is that it is based on a single parameter (in this case, the shortest Euclidean distance) with an appropriate value. Therefore, the parameter calculation is accurate and the approach performs satisfactory only if given enough information.

We also analyzed the robustness of CLUTE to reduced sampled trajectory (missing data) and noisy trajectory and in case of camera failure. As MDS is poor in estimating the localization, we compare the robustness of CLUTE and MAP only (Table 3).

For the missing data test, the object’s trajectory is downsampled by 2 and by 3 for all the three experiments. The results for downsampling by 2 show that on average CLUTE does not suffer for translation estimation in experiment 1 and experiment 2 and improves in experiment 3. Similarly for the rotation (especially for experiment 2 and experiment 3), a closer look at the results shows that in experiment 3 the estimation result is quite poor for on the original data compared to the other cameras. Figure 8 shows that in most cases the object takes very sharp turns before entering into this camera. These sharp turns cause the degradation of the result, while downsampling reduces the sharpness of the turns and therefore simplifies the modeling. Compared to MAP, the average translation error over the three experiments for CLUTE is better by 2.75%. Similarly, the rotation estimation is better by in favor of CLUTE. Also the performance of MAP degrades considerably due to downsampling. The reason for this behavior is due to the fact that in MAP the track points provide the likelihood function over the trajectories and the camera parameters, and therefore the posterior probability depends directly upon the availability of enough reliable object measurements. Downsampling reduces the likelihood probability and hence degrades the overall results. On the other hand, in CLUTE, the trajectories are estimated by interpolation of the track points over time. For this reason, as long as the available object observations maintain the shape of the trajectory, the calibration will be performed accurately. For further downsampling by 3, the average translation error over the three experiments for CLUTE is still 4.25% better than that of MAP. Similarly, the rotation estimation is better by for CLUTE. Also when downsampled from 2 to 3, the degradation for the position estimation is just 0.02 units for CLUTE and 0.08 units for MAP. To summarize, CLUTE is more robust to reduced sampling rates compared to MAP.

In order to analyze the performance of CLUTE with noisy observations, we introduced a 5% Gaussian noise in the measurements with variance equal to the 15% of the camera’s field of view, for all three datasets. For both CLUTE and MAP, the results degrade substantially for the position estimation (especially , in experiment 1 and , in experiment 3) and orientation (especially in experiment 3). This is due to the fact that the noise not only degrades the quality of the observations (essential for MAP), but also changes the shape of the trajectory (essential for CLUTE), which are necessary for accurate results from both approaches. However, it is noticed that on average the results for CLUTE are degraded by 8% for the translation and for the rotation, whereas the degradation for MAP is 8.5% and .

In order to analyze the localization performance in the case of camera failure, we removed one camera’s observations from the available datasets (1-camera failure case). The complete results for the three experiments are shown in Table 4. On average, the results for CLUTE are degraded by 5% for the translation and for the rotation, whereas the degradation for MAP is 10.25% and . The results show that CLUTE is more robust in case of camera failure.

In the final experiment for the 4-camera network, we use real data captured indoor (Figure 7). The fields of view of the cameras are squares whose sides are 1.5 meters, and the cameras are between 3 and 4 meters apart. A toy car is moving at varying velocities traces a long trajectory across the cameras. Based on the ground-truth segmentation of the moving object in the field of view of every camera, we compute the relative distance and orientation between the cameras based on trajectory estimation. Figure 11 shows the results obtained with CLUTE, and Table 5 compares the results of the three approaches on the real data. For CLUTE, the sensors were on average misplaced by 70 cm from the locations measured by hand, with an average orientation error of . For the orientation estimation, CLUTE performs on average better than MAP by and better than MDS by 5°. For the position estimation, the error of CLUTE is 0.45 meter smaller than that of MDS and poorer than that of MAP by 0.04 meters only.

5.2. 8-Camera Network

Three datasets have been generated by simulating three different moving objects in a network of 8 cameras with nonoverlapping fields of view. The results obtained with CLUTE on these datasets are shown in Figure 12. Table 6 contains the results obtained in each experiment. CLUTE outperformed MDS as its average translation and rotation errors are 13.75% and lower than those of MDS. Likewise, CLUTE outperformed MAP in estimating the orientation, as the average rotation error over the three datasets is lower by 0.90% and in estimating the translation MAP is better by 0.16%.

We further investigate the robustness of the methods for missing data, noisy trajectories (Table 7), and camera failure. For downsampling by 2, the average translation error over the three experiments for CLUTE is 0.03 unit-size better than MAP. Similarly, the rotation estimation is better by . Also, for downsampling by 3, the average translation error using CLUTE over the three experiments is 1.5% better than using MAP. Furthermore, the rotation estimation is better by 3.69 for CLUTE. Also, when downsampled from 2 to 3, the degradation for the position estimation is 40% better for CLUTE, compared to MAP. For trajectories contaminated with a 5% Gaussian noise with variance equal to the 15% of the camera's field of view, the average translation error for CLUTE over the three datasets is 2.5% better than that of MAP and the rotation error is also better by .

For the camera failure, we simulated the one, three, and five cameras failure situations by ignoring the measurements coming from one, three, and five cameras, respectively, and by evaluating the localization accuracy of the remaining sensors. For one camera failure (Table 8), the average translation error (taken over three datasets) for CLUTE is 0.16 unit, which is better by 0.25 units compared to MAP. The rotation error for CLUTE is better than MAP. Likewise, the translation error for CLUTE in case of three (Table 9) and five (Table 10) cameras failing is 12% of the environment size. The translation results obtained by CLUTE are better than MAP by more than 5% in both cases. Also for the rotation error, the performance of CLUTE is better by and for the three and five cameras missing cases, respectively.

6. Conclusions

We proposed an algorithm to recover the network configuration of a set of cameras with disjoint views. The algorithm finds the position and orientation of each camera on a common ground plane and consists of two main steps: the estimation of the unobserved trajectory in the regions not covered by the cameras' fields of view and the estimation of the relative orientations of the cameras. Kalman filtering and linear regression are used for the estimation of the trajectory. Forward and backward estimations are used to increase the reliability of the results. The relative orientation of the cameras is obtained by using the exit and entry point information in each camera's fields of view.

We have compared the performance of the proposed approach with a statistical approach (MAP) and with a nonstatistical approach (MDS) on both simulated and real data. The experimental results show that CLUTE is more accurate in localizing the network compared to these state-of-the-art approaches. Also, the proposed approach is more robust for missing and noisy data and performs better in case of failure of one or more cameras.

Our current work includes further improvements of the trajectory estimation in unobserved areas by tracking, when available, the audio information captured with stereo microphones coupled with each camera [40].

Appendix

Relationship between Kalman Filter and Linear Regression

To show that the Kalman filter and linear regression exhibit similar behaviors in the unobserved regions, let us define the object state at time as , where is the object position and is the object velocity. If is the model that transforms the object state at time to the next state at time , the state evolution process can be expressed as , where is the process additive noise and is assumed to be zero-mean Gaussian noise with covariance . The Kalman filter propagates the state using a prediction and an update step. The state prediction equation and error covariance matrix are defined as where is state estimate and the superscript indicates the transpose of a matrix. The filter is updated by computing the Kalman gain, , as where is the covariance of the observation noise and maps the state vector with the measurements. The object state can be updated using where is the observational model. In the unobserved regions, there is no prior information about the object state and the observation noise covariance is zero. Therefore, (A.2) can be written as The optimal state estimate in this case can be expressed as Linear regression finds the optimal estimate of the object state at by minimizing the squared error between the estimate and the observation. Let us consider the generalized weighted sum of the squared residual as To minimize the squared residual, we take the derivative of (A.6) with respect to optimal state estimate and set it to zero. This results in and thus from (A.5) and (A.7) it is possible to notice that both Kalman filter and linear regression exhibit similar behavior for no prior object information.

Acknowledgment

I am deeply indebted to my supervisor Professor Dr. A. Cavallaro from the Queen Mary University of London whose help, stimulating suggestions, and encouragement helped me in completion of this work.