Surveillance systems capable of autonomously monitoring vast areas are an emerging trend, particularly when wide-angle cameras are combined with pan-tilt-zoom (PTZ) cameras in a master-slave configuration. The use of fish-eye lenses allows the master camera to maximize the coverage area while the PTZ acts as a foveal sensor, providing high-resolution images of regions of interest. Despite the advantages of this architecture, the mapping between image coordinates and pan-tilt values is the major bottleneck in such systems, since it depends on depth information and fish-eye effect correction. In this paper, we address these problems by exploiting geometric cues to perform height estimation. This information is used both for inferring 3D information from a single static camera deployed on an arbitrary position and for determining lens parameters to remove fish-eye distortion. When compared with the previous approaches, our method has the following advantages: (1) fish-eye distortion is corrected without relying on calibration patterns; (2) 3D information is inferred from a single static camera disposed on an arbitrary location of the scene.

1. Introduction

The coexistence of humans and video surveillance cameras in outdoor environments is becoming commonplace in modern societies. This new paradigm has raised the interest in automated surveillance systems capable of inferring useful information from the scene (e.g., person identification, action recognition, and abnormal event detection). However, these systems are designed for monitoring vast areas, which highly decreases the resolution of regions of interest.

To address this issue, several approaches have exploited pan-tilt-zoom (PTZ) cameras, since the mechanical properties of these devices allow zooming in on arbitrary scene locations. Most PTZ-based methods adopt a master-slave configuration, where a static camera monitors a large surveillance area to instruct the PTZ camera to zoom in on regions of interest. While several advantages can be outlined, intercamera calibration is the major bottleneck of this configuration, since an accurate mapping from image coordinates to pan-tilt space requires depth information and distortion correction, as illustrated in Figure 1. The existing approaches [14] rely on rough approximations or on the use of multiple static devices to perform triangulation. Also, they assume the pin-hole model for the static camera. Such assumption is highly restrictive, since in surveillance scenarios fish-eye lenses are commonly used to increase the coverage area and the distortion introduced by these lenses is nonnegligible.

In this paper, we propose a master-slave calibration algorithm capable of both removing the fish-eye distortion and inferring an accurate mapping between the static and the active camera without requiring calibration patterns. Our approach exploits geometric cues—which are typically available in urban environments—to measure objects in the scene. As in [5], the vanishing line of a reference plane in the scene and one vertical vanishing point are used to infer the height of static objects or subjects walking throughout a surveillance scenario. This information has a twofold goal: to determine the properties of a fish-eye lens and to determine the 3D position of a subject. In the former, the height of an object is exploited to determine the angle of view and the projection type of the lens to rectify the image coordinates according to the pin-hole projective transform. In the later, subjects height is imposed to the projective transform to determine its 3D location, enabling the correct estimation of pan-tilt values.

When compared with the previous approaches, our method has the following advantages: fish-eye distortion is corrected without relying on calibration patterns; 3D information is inferred from a single static camera; cameras can be disposed on an arbitrary location of the scene.

The remainder of this paper is organized as follows. Section 2 summarizes the most relevant master-slave approaches as well as the existing fish-eye correction strategies. Section 3 describes the proposed method. The experimental evaluation of the proposed algorithm is presented and discussed in Section 4. Finally, Sections 5 and 6 outline the major conclusions of this work and its future direction.

Most fish-eye correction approaches focus on defining a mapping from the viewing sphere to the view plane using polynomial functions or fish-eye projection models [68]. Straight line preservation is a strategy commonly used to infer the correction models, for which two distinct approaches have been proposed: the use of planar calibration patterns [914] and automatic extraction of geometric constraints from the scene. In the former, a set of calibration points, arranged in straight lines, are used to minimize the lines curvature or the reprojection error when full calibration is considered. The later uses a set of automatically detected key points to impose epipolar geometry constraints in multiple views of the scene [1518]. Another strategy is to use a semiassisted straight line detection [19].

Regarding the integration of fish-eye correction in master-slave systems, [20] is the only work that proposed a full integrated system. However, this approach does not take into account depth information which turns the mapping between both devices as an ill-posed problem. To alleviate the mapping inaccuracies, the cameras are assumed to be side-by-side.

To address the lack of depth information in master-slave systems, a large number of approximations have been proposed. The use of manually constructed look-up tables [21] or linear interpolations [22, 23] is one alternative to perform the static to pan-tilt mapping. To alleviate the burden of manual mapping, automatic calibration approaches infer an approximate relation between camera images using feature point matching [20].

Some alternative approaches have also been presented in [2, 24, 25]. In [24] multiple consecutive frames were used to approximate target depth. However, this strategy is time-consuming and, consequently, increases the delay between issuing the order and directing the PTZ. You et al. [25] estimated the relationship between the static and the active camera using a homography for each image of the mosaic derived from the slave camera. Del Bimbo et al. [2] relied on feature point matching to automatically estimate a homography (), relating the master and slave views with respect to the reference plane. is used to perform an online mapping between the feet locations in the master to the slave camera and also determine the reference plane vanishing line from the one manually marked on the static view. Despite being capable of determining head location, this strategy has to set the active camera in an intermediate zoom level to cope with the uncertainties of vanishing line location. In contrast to the previous approaches, the use of multiple static cameras has also been introduced to solve the lack of depth information in master-slave systems. However, these systems either rely on stereographic reconstruction [26], which is computationally expensive, or dispose the cameras in a specific configuration to ease object triangulation [3, 4], which is not practical for real-world scenarios.

3. Proposed Method

In this section the proposed method is divided into two distinct phases: the fish-eye correction method and the master-slave calibration algorithm. The former is used to rectify the image coordinates to the projective projection, on which our master-slave calibration depends. The later shows how to determine the 3D position of a subject’s head in the scene and the correspondent pan and tilt values.

3.1. Fish-Eye Correction

While the pin-hole camera projection can be modelled by the perspective projection , fish-eye lenses introduce one of the following projections described in Table 1, being the distance to the principal point, the focal distance, and the angle between the incident ray and the optical axis. Figure 2 illustrates the effect of fish-eye lenses on the projection of an incident ray when compared with the projective projection of the pin-hole model. and represent the radial positions where a ray is projected when a fish-eye lens is used and when it is not, respectively. This model provides evidence that the radial position yielded by a projective projection model can be recovered by establishing a relation between and . Although a more general model exists—the polynomial fish-eye transform (PFET) [27]—they require a larger amount of ground truth data, and for the majority of the lenses, these models are a good approximation of the fish-eye projective models described in Table 1 [28].

Given the pin-hole camera projection model and a fish-eye projection model , a relation between and is given bywhere is one fish-eye projection function.

Considering that is necessary to define (1), it can be determined byand thuswhere is the horizontal angle of view and represents the image width in pixels. While determining being trivial, and require knowledge about the lens properties, which are often unavailable.

As such, we argue that the height of scene objects can be used to estimate and . The insight behind this idea is that image-based height estimation methods rely on the pin-hole camera model and thus yields incorrect height measurements in distorted images. Therefore, fish-eye correction is regarded as a minimization problem, where the correct lens parameters are the ones which minimize the height estimation error in the corrected image.

In order to perform height estimation from a single camera, we build on the work of Criminisi et al. [5]. We use three vanishing points for the , and axis, determined by the intersection of parallel lines (points at infinite) drawn manually in the image scene. and are determined from parallel lines contained in the reference plane, so that the line defined by these points represents the plane vanishing line. The point does not belong to reference plane since it is the intersection of two parallel lines perpendicular to the reference plane.

Given , , the top (), and bottom () points in an image, the height of an object can be obtained bywhere , whereas and are the top and base points of a reference object in the image with height equal to .

Considering that the vanishing points are marked on the original image, (3) is used to correct their locations and estimate the height of an object with respect to the lens parameters, hereinafter denoted by . Given the height of an object in the scene, the angle of view () and the projection type () can be estimated by

3.2. Master-Slave Calibration

First, we introduce the notation used to describe the proposed master-slave calibration algorithm:(i): the 3D world coordinates.(ii): the 3D coordinates in the static camera referentiality.(iii): the 3D coordinates in the PTZ camera referentiality.(iv): the 2D coordinates in the static camera referentiality.(v): the 2D coordinates in the PTZ camera referentiality.(vi): the head position of a subject in the static camera image plane.(vii): the pan, tilt parameters of the PTZ camera.

In the pin-hole camera model, the projective transformation of 3D scene points onto the 2D image plane is governed bywhere is a scalar factor and and represent the intrinsic and extrinsic camera matrices, which define the projection matrix .

Let denote the head position of a subject in the static camera image plane. Solving (6) for yields an underdetermined system, that is, infinite possible 3D locations for this point. As such, we propose to solve (6) by determining one of the 3D components previously.

By assuming a world coordinate system (WCS) where the plane corresponds to the reference ground plane of the scene, the component of a subject’s head corresponds to its height (). The use of height information reduces (6) towhere is a scalar factor and , is the set of column vectors of the projection matrix (refer to Appendix for the demonstration of (7)). In consequence, our algorithm works on the static camera to determine and infer the subject position in the WCS using its height.

Assuming that there is no displacement between the PTZ center of rotation and the optical center, the coordinates of 3D world point () in the PTZ referentiality are given by

The correspondent pan and tilt angles can be therefore obtained by

Considering that both fish-eye correction and master-slave calibration algorithms depend on an accurate height estimation, it is important to note that the ground is assumed to be approximately plane. The validity of our method in approximately plane scenarios has been assessed in Section 4.

4. Experimental Results

In this section, the evaluation of the proposed method was divided into two distinct phases: fish-eye correction and estimation of the image coordinate to pan-tilt mapping.

4.1. Performance Evaluation: Fish-Eye Correction

The proposed fish-eye correction method was tested using a surveillance camera equipped with a fish-eye lens installed in an outdoor parking lot. Three pairs of parallel lines were manually annotated on the distorted image to estimate the location of one vertical and two horizontal vanishing points. Additionally, two reference objects were annotated and measured as depicted in Figure 3(a). These data were used to estimate the height deviation with respect to using the different fish-eye projection functions, and the attained results are presented in Figure 3(b). The comparative analysis between the different fish-eye projection types supports the idea that lens parameters can be inferred by minimizing the error of automatic height estimation. According to (3), the pair would be chosen as the lens parameters, which constitutes a good approximation to the real angle of view of the lens, .

In order to validate the effectiveness of our approach, a comparison with pattern-based approaches (CB) was conducted by determining the average reprojection error when calibrating the camera using images corrected with the different strategies. For this purpose, a checkerboard was used and 60 marks were disposed in the scene and their image and world coordinates were manually determined. In both strategies, the intrinsic and extrinsic parameters of the camera were determined with the method described in [29]. The distribution of the reprojection error using both strategies is presented in Figure 4(a), whereas Figures 4(b) and 4(c) illustrate the displacement between the correct positions (in green) and the projected positions (in red) for CB and our method, respectively. The comparative analysis of the reprojection error of both approaches provides evidence that the proposed method provides a good approximation to typical fish-eye removal approaches without requiring the use of a planar calibration pattern.

Additionally, a comparative analysis of the height estimation performance was conducted. This performance was measured with respect to the deviation to the true height of the target. The height of a human being was used to assess in 50 different scene locations, as illustrated in Figure 4(e). As shown in Figure 4(d), the distribution of is highly similar for both approaches and in average an accurate height estimation is attained.

4.2. Performance Evaluation: Intercamera Calibration

To assess the accuracy of the proposed approach, we used the following procedure: given and its corresponding point, the algorithm error () was determined by the angular difference between the estimated and the 3D ray associated with . When compared with the typical reprojection error, this strategy is advantageous since it allows a direct comparison with the camera angle of view.

To assess the overall performance of our approach, three different persons were recorded—comprising more than 300 frames—byboth the static and the active camera while walking throughout a surveillance scenario. Both PTZ and wide-view images were annotated to mark the pixel location of the head and feet. Using these data, the system was evaluated with respect to , which was useful to determine if an object of interest will be successfully imaged when using the PTZ at the maximum zoom.

Figure 4(f) illustrates the attained results for the proposed method with respect to the pan and tilt error. The obtained results provide evidence that in the majority of the cases the displacement between the estimated pan-tilt values and the center of the region of interest is less than the field of view of the PTZ camera when using a 30-time zoom magnification, which corresponds to the maximum capability of state-of-the-art PTZ cameras.

5. Conclusions

In this paper, we introduced a master-slave calibration algorithm capable of removing fish-eye distortion and accurately estimating the mapping from the image coordinates to pan-tilt space without depending on calibration patterns. The geometrical cues typically available in urban scenes were exploited to perform height estimation, which can be used to infer the parameters of fish-eye lenses and also the 3D position of subjects in the scene.

An experimental evaluation in a real surveillance scenario provided evidence that fish-eye correction based on height estimation attains highly similar results to typical pattern-based approaches. Regarding the master-slave calibration algorithm, the pan and tilt errors of the method are confined to a tight range of values which in the majority of the cases do not exceed the PTZ field of view.

6. Further Work

In the future, we aim at determining how this approach can be extended to more general fish-eye correction models while maintaining the amount of ground truth data as low as possible. For that purpose, we will investigate how multiple height measurements extracted from a walking human can be informative enough to infer the correct parameters of a PFET.


Determining 3D Position from the Inverse Projective Transform

An explanation of the relation between (6) and (7) is given below.

A complete representation of (6) is given byfrom where we get the following equations:

Equation (A.2) can be equivalently written using homogeneous coordinates aswhich can be combined in

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.