Mobile Systems Design Laboratory, Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
Department of Electronics Engineering, College of Information Technology, Ajou University, Suwon 443-749, South Korea
Abstract
We present a simplified algorithm for localizing an object using multiple visual images that are obtained from widely used digital imaging devices. We use a parallel projection model which supports both zooming and panning of the imaging devices. Our proposed algorithm is based on a virtual viewable plane
for creating a relationship between an object position and a reference coordinate. The reference point is obtained from a rough estimate which may be obtained from the preestimation process. The algorithm minimizes localization error through the iterative process with relatively low-computational complexity. In addition, nonlinearity distortion of the digital image devices is compensated during the iterative process. Finally, the performances of several scenarios are evaluated and analyzed in both indoor and outdoor environments.
1. Introduction
The object localization is one of the key operations
in many tracking applications such as surveillance, monitoring and tracking
[1–8]. In these tracking systems,
the accuracy of the object localization is very critical and poses a
considerable challenge. Most of localization methods use geometric relationship
between the object and sensors. Acoustic sensors have been widely used in many
localization applications due to their flexibility, low cost, and easy
deployment. The acoustic sensor provides directional information in angle of
the source with respect to the sensor coordinates which are used to create a
geometry for localization. However, an acoustic sensor is extremely sensitive
to its surrounding environment with noisy data and does not fully satisfy the requirement
of consistent data [9]. Thus as a reliable tracking method, visual sensors
are often used for tracking and monitoring systems as well [10, 11]. The visual localization
has a potential to yield noninvasive, accurate, and low-cost solution [12–14].
Multiple-image-based
multiple-object detection and tracking are used in indoor and outdoor
surveillance, and give a delicate and complete history of an interested
object's action [2, 15, 16]. The object tracking can be simply concerned into a
2D tracking problem on the ground plane [2, 17–19]. The establishment of correspondences in multiple images
can be achieved by using a field of view lines [2, 20]. Besides, for the selection
of the best view about interested objects, a camera movement such as zooming
and panning is required [19].
There are many localization methods which use image
sensors [5, 6, 13, 21–25]. Most of conventional
localization methods follow two steps of operation. Initially, the camera
parameters are computed offline using known objects or pattern images. Then
using additional information such as control points in the scene or techniques
such as structure from motion, the relative displacements of a camera are
estimated [21, 26]. Basically, these studies
can sufficiently localize objects from 3D reconstruction. Once the sufficient
number of points is observed in multiple images from different positions, it is
mathematically possible to deduce the locations of the points as well as the
positions of the original cameras, up to a known factor of scale [21]. In the localization method
based on a perspective projection model, the camera calibration is critical.
The calibration usually uses a flat plate with a regular pattern [14, 27, 28]. However, in many
applications, it is not easy to obtain calibration patterns [29, 30]. In order to alleviate the effect of the calibration
patterns, some methods based on self-calibration use the point matching from
image sequences [29–34]. In these methods, the
image feature extraction should be very accurate since this procedure is very
sensitive to the noise [21, 27, 35]. Moreover, if a pair of stereo images for a single
scene is not calibrated and the motions between two images are unknown, the
image matching requires prohibitively high complexity [27, 34–36].
The localization method based on the affine
reconstruction can be used for object localization without the concern of the
complex calibration [37–40].
Basically, the relationships between physical space and geometric properties of
a set of cameras are considered. The method uses two uncalibrated perspective
images where an image is induced by a plane to infinity [37–39, 41, 42, 43, 44]. Especially, the factorization method based on the
paraperspective projection model can be used for localization [42, 44, 45]. In [42], three well-known
approximations such as orthography, weak perspective and paraperspective are
involved to full-perspective projection in the
affine projection model. In [44, 45], shape and motion recovery is used for less
complexity in depth computation. However, the localization method based on the
affine structure requires at least five correspondences in two images [37–39]. On the other hand, our
proposed method requires only one correspondence (i.e., a centroid coordinate
of the detected object) in two images, where each correspondence represents the
same object. Thus, the critical requirements of an effective localization
algorithm in tracking applications are the computational simplicity with a
simpler model where 3D reconstruction is not necessary as well as the robust
adaptation of camera's movement during tracking (i.e., zooming and panning)
without requiring any additional imaging device calibration from the images.
The contribution of this paper is to simplify localization method with
efficiency which does not consider 3D reconstruction and complex calibration.
In this paper, we propose a simplified algorithm for
localizing multiple objects in a multiple-camera environment, where images are
obtained from traditional digital imaging devices. Figure 1 illustrates the
application model where multiple people are localized in a multiple-camera
environment. The cameras can freely move with zooming and panning capabilities.
Within a tracking environment, the proposed method uses detected object points
to find object location. We use the 2D global coordinate to represent the
object location. In our localization algorithm, the distance between an object
and a camera is provided by a reference point. Since the reference point is
initially a rough estimate, we are motivated to obtain a more accurate
reference point. Here, we use an iterative process which substitutes a
previously localized position with a new reference point close to a
real-object location. In addition, the proposed
localization method has an advantage of using a zooming factor without
being concerned about a focal length. Thus, the
computational complexity is simplified in determining an object's position which
supports both zooming and panning features. In addition, the localization
algorithm sufficiently compensates a nonideal property such as optical
characteristics of a camera lens.
Figure 1: Illustration of the model of application.
The rest of this paper is organized as follows.
Section 2 briefly describes a parallel projection model with a single camera.
Section 3 illustrates the visual localization algorithm in a 2D coordinate
with multiple cameras. In Section 4, we present analysis and simulation
results where the localization errors are minimized by compensating for
nonlinearity of the digital imaging devices. An application that uses the
proposed algorithm for tracking people within a closed environment is
illustrated. Section 5 concludes the paper.
2. Characterization of Viewable Images
2.1. Basic Concept of a Parallel Projection Model
In this section, we introduce a parallel projection
model to simplify the visual localization, which is basically comprised of
three planes: an object plane, a virtual viewable plane and an actual camera
plane. In Figure 2, an object
placed on an object plane is projected to both
a virtual viewable plane and an actual camera plane, and
denotes the projected object point on the
virtual viewable plane. The distance
denotes the distance between a virtual
viewable plane and an object plane.
and
denote the position of projected object
on both the actual camera plane and the
virtual viewable plane. The virtual viewable plane is
in parallel with the object plane by distance
.
and
denote each length of the virtual viewable
plane and the actual camera plane, respectively. The virtual viewable plane is
for the connection between the position
on the object plane and the position
on the actual camera plane; it has an advantage
of simplifying the computation process.
Figure 2: Illustration of the parallel projection model.
Since the size of image sensor is much smaller than
the virtual viewable plane, the viewable range starts from a point
.
Thus the camera model of parallel projection model is similar to a pin-hole
camera. All planes are represented as
- and
-axes but we use
-axis for the explanation of the parallel
projection model in this section. Since
represents the origin of both the virtual
viewable plane and the camera plane, two planes are placed on the same camera
position. However, in Figure 2, we drew two planes separately to show the
relationship between three planes.
In the parallel projection model, an object is
projected from an object plane through a virtual viewable plane to an actual
camera plane. Hence, as formulated in (1),
is expressed as
,
,
and
through the proportional lines of two planes
as the following:
(1)
Thus the object
is represented from
and the distance
between the virtual viewable plane and the
object plane.
2.2. Zooming and Panning
Since the size of the virtual viewable plane and the
object plane are proportional to the distance between the object and the camera
(
), the length of the virtual viewable plane (
) is derived from the distance
and the viewable range.
Zooming factor represents the relationship between
and
.
The zooming factor
is defined as a ratio of
and
as follows:
(2)
Since both
and
use metric units, zooming factor
is a constant.
Figure 3 illustrates the model of zooming in terms of
two different zooming factors. Even though the zooming factor of a camera has
changed from
to
,
if the distance between object and camera is not changed, the position of
projected object on the virtual viewable plane is not changed. In the figure,
since the distance
is equal to
the distance
,
the position of the object on the virtual viewable plane is invariant but the
position on the actual camera plane is variant. Thus the distance
is equal to
but the distance
is different from
the distance
.
The projected positions
and
on the actual camera
planes 1 and 2 are expressed as
and
.
Since
and
,
the relationship between
and
is represented as
.
Figure 3: Illustration of the model of zooming in terms of two
different zooming factors.
Figure 4 illustrates a special case in which two
different objects denoted
and
are projected to the same spot on the actual
camera plane.
and
denote the projected objects on the virtual
viewable planes 1 and 2.
Figure 4: Illustration of a special case in which different
objects are projected to the same spot on the actual camera plane.
The objects
and
are projected to a point on the actual camera
plane while two objects are separated as two different points on the virtual
viewable plane 1 and 2. Since the zooming factor
is equal to
and
,
the relationship between the distance
and
is expressed as
.
The distance
is equal to the distance
,
and the distance
is different from the distance
.
It is shown that the distance in projection direction between an object and a camera
is an important parameter for the object localization.
Now, we consider a panning factor denoted as
that represents camera rotation. The panning
angle is defined as the angle difference between
-axis and
-axis where
-axis represents the normal direction of the
virtual viewable plane. Thus the panning angle can exist in the range of
.
The sign of
is determined: the left rotation is positive
and the right rotation is negative.
To get the global coordinate of the object,
-axis and
-axis in camera coordinate are translated to
-axis and
-axis in global coordinate. We define camera
angle factor (
) to represent the absolute camera angle in
global coordinate. The camera angle
is useful to translate the object coordinate
from camera images.
Figure 5 illustrates the relationship between the
camera angle
and the panning angle
in global coordinate. The global coordinate is
represented as
-axis and
-axis. For example, in the position of Camera
,
panning angle
is the angle between
- and
-axes; while in Camera
,
the panning angle is the angle between
-axis and
-axis. Thus four cases of camera deployment
such as Camera
,
,
,
have different relationships between
and
.
Thus the projected object
on the virtual viewable plane is derived from
and
.
denotes the origin on the virtual viewable
plane in global coordinate.
Figure 5: Illustration of individual panning factors with
respect to a global coordinate.
2.3. The Relationship between Camera Positions and Pan Factors
Figure 6 illustrates the panning factor selection in a
pair of cameras depending on an object position. Among deployment of four
possible cameras, such as cameras
,
,
,
and
,
a pair of cameras located in adjacent axes is chosen.
Figure 6: Illustration of panning factor selection in a pair of
cameras depending on an object position.
In this paper, we choose
cameras
and
for the deployment of two cameras for the sake
of the localization formulation. The camera angles
in Camera
and
are expressed as
and
in terms of the panning angle
.
3. Visual Localization Algorithm in a 2-Dimensional Coordinate
3.1. The Concept of Visual Localization
Turning to the object localization with an estimate,
consider a single-camera-based localization. In the
single-camera localization, we use the estimate
plane as an object plane. Figure 7 illustrates the object localization using
the estimate
based on a single camera, where
denotes the estimate which is used for a
reference point. Note that the the estimate
as a reference point may be any position at
the first time, and it becomes close to a real position. The estimate
and the object
are projected to two planes: virtual viewable
plane and actual camera plane. Here, the reference point
generates the object plane. The distance
denotes the distance between the estimate and
the virtual viewable plane. In view of the projected positions, the length
is obtained by the length
.
Hence the object
is determined from the estimate
.
Figure 7: Illustration of the visual localization in a single
camera.
Once we use the estimate plane as an object plane, the
estimated object position
is different from the real-object position
.
In other words, since any points on the ray between the object and origin are
projected to the same spot on the actual camera plane, the real object
is distorted to the point
.
Thus, the localization has an error from the distance difference of the
distances
and
.
Through the single-image sensor-based visual projection method, it is shown
that an approximated localization is accomplished with a reference point.
We are now motivated to use multiple image sensors in
order to reduce the error between
and
.
In the case of single camera, the distance difference between the
distances
and
cannot be found by a
single-camera view. However, if an additional
camera is available for localizing the object within different angles, the
distance difference can be compensated by the relationship between two camera
views.
Figure 8 illustrates the localization using two
cameras for a simple case where both panning factors are zero, and the
directions of
- and
-axes are aligned to
- and
-axes. Given by a reference point
,
the virtual viewable planes for two cameras are determined.
and
are the obtained object coordinates in each
single camera. In view of camera 1, the length
between the projected points
and
supports the distance between the object plane
of camera 2 and the point
.
Similarly, in the view of camera 2, the length
between the projected points
and
supports a distance between the object plane
of camera 2 and the point
.
Therefore, the basic compensation algorithm is that camera 1 compensates
-direction by the length
,
and camera 2 compensates
-direction by the length
given by a reference point
.
Figure 8: Illustration of the localization in multiple
cameras.
Through one additional image sensor, both
in
-direction and
in
-direction make a reference point
closer to a
real-object position. Hence
is computed by
and
.
Note that
is the localized object position through the
two cameras, which still results in an error with the real-object position
.
The error can be reduced by obtaining a reference point
closer to a real position
.
In Section 3.5, an iterative approach is introduced for improving
localization. In the next section, we formulate the multiple image
sensor-based localization.
3.2. 2D Localization
3.2.1. 2D Localization Model
In this section, we introduce a simplified
localization model. If the estimate
and the object
have the same
-coordinate and
-axis is aligned with
-axis, all points are placed on a plane. Thus
the localization is simplified in 2D coordinate. The 2D localization is simple
and has an advantage for mapping the test environment. Moreover, once the
object is represented as
in global coordinate, the 2D localization
gives a feasible solution.
To derive 2D localization equations, we use vector
notation which has a benefit to express the
relationship between the estimate and the object where “
” denotes a unit vector and “
” represents a vector. For example, one vector
is represented as
,
where
,
,
and
denote unit vectors toward
-,
-, and
-axes and A, B, and C are the magnitude of
-,
-, and
-axes, respectively. Figure 9 shows the basic
model of object localization. The vectors
,
,
and
denote the vector from the estimate
to the object
,
the vector from the projected estimate
to the projected object
on the virtual viewable plane 1, and the
vector from the projected estimate
to the projected object
on the virtual viewable plane 2, respectively.
The lengths
and
are the projections of the vector
on the virtual viewable
planes 1 and 2.
Figure 9: Illustration of basic localization algorithm.
Figure 10 shows the projected image on the virtual
viewable planes 1 and 2 where the projected
points
and
are expressed as
(
,
) and
(
,
) on the virtual viewable
planes 1 and 2.
and
denote the
-coordinates of the projected objects in
global coordinate and are equal to
and
.
Since the estimate has some height with the object, the projected estimate and
object have the same
-coordinate on the virtual viewable plane 1
and 2. Thus in the figure,
is different from
while
is equal to
.
Since an estimate is a reference point, the actual estimates in the figure are
not displayed on the actual camera plane. Since the projected vectors
and
are the projection of vector
toward
-axis and
-axis, the lengths
and
are equal to
and
.
Figure 10: Illustration of the projected images on the virtual
viewable planes 1 and 2.
3.2.2. Object Localization Based on a Single Camera
The projected object
in
-axis is transformed into
in global coordinate. The origin
is the center of virtual viewable plane. The
camera deployment is expressed as the origin
and camera angle
.
Figure 11 shows the estimation with a reference point,
and a projected object.
denotes the vector from the origin
to the estimate
.
The object
,
estimate
,
projected objects
,
and projected estimates
are denoted as
,
,
,
and
in global coordinate. The vector
is expressed in two ways which have different
points of view:
on the virtual viewable plane and
in global coordinate.
Figure 11: The estimation of a projected object.
The unit vector
is represented in global coordinate as
.
The vector
is expressed as
.
Since the length
is equal to the projection of vector
toward
-axis (
), the length
is represented as:
(3)
Once we assume the estimate is close to the object,
the length
is represented as
(4)where the
length
is the length of the projected estimate and
object on the actual camera plane.
In Figure 11, since the vector
is equal to
,
the length of vector
is represented as follows:
(5)
Since the length
is the projection of the vector
toward
-axis (
), the global coordinate
is related with
as follows:
(6)
Note that since there are two unknown values of
,
two equations are necessary.
3.2.3. Object Localization Based on Multiple Cameras
As shown in Figure 9, once there are two available
cameras which show an object at the same time, two cameras have the following
relationship:
(7)
The projected vector sizes of the
vectors
and
are derived from
and
in (5). The
lengths
and
are represented as
and
in (4). The length between
and
in an actual camera plane (
) and the length between
and
in an actual camera plane (
) are obtained from displayed images.
Therefore, the object position
is represented as follows:
(8)
3.3. Effect of Zooming and Lens Distortion
The errors caused by zooming effect and lens
distortion are the reason of scale distortion. In practice, since every general
camera lens has nonlinear viewable range, the zooming factor is not a constant.
Moreover, since a reference point is a rough estimate, the distance
could be different from the distance
.
However, in (4), the distance
,
instead of the distance
,
is used to get the length
.
Figure 12 illustrates the actual (nonideal) zooming
model caused by lens distortion where the dashed line and the solid line
indicate ideal viewable angle and actual viewable angle, respectively.
Figure 12: Illustration of actual zooming model caused by lens
distortion.
For reference, zooming distortion is illustrated in
Figure 13 with the function of distance from the camera and various actual
zooming factors measured by Canon Digital Rebel XT
with Tamron SP AF 17–50 mm Zoom Lens [46, 47] where the dashed line is the ideal zooming factor and
the solid line is the actual (nonideal) zooming factor. As the distance
increases, the nonlinearity property of zooming factor decreases.
Figure 13: Illustration of zooming distortion on a function of
distance from the camera and various actual zooming factors used.
To reduce the localization error, we update the length
.
The lengths
and
are equal to
and
,
respectively. Due to the definition of zooming factor,
and
are expressed as
and
.
Since the objects
and
are projected at the same point on the actual
camera plane in Figure 12,
and
have the same length
on the actual camera plane. Thus the actual
length
is represented as follows:
(9)
The distances
and
are derived from
(10)where
,
,
,
and
,
are equal to
,
,
,
and
,
respectively.
Finally, the compensated object position
is determined as follows:
(11)where the
lengths
and
are equal to
and
,
respectively.
3.4. Effect of Lens Shape
The virtual viewable plane is a plane, and real camera
displays a curved space. Thus, unit distances per pixel in
- and
-axes are
nonlinear on the actual camera plane. Figure 14 shows the error caused by lens
shape, where the distances
and
denote two different distances between the
estimates and the camera.
Figure 14: Illustration of the error caused by lens shape.
Figure 15 illustrate the distribution of unit distance
of
- and
-axes on the actual camera plane. The distance
between camera and calibration sheet is 35 inches
and an unit distance is 1 inch.
Figure 15: Illustration of unit distance distribution due to
camera nonlinearity on the actual camera plane.
The translation of the distance between the estimate
and the object needs the compensation for the nonlinearity by camera
calibration. In Figure 15(a), the unit distance for
-axis is invariant in
-axis and in Figure 15(b), the unit distance
for
-axis is also invariant in
-axis. Hence in Figure 10, the height
differences of two different cameras have little
effect for the overall localization error.
3.5. Iterative Localization for Error Minimization
Once the virtual viewable plane is defined by the
estimate, the localized result has the error caused by the distance difference
between the estimate
and the real object
.
Thus the distance between the object and the estimate is important for reducing
the localization error.
The basic concept of iterative approach is to use the
previous localized position
as a new reference point
for the localization of object
.
Thus since the reference point
is closer to a real position
,
the localized position
is getting closer to a real position
.
Figure 16(a) illustrates the basic localization based
on two cameras where
represents the real object. If the distance
is equal to the distance
,
the obtained object coordinate uses the coordinate of
and
to translate the global coordinate of the
object. Thus the object point
is closer to the real object point
.
Figure 16: Illustration of iterative localization.
Figure 16(b) shows the iterative localization. Each
iteration gives closer object coordinate with relative computational
complexity. Thus the iterative approach can reduce the localization error.
Furthermore, through the iteration process, the localization is becoming
insensitive to the nonlinear properties.
3.6. Effect of Tilting Angle
In surveillance system, a camera can have tilting
angle to increase viewable area. The tilting angle
represents the angle difference between
-axis and
-axis on the virtual viewable plane. The
tilting angle has the range as
.
Figure 17 illustrates an example of the tilting angle
where one plane is placed on
-axis and the other has
tilting angle. The tilting angle
is equal to the angle difference between
virtual viewable plane and virtual viewable
.
Since
-axis is invariant for the variation of
tilting angle,
-axis on the virtual viewable plane is the
same as
-axis on the virtual viewable
.
Figure 17: Illustration of an example of the tilting angle.
The tilting angle is the reason for distortion in
- and
-axes as shown in Figure 18.
and
denote the project object positions of the
same object within different tilting angles. The tilting angle is not affecting
the variation in
-axis. However, the tilting angle changes the
distance of the object and camera. Thus, once the distance of object and camera
is changed, the zooming factor is also changed. Therefore, the tilting angle
distorts the object position in
-axis.
Figure 18: Illustration of the distortion by the tilting angle (

).
In Figure 18, the distance
is different from the distance
even if the position of camera and object is
not changed. Since
and
on the actual camera plane are translated to
and
using the zooming factor and the distance
between the object and camera, the tilting angle is the reason for the
localization error.
Figure 19 illustrates the effect of tilting angle in
terms of the distance between the object and the virtual viewable plane. The
heights
and
denote the object height and the camera
height. If the camera has
tilting angle, the distance
is changed by the distance
.
Figure 19: Illustration of the effect of tilting angle.
In order to compensate the localization error from
tilting angle, we update the distance
to
and then change the zooming factor for the distance
.
Thus the length
in (9) is updated as follows:
(12)where
denotes the zooming factor when the distance
between the object and the virtual viewable plane is
.
The distance
is derived as follows:
(13)where the
distance
is computed as
(14)
To quantify the localization error caused by tilting
angle, we tested the localization error in the simple case. Figure 23 shows the
setup of experiment where two cameras are placed on the left side for camera 1
and the bottom side for camera 2 in Cartesian coordinate. For simplicity, the
panning factors
and
are both zero.
We denote the object is placed on
(1.8 m, 1.8 m) and
(1.5 m, 1.5 m).
Figure 20(a) illustrates the localization error in
terms of tilting angle variation. If the tilting angle is zero, the height
difference between the camera and the object (
) does not affect the localization result
while the higher tilting angle makes the higher localization error. Thus the
tilting angle
is the reason for localization error. For
example, if the height of the object is 0.2 m lower than the camera height, the
range of localization error is from 0.003 to 0.025 m.
Figure 20: Illustration of the localization error in terms of
tilting angle variation.
Once object height is provided, the localization error
is compensated by (12). In Figure 20(b), we compensated the localization error
by denoting the camera height as 1.8 m and the object height as 1.6 m. The
overall error caused by the tilting angle has the error range from 0.003 to
0.011 m. If we know the camera height and object height, the error is
compensated. Moreover, once the height difference between the object and the
camera is unknown, the localization error in high-tilting angle,
the localization error is obviously improved. Therefore, if we expect the
height of the object, the localization error can be successfully compensated.
When the height difference between the object and
camera is an unknown value, the compensation for localization caused by tilting
angle is difficult. However, if the distance
is much longer than the distance
,
the tilting angle has little effect for the localization error. Figure 21
illustrates the localization error in terms of the distance
where the tilting angle is 12.4 degree. When
the distance
increases, the localization error increases
but after
is 2.7 m, the error is saturated. In the worst
case, the error rate is 0.01 m error per 0.2 m height distance. For example, once
the camera height difference is 6 m, the expected error is about 0.3 m. Moreover,
when the camera height is 0.2 m taller than the object, the error range is from
0.023 to 0.04 m. Once we assume the object is placed on 0.2 m lower than the
camera, the compensation reduces the error to the range of 0.006 to 0.024 m.
Figure 21: Illustration of the localization error in terms of the
distance

(

).
4. Analysis and Simulation
4.1. Simulation Setup: Basic Illustration
The objective in this simulation ensures the proposed
localization algorithm by measuring the localization error in the real case. To
show the compensation for camera nonlinearity, we chose small space which is
close to the camera. In the case of Figure 13, the distortion from camera
nonlinearity exists in 2.0 m inside space. Thus in this simulation, we use
area.
Our target application is a surveillance system where
most of target objects are human or vehicle. However, in this simulation, we
use a small ball as a target object to simplify the target detection. There are
many reasons for localization error caused by detection. For example, the
centroid detection of a human is important for reducing localization error
since a human is represented as a point. If we use different positions between
two camera images, the localization result has some centroid error. Thus in
this setup, we use a small ball. Moreover, after taking pictures, we manually
search the center of ball. We analyze the localization error in 2D global
coordinate. The object is represented as
.
Figure 22 shows the displayed images in two cameras
where the lengths
and
are distances between a reference point
and a real-object point
in camera 1 and camera 2, respectively. To
explain the test setup, we showed the reference point
in Figures 22(a) and 22(b), but actually the
reference point is a virtual point.
Figure 22: Illustration of two images of camera 1 and camera 2.
Figure 23: Illustration of experimental setup for localizing an actual object.
Figure 23 shows the experiment setup to measure an
actual object. In this experiment, the actual position of the object is
calculated from the reference based on the parallel
projection model. In Figure 23, two cameras are placed on the left side for
camera 1 and the bottom side for camera 2 in Cartesian coordinate.
Both camera panning factors
and
are at zero.
The actual zooming factors are
and
,
where
is the zooming factor when the distance
between the object plane and the virtual viewable plane is
,
and
is the zooming factor when the distance
between the object plane and the virtual viewable plane is
.
Now, we analyze the localization result and compare the localization error
depending on the iteration process called compensation.
4.2. Localization Error and Object Tracking Performance
Figure 24 shows the error distribution of the
algorithm where two cameras are positioned at
and
.
The actual object is located at
.
The figures illustrate the amount of localization error as a function of the
reference coordinate. Since each camera has limited viewable angles, the
reference coordinate located on the outside of viewable angle cannot be
considered. Note that the error is minimized when the reference points are
close to the actual object point. The localization error can be further reduced
with multiple iterations.
Figure 24: Illustration of error comparison based on the number
of iterations.
The proposed localization algorithm is also used for a
tracking example. In this example, an object moves within a
area, and the images are obtained from the
real cameras. We first applied the proposed noniterative localization algorithm
with compensation in tracking problems. Each time the object changes
coordinates, its corresponding estimation is generated. Figure 25(a)
illustrates the trajectory result of localization. After the compensation, the
tracking performance is improved. Figures 25(b) and 25(c) illustrate the tracking
performance in the
-axis and the
-axis. These figures clearly show that the
compensation improves the tracking performance but the localization error still
exists.
Figure 25: Application of the noniterative localization in
tracking a trajectory with rough estimates.
Similarly, the proposed iterative localization
algorithm is used in the same tracking example. In this case, only one
reference coordinate is used for the entire localization. The chosen estimate
is outside the trajectory as shown in Figure 26. This figure illustrates the
trajectory result of localization. There is a significant error with the one
iteration since the estimated coordinate is not close to the object. Note that
the error increases if the object is further away from the estimated
coordinate. However, successive iterations eliminated the localization error as
shown in the figure.
Figure 26: Application of the iterative localization with single
estimate.
4.3. Application of the Algorithms
Figure 27 shows a tracking environment with moving
cameras where the proposed localization algorithm is applied. For illustration,
two sequences of images are shown. The coordinates
of the center of the room is chosen as the initial reference coordinate. The cameras
follow the object during the localization. When the object is detected by
individual camera, the coordinate of the camera images are combined for actual
coordinate. The actual coordinate is shown in the tracking environment. In the
experiment, cameras are following the object through panning.
Figure 27: The snapshots of the tracking environment with moving
camera based on the proposed localization algorithm. Human face is used to
localize a person. The circle represents the actual coordinate of the person
within the room.
Figure 28 illustrates object detection in outdoor
environment where two objects are used for evaluating the proposed localization
algorithm. Both cameras are placed on the same side and the panning angles for
camera 1 and camera 2 are
and
,
respectively. Figure 29 illustrates two objects
trajectories in an outdoor environment. Since the
method is computationally simple, the total computation time is proportional to
the the number of objects, which is not a significant with respect to overall
computation. As shown in the figure, the trajectory computation errors are
negligibly small for the practical use. The average error in terms of the
distance between the actual trajectories and the computed trajectories is
0.294 m and 0.296 m for persons
and
,
respectively. However, the maximum error can go up to as much as 0.608 m (3%)
for person B. In addition to the localization algorithm computation errors,
note that additional contributing factors on the errors are the measurements of
the distances between the cameras and persons, and the selected center point of
the detected regions of the persons used in the computation.
Figure 28: Illustration of detection results for people
localization in an outdoor environment.
Figure 29: Illustration of two objects trajectory in an outdoor
environment.
5. Conclusion
This paper proposes an accurate and effective object
localization algorithm with visual images from unreliable estimate coordinates.
In order to simplify the modeling of visual localization, the parallel
projection model is presented where simple geometry is used in computation. The
algorithm minimizes the localization error through iterative approach with
relatively low-computational complexity.
Nonlinearity distortion of the digital image devices is compensated during the
iterative approach. The effectiveness of the proposed algorithm in object
position localization as well as tracking is illustrated. The proposed
algorithm can be effectively applied in many tracking applications where visual
imaging devices are used.
Acknowledgments
This research is supported by Foundation of Ubiquitous Computing and Networking (UCN) project, the Ministry of Knowledge Economy (MKE) 21st Century Frontier R&D Program in Korea, and a result of subproject UCN 08B3-O4-30S.
References
- R. Okada, Y. Shirai, and J. Miura, “Object tracking based on optical flow and depth,” in Proceedings of the IEEE/SICE/RSJ International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 565–571, Washington, DC, USA, December 1996.
- S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple cameras with overlapping fields of view,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1355–1360, 2003.
- A. Bakhtari, M. D. Naish, M. Eskandari, E. A. Croft, and B. Benhabib, “Active-vision-based multisensor surveillance—an implementation,” IEEE Transactions on Systems, Man, and Cybernetics C, vol. 36, no. 5, pp. 668–680, 2006.
- N. X. Dao, B.-J. You, S.-R. Oh, and Y. J. Choi, “Simple visual self-localization for indoor mobile robots using single video camera,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '04), vol. 4, pp. 3767–3772, Sendai, Japan, September 2004.
- V. Ayala, J. B. Hayet, F. Lerasle, and M. Devy, “Visual localization of a mobile robot in indoor environments using planar landmarks,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '00), vol. 1, pp. 275–280, Takamatsu, Japan, October 2000.
- K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough, “A joint particle filter for audio-visual speaker tracking,” in Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI '05), pp. 61–68, Torento, Italy, October 2005.
- D. N. Zotkin, R. Duraiswami, and L. S. Davis, “Joint audio-visual tracking using particle filters,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1154–1164, 2002.
- G. Pingali, G. Tunali, and I. Carlbom, “Audio-visual tracking for natural interactivity,” in Proceedings of the 7th ACM International Conference on Multimedia, pp. 373–382, Orlando, Fla, USA, October 1999.
- D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering algorithms for tracking an acoustic source in a reverberant environment,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 826–836, 2003.
- H. Lee and H. Aghajan, “Collaborative node localization in surveillance networks using opportunistic target observations,” in Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, pp. 9–18, Santa Barbara, Calif, USA, October 2006.
- O. Yakimenko, I. Kaminer, and W. Lentz, “A three point algorithm for attitude and range determination using vision,” in Proceedings of the American Control Conference (ACC '00), vol. 3, pp. 1705–1709, Chicago, Ill, USA, June 2000.
- H. Tsutsui, J. Miura, and Y. Shirai, “Optical flow-based person tracking by multiple cameras,” in Proceedings of the International Conference on Multisensor Fusion and Integration for Intelligent Systems
(MFI '01), pp. 91–96, Baden-Baden, Germany, August 2001.
- V. Lepetit and P. Fua, “Monocular model-based 3D tracking of rigid objects: a survey,” Foundations and Trends in Computer Graphics and Vision, vol. 1, no. 1, pp. 1–89, 2005.
- Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330–1334, 2000.
- M. Han, A. Sethi, W. Hua, and Y. Gong, “A detection-based multiple object tracking method,” in Proceedings of the International Conference on Image Processing (ICIP '04), vol. 5, pp. 3065–3068, October 2004.
- I. Haritaoglu, D. Harwood, and L. S. Davis, “: real-time surveillance of people and their activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 809–830, 2000.
- H. Jin and G. Qian, “Robust multi-camera 3D people tracking with partial occlusion handling,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), vol. 1, pp. 909–912, Honolulu, Hawaii, USA, April 2007.
- J. Berclaz, F. Fleuret, and P. Fua, “Robust people tracking with global trajectory optimization,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), vol. 1, pp. 744–750, New York, NY, USA, June 2006.
- K. Nummiaro, E. Koller-Meier, T. Svoboda, D. Roth, and L. Van Gool, “Color-based object tracking in multi-camera environments,” in Proceedings of 25th DAGM Symposium on Pattern Recognition, pp. 591–599, Magdeburg, Germany, September 2003.
- O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Camera handoff: tracking in multiple uncalibrated stationary cameras,” in Proceedings of the IEEE Workshop on Human Motion (HUMO '00), pp. 113–118, Los Alamitos, Calif, USA, December 2000.
- P. E. Debevec, Modeling and rendering architecture from photographs, Ph.D. thesis, University of California at Berkeley Computer Science Division, Berkeley Calif, USA, 1996.
- M. Watannabe and S. K. Nayar, “Telecentric optics for computational vision,” in Proceedings of the 4th European Conference on Computer Vision (ECCV '96), vol. 2, pp. 439–451, Cambridge, UK, April 1996.
- M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- C. Geyer and K. Daniilidis, “Omnidirectional video,” The Visual Computer, vol. 19, no. 6, pp. 405–416, 2003.
- S. Spors, R. Rabenstein, and N. Strobel, “A multi-sensor object localization system,” in Proceedings of the Vision Modeling and Visualization Conference (VMV '01), pp. 19–26, Stuttgart, Germany, November 2001.
- S. Bougnoux, “From projective to Euclidean space under any practical situation, a criticism of self-calibration,” in Proceedings of the 6th IEEE International Conference on Computer Vision (ICCV '98), pp. 790–796, Bombay, India, January 1998.
- R. K. Lenz and R. Y. Tsai, “Techniques for calibration of the scale factor and image center for high accuracy
3-D machine vision metrology,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 5, pp. 713–720, 1988.
- J. Heikkila and O. Silven, “A four-step camera calibration procedure with implicit image correction,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), pp. 1106–1112, San Juan, Puerto Rico, USA, June 1997.
- F. Lv, T. Zhao, and R. Nevatia, “Camera calibration from video of a walking human,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1513–1518, 2006.
- O. D. Faugeras, Q.-T. Luong, and S. J. Maybank, “Camera self-calibration: theory and experiments,” in Proceedings of the 2nd European Conference on Computer Vision (ECCV '92), pp. 321–334, Santa Margherita Ligure, Italy, May 1992.
- A. Zisserman, P. A. Beardsley, and I. D. Reid, “Metric calibration of a stereo rig,” in Proceedings of the IEEE Workshop on Representation of Visual Scenes (WVRS '95), pp. 93–100, Cambridge, Mass, USA, June 1995.
- E. Horster, R. Lienhart, W. Kellermann, and J.-Y. Bouguet, “Calibration of visual sensors and actuators in distributed computing platforms,” in Proceedings of the 3rd ACM International Workshop on Video Surveillance & Sensor Networks, pp. 19–28, Hilton, Singapore, November 2005.
- P. F. Sturm and S. J. Maybank, “On plane-based camera calibration: a general algorithm, singularities, applications,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), vol. 1, pp. 432–437, Fort Collins, Colo, USA, June 1999.
- Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong, “A robust technique for matching two uncalibrated images through the recovery of the unknown
epipolar geometry,” Artificial Intelligence, vol. 78, no. 1-2, pp. 87–119, 1995.
- Q. Memon and S. Khan, “Camera calibration and three-dimensional world reconstruction of stereo-vision using neural networks,” International Journal of Systems Science, vol. 32, no. 9, pp. 1155–1159, 2001.
- R. Cipolla, T. W. Drummond, and D. Robertson, “Camera calibration from vanishing points in images of architectural scenes,” in Proceedings of the British Machine Vision Conference, vol. 2, pp. 382–391, Nottingham, UK, September 1999.
- P. A. Beardsley, A. Zisserman, and D. W. Murray, “Sequential updating of projective and affine structure from motion,” International Journal of Computer Vision, vol. 23, no. 3, pp. 235–259, 1997.
- O. Faugeras, “Stratification of three-dimensional vision: projective, affine, and metric representations: errata,” Journal of Optical Society of America, vol. 12, no. 3, pp. 465–484, 1995.
- T. Moons, L. Van Gool, M. Proesmans, and E. Pauwels, “Affine reconstruction from perspective image pairs with a relative object-camera translation in between,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 77–83, 1996.
- M. Pollefeys and L. Van Gool, “A stratified approach to metric self-calibration,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '97), pp. 407–412, San Juan, Puerto Rico, USA, June 1997.
- P. A. Beardsley and A. Zisserman, “Affine calibration of mobile vehicles,” in Proceedings of the Europe-China Workshop on Geometrical Modelling and Invariants for
Computer Vision (GMICV '95), Xi'an, China, April 1995.
- J. J. Koenderink and A. J. van Doorn, “Affine structure from motion,” Journal of the Optical Society of America A, vol. 8, no. 2, pp. 377–385, 1991.
- P. Sturm and L. Quan, “Affine stereo calibration,” in Proceedings of the 6th International Conference on Computer Analysis of Images and
Patterns (CAIP '95), pp. 838–843, Prague, Czech Republic, September 1995.
- C. Tomasi and T. Kanade, “Shape and motion from image streams under orthography: a factorization method,” International Journal of Computer Vision, vol. 9, no. 2, pp. 137–154, 1992.
- C. J. Poelman and T. Kanade, “A paraperspective factorization method for shape and motion recovery,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp. 206–218, 1997.
- http://www.usa.canon.com/.
- http://www.tamron.com/.