Abstract

This paper presents a vision-based technology for localizing targets in 3D environment. It is achieved by the combination of different types of sensors including optical wheel encoders, an electrical compass, and visual observations with a single camera. Based on the robot motion model and image sequences, extended Kalman filter is applied to estimate target locations and the robot pose simultaneously. The proposed localization system is applicable in practice because it is not necessary to have the initializing setting regarding starting the system from artificial landmarks of known size. The technique is especially suitable for navigation and target tracing for an indoor robot and has a high potential extension to surveillance and monitoring for Unmanned Aerial Vehicles with aerial odometry sensors. The experimental results present “cm” level accuracy of the localization of the targets in indoor environment under a high-speed robot movement.

1. Introduction

Knowledge about the environment is a critical issue for autonomous vehicle operations. The capability of localizing targets with a robot for environment is highly demanded. In [1], it investigates the vision-based object recognition technique for detecting targets of the environment which have to be reached by a robot. After the robot finishes the target detection, it is an essential task for robots to know the positions of targets. The tasks include navigation and object tracking. As a result, the approach to localize targets for specific tasks is an essential issue in some applications. For UAVs, there are some similar demands [2, 3] that UAVs have to track the positions of ground objects for reconnaissance or rescue assistance with monocular vision. Besides, UAVs need to observe the changing surroundings to understand the movements of the aircraft better, or the localizing system has to direct the aircraft to a region of interest after taking ground observations.

In recent years, odometry sensors have been widely used for estimating the motion of vehicles moving in a 3D space environment such as UAVs and Unmanned Ground Vehicles (UGVs). For instance, Inertial Navigation Systems (INSs) are applied to measure linear acceleration and rotational velocity, and capable of tracking the position, velocity and attitude of a vehicle by integrating these signals [4] or mobile vehicles use the Global Position System (GPS) and Inertial Measurement Units (IMUs) for land vehicle applications [5]. However, most of inertial navigation system sensors are expensive for some applications in the indoor environment. Optical wheel encoders and an electrical compass provide linear and angular velocities respectively. Both of two sensors are basic odometry sensors and widely used owing to their low cost, simplicity, and easy maintenance. Encoders provide a way of measuring the velocity to estimate the position of the robot and compasses are often used to detect the orientations of the robot. Based on the sensor information, motion control is done and then the localization of the robot is estimated [6, 7]. Despite some limitations of an encoder, most researchers agree that an encoder is a vital part of the robot’s navigation system and the tasks will be simplified if the encoder accuracy is improved. Besides, the additional sensor, a camera, allows a robot to perform a variety of tasks autonomously. The use of computer vision for localization has been investigated for several decades. The camera has not been at the center of robot localization while most of researchers have more attention to other sensors such as laser range-finders and sonar. However, it is surprising that vision is still an attractive choice for sensors because cameras are compact, cheaper, well understood, and ubiquitous. In this paper, the algorithm is achieved by combining different types of sensors including optical wheel encoders, an electrical compass, and a single camera.

Mobile vehicles such as robots, UGVs, and UAVs are becoming fully intelligent robotic systems which can handle more complicated tasks and environments. It is necessary to provide them with higher autonomy and better functionalities. For instance, robot localization and target tracking are mandatory. For these tasks, some of sensors such as GPSs, IMUs, laser range-finders, encoders, and compasses are possibly combined to describe the time evolution of the motion model. Additionally, visual sensors, such as electro-optic or infrared cameras, have been included in the on-boarded sensor suit of many robots to increase their value. Therefore, it is a popular research topic regarding how to combine hybrid sensors to achieve a special task. Recent researches have shown that the way to improve sequential estimation and achieve repeatable motion with multisensors is to adopt the probabilistic method. The Kalman filter has been widely used for data fusion such as navigation systems [8], virtual environment tracking systems [9], and 3D scene modeling [10]. The Kalmna filter is a mathematical method to provide an iterated procedure for a least-square estimation of a linear system. The procedure includes predicting state by using the motion model and then correcting the state by the measurement innovation. The EKF is a varied type of the Kalman filter for a non-linear system such as the target tracking system in this paper. More details about the Kalman filter are introduced in the material of [1114]. In this paper, the algorithm represents a new approach of EKF-based multisensor data fusion. The information from different types of sensors is combined to improve the estimations in the system state. In order to make our system able to deal with the variety of tasks such as navigation for a mobile robot or ground target tracking for UAVs, we are determined to choose monocular vision rather than binocular vision as one type of our sensors. Based on the monocular vision, simultaneous robot localization and target tracking become more complex with a higher computational loading due to unknown depths of targets in only one image observation. Thus, it is an important issue regarding how to make use of computational tricks or adopt more efficient algorithms to reduce the running time. In this paper, Chelesky decomposition [15] and forward-and-back substitution are used to invert the covariance matrix for improving the computational efficiency because the property of covariance matrix is semipositive definite.

The rest of this paper is structured as follows. Section 2 presents the proposed algorithm and explains the details. Section 3 explains two key points about the proposed algorithm. The experimental results are provided in Section 4, and finally, Section 5 concludes the paper and suggests future work.

2. Explanation of the Algorithm

The algorithm is mainly divided into five parts including motion modeling, new target adding, measurement modeling, image matching, and EKF updating. By sequential measurement from sensors, the EKF is capable of improving the initial estimate for the unknown information while simultaneously updating the localization of targets and the robot pose. Finally, Chelesky decomposition and forward-and-back substitution are presented to calculate the inverse covariance matrix efficiently.

2.1. The Origins of the Proposed Algorithm

A single camera is mounted on our system as one of the sensors. The monocular vision infers that the depth of the target is not able to be measured by only one image but estimated by the sequential images. Therefore, the single camera has to estimate the depth by observing the target repeatedly to get parallax between different captured rays from the target to the robot. The orientations of target are estimated in the world coordinate system only by one image. is a six-dimension state vector and used to describe the position of the th target in 3D space. Its equation is addressed as is the location of the robot when the robot detects the th target in the first time. is defined as the position of the target with respect to the camera coordinate system and denoted by and are the orientations of the th target and calculated by the pixel location of the target in only one image. The relationship between the position of the target and the current robot localization is presented in Figure 1.

is the position of the target with respect to the world frame system. is defined as the distance from to the target and is its unit vector. Consequently, is seen as the vector from to the target and then is calculated as . Finally, is estimated if the depth is known in advance. However, the depth of the target is not able to be measured by a single image. The EKF is applied to estimate , , and and they converge toward more correct values under the recursive iteration by using sequential image measurement information. To sum up, (2) is the basic concept and origin of our proposed algorithm.

2.2. Motion Modeling

We first derive the continuous-time system model that describes the time evolution in the state estimate. This model allows us to employ the sampled measurements of the wheel encoder and the compass for state propagation. The process state is described by the vector where is the position of the robot with respect to the world frame. is the quaternion that represents the orientation of the world frame for the robot. The linear velocity in the world frame and the angular velocity with respect to the camera frame are denoted by and , respectively. In this system, the control inputs are measured by wheel encoders and the compass which provide the linear velocity and the angular velocity, respectively. Besides, the linear velocity is used to simulate the acceleration data of the robot for describing how components of the system state vector change completely. The definition of acceleration is defined as . In order to have similar on-line simulation, we define as rather than . Although the former definition is not better than the later one which is closer to the real acceleration, an iterated EKF is still capable of compensating for errors when adopting as the robot’s acceleration. The system model describing the time evolution of the wheel encoders and of the compass is given by

The compass is used to measure which is defined as . and are defined as and respectively. If is assumed to be , is derived as . is denoted by whereis equal to .

The covariance is predicted and modified if considering the control covariance . We assume that the process noise of the control vector is not correlated with the process state vector so that is set to be a zero matrix. By using (A.4) in Appendix A, the predicted covariance of the system model is expressed as where The first item represents the noise from the process state vector and describes the noise from the measurement of the wheel encoders and the compass. The Jacobian matrix of is shown as The Jacobian matrixes of and are as the same as and , respectively.

2.3. Adding New Target

It is remarkable about our algorithm that new targets are initialized by using only one image. The initialization includes the state initial values and the covariance assignment.

2.3.1. Target Initialization

Equation (1) is used to define the location of the new target and estimate the six parameters of with the EKF prediction-update loop. If the robot senses the 1st target at the state for the first time, the new target information is added and then the process state vector is modified. The expanded process state vector is represented by the following equation: where The first time observation of the 1st target is done at the current camera location . is defined as an unit vector from the location to the 1st target with respect to the world frame. Figure 2 illustrates the relationship about , , and . is defined as the pixel location of the 1st target in an undistorted image. is the location of the 1st target with respect to the camera frame. The location of the 1st target with respect to the world frame is addressed as where Then we get Only and rather than and are computed by using . It is impossible to know by only one image. However, it is possible to make use of and to calculate and which are shown as The unit vector is derived as a function of the orientations of the target and described as The final parameter is not able to be measured by only one image but estimated by the sequential images with EKF prediction-update loop. As a result, it is assumed to be an initial value .

2.3.2. New Covariance Assignment

The system covariance is modified after adding a new target. By using (B.1) in Appendix B, the new covariance and the function of are denoted by whereis not a zero matrix because the new target’s location is correlated with the estimates of the robot. According to (B.4) and (B.5) in Appendix B, and are derived as where

2.4. Measurement Modeling

The robot moves continuously and records the sequential images. This is a process of detecting and identifying the targets. The parameters of targets have been set into the process state vector and the covariance has been estimated with the recursive loop.

2.4.1. Sensor Model

The predicted pixel locations of targets are estimated as a function of the prior process state vector which is described in (4). According to (2), the location of the th target in the camera frame can be defined in another way: Our measurement sensor of the camera is monocular vision. There is a camera sensor model to describe how the sensor maps the variables in the process state vector into the sensor variables. By using the pinhole model [16], is derived as a function of the undistorted pixel location of the th target and denoted by Equation (22) and (23) are combined to derive the predicted pixel location without distortion for the th target . Its equation is addressed as By using image correct, the predicted pixel location of the th target within distortion is addressed as where and are coefficients for the radial distortion of the image.

The actual image target the robot takes is the distorted pixel location rather than the undistorted one. Therefore, we have to calculate to get the measurement innovation for EKF updating.

2.4.2. Measurement Covariance Assignment

The measurement covariance is expected to describe what the likely variation of measurement is by the sensor under the current condition. The variation is infected by the variables in the process state vector and the noise which corrupts the sensor. The measurement Jacobian matrix is where When there are targets observed concurrently, they are stacked in one measurement vector to form a single batch-form update equation which is shown as Similarly, the batch measurement Jacobian matrix is derived as where

2.5. Image Matching by Using Cross-Correlation

The image patch information of the targets whose variables are set in the process state vector has been stored in the data-base when the robot senses them in the first time. The predicted pixel locations of the targets within distortion are estimated by using (25). Next important issue is how to search for the actual pixel locations of the targets in a 320 × 240 image. Image matching is a fundamental and vital aspect of many problems in computer vision including motion tracking and object recognition. The image features are invariant to rotation, image scaling, change in illumination, and 3D camera viewpoint. It is still a popular research topic regarding how to allow a single candidate image to be correctly matched with high probability. Based on the estimation of measurement covariance, the small area where the target lies with high probability is able to be predicted. The measurement covariance of the th target is defined as

The template match technique is used to search the actual image pixel location of the th target in the small search area whose width and length are and , respectively. The coefficient is a constant and defines how large the search area is. The larger is, the more is the search time. There are two advantages regarding the search area which is estimated by the measurement covariance. One advantage is that a lot of time is saved because the image pixel location of the th target is detected in a small search area rather than in a 320 × 240 image. The other one is that the successful search rate increases dramatically because the search area allows the image pixel location to be correctly matched with high probability. As a result, it is not necessary to use a complicated object recognition algorithm such as Scale Invariant Feature Transform (SIFT) [17] for image matching in our system. Cross-correlation search is applied to be our template match algorithm and the computing loading is lower due to its simplification. This approach uses cross-correlation of image to search a suitable image patch. is defined as the template image patch for the th target and stored in the database. is defined as a candidate image patch in the search area whose width is and length is . The cross-correlation value of with is given by

is the number of pixels in an image patch and and are the stand deviations of and , respectively. and are the average values of and , respectively. The maximum value of cross-correlation of with is the best template matching and it is seen as the matching target pixel patch for the th target.

2.6. Iterated EKF Updating

The EKF is one of the most widely used nonlinear estimators due to its similarity to the optimal linear filter, its simplicity of implementation, and its ability to provide accurate estimates in practice. We employ the iterated EKF to update the state. At each iteration step, the prior process state vector is computed by using (4) and then is calculated as a function of by using (28). Next, the measurement innovation and the measurement Jocobian matrix are computed as and , respectively. The measurement covariance and Kalman gain can be performed as Finally, the process state vector and its covariance are updated at each iteration state and presented as where is the actual measurement and detected by using image matching search. is the predicted measurement and computed by the sensor model. The error in the estimation is reduced by the iteration and the unknown depths of the targets converge toward the real values gradually.

2.7. Fast Inverse Transformation for Covariance Matrices

The inverse matrix of the measurement covariance has to be computed at each iteration step. It needs plenty of running time by using the transpose of the matrix of cofactors to invert the measurement covariance. In order to deal with the large size of the inverse matrix within the variations of targets efficiently, Cholesky decomposition is applied to invert the measurement covariance and reduce the running time at each iteration step. The measurement covariance is factored in the form for a lower-left corner of the matrix because the measurement covariance is positive semidefinite. The matrix is not unique, and so multiple factorizations of a given matrix are possible. A standard method for factoring positive semidefinite matrices is the Cholesky factorization. The element of is computed by using Cholesky decomposition and addressed as follows: where is the element of .

A matrix equation in the form or can be solved by forward substitution for lower triangular matrix and back substitution for upper triangular matrix , respectively. In our proposed method, Cholesky decomposition and forward-and-back substitution are combined to invert the measurement covariance for reducing the computational loading.

3. The Analysis of the Algorithm

In this section, there are two key points about the proposed algorithm presented as follows. The first key point is that the details of algorithm will be analyzed to prove that it is not necessary to get the depth information in our algorithm even though it is designed to track the 3D localization of the targets. The other key point is introduced to explain why the simple template match technique, cross-correlation search, is applied.

3.1. Depth Information Analysis in the Proposed Algorithm

The depth information should be known while the robot localizes the targets in 3D space by using monocular vision. However, the proposed algorithm does not need the depth information to localize targets. As presented in (14), it is not necessary to know the depth information to calculate and . Similarly, it seems that the depth information should be known to compute in (19). is derived as where The depth information has to be known for calculating according to the following equation: However, still can be computed without knowing because of . The details are proved as follows: where

, , and are computed by (14). Equation (39) can be calculated without knowing the depth information due to the following equation:

In the same way, the depth information should be known to compute in equation (19). is derived as where

and can be computed without but cannot be calculated without according to (39). However, still can be estimated without knowing and if and are combined. The details are proved by the following equations: where Therefore, is computed without and by using (44) and applying (14) to calculate , and .

Based on the results of calculating , and (14), it infers that we have solved a problem that the depth information, and , is not able to be measured only by one image. This is a very important characteristic of the proposed algorithm.

3.2. Analysis Regarding Using Cross-Correlation Search

Generally speaking, cross-correlation search is though as a simple template match algorithm and it is not as robust as SIFT. However, is still detected correctly in the search area by using cross-correlation search because the small area is estimated by an iterated EKF and it includes the actual pixel location of the target with high probability. As shown in Figure 3, the predicted pixel locations of the targets are estimated by using the sensor model and denoted by the green crosses. The pixel locations of the red crosses are the corresponding target pixel locations and detected by applying cross-correlation search. Based on the testing result that the red crosses are closer to the actual target pixel locations, it has proved that it is feasible for the proposed algorithm to apply cross-correlation search as its template match algorithm.

4. Experimental Results

In order to validate the proposed algorithm for localizing targets when the ground truth is available we have performed a number of experiments. The algorithm is implemented by C++ and performed by PC with 2.0 GHz microprocessor. The monocular sensor is a single camera with wide angle lens because we hope that more targets can be observed in one image and tracking rate can be higher. The camera’s field of view is 170° with a focal length of 1.7 mm. The image measurements received at a rate of 15 Hz are distorted with noise pixel. The addressed experimental results are tested under the high speed robot motion because the average velocity of each case is higher than 20 cm/sec and the maximum velocity of all cases is 69.11 cm/sec. For the duration of experiments, the initial distance between the camera and the targets ranges from 1.68 m to 5.76 m. The unknown depth of the target is estimated by sequential images with EKF and six cases (3.0 m, 3.5 m, 4.0 m, 4.5 m, 5.0 m, and 5.5 m) are assumed to be default depths for each experiment case. All of the sensors mounted on our system are shown in Figure 4.

There are some measurement errors caused by the camera distortion when using the camera with wide angle lens. Before validating the proposed algorithm, we performed an experiment to estimate the distorted noise by making use of the artificial landmarks. We chose the corners of the artificial landmarks as targets. The undistorted pixel locations of the corners are estimated by the pinhole model and then the image correction is applied to compute their predicted pixel locations with distortion. Owing to using a camera with wide angle lens, there is an error between the predicted pixel location with distortion and its actual pixel location which is detected by the cross-correlation search. In the proposed algorithm, the predicted pixel location without distortion is estimated in terms of the prior process state vector. The distorted error is produced if transforming the undistorted pixel locations of targets to the distorted pixel locations. Therefore, the distorted error should be taken into consideration very carefully. Based on the experimental result, the distorted noise is assumed to be 20 pixels.

4.1. Results of Target Localization

For demonstrating the performance and practical applications to the proposed MVLT, the experiments for a mobile robot go through the doors by tracking the feature targets located upon the door and localize robot location simultaneously.

4.1.1. Target Localization Experiments

Once a robot has to pass through doors, one should recognize the accurate doors position. The experiment shown in Figure 5 is designed to verify accuracy of MVLT for the door as tracked targets. Figure 6 shows the real environmental features as targets whose positions require to be estimated for a task that the robot passing through the door. Since the developed system does not require an initial setting at the first iteration, it provides a practical usage under a dynamical motion if target is detected instead of training processes of the environment features for robot localization. The two different moving paths with varied linear velocities and orientations are examined as shown in Figure 7. The initial true depth for paths A and B between robot and the doors is about 3.5 meter, path A is a straight forward path, and path B is a left way of robot to move forward to the doors. The experimental result is summarized in Table 1, and the averaged error of proposed approach is about 10 cm.

According to the experimental result shown in Figures 8 and 9, it infers that the camera is not able to provide the precise measurement information to make MonoSLAM [18] correcting the prior process state variables under a high-speed movement when the control input is not provided. It also shows that the image-distorted noise pixel is too large for MonoSLAM [18] model. Owing to combining hybrid sensor information, the proposed MVTL algorithm is capable of localizing targets with image distorted noise pixel and its average error is lower than 0.11 m. It has proved that MVTL algorithm is robust to track targets under a higher-speed movement with larger measurement noise.

4.1.2. Acceleration Approximation-Based Error Comparison

We also performed an experiment about the comparison between encoders and IMU. We choose an encoder as our sensor instead of IMU in indoor environment. The acceleration is approximated as by encoders in order to have similar on-line simulation data. It will increase additional acceleration noise if we use prevelocity rather than to simulate the acceleration of the robot at state . However, an iterated EKF is robust enough to compensate for errors from this low-cost sensor and the definition of acceleration. The error of acceleration from a simulated IMU is lower than 0.001 m/s2. Based on the result shown in Table 2, the errors of localization by using encoders are still accepted even though we use prevelocity to simulate acceleration at iterated state . It has proved that errors are able to be reduced gradually with EKF recursive loop.

4.1.3. Performance of Depth Initial Conditions

In terms of the experimental result shown in Figure 10, we could conclude that the ability of localizing targets depends on the parallax from different views rather than the distance between the robot and the target. It is a very crucial viewpoint to analyze the stability of the system in terms of means and deviations. Not only do we analyze one the target localization but also we understand the details about multiple target localization. Refering to the above experiments, MVTL algorithm localizes targets with a higher-speed motion and six kinds of default depth values converge to the similar value with little errors.

4.2. Experiments of Fast Inverse Matrix Transformation

Finally, we performed an experiment to verify Cholesky decomposition and forward-and-back substitution. In this experiment, the totally testing time is 4.2 sec with 15 Hz image rate. At first, the transpose of the matrix of cofactors is used to invert the measurement covariance matrix whose size ranges from 2 × 2 to 8 × 8. The experimental results shown in Table 3 present that the running time is long if the system inverts a 6 × 6 or a more dimension matrix. However, the running time is reduced dramatically if the system inverts an 8 × 8 matrix by using Cholesky decomposition and forward-and-back substitution. The test result shows that the system is capable of being real-time even though localizing four targets concurrently.

5. Conclusion and Future Work

5.1. Conclusion

Based on the different experiments we performed, it has proved that the proposed algorithm is able to localize targets with “cm” level accuracy under a higher-speed movement. Besides, we have validated that it is practical to use odometry sensors to track targets as well. Not only does the system start the recursive procedure without the initial setting but also the robot can move rapidly to localize targets. The EKF-based algorithm is really practical and robust to reduce errors from sensors even though the low-cost sensors are used in our system. The efficiency of the proposed algorithm is impressive by using Cholesky decomposition and some computational tricks. In terms of the experimental result shown in Figure 10, we could conclude that the ability of localizing targets depends on the parallax from different views rather than the distance between the robot and the target. Consequently, the proposed algorithm has a high potential extension to surveillance and monitoring for UAVs with aerial odometry sensors. In the same way, it is able to be used widely for robot tasks as well.

5.2. Future Work

The targets are assumed to be stationary landmarks in the proposed algorithm. It will be an interesting and challenging research if the algorithm is modified to track a moving target. This type of the technique is necessary and going to be widely used in many applications. For instance, UGV has to know the intended motions of other moving objects in order to plan its navigating path for the next state. Therefore, it will be a good future work to add a new approach into the proposed algorithm to track moving targets. Besides, there is another important future work for our system. This future work is to find some ways or algorithms to improve accuracy of the measurement data by the low-cost sensor because it will improve the localization errors in terms of more accurate measurement data.

Appendices

The proof of the system covariance is presented in appendix. It derives the modified covariance in motion model at each iteration step and the new covariance after adding new targets.

A. The Modified System Covariance in Motion Model

The process state vector is predicted by the system process model. It is determined by the old state and the control inputs applied in the process. is a control vector and is corrupted by mean-zero process. Both of equations are described as The prior estimate error covariance is derived as a function of the prior state estimate and represented as where If the control vector is not correlated with the state , and are equal to zero matrices. The prior estimate error covariance simplifies to

B. The New Covariance after Adding New Targets

When the robot tracks a target, the location information of new target is added into the process state vector. The form of the new information is determined by sensor measurement model. is composed of the random variables of the new target information. The new covariance is addressed as

The size of the posterior estimate error covariance increases with the number of the targets (Figure 11). The EFK is a multihypothesis and a proper way to include all the measurement information. There are two cases regarding the relationship between the new target and other variables. One case is that the estimate of the new target’s location is independent of the estimates of other targets and the variables of the robot. In this case, the covariance matrix of the new target is given by is a covariance matrix and is a cross-covariance matrix. is a zero matrix because the new target is independent of other targets and the robot by definition. The other case is that the new target is determined as a function of its spatial relation to other target locations. The new target’s location is correlated with the estimates of other targets and the robot. The function and the covariance matrix of the new target are shown as

Acknowledgment

This work is supported by the National Science Council under Grant no. NSC 99-2221-E-009-159.