Abstract

Augmented Reality (AR) aims at enhancing our the real world, by adding fictitious elements that are not perceptible naturally such as: computer-generated images, virtual objects, texts, symbols, graphics, sounds, and smells. The quality of the real/virtual registration depends mainly on the accuracy of the 3D camera pose estimation. In this paper, we present an original real-time localization system for outdoor AR which combines three heterogeneous sensors: a camera, a GPS, and an inertial sensor. The proposed system is subdivided into two modules: the main module is vision based; it estimates the user’s location using a markerless tracking method. When the visual tracking fails, the system switches automatically to the secondary localization module composed of the GPS and the inertial sensor.

1. Introduction

The idea of combining several kinds of sensors is not recent. The first multi-sensors system appeared with robotic applications where, for example, in [1] Vieville et al. proposed to combine a camera with an inertial sensor to automatically correct the path of an autonomous mobile robot. This idea has been exploited these last years by the community of Mixed Reality. Several works proposed to fuse vision and inertial data sensors, using a Kalman filter [26] or a particular filter [7, 8]. The strategy consists in merging all data from all sensors to localize the camera following a prediction/correction model. The data provided by inertial sensors (gyroscopes, magnetometers, etc.) are generally used to predict the 3D motion of the camera which is then adjusted and refined using the vision-based techniques. The Kalman filter is generally implemented to perform the data fusion. Kalman filter is a recursive filter that estimates the state of a linear dynamic system from a series of noisy measurements. Recursive estimation means that only the estimated state from the previous time step and the current measurement are needed to compute the estimate for the current state. So, no history of observations and/or estimates is required.

In [2] You et al. developed a hybrid sensor combining a vision system with three gyroscopes to estimate the orientation of the camera in an outdoor environment. Their visual tracking allows refining the obtained estimation. The system described by Ababsa [5] combines an edge-based tracking with inertial measurements (angular velocity, linear acceleration, magnetic fields). The visual tracking is used for accurate 3D localization while the inertial sensor compensates errors due to sudden motion and occlusion. The measurements of gravity and magnetic field are used to limit the drift problem. The gyroscope is employed to automatically reset the tracking process. Data provided by the two sensors are combined with an extended Kalman filter using a constant velocity model. More recently [9], the same authors proposed to use the GPS positions to re-initialize visual tracking when it fails. Thus, initialization of the visual tracking is obtained by defining a search area represented by an ellipse centered on the GPS position.

Recently, Bleser and Stricker [6] proposed to combine a texture-based tracking with an inertial sensor. The camera pose is predicted from data provided by the accelerometers using an Extended Kalman filter (EKF). In order to estimate the pose, the EKF fuse the 2D/3D correspondences obtained from the image analysis and the inertial measurements acquired from the inertial sensor. A rendering of CAD model (textured patches) is made using the predicted poses. This allows aligning iteratively the textured patches in the current image to estimate the 2D motion and to update the estimate given by the filter. Natural feature points are tracked by a KLT (Kanade Lucas Tomasi) tracker. The motion model assumes constant acceleration and constant angular velocity. This approach needed offline preparation for generating a textured CAD model of the environment. Hu et al. [10] proposed to combine a camera, a GPS and an inertial gyroscope sensor. The fusion approach is based on PPM (Parameterized model matching algorithm). The road shape model is derived from the digital map with respect to GPS position, and matches with road features extracted from the real images. The fusion is based on a predictor-corrector control theory. After checking data integrity, GPS data will start a new loop and reset gyro’s integrated. Gyro’s prediction will be feedback into the gyro integration module as a dynamical correction factor. When the image feature tracking fails, gyro’s prediction data is used for the camera pose estimation. Ababsa and Mallem [8] proposed a particle filter instead of the Kalman filter. Particle filters (PF), also known as methods of Monte-Carlo sequential, are sophisticated techniques for estimating models based on simulation. PFs are generally used to estimate Bayesian models. They represent an alternative to extended Kalman filter, their advantage is that they approach the optimal Bayesian estimation using enough samples. Ababsa et al. merged data from fiducial-based method with inertial data (gyros and accelerometers). Their fusion algorithm is based on a particle filter with sampling importance resampling (SIR). As the two sensors have different sampling frequency, the authors implemented two complementary filters. Thus, if there is no data of vision (e.g., occlusion), the system uses only data from the inertial sensor and vice versa. Aron et al. [11] used the inertial sensor to estimate the orientation of the camera only when the visual tracking fails. The orientation allows tracking the visual primitives by defining a search area in the image to perform the features matching. A homography is estimated from this set of matched features to estimate the camera pose. The errors of the inertial sensor are taken into account to optimize the search area. Unlike the approach proposed by Aron et al. [11] which only estimates the camera orientation, Maidi et al. [12] used an inertial sensor to estimate both the position and the orientation. Their multimodal system allows tracking fiducials and handling occlusions by combining several sensors and techniques depending on the existing conditions in the environment. When the target is partially occluded, the system uses a point-based tracking. In presence of a total occlusion of the fiducials, inertial sensor helps to overcome the vision failure. However, the estimation of position from acceleration produces drift over time resulting in a tracking failure. Combining sensors following the assistance scheme seems more interesting than the data fusion. Indeed, assistance approach makes the system more intelligent so that it can adapt itself to different situations and uses at each time only the data provided by the available sensors.

In this research work, we are interested in developing an original localization system combining three heterogeneous sensors (Camera, GPS and IMU) in order to increase the accuracy and the robustness. Our objective is to carry out a generic solution for the 3D localization, 3D visualization and interaction adaptable to several outdoor environments. The remainder of the paper is organized as follows: Section 2 is devoted to the overview of our Hybrid localization system. In Section 3, we give the formulation of the camera pose estimation problem when using point features. Sections 3 and 4, will describe in details our proposed hybrid localization system. Section 5 presents some performed experiments and results obtained in outdoor environments under real conditions. We conclude in Section 5 and suggest future works.

2. Hybrid Localization System Overview

Our localization system is wearable and composed of a Tablet PC, an Inertial Measurement Unit (IMU), a camera and a GPS receiver (see Figure 1). The IMU is rigidly coupled with the camera and used to estimate the camera rotation. We used an Xsens MTi sensor which contains gyroscopes, accelerometers and magnetometers. The advantage of the MTi is that it incorporates an internal digital signal processor which runs a real-time sensor fusion algorithm providing a reliable 3D orientation estimate. Data from MTi are synchronously measured at 100 Hz. For the vision, we opted for the uEye UI-2220-RE-C CCD camera with 6 mm lens which is extremely compact, low-cost and well adapted for outdoor environments. Color images with resolution of 768 × 576 pixels at a frame rate of 25 Hz are streamed to a PC using a USB 2.0 connection. A Trimble GPS Pathfinder ProXT receiver mounted on a user provides GPS measurements. The ProXT receiver integrates a multipath rejection technology providing a submeter accuracy. Its rate update is about 1 Hz. The ProXT receiver uses a Bluetooth wireless connection, to communicate with the computer. The three sensors providing measurements to the system are synchronized in hardware and runs at different rates.

This localization system uses an assistance scheme between the several sensors, and thus it is subdivided in two subsystems: a main subsystem and an auxiliary one. The main subsystem corresponds to the vision based localization. The auxiliary subsystem is used only when the visual tracking fails; it is composed of the GPS and the inertial sensors. This hybrid system estimates continuously the position and orientation of the point of view, even when the vision fails. Figure 2 provides a flow chart to describe the 3D localization process using the proposed assistance scheme.

3. Vision-Based Localization System

3.1. Camera Pose Problem Formulation

Throughout this paper, we assume a calibrated camera and a perspective projection model. If a point has coordinates (𝑥,𝑦,𝑧)𝑡 in the coordinate frame of the camera, its projection onto the image plane is (𝑥/𝑧,𝑦/𝑧,1)𝑡. In this section, we present the constraints for camera pose determination when using point and line features.

3.2. Problem Definition

Let 𝐩𝑖=(𝑥𝑖,𝑦𝑖,𝑧𝑖)𝑡,𝑖=1,,𝑛,𝑛3 a set of 3D non-collinear reference points defined in the world reference frame, the corresponding camera-space coordinates 𝐪𝑖=(𝑥𝑖,𝑦𝑖,𝑧𝑖) are given by: 𝐪𝑖=𝑅𝐩𝑖+𝑇,(1) where 𝑅=(𝐫𝑡1,𝐫𝑡2,𝐫𝑡3)𝑡 and 𝑇=(𝑡𝑥,𝑡𝑦,𝑡𝑧)𝑡 are a rotation matrix and a translation vector, respectively. 𝑅 and 𝑇 describe the rigid body transformation from the world coordinate system to the camera coordinate system and are precisely the parameters associated with the camera pose problem.

Let the image point 𝐠𝑖=(𝑢𝑖,𝑣𝑖,1)𝑡 be the projection of 𝐩𝑖 on the normalized image plane. Using the camera pinhole model, the relationship between 𝐠𝑖 and 𝐩𝑖 is given by: 𝐠𝑖=1𝐫𝑡3𝐩𝑖+𝑡𝑧𝑅𝐩𝑖+𝑇(2) which is known as the colinearity equation.

The point constraint corresponds to the image space error, it gives a relationship between 3D reference points, their corresponding 2D extracted image points and the camera pose parameters as follows: 𝐸𝑝𝑖=𝑢𝑖𝐫𝑡1𝐩𝑖+𝑡𝑥𝐫𝑡3𝐩𝑖+𝑡𝑧2+̂𝑣𝑖𝐫𝑡2𝐩𝑖+𝑡𝑦𝐫𝑡3𝐩𝑖+𝑡𝑧2,(3) where 𝐦𝑖=(̂𝑢𝑖,̂𝑣𝑖,1)𝑡 are the observed image points.

The pose estimation problem is to find the rigid transform (𝑅,𝑡) that best fits the known 3D reference points with the observed 2D image points. Usually this is achieved by minimizing some form of accumulation of errors (least squares methods) based on (3). Typically Gauss-Newton or Levenberg-Marquardt methods are used for this purpose [13, 14].

3.3. Discussion

3D-2D feature matching is critical for the camera pose estimation and still a difficult unsolved problem in computer vision. The tracking system needs an initialization that provides a set of good 3D-2D matched points. In practice, the accuracy of the matching process depends on the relevance of the information associated to the 3D points for their recognition. One interesting approach is to define a reference patches around the image points corresponding to the 3D model points. Matching is then performed by aligning these references patches within those extracted from the current frame. The correlation can be used to measure the similarity between the patches as in [15]. However, this method is not robust against illumination variation. Other approaches use the SIFT descriptors [16] for their robustness to changes in viewing conditions. The main disadvantage of the SIFT is its complexity and its high time consumption, this makes it not suitable for real-time applications. In this work, we propose two vision-based initialization approaches. The first one is semi-automated and requires the user intervention to guide the matching process; it is used to start the tracking system. The second approach is fully automated; it is executed when the tracking is lost. The next sections give an overview of these approaches.

3.4. Semi-Automated Initialization Approach

The proposed semi-automated initialization approach is performed in two steps: the wireframe model of the environment (here the building frontage) is first render, in real time, on the video flow coming from the camera, using a set of predefined poses (see Figure 3(a)). At the same time the user moves the camera in order to align the projected model within its image. Once this alignment is achieved (see Figure 3(b)) the user validates the corresponding pose and the system switch to the matching step in order to perform, with high accuracy, the 3D-2D points matching. Aligning the rendered model allows to limit the search area of the 2D points in the current image. This makes the approach faster and robust against the outliers.

The 3D-2D matching is performed as follows. A search box is defined around each projected 3D model point on the aligned image. The interest points are then extracted from these image regions. As 3D points represent corners in the model, we use the Harris detector [17] in order to extract the 2D interest points. Then, a SIFT descriptor is computed and associated to each extracted Harris point. We choose to use the SIFT descriptor because it is scale-rotation invariant and allows real-time tracking. The distances between the reference descriptor associated to the 3D point and the descriptors of the extracted 2D points are measured and compared. The 2D point that minimizes this distance is selected as the corresponding 2D point. We have also used a RANSAC algorithm [18] in order to detect and remove outliers in the matching set, and thus increasing the accuracy of the initialization.

In order to validate the whole matching 3D-2D points, we introduce a coherence test which is used as a quality measurement for the estimated camera pose. We assume that this pose is close to that selected when the wireframe model is aligned within the image. Let, 𝑃𝑎=[𝑅𝑎𝑇𝑎] be the predefined camera pose that is used for the model/image alignment, and 𝑃=[𝑅𝑇] the camera pose estimated using the set of candidate matched points. As the two matrices are identical, we can write then: 𝑃𝑎𝑃1=𝐼,(4) where 𝐼 is a 4×4 identity matrix.

So, the trace of the matrix 𝑃𝑎𝑃1 tends to 4. Out coherence test can be formulated as: 𝑃𝛿<Trace𝑎𝑃<4,(5) where 𝛿 is a threshold below which the two matrices are considered different.

3.5. Automated Initialization Approach

Unlike the semi-automated approach described above, in automated initialization procedure, the user intervention is not required. The system switches automatically to this mode every time when the tracking fails because of noise, occlusion or image blurring. This approach is performed as follows (see Figure 4).

Let 𝐹𝑖 be the reference frame that corresponds to the last captured image before the tracking failed. Several information are associated to this frame namely: the camera poses 𝑃𝑖 and the set of matched 3D-2D points. The idea is to generate new 3D-2D matched points between the reference and the current frames.

For that, we first project the 3D points on the current frame using the reference camera pose 𝑃𝑖 to generate predicted research areas. These image areas, named patches, are centred on the projected 3D points and have rectangular shape. The interest points corresponding to the SIFT features are then extracted inside these patches and matched with those extracted from the reference frame. We also use a RANSAC algorithm in order to discard the outliers. To find the 3D-2D matched points for the current frame, we only need to identify the transformation that maps interest points defined in the reference image to those extracted in the current image. We assume that this transformation is a homography because the movement between the two frames is not meaningful.

Let 𝐻𝑖𝑗 be this homography, 𝑚𝑖 and 𝑚𝑗 the interest points extracted from reference and current frames 𝐹𝑖, and 𝐹𝑗, respectively. The relationship between 𝑚𝑖, 𝑚𝑗 and 𝐻𝑖𝑗 is defined as: 𝑚𝑗=𝐻𝑖𝑗𝑚𝑖.(6)

Once the homography is estimated, we apply it to the 2D image points associated to the 3D model points for the reference frame, in order to transform them in the current image. This allows updating correspondence between 3D and 2D points for the current image, and hence restarts automatically the tracking process.

3.6. Visual Tracking

Once the vision system is initialized the visual tracking can start. To estimate the camera pose, we must keep the 2D/3D matching for each current view. This can be achieved by using a frame-to-frame 2D points tracking. Tracking consists in following features from one frame 𝑡-1 to another frame 𝑡. Several approaches can be used such as correlation matching methods; however they are very expensive in computing time. To track 2D features in real time, the chosen method must be fast and accurate. For that the KLT Tracker [19] can be adopted. This algorithm used an optical flow computation to track features points or a set of predefined points from the previous image 𝐼𝑡1 to the current image 𝐼𝑡. Therefore, this algorithm tracks a set of 2D points associated to visible 3D points. Briefly, 2D points are searched in the neighborhood of its position in view 𝑡-1 based on the minimization of brightness difference. To minimize the computation time, the KLT tracker uses a pyramid of images for the current view. Therefore, tracking is done at the coarsest level and then propagate to the finest. This allows following the features over a long distance with great precision. The approach is fast and accurate, but it requires that the tracked points are always visible. So the approach does not handle occlusions.

4. IMU-GPS Localization System

This subsystem, replaces the vision-based localization system when this one fails. The position and orientation given by the vision subsystem are substituted by the absolute position provided by the GPS receiver and the orientation given by the inertial sensor. The use of the Auxiliary subsystem is not limited only to replace the vision subsystem. The Auxiliary subsystem is also used to initialize the vision subsystem. Moreover, from the position and orientation given by this subsystem, we can measure the accuracy of the 3D localization estimated by the vision subsystem by defining some confidence intervals. The Auxiliary subsystem is composed of two modules: prediction and correction. The prediction module is used to predict accuracy errors of the localization system. It is based on online training of the error between the two subsystems. Once the localization system switches to the Auxiliary subsystem, the error is predicted following a Gaussian model and used to improve the position and the orientation provided by the GPS and the inertial sensor. The two parts composing the system interact continuously with each other. Also, the use of GPS for position estimation solves the problem of inertial sensor’s drift, which is used only for orientation estimation.

4.1. Sensors Calibration

In our hybrid system, each sensor provides data in its own reference frame. The inertial sensor computes the orientation between a body reference frame attached to it and a local level reference frame. Also, the GPS position is expressed in an earth reference frame defined by WGS84 (World Geodetic System) standard. For registration, we need to estimate continually the camera pose which relates the world reference frame to the camera reference frame. Thus, the 3D localization provided by the IMU-GPS system must be aligned with the camera reference frame. The several sensors must be aligned in a unified reference frame in order to have the same position and orientation of the point of view. So, the hybrid sensor must be calibrated to determine the relationships between the several sensors and thus to unify the measurements. The accuracy of the IMU-GPS system depends on the accuracy of the calibration processes. We have developed an original calibration method in order to compute the several transformations with high accuracy. Our method is divided in two calibration processes. The first one consists in estimating the relationship between inertial sensor and camera (Inertial/Camera calibration). The second one estimates the transformation which maps the GPS position to the camera position (GPS/Camera transformation).

4.1.1. Inertial/Camera Calibration

In order to deduce the camera orientation from the orientation given by the inertial sensor, we need to estimate the transformation between the references frames attached to the camera and the inertial sensor. The used reference frames are illustrated in Figure 5.

Let 𝑅𝐶𝑊 be the rotation of the world frame 𝑅𝑊 with respect to the camera frame 𝑅𝐶. 𝑅𝐶𝐼 represents the rotation of 𝑅𝐼 with respect to the camera frame 𝑅𝐶. The rotation between the body frame 𝑅𝐼 and 𝑅𝐺 is noted 𝑅𝐼𝐺. Finally, 𝑅𝐺𝑊 represents rotation of the world frame 𝑅𝑊 with respect to 𝑅𝐺. The rotation 𝑅𝐼𝐺 is given by the inertial sensor and 𝑅𝐶𝑊 is obtained from the camera pose estimation. Thus, we need to compute rotation 𝑅𝐶𝐼 and 𝑅𝐺𝑊. The relationship between the several frames is expressed by: 𝑅𝐶𝑊=𝑅𝐶𝐼𝑅𝐼𝐺𝑅𝐺𝑊.(7)

In this case, the Inertial/Camera calibration process consists in estimating the rotation matrix 𝑅𝐶𝐼 and deducing the matrix 𝑅𝐺𝑊. We assume that 𝑍-axis of the 𝑅𝐺 frame is pointing up along the vertical and is collinear with the 𝑍-axis of the world coordinate frame 𝑅𝑊 (Figure 6).

Therefore, this configuration implies that the matrix 𝑅𝐺𝑊 will correspond to a single rotation around the 𝑍-axis with an angle θ. So, the matrix 𝑅𝐺𝑊 can be expressed as follows: 𝑅𝐺𝑊=cos(𝜃)sin(𝜃)0sin(𝜃)cos(𝜃)0001.(8)

Assuming that 𝑅𝐼𝐺=(𝑟𝑖𝐼𝐺),𝑖=1,2,3, where 𝑟𝑖𝐼𝐺 is the 𝑖th column of the 𝑅𝐼𝐺 matrix. Equation (8) can be written as: 𝑅𝐶𝑊=𝑅𝐶𝐼𝑟1𝐼𝐺𝑅𝐶𝐼𝑟2𝐼𝐺𝑅𝐶𝐼𝑟3𝐼𝐺cos(𝜃)sin(𝜃)0sin(𝜃)cos(𝜃)0001(9) and then, 𝑅𝐶𝑊=𝑅𝐶𝐼𝑟1𝐼𝐺𝑅cos(𝜃)sin(𝜃)𝐶𝐼𝑟2𝐼𝐺𝑅sin(𝜃)cos(𝜃)𝐶𝐼𝑟3𝐼𝐺(10)

From (10) we can deduce that: 𝑟3𝐶𝑊=𝑅𝐶𝐼𝑟3𝐼𝐺(11)

We rewrite this equation in the linear form 𝐴𝑋=𝑏, so: 𝑟3𝐼𝐺𝑇0𝑟003𝐼𝐺𝑇0𝑟003𝐼𝐺𝑇𝑋=𝑟3𝐶𝑊,(12) where 𝑋=𝑟𝐶𝐼11𝑟𝐶𝐼12𝑟𝐶𝐼13𝑟𝐶𝐼21𝑟𝐶𝐼22𝑟𝐶𝐼23𝑟𝐶𝐼31𝑟𝐶𝐼32𝑟𝐶𝐼33𝑇. We need at least 3 matrixes 𝑅𝐼𝐺 and 𝑅𝐶𝑊 in order to estimate 𝑅𝐶𝐼 using the least mean squares algorithm. Once the matrix 𝑅𝐶𝐼 is computed we can deduce the matrix 𝑅𝐺𝑊 using (1), so: 𝑅𝐺𝑊=𝑅𝑇𝐼𝐺𝑅𝑇𝐶𝐼𝑅𝐶𝑊.(13)

The estimation of the two matrixes 𝑅𝐶𝐼 and 𝑅𝐺𝑊, which is done off line, allows the system to estimate the rotation of the camera in real-time using only the orientation matrix given by the inertial sensor. This is very important to recover the camera pose when the visual tracking fails.

4.1.2. GPS/Camera Calibration

This second calibration process estimates the rigid transformation (rotation + translation) which maps the GPS position to the local world reference frame 𝑅𝑊 (see Figure 6). In our case, the position provided by the GPS receiver is expressed in degrees. A Lambert conic projection is necessary to express the GPS positions in meters. This projection minimizes the distortions and is adapted to the region where the experiments are done (i.e., France). The Lambert conic projection superimposes a cone over the sphere of the Earth. Let 𝑝gps the camera position given by the GPS receiver and expressed in the earth reference frame and 𝑝cam the camera position in the local world reference frame. The relationship between 𝑝gps and 𝑝cam is then given by: 𝑝cam=𝑅𝑝gps+𝑡,(14) where 𝑝cam is computed using the camera calibration parameters as follows: 𝑝cam=𝑅𝑇𝐶𝑊𝑡𝐶𝑊.

Finally, the rigid transformation is obtained by minimizing the following criterion: 𝑛𝑖=1𝑝𝑖cam𝑅𝑝𝑖gps+𝑡2.(15)

In order to simplify the problem we have used the axis and the angle (𝑛,𝜃) to represent the rotation matrix 𝑅 [20]. This allows subdividing the optimization problem into two sub-problems, one for the rotation estimation and one for the translation vector.

4.2. Localization Error Prediction

The estimation of the 3D localization provided by the combination of the GPS and the inertial sensor is less accurate then the vision-based estimation. The computation of the produced error is important in the localization process. Indeed, it allows quantifying the quality of measurements in order to improve the 3D localization estimation provided by the IMU-GPS localization system. The error represents the offset between the camera pose and the position and orientation deduced from GPS and inertial sensor. When the vision fails, this error must be predicted. For that, the error is modeled as a regression with a Gaussian process [21]. The Gaussian process is a stochastic process which generates samples and can be used as a prior probability distribution over functions in Bayesian inference.

Let be 𝑥1,𝑥2,,𝑥𝑛 a set of training data associated to 𝑦1,𝑦2,,𝑦𝑛 where 𝑦𝑖=𝑓(𝑥𝑖). The goal is to predict the value 𝑦𝑛+1 associated to the data 𝑥𝑛+1. We consider (𝑌1,𝑌2,,𝑌𝑛+1) a set of (𝑛+1) random variables which have a Gaussian distribution with zero mean and covariance matrix Σ𝑛+1 such as: Σ𝑛+1=Σ𝑛𝜅𝜅𝑇𝜅𝑛+1,(16) Where 𝜅𝑛 is 𝑛×𝑛 matrix, 𝜅 is a 𝑛-column vector and 𝜅𝑛+1 is a scalar. If 𝑦1,𝑦2,,𝑦𝑛 are the observed variables associated to 𝑥1,𝑥2,,𝑥𝑛 then the conditional distribution 𝑃(𝑌𝑛+1|𝑌1,,𝑌𝑛) yielding a Gaussian distribution with: 𝜇𝑌𝑛+1=𝜅𝑇Σ𝑛1𝑦𝑛𝜎2𝑌𝑛+1=𝜅𝑛+1𝜅𝑇Σ𝑛1𝜅,(17) where 𝑦𝑛=(𝑦1,𝑦2,,𝑦𝑛)𝑇, 𝜅𝑛+1=cov(𝑦𝑛+1,𝑦𝑛+1), 𝜅𝑖=cov(𝑦𝑛+1,𝑦𝑖) and Σ𝑖𝑗=cov(𝑦𝑖,𝑦𝑗). The covariance between 𝑦𝑖 and 𝑦𝑗 is the same as the covariance between 𝑥𝑖 and 𝑥𝑗 whish is given by: 𝑥cov𝑖,𝑥𝑗=1||||𝑁𝑖𝑗𝑁|𝑖𝑗|𝑛=1𝑥𝑛𝑥𝑛+|𝑖𝑗|.(18)

In our case 𝑥𝑖 represents either the GPS position or the orientation matrix give by the inertial sensor, and 𝑦 is the error localization that we want to predict. So, during the visual tracking, the offset between the IMU-GPS localization system and the vision-based localization system is recorded for the online training step. Training data are used to learn sampled covariance function. When the visual tracking fails, the Gaussian process predicts the offset made by GPS and the inertial sensor. This offset, which is represented by the mean error, is used to correct the estimated 3D localization. Indeed, the Gaussian Process allows computing 𝑝(𝑦𝑥) the likelihood of the error localization. This likelihood function is Gaussian, with mean and variance at the point 𝑥 given by the Gaussian Process based map. In our case, mean and variance are used to compute the error ellipse of the estimated localization.

5. Real-Time Operating

Our localization system operates using a finite state machine scheme (see Figure 7). A finite state machine is an abstract model composed of a finite number of states, transitions between those states, and actions. This formalism is mainly used in the theory of computability and formal languages and allows running in real-time with high accuracy.

The proposed state machine is composed of three states: the Auxiliary predominance state, the initialization state and the visual predominance state. The transitions between different states are as follows: At the initialization state, the Auxiliary subsystem provides an estimation of the pose (1). This estimation is refined with vision subsystem (2). When the visual tracking fails, the Auxiliary subsystem takes over to estimate the 3D localization (3). Since the Auxiliary subsystem is less accurate than the vision subsystem, the estimation is corrected taking into account the predicted error. Thereafter, the estimation is used to re-initialize the visual tracking (4).

We have defined three criteria in order to quantify the quality of the estimated pose using the visual tracking. If one of these criteria is not verified, the pose is rejected and the system switches to the Auxiliary subsystem.

5.1. Number of Tracked Points

The number of 2D/3D matching points affects the accuracy of the minimization process used to estimate the camera pose. Indeed, the more we have a large set of 2D/3D matched points, the more the estimated pose is accurate and vice versa. For this, we define a minimum number of matching. Below this threshold, it is considered impossible to estimate the pose with the vision subsystem.

5.2. Projection Error

The number of matched points is not sufficient to ensure the accuracy of the pose estimation; the projection error criterion can also be used. This error represents the average square of the difference between the projection of 3D points using estimated pose and the 2D points. If the error is large, greater than an empirical threshold, the pose is considered wrong.

5.3. Confidence Intervals

The data provided by the Auxiliary subsystem can also be used as an indicator of the pose validation. In fact, from the position and orientation given by the Auxiliary subsystem, confidence intervals are defined. They are represented by an ellipsoid centered by the orientation provided by the inertial sensor and an ellipse which center is determined by the 2D position given by GPS. The axes of the ellipse or the ellipsoid can be defined 3*σ (standard deviation of the offset between the camera pose and Auxiliary estimation) or empirically. If the pose computed by the vision subsystem is included in these confidence intervals (position in the ellipse and the orientation in ellipsoid), the pose is considered correct.

6. Experiments

Several experiments have been achieved to study the behavior of the proposed localization system when used in outdoor environments. The first experiment points out the performances of the semi-automated initialization approach in several conditions. Figure 8 shows that this approach performs good matching in spite of the illumination change.

Furthermore, in order to analyze the error in the matching process, we estimate the mean distance between the 2D points obtained after the semi-automated initialization step and the 2D points extracted from the images after refining the camera pose. We found a mean error equal to 3.8216 pixels with a standard deviation about 1.0873 pixels. This means that semi-automated approach is very efficient to generate a rough estimate of 2D-3D correspondences and really helps the tracking system to rapidly converge to the optimal solution.

We also analyzed the execution time by carefully evaluating the processing time needed to achieve each step in the semi-automated initialization procedure. An example of these computation times is given in Table 1.

This table shows that the computation time needed to achieve the semi-automated initialization matching is quite fast and makes this approach particularly efficient for the initialization stage of the tracking system.

In addition, we also tested the performances of our automated initialization approach. For that we have considered several images taken under different viewpoints. Figure 9 shows some obtained results. We can see that, for all the considered cases, the reference points (taken on the frontage of the building) are well matched in the current frames. In this example we have considered coplanar points.

We have also tested our approach for non coplanar points chosen on the tower of the castle (Figure 10). Obtained results in this case are satisfying. Indeed, combining the SIFT points with the RANSAC algorithm provides an accurate and robust homography estimation, and thereby allows good points matching. The matching process in this case gives a mean error about 1.7823 pixels with a standard deviation of 0.6634 pixels.

The computation time of this approach depends mainly on the SIFT features extraction and matching. Introducing a prediction stage in order to limit the research area of the interest points in the current frame has significantly reduced the total time of the whole algorithm (practically, it is divided by two). Comparing to other similar approaches [15], our proposed automated initialization technique is more flexible and provides best real times performances.

In the next experiments, we have evaluated the performance of our GPS-IMU localization system. The first experiment considers a straight line as a truth data. The origin of this line is defined in front of the origin of the world reference frame. The line is sampled in several positions and for each sample we do some acquisitions, namely images and GPS positions. The sensors are mounted on a tripod to ensure more stability (see Figure 11). The reference positions are measured with a telemeter which accuracy is about 0.15 m. In addition, for each acquired image, we calculated the position and the orientation of the camera.

From the GPS data and the transformation estimated during the calibration step, we deduce the absolute position with respect to the world reference frame associated to the real scene. By comparing the different estimated positions to the reference positions, we find a mean offset about (1.8374 m; 1.4810 m). The same GPS positions compared to the camera’s positions give a mean error equal to (1.7321 m; 1.4702 m) with a standard deviation (1.8314 m; 1.0116 m). Figure 12 shows the hole trajectories obtained from the GPS and camera positions computations compared to the reference trajectory.

The second experiment focused on the relative position between two successive fixed positions. In average the offset between the reference position and that obtained with the GPS is about 0.7817 m with a standard deviation equal to 1.06 m. Similar values are given by the vision subsystem, that is, an offset mean about 0.8743 m with a standard deviation of 0.9524 m. Therefore, these results demonstrate that the movement provided by the two subsystems is consistent. The third experiment performed several continuous recordings of GPS/camera positions. The two sensors are time-stamped in order to synchronize them and to retrieve the set of data acquired at the same time. The positions given by the vision and the GPS without correction are compared and the obtained errors are about 0.9235 m in the 𝑥-axis (with a standard deviation of 0.6669 m) and 0.8170 m in the 𝑦-axis (with a standard deviation of 0.6755 m). In addition, in order to study the error prediction approach we first used a set of 76 data acquired in continuous manner to perform the error training. Then, the Gaussian process is used with the last 30 data to predict errors. The mean offset between the predicted error and the real one is about (𝜇𝑥=0.2742 m; 𝜎𝑥=0.4799) and (𝜇𝑦=0.5757 m; 𝜎𝑦=0.5097 m) (see Figure 13). The positions provided by the GPS receiver are then corrected using this predicted error. This allows improving the 3D localization provided by the Auxiliary subsystem.

To assess the accuracy of the inertial sensor, we compared the orientations produced from the gyroscope to those computed by the vision pose estimation algorithm. For that, a video with several orientations in an outdoor environment has performed. Both orientations have the same behavior (Figure 14).

However, in some cases, we found that external factors can affect the inertial measurements, particularly in defining the local reference frame where the 𝑥-axis is in the direction of the local magnetic north. This causes errors in the orientation estimation. To solve this problem the rotation between the local reference frame associated to inertial sensor and the world reference frame is re-estimated continuously. The behavior of the whole system is also tested. The initialization process allows having the matching of the 3D visible points from the 3D model with their projections in the first view. From this 2D/3D matching, the set of 2D points are defined and tracked frame to frame. For each frame, the wire frame model is registered using the positions and orientations obtained from the hybrid localization system. In Figure 15, the green color projection is obtained from the positions and orientations provided by the vision subsystem. The wire frame model is well superimposed on the real view which demonstrates the accuracy of the camera pose estimation. In magenta, the projected model is obtained with the positions and orientations provided by the Auxiliary subsystem. Figure 15 show that when vision fails, the localization system switches to the Auxiliary subsystem to provide 3D localization. The localization is corrected with the predicted error which contributes to improve the estimation.

Figure 16 show that during the occlusion of the tracked points, the GPS-IMU localization system provides always an estimation of the position and orientation of the camera. Therefore, even when a total occlusion occurred, the system can provide a rough estimation of the 3D localization. This would not be the case if we used only the camera.

Furthermore, we evaluated the performance of our 3D localization system in this urban scene. The 3D model of the building is known, it is composed of 120 natural points defined by their 3D coordinates within the world coordinates frame. Several trials in different locations were recorded. The data coming from the three sensors are time-stamped and stored in data file. The camera pose estimated by the visual tracking was used as ground truth for the performance evaluation. We have defined a set of 2D/3D point’s correspondences from the first sequence frame. The interest points are then tracked frame to frame using the KLT feature tracking algorithm [17, 18]. The 2D/3D correspondences are then updated and used to estimate the 3D localization. We have compared our approach with a data fusion method using an Extended Kalman Filter. Computational results show the effectiveness of our approach, since it obtains better accuracy. Figure 17 shows the recovered camera trajectory in world coordinate system; we note that the trajectory estimated by our localization approach is closer to the ground truth then the one estimated by fusing data coming from the three sensors.

Finally, the performances of our localization system are compared to the state-of-the-art results reported by DiVerdi and Höllerer [22] with their GroundCam system which also combines a camera, a GPS receiver and an orientation tracker. The authors run their system along a residential street for approximately 90 seconds and compare the estimated trajectory to a hand-labelled ground truth. The reported RMS is about 5.5 m. We run our system in the similar conditions around our institution building. The obtained RMS is about 1.55 m which shows that our localization system generates results significantly better.

7. Conclusion

In this paper, we presented an original solution for 3D camera localization using multi-sensors technology. The system combines a camera, a GPS and an inertial sensor; it is designed to work in outdoor environments. Instead to fusion all data, the proposed system is based on an assistance scheme. It is composed of two parts which work in a complementary manner and controlled by a finite state machine allowing continuous 3D localization. The vision subsystem, representing the main part, uses a point-based visual tracking. Once the vision fails, the system switches to Auxiliary subsystem which is composed of the GPS/inertial sensors. The Auxiliary subsystem is less accurate then the vision subsystem, especially the GPS positioning. Hence, a prediction stage is performed to improve the accuracy of the Auxiliary subsystem. The 3D localization provided by the two subsystems is used to learn, on-line, the errors made by the Auxiliary subsystem. The two subsystems interact continuously to each other. The obtained results are quite satisfactory with respect to the purpose of Mobile Augmented Reality systems. They have shown that the proposed system has quite good accuracy compared to other approaches.

The system was tested in outdoor environment and has demonstrated its capacity to adapt itself to the several conditions occurred in such environments. For example, when a total occlusion of the scene model is occurred, the Auxiliary system takes over the 3D localization estimation until the vision becomes operational. However to increase the robustness and the efficiency of the whole system, improvements must be made in several parts. Actually, within the implemented vision-based method, the tracked points must be always visible. So, one challenge is to develop a tracking method which can handle visual occlusions and update automatically the set of tracked points by adding, in real time, new visible points. In addition, other markerless tracking approaches can be combined with the point tracker such as edge-based methods [23] or fiducials based approaches [24, 25] to improve the accuracy of the vision-based pose estimation. Also, the fusion process can be optimized if we consider the motion dynamic of the camera given by the IMU sensor. On the other hand, the experiments have shown that the GPS signal can be obstructed when the user is quite near the buildings. So, when the system switches to the Auxiliary subsystem, the position could not be estimated. This problem can be solved by adding other kinds of positioning sensors which can replace the GPS (RFID, WIFI, etc.). The main idea is to develop a ubiquitous tracking system composed of a network of complementary sensors which can be solicited separately and in real time in terms of the situations occurred in the environments.