Abstract

To deal with the low accuracy of positioning for mobile robots when only using visual sensors and an IMU, a method based on tight coupling and nonlinear optimization is proposed to obtain a high-precision visual positioning scheme by combining measured value of the preintegrated inertial measurement unit (IMU) and values of the odometer and characteristic observations. First, the preprocessing part of the observation data includes tracking of the image data and the odometer data, and preintegration of IMU data. Second, the initialization part of the above three sensors includes IMU preintegration, odometer preintegration, and gyroscope bias calculation. It also includes the alignment of speed, gravity, and scale. Finally, a local BA (bundle adjustment) joint optimization and global graph optimization are established, so as to obtain more accurate positioning results.

1. Introduction

Positioning technology is undoubtedly the basic premise for a wide range of applications such as robot navigation, autonomous driving, and virtual reality. The visual positioning system has won the attention of researchers because of its small size, low cost, and simple hardware setup [14]. In recent years, the visual SLAM (simultaneous location and mapping) has experienced rapid development, among which the more classic methods include the PATM [5] algorithm proposed by Klein et al. in 2007 and its successor ORB-SLAM [6]. Other notable methods are LSD-SLAM [7] for the monocular vision direct method, SVO [8] based on the sparse method, DSO [9] and DVO-SLAM [10], etc. Since the discussion cannot be fully discussed, this section discusses the method of multisensor fusion localization, skipping the above method of localization using only visual sensors. Due to a large amount of information measured by the camera and the low cost, the application on mobile robots is very extensive. Therefore, the following mainly introduces the positioning method combining the camera and other sensors.

The way the camera is integrated with other sensors can be divided into loose coupling [11, 12] and tight coupling. Among them, the loose-coupling method is used to use the camera and other sensors as separate modules to calculate the respective pose. The fusion information methods generally include Kalman filter (KF), extended Kalman filter (EKF), and other methods, while other sensor information is used for state propagation. The tight-coupling methods are mainly divided into two kinds: EKF [13, 14] and graph-based optimization [15, 16]. A widely concerned VIO method based on EKF is the MSCKF [13] algorithm proposed by MARS Laboratory of the University of Minnesota. It can marginalize old map points and postures to ensure accuracy and computational efficiency under the condition that the number of map points and postures in the state vector is constant. However, with the development of the optimization technology in SLAM, VINS based on graph optimization has also been studied more and more. The limitation of the filtering algorithm is analyzed in reference [17], and the method based on optimization is superior to the method based on the filter. The ICE-BA algorithm proposed in reference [18] improves the BA optimization part of VI-SLAM, changes the incremental nature of SLAM measurement, resolves the direct conflict between sliding window edge deviation and global closed-loop detection constraints, and improves the computational efficiency. In [8], the IMU preintegration technology was extended to Lie algebra in 2015 and implemented in the open-source factor graph optimization library GTSAM. Based on this, [19, 20] designed the VINS-MONO algorithm and obtained the experimental results that the dynamic performance of continuous-time preintegration is better than the discrete preintegration. However, there are many differences in the adaptability of sensors on different platforms. In the mobile robot, because the robot is mostly in the plane uniform motion [4], the IMU cannot be fully excited, which causes numerical stability problems during the initialization of the scale. Therefore, the applicability of the IMU is not as good as the mileage. At the same time, external calibration and initialization issues between sensors pose significant challenges for highly nonlinear visual-inertial systems.

These problems will affect the positioning accuracy and robustness of mobile robots. To solve these problems, a tightly coupled and optimization-based approach is used to fuse monocular vision sensor, IMU, and wheeled odometer to achieve accurate pose estimation in this paper. The main work of this paper is shown as follows:(1)In this paper, the odometer-inertial navigation system is used to estimate the scale to solve the problem that the absolute scale cannot be obtained by location algorithm using monocular camera and scale stability cannot be guaranteed.(2)A BA tight-coupling optimization model is established, including marginalization (removal of pose and feature point constraints from the sliding window), IMU residual between adjacent frames in the sliding window, visual reprojection error, and measurement error of odometer. Finally, accurate positioning results are obtained by minimizing the error cost function.

2. System Composition and Measurement Data Preprocessing

2.1. System Composition and Coordinate System Definition

In Figure 1, the system starts with data preprocessing (Section 2.2), including image feature extraction and tracking, preintegration of IMU measurements between two consecutive frames, and calculation of pose measurement by two consecutive interframe odometers. In the second part (Section 3), the initialization process provides the contents including attitude, velocity, gravity vector, gyroscope bias, and three-dimensional feature location for subsequent nonlinear optimization. The third part is local BA joint optimization (Section 4) and global map optimization, including a sliding window BA optimization model; the fourth part is loop detection, eliminating drift and ensuring map consistency. The main application method is DOW2 [21], so it will not be discussed in this paper.

The relevant coordinate system in this paper is shown in Figure 2, which defines the coordinate system and symbols used in this paper. Among them, represents the world coordinate system, represents the camera coordinate system, represents the IMU coordinate system, represents the odometer coordinate system, and represents the robot coordinate system, which is defined as the same as the IMU coordinate system. The rotation matrix R and quaternion are used simultaneously to represent the rotation. and represent the rotation and translation from the robot coordinate system to the world coordinate system. represents the coordinate system of the robot when the j frame image is obtained. represents the camera coordinate system when the j frame image is obtained. denotes multiplication between quaternions. denotes the measured or estimated value of a specific quantity.

2.2. Data Preprocessing
2.2.1. Feature Point Tracking

The tracking and matching of feature points in monocular vision are used to inherit or generate new feature points from the previous frame to the next frame. The specific process is shown as follows: firstly, the strongest corner features are found on the first frame image, and the nonmaximum suppression radius is set at the same time. Then, when the next frame image arrives, these feature points are tracked by the optical flow method [22], and the matching points are found in the next frame image. The matching points in the front and back frames are corrected to get the final corrected position. Based on the corrected position, the F-matrix and RANSAC algorithm are used to remove the outliers. Finally, the strongest new feature points are searched in the area beyond the matched feature points, and the number of feature points is complemented to , which ensures that the matching number of feature points is consistent with that of the next frame. The value of is generally in the range of 100–300, and the specific content is referenced in [23].

2.2.2. Inertial Navigation Preintegration

The original accelerometer and gyroscope can measure the acceleration and angular velocity of the robot. It is assumed that the additional noise in the accelerometer and gyroscope is the Gaussian white noise and . The bias of accelerometer and gyroscope is assumed to be a random walk model, whose derivatives obey Gaussian normal distribution and . The measurement results are illustrated as follows:

From the above formula, the position, velocity, and direction of the next moment can be calculated, and the time interval between the two moments is given as :where

The reference coordinate system is transformed from the world coordinate system to the robot coordinate system. Formula (2) is multiplied by the transformation matrix from the world coordinate system to the frame coordinate system. By using as the reference frame of a given bias, the preintegration term can be obtained only by the IMU measurement, and the covariance matrix of , , and propagates accordingly. Then, when the IMU bias changes slightly, the first-order approximation estimation of the bias state is adjusted based on , , and , and the resulting preintegral is expressed as

The preintegration method does not need to repeat IMU measurements, and it saves a lot of computing time and resources for subsequent graph-based optimization algorithms.

2.2.3. Odometer Preintegration

The wheel odometer can measure the linear velocity and the angular velocity of the robot. It is assumed that the noise between the measured value and the true value is Gaussian noise, . The measurement model is given as follows:

Considering the problem of slipping off the wheeled robot, the angular velocity measured by the gyroscope can be used as the acceleration of the robot. Referring to the IMU preintegration algorithm, the preintegration is calculated for the odometer measurement and the angular velocity measured by the gyroscope; the preintegration of the velocity is simpler, and there is no offsetting effect in the acceleration. The odometer preintegration results are shown as follows:

Similar to IMU preintegral, the preintegral measurement is obtained by separating the initial value and the constant term of the integral. When the gyroscope bias changes slightly, the approximation of the previous moment is used to update and , instead of iteratively calculating the preintegration measurements:

3. System Initialization State Estimation

The initial value of the system is obtained by a loosely coupled method. Firstly, the structure from motion (SFM) is used to estimate the camera pose and feature 3D position. Then, the SFM results are aligned with IMU preintegration and odometer preintegration, respectively, to correct the gyroscope bias. At the same time, the corresponding velocity of each frame, gravity vector direction, and scale factor are solved. It should be noted that the scale is the final result obtained by weighting the scale estimated by the two alignment processes according to the dynamic weight method.

3.1. Vision-Only Initialization Based on the Sliding Window

The monocular initialization uses SFM to estimate the camera scale pose and feature point spatial position of each frame within the sliding window. Firstly, the feature matching between the latest frame and any other frame in the window is checked. If a stable feature tracking and sufficient parallax are found, the scale and pose between the two frames are restored by the five-point method [24]. Then, the feature triangulation and the reprojection error are performed to be minimized. Otherwise, the current latest frame is saved, continued to wait for the new frame. The pose of all frames in the sliding window is obtained by the above procedure. It is assumed that the external parameter between camera and IMU is , the external parameter between camera and the odometer is . The pose is converted to the robot coordinate system and odometer coordinate system, respectively. The attitude is converted to the IMU coordinate system as

Similarly, the conversion to the odometer coordinate system is shown aswhere and both represent the scale information of the visual measurement and are used for the subsequent dynamic weighting to solve the final scale parameter .

3.2. Vision and IMU Preintegral Alignment
3.2.1. Gyro Offset Calibration

Two consecutive frames and in the sliding window are rotated by visual SFM to obtain and . The relative constraint is obtained by IMU preintegration. Then, the linearization of the IMU preintegral term is obtained for the gyroscope bias and the cost function is minimized aswhere B represents all keyframes in the sliding window, where

After the initial calibration value of the gyroscope bias is obtained, the IMU preintegration is recalculated based on the new gyroscope bias.

3.2.2. Velocity, Gravity Vector, and Scale Initialization

After the gyro bias is initialized, the gyro measurement is regarded as accurate and continues to be used to initialize other variables to be optimized:where is the velocity of the j frame image robot coordinate system, is the gravity vector in the coordinate system, and is based on the monocular scale obtained by the IMU preintegration.

The world coordinate system in the preintegral terms and is replaced with the coordinate system and converted into the form of :where

By solving the linear least-squares problem,

According to the above formulas, can be optimized, including all keyframe velocities, gravity vectors, and monocular scales in sliding windows.

3.3. Vision and Odometer Preintegral Alignment

Since the gyro measurement is used for the rotation measurement of the odometer preintegration, the gyro offset after calibration can be used directly. The variables to be optimized are defined as

Similarly, the odometer preconcentration term is transferred to the coordinate system and converted into the form of :

By solving the linear least-squares problem,

The coordinate system speed of odometer in each frame in the window and the parameters of the monocular scale can be obtained.

3.4. Dynamic Weighted Scale Parameter

The idea of dynamic weighting method is used to design a sliding window with a size of 5 keyframes based on the original sliding window of the system and reoperate the sliding window of the system with a size of 10 keyframes. Therefore, the keyframe is selected from the sliding window, where 1–5 frames within the window are included in , 2–6 frames within the window are included in , and so on, and is derived. The scale parameters and at are calculated according to Sections 3.2 and 3.3, respectively, and the minimum mean square errors and are obtained. The scale parameter is calculated by dynamically adjusting the weights and according to the above parameters, wherein . The initial values of and are both 0.5 as shown in Figure 3.

4. Back-End Tight Coupling Optimization

4.1. State Variables and Cost Functions

To improve computational efficiency, a sliding window is used to achieve graph optimization in this paper, so the state variables defining the sliding window optimization at time are shown aswhere represents the position, velocity, the orientation of the robot in the global coordinate system, and the bias of the IMU. represents the inverse depth of the kth feature point. and represent the number of keyframes and the total number of 3D feature points, respectively.

In the sliding window, the residual of all measurement models is minimized to optimize the state variables:where and are the residual measurements of the IMU and odometer between state and , and B and D are the preintegral measurements of the IMU and odometer; is the 3D feature point reprojection error; is the prior information in the window; is the Jacobian matrix obtained from the last optimization result; is the Huber kernel function to make the optimization more robust.

4.2. Camera Measurement Residual

In this paper, a pinhole model is used to calculate the reprojection error. First, the reprojection error is defined according to the observation value of the kth feature point in the image frame in the normalized camera coordinate system:where

The above formula is a three-dimensional coordinate form of the reprojection equation, and the optimization variable is based on the estimated value and the reprojection error in the normalized camera coordinate system, including the state quantities and and the inverse depth at two moments. To minimize the reprojection error, it is necessary to optimize the above state quantities, and the corresponding Jacobian matrix is obtained as shown in equations (23)–(25).

4.3. IMU Measurement Residual

According to the IMU measurement in the two consecutive frames and and the content in formula (4), the residual of the IMU is obtained as follows:

Here, , , and are the measurements obtained after the corrected bias. is a three-dimensional error state representation of the quaternion, and represents the imaginary part of the extracted quaternion . In the graph optimization, the state quantities and need to be optimized, and the corresponding Jacobian matrix is obtained as shown in equations (27)–(29).

4.4. Odometer Measurement Residual

It is assumed that the axle is on the central axis of the robot. The amount of displacement measured by the odometer can be used to obtain the displacement of the center of the robot. Therefore, the odometer measurement residual is defined as

In the graph optimization model, we need to optimize the states and under two consecutive frames and and obtain the corresponding Jacobian matrices such as formulas (31) and (32):where

5. Experimental Result Analysis

The experimental scene in this paper is the indoor environment with a Vicon motion capture system, and the robot platform is Turtlebot3 mobile robot with a camera named Xiaomi. The robot moves in the area of , and the average speed of the whole process is . In the experiment, the position and posture of the robot acquired by the motion capture system in the motion capture coordinate system are taken as the true value of the motion trajectory.

5.1. Scale Error Analysis

The experiment mainly verifies the effects of the initialization state estimation algorithm and the dynamic weighting result proposed in Section 3.4. Compared with the initialization using only IMU, in this paper, the combination of visual, IMU, and odometer can achieve more accurate scale and convergence time. The scale estimation results and time are shown in Table 1, and the root mean square error (RMSE) is used to quantitatively represent the accuracy of the algorithm.

It can be seen from the results in Table 1 that the initialization state estimation algorithm of this paper reduces the RMSE of the scale value by 72.2% and the convergence time RMSE by 62.9% compared with the amplification using only the IMU for initialization. Therefore, the initialization state estimation algorithm proposed has obvious improvement in the accuracy and stability of scale estimation.

5.2. Pose Error Analysis

The algorithm proposed in this paper is an improvement based on VINS_MONO, so the experimental results are compared with the VINS_MONO. The trajectory of mono_imu in the figure represents the result of the VINS_MONO [25] algorithm running on the robot platform. The trajectory and true value comparison of the two algorithms are shown in Figures 4 and 5. The dotted line in the figure is the real trajectory obtained by the motion capture system, and the solid line is the trajectory obtained by the algorithm. As can be seen from the figure, due to the odometer constraint and more accurate initialization scale added to the algorithm in this paper, the accuracy in the positioning process of mobile robots is better than that of VINS_MONO.

To quantitatively analyze the accuracy of the algorithm, the accuracy is measured by estimating the absolute position error (APE) obtained by comparing the pose result with the true motion pose of each time step. The pose and APE and the rotational APE of the algorithm proposed in this paper and the monocular inertial navigation algorithm are shown in Figure 6, respectively. The overall pose APE obtained by the two algorithms is shown in Table 2.

By comparing the two methods, it is not difficult to see that the proposed algorithm is superior to the monocular inertial fusion algorithm in accuracy, which indicates that the odometer information has a certain degree of improvement on the positioning result. Moreover, the better planar constraint of the odometer makes the algorithm consistently stable at 0.03 m in the direction, while the monocular inertial fusion algorithm has a certain degree of trajectory divergence in the direction. Since the center of gravity of the rigid body model built in the Vicon system is 0.03 m, the true value keeps near 0.03 m in the direction. The results are shown in Figure 7.

6. Conclusions

In this paper, aiming at the problem of low accuracy and robustness of the monocular inertial navigation algorithm in the pose estimation of mobile robots, a multisensor fusion positioning system is designed, including monocular vision, IMU, and odometer, which realizes the initial state estimation of monocular vision and the fusion location of the tightly coupled visual-inertial odometer.

The method of odometer preintegration and dynamic weighted scale parameters, as well as the back-end tightly coupled pose optimization method based on graph optimization, is proposed, which greatly improves the accuracy and robustness of the positioning system. Then, the effectiveness and feasibility of the proposed algorithm is verified by theoretical derivation and practical experiments.

Although the research in this paper has achieved certain results, there are still some aspects that need to be improved. First, the results are only tested in the laboratory or a small number of complex scenarios, and it cannot fully explain the ability to work robustly and stably for a long time in a complex environment. Second, binocular cameras are superior to monocular cameras in terms of depth calculation and scale estimation. In subsequent solutions, multisensor fusion schemes can be considered based on binocular vision and contribute to future map-based visual navigation studies.

Data Availability

The data are available at https://figshare.com/s/c0328d437eb86a021e24, DOI: 10.6084/m9.figshare.9782474.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the NSFC (National Nature Science Foundation of China) under grants nos. 61374128 and 61973278.