Abstract

The initialization process has a great effect on the performance of the monocular visual inertial simultaneous localization and mapping (VI-SLAM) system. The initial estimation is usually solved by least squares such as the Gauss-Newton (G-N) algorithm, but the large iteration increment might lead to the slow convergence or even divergence. In order to solve this problem, an improved iterative strategy for initial estimation is proposed. The methodology of our initialization can be divided into four steps: Firstly, the pure visual ORB-SLAM model is utilized to make all variables observable. Secondly, the IMU preintegration technology is adopted for IMU-camera frequency alignment at the same time with key frame generation. Thirdly, an improved iterative strategy which is based on the trust region is introduced for the gyroscope bias estimation as well as the gravity direction is refined. Finally, the accelerometer bias and visual scale are estimated on the basis of previous estimations. Experimental results on the public datasets show that the estimation of initial values can be converged faster, as well as the velocity and pose of sensor suite can be estimated more accurately than the original method.

1. Introduction

With the development of artificial intelligent (AI), the monocular visual inertial simultaneous localization and mapping (VI-SLAM) technology has become an active research topic in the robotics and computer vision communities. The camera image contains a rich information of the surrounding environment, in which it can be utilized to estimate the camera poses, as well as restructure the sparse or the dense maps. The IMUs provide the measurements of angular velocity and local linear acceleration, which can be utilized to estimate the rigid body motion in a short period. The complementary features make the visual-inertial combination suitable for many applications such as autonomous or semiautonomous driving [1, 2], unmanned aerial robots [3, 4], augmented reality (AR) [5, 6], and 3D reconstruction [7, 8]. At present, the tightly coupled nonlinear optimization method is widely applied to the visual/visual-inertial SLAM, such as ORBSLAM and ORBSLAM 2/3 [911], OKVIS [12], VINS-MONO/VINS-FUSION [13, 14], VI-DSO [15], and VI-ORBSLAM [16]. The estimation of initial values has a great effect on the aforementioned VI-SLAM system. Specifically, the IMU’s initial value estimation is of great significance to the initialization process; once these parameters are obtained successfully, the measurements of the inertial measurement unit (IMU) can be used to improve the accuracy and robustness of continuous tracking, as well as to find the metric scale of the three-dimensional (3D) visual map.

In early studies, several initialization methods have been studied, such as the representative joint methods [1719] and disjoint methods [16, 20, 21]. (i)The joint visual-inertial initialization method is pioneered by Martinelli [17], which assumes that all features are correctly tracked in all frames. The spurious tracks lead to a poor real-time performance. In later research, it is improved by Kaiser et al. with a little bit of precision that is sacrificed [18]. The method in [19] suffers a low initialization recall; it only works in twenty percent of the trajectory points, which might be a problem for the robot applications in case the system needs to be launched immediately(ii)The disjoint method is first introduced by Murartal and Tardos [16] and latter adapted by Qin and Shen and Yang and Shen [20, 21] with a good performance. In both cases, the parameters of IMU are estimated in different steps by solving a series of linear formulas with the least-squares method such as Gauss-Newton (G-N) and Levenberg-Marquardt (L-M) [22, 23]. The G-N method is built on the basis of the Newton method. It performs a high local convergence speed, but several limitations still exist such as the large iteration increment which may lead to the slow global convergence speed [24, 25]. In the later research, Levenberg [26] and Marquardt [27] suggest to use a damped G-N method, in which the size and direction of the iterative step are influenced by the damped parameter. It makes this method without a specific line search which guarantees the global convergence performance [28]. However, solving a set of complexity equations needs several times for each iteration, which may lead to a reduction in the speed of solution

To sum up, the speed of initialization convergence is important for the VI-SLAM system. The contribution of this paper is that an improved iterative strategy is proposed based on the trust region method to speed up the initial estimation.

The flowchart of our initialization method is shown in Figure 1. It can be divided into four steps: firstly, the pure visual ORB-SLAM model is adopted to make all variables observable in the preliminary stage. Secondly, the IMU preintegration technology is adopted for IMU-camera frequency alignment at the same time with key frame generation. Thirdly, an iterative strategy which is based on the trust region is introduced for the gyroscope bias estimation while the gravity direction is refined. Finally, the accelerometer bias and visual scale are estimated on the basis of previous estimation. In Experiments, qualitative and quantitative analyses on the public datasets [29] are given to demonstrate our improved effect. The results show that the estimation of initial values can be converged in a faster speed, as well as the velocity and poses of a sensor suite can be estimated more accurately than the original method.

The remainder part of this paper is organized as follows: Section 2 describes the process of IMU initial estimation. Then, the improved iterative strategy is described in Section 3. The experiments are described in Section 4. The conclusion is drawn in Section 5.

2. IMU Initial Estimation

In this section, the initial parameters of IMU are estimated. The relationships of IMU body frame {B} and camera frame {C} are defined with the scale factor taken into account; it is described as follows: where and represent rotation and translation vector, respectively.

The IMU preintegration technology [30, 31] is adopted for IMU-camera frequency alignment. In order to make all variables observable, the pure monocular visual SLAM system [9, 10] is launched to work a few seconds for generating key frames. The detailed process of the IMU parameter estimation is described as follows.

2.1. Estimating Gyroscope Bias

The gyroscope bias estimation could be obtained from the known orientation of two consecutive key frames. Firstly, we assume that the change of bias is negligible; i.e., the bias is a tiny constant value. Then, the difference between the preintegration of gyroscope measurements and the orientation estimated by pure visual SLAM is minimized: where denotes the total quantity of key frames, is computed from the orientation and the calibration , denotes the preintegration of gyroscope measurements between two consecutive key frames, denotes the exponential map with , where denotes a rotation matrix, , is a corresponding vector, , and represents the Jacobian matrix. Analytic Jacobian matrices for a similar expression can be found in [31].

2.2. Estimating Gravity Direction

The gravity direction has a great influence on the estimation of acceleration; it should be refined before the parameter estimation of accelerometer bias and scale. In particular, a new constraint, i.e., gravity magnitude (), is introduced. In terms of the frame {W}, the gravity direction can be computed as follows:

The rotation can be calculated from the angle between the two direction vectors:

With , the gravity vector can be expressed as follows: where can be calculated by just two angles around the axis and axis in frame {I} and the rotation around the axis has no effect in .

2.3. Estimating Accelerometer Bias and Scale

Once the accurate gyroscope bias and gravity vector are obtained, the positions, velocities, and rotation can be calculated by the integral operation. Considering the influence caused by the accelerometer bias, the is also adjusted, where it can be expressed by a two degree of freedom disturbance ;; Equation (5) can be rewritten as follows: with . Considering the effect of accelerometer bias, it can be obtained that

The velocities can be eliminated when the constraints between three consecutive key frames are taken into account. The linear relationship is obtained as follows: where , , , and are parameterized as follows: where denotes the first two columns of the matrix. Stacking all relations between three consecutive key frames in (8), a linear system can be formed as the following equation . It can be solved by the singular value decomposition (SVD) method. In this case, it contains equations and 6 unknown variables, and at least 4 key frames are needed to solve the system.

3. Improved Iterative Strategy

As Equation (2) is a typical nonlinear least-square problem, the common solving method is the G-N algorithm which is adopted in [16]. The G-N method provides a high local convergence speed, but the large iteration increment might lead to the slow global convergence or even divergence. To tackle this problem, an improved iterative strategy is introduced. In particular, our method is a trust region-based method which combines the steepest descent and G-N algorithms; the model of truth region is shown in Figure 2.

Similar to the L-M algorithm, the objective function is approximated with a trust model, while the solution is considered as the minimum of model function around the current point. Then, the following minimization subproblem is solved in each iteration step: where is the objective function, Equation (10) is the approximate model of around the current point , is the gradient of , represents the radius of the trust region, , is the class of path, is the Hessian matrix of , and denotes the norm. At each iteration step, the radius of the trust region is adjusted with changed: where is defined as the gain ratio, the denominator: , and numerator: , is the predicted and actual reduction values, respectively. The steps of updating are described in Algorithm 1.

Inputs: ,,
Output:
Step 1: If , then put:where ,
Step 2: Else if ,then put:

In this work, the sufficient reduction condition for global convergence is assumed that the reduction of obtained step performs at least square to the reduction of Cauchy step [32], which is described as follows: where represents the Jacobian matrix; it includes the gradient of residuals which can be found in [22].

Assuming that is the solution of Equation (10), due to our method which combines the steepest descent and G-N algorithms, the solution is obtained on the basis of the selection of one of the two steps and . The relation of the two steps is described as follows: where is the step of the steepest descent algorithm and is the step of the G-N algorithm.

According to Equation (13), there are two states occur for and : (1)The G-N point is outside of the trust region

The path of our iterative strategy consists of two conditions in which the first case is a line segment starting from the current point to and the second case is a line segment extending from to . Specifically, it can be formulated by : where can be computed from the following equation: (2)The G-N point is inside of the trust region

Thus, the solution of our method is obtained on the approximate path with the minimum point of model function; the processes are shown in (Algorithm 2).

Inputs: ,,
Output:
Step 1: If ,then output:
Step 2: Else if , then output:
Step 3: Else output: , where is calculated by equation (17).

4. Experiments

In this section, the proposed initialization algorithm is integrated into the VI-ORBSLAM framework [16], and several tests are evaluated on the EuRoC dataset [29]. In order to display the excellent performance, we compare our method with the original VI-ORBSLAM, VINS-MONO, and the monocular version of ORB-SLAM3 frameworks.

4.1. EuRoC Dataset

The EuRoC dataset with 11 sequences is collected by a micro aerial vehicle (MAV) platform with a visual inertial sensor suite assembled. The environment of data collection is shown in Figure 3, which consists of machine hall scene and man-made laboratory room.

According to the texture, illumination, motion blur, and fast/slow motions, the sequences are classified into easy, medium, and difficult sets [33]. The binocular cameras (denoted as CAM0 and CAM1) and the IMU are logged at 20 and 200 Hz with hardware time-synchronized, respectively. Besides, the dataset provides not only accurate ground truth of the moving trajectories but also the biases of accelerometer and gyroscope, and the velocities of IMU body are provided. In this work, we only use the monocular (left camera) and the IMU measurements. All experiments are carried out on the computer with 16GB RAM, i7-9700 CPU (8 cores @3.00GHz), in Ubuntu 18.04+ Melodic operating system. For more convenient analysis, the performance of parameter convergence, velocity estimation, and the localization accuracy is evaluated by using V2_01_easy sequence of the EuRoC dataset, in which the accelerometer bias and gyroscope bias are approximately [-0.0236, 0.1210, 0.0748] m/s2 and [-0.0023, 0.0250, 0.0817] rad/s, respectively. The sparse mapping results on V2_01_easy sequence with the key frame trajectory are shown in Figure 4. The convergence performance of the initial parameter, velocity estimation errors, and key frame absolute trajectory error (ATE) of the whole 11 sequences is also analyzed. The detailed description is shown as follows.

4.2. Initialization Parameter Convergence

The time-varied characteristic curves of the initial estimations: accelerometer bias , gyroscope bias , gravity , and visual scale , are shown in Figure 5. We compare the convergence performance of the two optimization-based methods: the Gauss-Newton method and the proposed method. It can be seen that all parameters converge successfully in these two methods; the Gauss-Newton- (G-N-) based method converges within 9 seconds, while our based method converges within 6 seconds. In particular, as shown in Figures 5(a) and 5(c), the characteristic curves of accelerometer bias and gravity encounter severe oscillation within 4 seconds. This is because the MAV platform does not have enough excitation on the sensor suite in the stage of stationary and slight disturbance, which makes the accelerometer bias and the gravity vector hard to distinguish. Besides, the estimation of gyroscope bias converges to stable values in a very short time (within 2 seconds of ours) which is shown in Figure 5(b).

It well confirms that the proposed iterative strategy described in Section 3 achieves good performance. The estimation of visual scale is plotted in Figure 5(d), it should be noted that the optimal value of scale factor is obtained by manually scaling and aligning the estimated IMU body trajectories to the ground-truth trajectories in advance. The result shows that our method also converges to the optimal value with a faster speed than the G-N-based method.

4.3. Velocity Estimation

The time characteristic curves of the estimated velocity with , , and directions are shown in Figure 6. It draws the comparison results of VI-ORBSLAM (red line), ground truth (blue line), and our proposed algorithm (green line) which are tested on the V2_01 sequence of the EuRoC dataset. Due to the estimations and ground truth that are expressed in different reference frames, the estimated velocities need to be aligned with the ground truth in advance. It can be known that the velocity estimation suffers different drifts in , , and directions, but the error of ours is smaller than that of VI-ORBSLAM. The velocity drift of the whole path is quantified in Table 1. Compared with the velocity drift of VI-ORBSLAM: [0.0293,0.0165,0.0501] m/s, the drift of ours is [0.0195,0.0110,0.0385] m/s in the , , and axes; it can be calculated that the accuracy of ours is improved by [33.45,33.33,23.15] %. The percentage improvement in accuracy can be calculated as , where and indicate the drift of VI-ORBSLAM and the drift of our method, respectively. So it can be verified that the improved iterative strategy also has the positive effect on the velocity estimation. Besides, it should be noted that the obtained results are the median of 10 tests on V2_01 sequence. Although we keep the same parameters and the same play speed of dataset (×1.5 speed of the bag file) in each test, the results are still slightly different. For fair comparison, we choose the median value as the evaluating indicator. The boxplots of the 10 tests are shown in Figure 7, and the median values are represented by a horizontal line.

4.4. Localization Accuracy

The accuracy performance of the proposed method is also examined using the whole EuRoC dataset. In this work, the open-source package: evo [34], is adopted to evaluate the VI-SLAM algorithms. The qualitative absolute pose error (APE) results on V2_01_easy sequence with respect to the translation are shown in Figures 8(a)8(h), where the blue colors of (a), (c), (e), and (g) encode the corresponding absolute pose errors of trajectory, and the red color represents the error which is larger than the blue color. Panels (b), (d), (f), and (h) draw the curve of other indicators: mean, median, RMSE, and Std. As shown in Figure 9, the three comparison curves of APE results are also given; it can be seen that our method has the best performance than VI-ORBSLAM and VINS-MONO on APE evaluation. The comparisons of key frame trajectories with ground truth which tested on the V2_01 sequence are shown in Figure 10, where the ground truth is represented by a black dotted line, VI-ORBSLAM is represented by a blue line, VINS-MONO is represented by a green line, ORB-SLAM3 is represented by a violet line, and our method is represented by a red line. The quantitative comparison with the values of max, mean, median, min, RMSE, SSE, Std are shown in Table 2.

The quantitative root mean square error (RMSE) of the whole 11 sequences is also the median values with 10 executions in the same computing platform. Our reported results are show in Table 3, and the boxplots are also provided in Figure 11. It should be noted that the estimated pose is usually not in the same coordinate system with the ground truth; we need to give an align process. Practically, we calculate the transformation matrix from the estimated pose to the ground truth; the APE in th frame is defined as follows:

Then, the absolute translational and orientation RMSE is calculated as follows:

From Table 3, it can be known that our method achieves in all sensor configurations more accurate result than the VI-ORBSLAM, VINS-MONO, and ORB-SLAM3. Practically, our method provides the best performance on six sequences for the translation and seven sequences for the orientation. It can be noted that the VI-ORBSLAM could not completely run the V1_03_difficult sequence, while our method can run completely. Besides, we list the average values of the all sequences in the last row, where VI-ORBSLAM is calculated as 0.077 m in translation and 1.971 deg in orientation, VINS-MONO is calculated as 0.111 m in translation and 2.427 deg in orientation, ORB-SLAM3 is calculated as 0.042 m in translation and 2.015 deg in orientation, and OURS is calculated with 0.036 m in translation and 1.909 deg in orientation. VI-ORBSLAM and ORB-SLAM3 have a better performance in the testing of several sequences such as V1_01_easy, V1_02_medium, and MH_04_difficult, while our algorithm performs better in other sequences.

5. Conclusions

In this paper, the proposed initialization algorithm is on the basis of the VI-ORBSLAM framework. In order to improve the quality of initialization, an improved trust region-based iterative strategy is proposed. Our method has been verified on the public dataset with faster convergence speed in the initialization stage, as well as the velocity and pose can be estimated more accurately than the original method. At present, the binocular VI-SLAM technology has more practical value in the industrial application. There are several key issues that should be considered, such as the real-time computing and the online recovery capabilities. In the future works, we will try to apply the proposal for the binocular model, as well as the positioning and mapping of the unmanned vehicles in outdoor environment are being considered.

Data Availability

The datasets used in this article can be found in the following link: https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets.

Conflicts of Interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (61673306).