Complex Deep Learning and Evolutionary Computing Models in Computer Vision
View this Special IssueResearch Article  Open Access
Yu Yang, Jing Xiong, Xiaoyu She, Chang Liu, ChengWei Yang, Jie Li, "Passive Initialization Method Based on Motion Characteristics for Monocular SLAM", Complexity, vol. 2019, Article ID 8176489, 11 pages, 2019. https://doi.org/10.1155/2019/8176489
Passive Initialization Method Based on Motion Characteristics for Monocular SLAM
Abstract
Visual SLAM techniques have proven to be effective methods for estimating robust position and attitude in the field of robotics. However, current monocular SLAM algorithms cannot guarantee timeliness of system startup due to the problematic initialization time and the low success rates. This paper introduces a rectilinear platform motion hypothesis and thereby converts the estimation problem into a verification problem to achieve fast monocular SLAM initialization. The proposed method is simulation tested on a fixedwing UAV. Tests show that the proposed method can produce faster initialization of visual SLAM and that the advantages are more profound on systems with sparse image features.
1. Introduction
In recent years, with the application of Graphbased Optimization [1] and Bundle Adjustment (BA) [2] in Visual Simultaneous Localization and Mapping (vSLAM) and the emergence of excellent opensource libraries [3, 4], vSLAM systems are increasingly used in autonomous motion platforms. Exceptional opensource vSLAM systems also help popularize vSLAM techniques. Presently, vSLAM systems have been applied to UAV autonomous navigation [5, 6] and obstacle avoidance [7, 8] problems in GPSdenied environments. However, current vSLAM systems usually take a long time to initialize [9], posing difficulties for realworld engineering problems.
Currently, there exist many powerful vSLAM methods, such as PTAM [10], ORBSLAM [11, 12] SVO [13, 14], and semidirect LSDSLAM [15] and DSO [16]. Their initialization methods are summarized in Table 1.

Generally speaking, featurebased vSLAM techniques rely on epipolar geometry constraints or homography constraints [17]; they obtain the and corresponding to the minimum Reprojection Error with RANSAC or Least Squares methods. As for direct methods, they are usually initialized through randomized approaches, as exact pointtopoint mappings cannot be obtained directly, leading to noncomputable and .
As can be seen from the above method, most of the classical monocular vision SLAM method does not consider the motion characteristics of the platform during the initialization phase. However, the basic equations of the SLAM system are composed of equations of motion and observation equations. Most of the current research focuses on the observation equations. This paper believes that the reasonable introduction of motion hypothesis can effectively improve the robustness of observations, especially in the initialization phase.
PTAM’s initialization works with the hypothesis that captured images are mainly composed of flat surfaces; initial camera motion and are then computed with homography matrix () accordingly. ORBSLAM algorithms are effective extensions of PTAM that compute computing essential matrix () and H simultaneously; the final initialization method is then selected by comparing the respective scores. LSDSLAM and DSO, as direct methods, cannot compute and through Reprojection. Therefore, they initialize through random variables. When camera motions cover enough distance, initialization will be effectuated by locking into specific depths. SVO’s initialization is similar to that of PTAM, except that SVO integrates an additional assumption that the motion direction is perpendicular to the photographed plane, as SVO is originally designed for rotor UAV use.
These five classic methods are renowned in the field of monocular SLAM/VO, each possessing unique strengths. They have been successfully applied in their respective environments with satisfactory performance. The initialization workflows of the featurebased algorithms are summarized in Figure 1.
Theoretically, the initialization process depicted herein can initialize any movement except pure rotation. Firstly, corresponding points from separate frames are identified through featurebased or optical flow methods. These point mappings are then utilized along with monocularcamera imaging characteristics in computing or under the epipolar geometry frame. or is then decomposed to produce , t, and finally the initial map points with the additional assumption that the mapped points contain no actual movement. This concludes the traditional initialization process, where the frames can be adjacent or nonadjacent, and the decomposition utilizes RANSAC, EightPoint Method, or Bundle Adjustment. Subsequent processes will use the initial , and map points (3D) for the chain processes maintaining the monocular SLAM system. Due to the scale uncertainty of the monocular visual SLAM system, no initialization method can produce realworld distance of the map points; the dimensionless depths are provided instead. The initial map points (3D) play an important role for subsequent frames. The indirect 3Dto2D correspondences between pixel points and map points (3D), together with the geocalculated DLT/P3P [18]/EPnP [19]/UPnP [20] or the optimized BA, are used to determine the subsequent frames’ positions and orientations relative to the preceding key frame. Frame I is an initialization frame as well as a key frame. As the camera moves on, the number of indirect 3Dto2D correspondences that can be established will gradually decrease, leading to probable failure of the aforementioned chain processes. It is then necessary to consider inserting new key frames to replenish the map points (3D) needed for the chain processes. Complete SLAM algorithms also involve another important process termed loop closure, which will not be discussed further, as it is not much related to the present paper.
It can be seen from above that the or is obtained from point correspondences is the initial enabler of the entire monocular SLAM system. However, when point correspondences are insufficient or too inaccurate, the obtained or may contain large errors, affecting the accuracy of the map points (3D), and thus compromising the subsequent processes. Current methods are based on limited errors of and . Therefore, the actual implementations contain much strict computation on or , leading to low success rates in many cases. When the monocular SLAM systems are applied to fixedwing unmanned aerial vehicles (UAVs), the initialization success rates are even more worrying [5, 6].
In this paper, we add a generalized motion characteristic hypothesis in the initialization process to transform the solution of camera motion R and t into the error elimination problem during the initialization process. In this way, the success rate of initialization is increased. In view of the error caused by the hypothesis, this paper reduces the error by multiframe optimization method, thus improving the accuracy of the initialization process.
1.1. Contribution
Firstly, this paper proposes the platform motion characteristics, which represents the motion state of the platform in most of the time. Secondly, this paper introduces the platform motion characteristics into the initialization phase of monocular vision SLAM and avoids the solution of the essential matrix and the homography matrix by optimization. Finally, this paper uses the subsequent BA to convert the initialization from a transient process to a convergent process of several consecutive frames.
2. Monocular SLAM Initialization Method Based on Platform Motion Characteristics and Optimization
2.1. Overview
The present paper proposes a monocular SLAM initialization method based on platform motion characteristics and optimization, the flowchart of which is shown in Figure 2.
The proposed method contains an offline process and an online process. In the offline process, initial motions and are computed with platform motion characteristics of the camera installation mode. In the online process, firstly, a set of Frame I and Frame T are used to detect and match the feature points, and then initial map points are generated under the initial motion hypothesis, whose initial errors are eliminated by Init BA. Finally, the subsequent BA is performed with the matched feature points of Frame I and Frame n, further reducing the errors of , , and the map points. The initialization is considered to have succeeded when the errors converge.
2.2. Initial Motion Hypothesis Combining Platform Motion Characteristics and Camera Installation
The proposed method utilizes an initial motion hypothesis to initialize the system. Cars on ground generally run along straight lines, while aerial vehicles usually fly at a fixed angle of attack. This is a very broad description, as ground vehicles may turn, and aircraft may roll. General motion characteristics can be expressed with Equation (1) is a 6DOF rectilinear description of any motion platform. For the monocular vSLAM initialization, the rectilinear hypothesis of the platform motion needs to be expressed in the coordinate system of the camera. The require conversion is derived from the mounting characteristics of the camera and is expressed with in (2) is the transformation matrix of the camera coordinate system with respect to the platform coordinate system. This matrix can be obtained from the camera installation characteristic. A general form of is given in Substitute and in (2) by (1) and (3), respectively, the camera motion model under the aforementioned rectilinear hypothesis is obtained.
2.3. Monocular SLAM Initialization with Initial Motion Hypothesis
The established camera motion model is then used in the implementation of the passive initialization process. In conventional initialization methods, and are obtained from the decomposing of the strictly computed or . The proposed method replaces the decomposition results and with and and thereby triangulates the feature points to obtain the map points and finally utilizes Init BA to reduce the errors (Algorithm 1).

The Init BA mentioned above minimalizes the Reprojection Error for the feature points of Frame I and Frame T. Let and be the coordinates of the matched feature points in Frame I and Frame T, respectively. The previously established is used to triangulate and to obtain map points . introduces the rectilinear hypothesis; therefore, it is necessary to restrain the errors in the map points’ coordinates.The map points can be optimized once by (6), which reduces the error caused by the hypothetical model. Due to the quality of feature point matching, only Init BA cannot make the error of X, R, and t small enough, so the method introduces the subsequent BA to further reduce the error.
Equation (6) describes the Init BA optimization of the map points. Due to the quality of the matched feature points, Init BA alone cannot reduce the errors of , , and to an acceptable margin. The propose method utilizes the subsequent BA to further reduce the errors.
2.4. Error Reduction with Subsequent BA
Limited by the number and distribution of matching feature points, the errors contained in , , and cannot be evaluated, so this paper introduces subsequent BA to achieve further error suppression and initialization accuracy evaluation. The main idea of subsequent BA is to optimize X, Rinit, and tinit with each subsequent frame and then decide whether to continue the subsequent optimization by judging its convergence (Algorithm 2).

The errors contained in , , and cannot be evaluated through Init BA, due to the scale and distribution of the matched feature points. Subsequent BA is thus utilized for initial error evaluation and further error reduction. The main idea of subsequent BA is to optimize , , and with each subsequent frame. Convergence evaluation is performed to determine when to stop the subsequent optimization.
For each frame of the subsequent input, the coordinate of the map points is optimized as shown in the process of Figure 3. When it converges, the error elimination process is considered to be ended, and the subsequent BA process in the figure is as shown in (7).
For each subsequent frame, is optimized with the error reduction process shown in Figure 3 and It can be seen from (7) that the scale of optimization gets larger with continuous input of subsequent frames, which ensures to some extent the reliability of the optimized results. The optimized map points are viewed as initialization results, providing input for subsequent chain processes. The quality of initialization is evaluated with where is the sum of the Reprojection Errors of all map points in all frames participating in the optimization.
3. Simulation
3.1. Simulation System
In order to better reproduce vehicle motion characteristics, the present study builds a hardwareintheloop (HIL) simulation system, as illustrated in Figure 4. It consists of four parts, namely, the Xplane10 flight simulation software, the Pixhawk2 flight controller, the QGroundControl software, and the data logger. Xplane10 and QGroundControl run on PC (CPU: Intel i77700K 4.20GHz, graphics card: NVidia GTX 1080 8G, memory: 32GB). The Pixhawk2 controller is linked to PC via a USB port.
Xplane10 plays the most important role in the entire simulation system, providing aircraft models and simulation images. The Pixhawk2 controller performs autonomous control of the fixedwing aircraft in Xplane10, with QGroundControl acting as the data relay. Specifically, Xplane10 sends the aircraft states to QGroundControl through local loopback UDP; QGroundControl forwards the aircraft data to Pixhawk2 via the USB port using the Mavlink protocol; Pixhawk2 sends out the control commands through the same protocols. The data logger records the uplinkdownlink data through UDP and the firstperson view (FPV) simulation images through the video capture card.
Data transmitted in the simulation system can be roughly classified as periodic data and sporadic data. Periodic data includes the control commands and the aircraft states. Sporadic data includes the start signal, the waypointplanning instruction, etc. Through meticulous testing, the frequency of the periodic commands is set at 65HZ, and the image sampling frequency is set at 25HZ.
3.2. Performance Indicators
Reasonable and balanced performance indicators are needed to evaluate the initialization methods. This paper proposes two groups of performance indicators, for selfevaluation (Table 2) and comparative evaluation (Table 3), respectively.


This study holds that the convergence frame number and the initial error are key indicators to assess the proposed passive initialization algorithm. As the initial error is only affected by the aircraft state at the initial time and the rectilinear motion hypothesis, the convergence frame number is a stronger indicator of the usability of the proposed method.
The optimized map points and pose information are only usable after convergence. Since error rotation is quite unintuitive, for ease of understanding, this paper decomposes into , , to facilitate the evaluation of performance. The difference in length between and (true value of ) is not considered due to the depth uncertainty of monocular vSLAM initializations; only the difference in angle between and is considered.
3.3. Test Design
In order to thoroughly test the proposed initialization method, we devise a simple selfevaluation test and an advanced comparativeevaluation test. The selfevaluation test measures the inherent capabilities of the new method, while the comparativeevaluation test runs the competing algorithms on different terrains. The test scenarios include: taxing, climbing, level flight, BTT turn, diving, and landing.
3.4. SelfEvaluation Test and Result Analysis
The test results of the algorithm in this paper are shown in Figure 6, and the convergence curve of the algorithm is given, where Figure 6(a) gives the convergence curve for initializing in the running state. It can be seen that in this state, the initial error is small, because the state of motion of the aircraft is very close to the motion assumption in the slipping state, and the error is within 1° even if the error is not eliminated. Figure 6(b) gives the convergence curve for the initialization of the aircraft at the moment of takeoff. It can be seen that there is a large error in the motion state of the aircraft and the motion assumption at this time. Due to the characteristics of the fixedwing aircraft, it is mostly in a level flight during the cruise flight. In this state, the motion of the aircraft is similar to the motion assumption. Therefore, several special states are selected in the test, including the climbing to level flight (Figure 6(d)), level flight to BTT turn (Figure 6(e)), level flight to dive (Figure 6(f)), etc. Thus Figure 6 gives the convergence of the algorithm in each typical state during a complete flight, which does not reflect the ability of the algorithm in the whole process.
Figures 7 and 8 and Table 4 summarize the convergencerelated initialization performance at different poses throughout one complete flight. Figure 7 shows the convergence time statistics of the algorithm in the whole flight process. Figure 8 indicates the error distribution of the algorithm under different thresholds. Table 4 gives the exact values of Figures 7 and 8.

3.5. ComparativeEvaluation Test and Result Analysis
This paper selects ORBSLAM2 and DSO as the competing classical algorithms for the comparativeevaluation test. In order to better reflect their performance, this study runs the all methods on plain terrain (Figure 5) and mountainous terrain (Figure 9). Considering the stochastic nature of ORBSLAM2, the study conducts five comparativeevaluation subtests on each terrain. The best subtest results are viewed as the illustrative test results.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 10 gives the initialization results of the three algorithms in two terrains. in proposed method is set to 0.5 during test. It can be seen that SRI of the proposed method in both plain and hilly terrain is greater than that of ORBSLAM2 or DSO. Considering the effect of on proposed method, SRI under different is compared, as in the Table 5

(a)
(b)
(c)
(d)
(e)
(f)
In addition to the comparativeevaluation performance indicators introduced in Table 3, this paper also compares the number of matched feature points(ANMFP) needed by ORBSLAM2, DSO and the proposed method, respectively, to effectuate successful initialization (Table 6).

It can be seen from Table 6 that the ANMFP value of the proposed algorithm is between 50 and 70, while the ANMFP of ORBSLAM2 is above 200. The DSO algorithm requires a larger ANMFP, because it uses a direct method framework. It can be seen that the number of feature points required by the proposed algorithm is much smaller than that of ORBSLAM2 and DSO. The reason for this result is determined by the basic structure of the algorithm in this paper. The algorithm does not directly calculate the or by relying on the correspondence between the feature points of two adjacent frames, but continuously optimizes the initial pose by using the feature point correspondences that can be continuously observed in successive frames. That is to say, for this method, there is no need to have so many feature points in the initial frame. This method could get an acceptable initial attitude as long as enough points can be continuously observed in successive frames. This also explains from another side why the algorithm can achieve a higher SRI. Therefore, the method in this paper can achieve better results from little feature points when dealing with sparse image features.
4. Conclusion
In this paper, we propose a rectilinear hypothesis of platform motion and thereby derive a passive initialization method for monocular SLAM. Init BA and subsequent BA are utilized in reducing the errors between the actual motion and that of the proposed hypothesis. A simulated fixedwing aircraft is selected as the test platform for the proposed method. Results show that the success rate of monocular SLAM Initialization is greatly improved compared with that of ORBSLAM2. However, this method is only effective on platforms with strong motion characteristics and cannot be used indiscriminately on platforms characterized by randomized motions, such as humans and animals. At present, the method has yet to be tested in realworld environments, which will be rectified in future works.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Disclosure
The research received no external funding. Among the authors, Yu Yang, Jing Xiong, and Xiaoyu She are graduate students of Beijing Institute of Technology. The research was performed as part of their education. Authors Jie Li, Chengwei Yang, and Chang Liu are employed by Beijing Institute of Technology. They only played a supervising role in the research.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
 G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial on graphbased SLAM,” IEEE Intelligent Transportation Systems Magazine, vol. 2, no. 4, pp. 31–43, 2010. View at: Publisher Site  Google Scholar
 B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “"Bundle adjustment” a modern synthesis,” in Proceedings of the Vision Algorithms: Theory and Practice, B. Triggs, A. Zisserman, and R. Szeliski, Eds., vol. 1883 of Lecture Notes in Computer Science, pp. 298–372, Springer, Corfu, Greece, 1999. View at: Publisher Site  Google Scholar
 R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “G^{2}o: a general framework for graph optimization,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '11), pp. 3607–3613, IEEE, Shanghai, China, May 2011. View at: Publisher Site  Google Scholar
 S. Agarwal and K. Mierle, “Ceres solver,” http://ceressolver.org. View at: Publisher Site  Google Scholar
 Y. Lin, F. Gao, T. Qin et al., “Autonomous aerial navigation using monocular visualinertial fusion,” Journal of Field Robotics, vol. 35, no. 1, pp. 23–51, 2018. View at: Publisher Site  Google Scholar
 F. Nex and F. Remondino, “UAV for 3D mapping applications: a review,” Applied Geomatics, vol. 6, no. 1, pp. 1–15, 2014. View at: Publisher Site  Google Scholar
 S. Lin, M. A. Garratt, and A. J. Lambert, “Monocular visionbased realtime target recognition and tracking for autonomously landing an UAV in a cluttered shipboard environment,” Autonomous Robots, vol. 41, no. 4, pp. 881–901, 2017. View at: Publisher Site  Google Scholar
 D. Scaramuzza, M. C. Achtelik, L. Doitsidis et al., “Visioncontrolled micro flying robots: From system design to autonomous navigation and mapping in GPSdenied environments,” IEEE Robotics and Automation Magazine, vol. 21, no. 3, pp. 26–40, 2014. View at: Publisher Site  Google Scholar
 J. Ventura, C. Arth, G. Reitmayr, and D. Schmalstieg, “Global localization from monocular SLAM on a mobile phone,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 4, pp. 531–539, 2014. View at: Publisher Site  Google Scholar
 G. Klein and D. Murray, “Parallel tracking and mapping for small AR workspaces,” in Proceedings of the 6th IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR '07), pp. 225–234, Nara, Japan, November 2007. View at: Publisher Site  Google Scholar
 R. MurArtal, J. M. M. Montiel, and J. D. Tardos, “ORBSLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. View at: Publisher Site  Google Scholar
 R. MurArtal and J. D. Tardos, “ORBSLAM2: an opensource SLAM system for monocular, stereo, and RGBD cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017. View at: Publisher Site  Google Scholar
 C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: fast semidirect monocular visual odometry,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '14), pp. 15–22, IEEE, Hong Kong, June 2014. View at: Publisher Site  Google Scholar
 M. Pizzoli, C. Forster, and D. Scaramuzza, “REMODE: Probabilistic, monocular dense reconstruction in real time,” in Proceedings of the 2014 IEEE International Conference on Robotics and Automation, ICRA 2014, pp. 2609–2616, IEEE, June 2014. View at: Publisher Site  Google Scholar
 J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: largescale direct monocular SLAM,” in Proceedings of the Computer Vision –ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8690, pp. 834–849, Springer International Publishing, 2014. View at: Publisher Site  Google Scholar
 J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2018. View at: Publisher Site  Google Scholar
 R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2nd edition, 2003. View at: MathSciNet
 X.S. Gao, X.R. Hou, J. Tang, and H.F. Cheng, “Complete solution classification for the perspectivethreepoint problem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 930–943, 2003. View at: Publisher Site  Google Scholar
 V. Lepetit, F. MorenoNoguer, and P. Fua, “EPnP: an accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009. View at: Publisher Site  Google Scholar
 L. Kneip, H. Li, and Y. Seo, “UPnP: An optimal O(n) solution to the absolute pose problem with universal applicability,” in Proceedings of the European Conference on Computer Vision, vol. 8689, pp. 127–142, Springer, 2014. View at: Google Scholar
Copyright
Copyright © 2019 Yu Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.