A Review of Common Techniques for Visual Simultaneous Localization and Mapping

Dai, Yong; Wu, Jiaxin; Wang, Duo

doi:https://doi.org/10.1155/2023/8872822

Journal of Robotics

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Review Article | Open Access

Volume 2023 | Article ID 8872822 | https://doi.org/10.1155/2023/8872822

A Review of Common Techniques for Visual Simultaneous Localization and Mapping

Yong Dai,¹Jiaxin Wu,¹and Duo Wang¹

Academic Editor: Keigo Watanabe

Received23 Dec 2022

Revised12 Jan 2023

Accepted28 Jan 2023

Published17 Feb 2023

Abstract

Mobile robots are widely used in medicine, agriculture, home furnishing, and industry. Simultaneous localization and mapping (SLAM) is the working basis of mobile robots, so it is extremely necessary and meaningful for making researches on SLAM technology. SLAM technology involves robot mechanism kinematics, logic, mathematics, perceptual detection, and other fields. However, it faces the problem of classifying the technical content, which leads to diverse technical frameworks of SLAM. Among all sorts of SLAM, visual SLAM (V-SLAM) has become the key academic research due to its advantages of low price, easy installation, and simple algorithm model. Firstly, we illustrate the superiority of V-SLAM by comparing it with other localization techniques. Secondly, we sort out some open-source V-SLAM algorithms and compare their real-time performance, robustness, and innovation. Then, we analyze the frameworks, mathematical models, and related basic theoretical knowledge of V-SLAM. Meanwhile, we review the related works from four aspects: visual odometry, back-end optimization, loop closure detection, and mapping. Finally, we prospect the future development trend and make a foundation for researchers to expand works in the future. All in all, this paper classifies each module of V-SLAM in detail and provides better readability to readers. This is undoubtedly the most comprehensive review of V-SLAM recently.

1. Introduction

Tracing back to the modern positioning technology, the earliest widely used in the world is the global positioning system (GPS) [1] proposed by the U.S. department of defense in 1964. Its military accuracy can reach 1-2 meters, and the civilian accuracy is 5–10 meters. Obviously, this kind of precision is not enough for application devices. Then, a high-precision positioning technology is proposed, that is, a high-cost combination of GPS, real-time kinematic (RTK), and inertial measurement unit (IMU) [2], which can achieve an accuracy of 1–3 centimeter. Although the integrated inertial navigation technology basically solves the outdoor problem, it is not available indoor, because an RTK signal cannot be obtained after occlusion. In 2004, an indoor wireless ultrawide band (UWB) [3] communication positioning technology was proposed to solve the indoor positioning problem. However, we know that UWB technology requires modification of the working environment of mobile machines, such as the installation of base stations and launchers, which is not suitable for unfamiliar scenes without modification. In order to solve the above problems, in recent years, an independent, indoor and outdoor compatible, and low-cost integrated solution to solve the localization by real-time mapping without modification of the working environment has been proposed, namely, simultaneous localization and mapping (SLAM) technology. The emergence of SLAM technology brings broad application scenarios and great significance for robot positioning.

Here is a look at the origins and applications of SLAM. It originated in 1986 [4] when the statistical probability was just applied to the field of artificial intelligence. Smith and Cheeseman discussed the problem of continuous graph building at the IEEE Robotics Automation Conference, drew the conclusion of the importance of consistency detection, and found the correlation law of landmarks through the statistical frameworks. Then, it is accompanied by the development of image processing in statistics. On the basis of [5–7], Smith et al. proposed synchronous positioning and real-time monitoring. They used state vector to store joint pose and coordinate [8]. Through the integration of the above technologies, the abbreviation of SLAM appeared for the first time in this academic forum [9] in 1995. Since then, there has been a relatively systematic framework for SLAM. This conference also opened the prelude of the era of SLAM. In recent years, SLAM technology has been widely concerned and applied, such as augmented reality (AR), Google, and Apple launched ARCore and ARKit, respectively. In the field of mobile robots and aerial vehicles, SLAM can assist these intelligent agents in positioning, path planning, autonomous exploration, obstacle avoidance, and navigation. Similarly, SLAM technology is widely used in underwater operations, industrial robots, medical treatment, unmanned driving, and the other fields [10–13], as shown in Figure 1.

(a)

(b)

(c)

According to the perceptual module, the common SLAM system generally has two forms: LIDAR-based (light detection and ranging, LIDAR) SLAM and visual SLAM (V-SLAM). When the LIDAR sensor is used in SLAM, then it is called LIDAR-based SLAM. Similarly, when the camera sensor is used in SLAM, then it is called V-SLAM. In LIDAR-based SLAM, LIDAR can collect point cloud images containing direct geometric relations. Representative methods are as follows: real-time 3D laser SLAM (LOAM) [14], real-time loop closure in 2D LIDAR SLAM (cartographer) [15], tight-coupled LIDAR-visual-inertial SLAM system (LVI-SLAM) [16], and so on. According to the investigation on the application of LIDAR-based SLAM, in the past decade, enterprises have mainly applied it to autonomous driving vehicles and automated guided vehicle (AGV) robots. However, these fields are facing the expensive production that cannot be mass-produced, and the prices of some famous LIDAR manufacturers are shown in Figure 2. Besides, the current bottleneck is that the technology cannot be generalized. Although there are some low-cost LIDAR sensors, these low-cost home LIDAR sensors are only suitable for small-scale detection scenarios, mostly for indoor applications, such as the sweeping robot [17], and not for outdoor applications. With the development of computer vision, V-SLAM is undoubtedly the best choice to provide customers with better service due to its advantages of low cost, simple structure, diversified installation, universal usage indoor and outdoor, and rich semantic information.

Next, we evaluate the review papers on V-SLAM from an objective perspective. In 2015, Fuentes-Pacheco et al. firstly reviewed the main concepts and basic algorithms of the V-SLAM [18]. Although the papers were backward in the current perspective, they were incomparable for that time. In 2016, Gui et al. discussed filter-based and optimization-based V-SLAM [19]. Nevertheless, they only investigated the main concepts of visual-inertial SLAM. In 2017, Taketomi et al. reviewed the V-SLAM technology from 2010 to 2016, but they only focused on the comparison of the previous V-SLAM [20]. Furthermore, they did not carry out a detailed expansion of the V-SLAM problem. In 2018, Li et al. analyzed the opportunities and challenges of V-SLAM. They also predicted the future development prospects of V-SLAM, but they focused too much on deep learning [21]. In 2022, Chen et al. collated a large body of traditional and semantic V-SLAM literature, and they summarized classical frameworks that employ direct/indirect approaches [22]. Despite providing a comprehensive exposition, the classification of methods was limited to the types of features used in feature-based V-SLAM. In another work, Macario Barros et al. provided a timeline of V-SLAM techniques over the years [23]. The visualization was easy to understand, but the rationale was generalized rather than explored in depth.

Table 1 classifies the classic review papers so far. It can be seen that most of the papers focus on a particular topic or trend. Especially, none of the review papers of V-SLAM discuss the loop closure detection in detail. Compared with other reviews, this paper provides a more comprehensive review of the V-SLAM. The main contributions of this paper are summarized as follows:(a)We discuss the basic theoretical knowledge and review the advantages of V-SLAM comprehensively. The recent researches in the field of V-SLAM are classified into the front end, back end, loop closure detection, and mapping.(b)We modularly classify the open-source V-SLAM algorithms by seven indicators and provide the convenient references for readers.(c)We analyze the advantages and disadvantages of the current mainstream V-SLAM by studying different methods in different aspects. Especially, we describe the visual odometry and loop closure detection in detail.(d)We explain the V-SLAM algorithms by tables and flow charts, which lays a foundation for researchers.(e)We divide the V-SLAM algorithms into five categories. Meanwhile, we analyze the existing solutions, summarize the shortcomings, look forward to the future research directions, and provide researchers with extended research ideas.

The remainder of this paper is organized as follows. The research problem of V-SLAM is described in Section 2. In Section 3, the details of the V-SLAM technology are stated. The research status of the V-SLAM is reported in Section 4. Then development trends and active research areas of V-SLAM are discussed in Section 5. Finally, Section 6 concludes the paper.

2. Research Problem

For V-SLAM, a major research issue is how to use cameras to collect, store and update environmental information in order to accurately locate and map in an unknown environment. There are two main research directions to V-SLAM. One is V-SLAM based on filter, which uses all image information for visual fusion. The other is based on the feature extraction and optimization theory, which uses key frames for data association. Both methods are described in detail below.

V-SLAM based on filter deals with data association in a more holistic way that it considers the relationship between the current moment state and all previous states in order to save a lot of feature extraction time. However, it has the following inevitable shortcomings. Firstly, it is based on the assumption of invariant gray level, so it is easy to be affected by external illumination. Secondly, there is no differentiation for single pixels, so it is necessary to calculate the correlation for pure colors. Third, because the image is nonconvex, a good initial value of the image is needed to ensure the correctness of tracking.

V-SLAM based on the feature extraction and optimization theory has become the mainstream due to its directly extracting information and nonlinear optimization, as shown in Table 2. However, it has the following inevitable shortcomings as well. Firstly, V-SLAM based on feature extraction information still spends a lot of time in the calculation of descriptors and feature matching, and many features can not work properly. Second, due to the matrix computation, it has the efficiency optimization problem of extracting characteristic information, loop closure detection, and failure repair. Third, it needs to enhance the robustness.

All in all, the above research issues are the key analysis of this paper, and the classification of the used technology content will be made in the following sections.

3. Formal Description

V-SLAM is generally divided into sensor data collection, front-end visual odometer, back-end optimization, and loop closure detection. The front-end vision odometer is used to estimate the relative motion of the camera by using the information from adjacent pictures and to build local pictures. Back-end optimization is to further optimize the initialization information obtained by the front-end visual odometer according to statistical inference. Loop closure detection is to judge whether the previous scene has been reached by visual detection of pose information at a certain point in time, as shown in Figure 3.

In order to obtain positioning information, we might as well assume , where is the sampling period. The position information of the corresponding time , these discrete points constitute the trajectory of the rigid robot. We can use a general and abstract mathematical model , where is the motion equation, represents a vector, represents the position coordinates of the robot, is the position at the last moment, is the input of the motion sensor, and is the noise. In order to obtain map building information, we might as well assume that the rigid robot sees the landmark point at position in order to generate an observation data . According to the observation data, the rigid robot can independently adjust its position in an unknown environment. Similarly, we can use the general and abstract mathematical model , where is noise and is the observation equation. To sum up, V-SLAM is to estimate internal state variables through noise measurement data for localization and mapping.

The accuracy of the measurements depends greatly on the type of camera sensor. Camera sensors are classified according to the implementation mode: monocular camera, binocular camera, and RGB-D camera, as shown in Figure 4. In addition, camera sensors include event camera, panorama camera, and other special cameras.

(a)

(b)

(c)

(d)

We can not know the depth information of this spatial point only according to the imaging model of the monocular camera, so the monocular camera generally relies on its own motion to collect two frames of the fixed time interval for parallax to estimate the depth. However, the binocular camera can estimate the distance between a pixel depth and the camera through the static visual difference between the left and right eye cameras. In both monocular and binocular cameras, parallax itself is difficult to calculate, then RGB-D cameras were invented and could actively measure the depth of each pixel using infrared structured light or time-of-flight principles. However, RGB-D cameras have some disadvantages in cost, power consumption, and so on. In addition, visual cameras have congenital defects caused by internal parameter calibration errors. It takes images that are usually determined by the focal length of its built-in camera and are susceptible to exposure, so the data obtained by the visual camera must be corrected. In addition, camera sensors can be divided into the fish-eye camera and wide-angle camera according to the type of camera. Wide-angle camera has a short focal length and a wide angle of view, so it cannot take panoramas in a tight space. Therefore, it generally shoots large scenes. Fisheye camera has a wide range of visual angles, but the smaller the scale of the object in the picture, the less clear it will be. Therefore, it is generally used in scenes that do not care about details.

The motion of a rigid body robot in 3D space is actually the corresponding rotation and translation operation, as shown in Figure 5. Therefore, we can estimate the trajectory of the rigid body robot and the spatial structure of the environment through the pixel information between adjacent frames.

There is also a problem that cannot be ignored because the transformation relationship is determined by estimating the motion relationship between two images, and the error brought by estimation is inevitable. Therefore, we need to do back-end optimization for the resulting errors. In the last few decades, the filter is almost always used in the back-end optimization, as shown in Figure 6. However, the system is generally linearized and approximates its noise with gaussian distribution in operation. In this way, the highest term is discarded and the covariance matrix of the state needs to be updated in real time, and the time complexity and error are high. The least square method is used to solve the problem, where is the actual value and is the value obtained through the observation model. The spatial position of each camera attitude and each feature is adjusted to meet the real-time performance of the system. Therefore, currently nonlinear optimization represented by bundle adjustment (BA) is the mainstream back-end optimization method. There is another method based on graph optimization, as shown in Figure 7. Optimization of pose nodes and landmark nodes by using nodes as optimization variables and edges as error terms greatly improves optimization efficiency.

Moreover, the error caused by each movement will be transmitted to the next moment, so drift will occur after a long time of movement. To solve this problem, the rigid robot can determine whether to return to the origin based on the similarity between images in order to significantly reduce the cumulative error and building a consistent map. At present, loop closure detection technology is basically the combination of bag model and feature point detection. Although nonlinear optimization or graph optimization technology is used in updating frames, it is more efficient in architecture than the direct filter method.

In some specific applications, we need different forms of maps to store environmental information. Generally speaking, map forms can be divided into point cloud map, occupancy grid map, octree map, and grid map, as shown in Figure 8.

(a)

(b)

(c)

(d)

4. Research Status

V-SLAM has made a major breakthrough in the last 10 years. The realization methods of V-SLAM include the direct method, the feature-points method (the indirect method), and the hybrid semidirect method. In this section, we discuss the three methods in detail. Then we introduce the current status of the loop closure detection. Finally. we analyze the problems of the current V-SLAM technology from a modular perspective and propose future solutions.

4.1. The Direct Method

In recent years, the direct estimation method based on the invariance of pixel gray level, which directly utilized the change information of image’s light and shade, had developed rapidly [53–55], as shown in Figure 9. The direct method evolved from the optical flow [56] solved the problem of estimating camera pose through the photometric error of two pixels, which had evolved into the least square problem. This method not only overcame the shortcoming of extracting key points and calculated descriptors in feature method, but also solved the problem of missing features. Generally, the direct method can be divided into three categories according to the number of pixels used: the sparse direct method, the dense direct method, and the semidense direct method.

Early V-SLAM based the direct method was not applied to tracking and mapping frameworks, and the most was based on key points of human experience. Later, in the paper of [57], functional variational formula was proposed to construct dense map, but these requirements on GPU were quite stringent. In the paper of [58], depth filter was proposed to construct version dense map, which not only reduced the time complexity but also could run on a single CPU in real time. In 2011, Newcombe et al. proposed DTAM [59], which was the first completely direct approach with stability and accuracy as its virtues. However, this method aligned every depth information of images and maps and requires GPU acceleration, so the real-time performance was greatly reduced. Subsequently, Engel J et al. proposed a direct monocular SLAM (LSD-SLAM) in large-scale environments in 2014. LSD-SLAM replaced V-SLAM which was previously only partially processed. The semidense map generated by LSD-SLAM on common CPU is directly composed of key frames by transformation and could accurately detect scale drift. However, in the aspect of loop closure detection, LSD-SLAM had the same pose optimization as ORB-SLAM2 [60], so it did not get rid of the calculation of feature points.

DSO is also a pure direct method, which is an updated version of Engel et al.’s LSD-SLAM. DSO is not a complete SLAM [61], and it lacks loop closure detection and map reuse. DSO improves real-time performance by projecting each point onto all frames and calculating its residual in each frame. Moreover, DSO improves accuracy by using a sliding window consisting of several key frames as its back end.

TANDEM is a novel monocular camera-only approach to V-SLAM that combines classic direct approaches with the multiview system (MVS) 3D reconstruction [62]. TANDEM is not only rendered by the global truncated signed distance function (TSDF) model to build dense maps but also used view aggregation and multilevel depth to predict the entire key frame window, which greatly improved accuracy, real-time performance, and stability.

Similarly, the direct method is based on the assumption of invariance of pixel gray scale, which is actually susceptible to environmental lighting, camera exposure, and other factors, so the application scene is limited under certain circumstances. None of the above algorithms escape from the common disadvantages of the direct method. In addition, the direct method is based on gradient search. If the time difference between two frames of image acquisition is too large, the movement distance of the picture taken may be too large, so the optimization function falls into the local minimum. From this point of view, the direct method is more suitable to be applied to the environment with large tonal differences and high-speed movement of objects.

4.2. The Feature-Points Method (The Indirect Method)

For the feature-points V-SLAM, its basic principle is to estimate the relative motion of the camera by extracting and comparing the feature points of adjacent images. The steps of the feature point method mainly include feature extraction, feature matching, motion estimation, and local optimization, as shown in Figure 10.

Feature points are composed of key points and descriptors. Initially, David Lowe et al. proposed a scale-invariant feature transform (SIFT) based on the difference of Gaussian (DOG) scale space in 1999 and perfected it in 2004 [63]. This is the first algorithm to solve the problems of rotation, scale scaling, and brightness invariance and apply them to V-SLAM. In this method, the histogram of oriented gradient (HOG) algorithm [64] maintained good invariance through calculation and statistics of the histogram features of the local area of the image, so subtle movements can be ignored without influence. HOG is faster than SIFT in real time, but it is far inferior to SIFT in a complex environment. Subsequently, in 2006, Herbert et al. improved SIFT and proposed a speeded-up robust features (SURF) algorithm [65], which improved the execution efficiency by using the hessian matrix and dimensionality reduction feature descriptor. However, the main direction in this method is too dependent on the gradient direction of the local pixel region, and the subsequent extraction operations are also dependent on the main direction, which can easily lead to the error continuously magnified. In 2011, Rublee et al. proposed oriented FAST and general theory BRIEF(ORB) [66], which increased the variance of binary robust independent elementary features (BRIEF) [67] from the original BRIEF algorithm (rBRIEF) and improved the scale invariance of the original FAST [68]. The rBRIEF algorithm is proposed by using the exhaustive method to select points, which ensures rotation invariance. The ORB-SLAM series makes good use of this feature description. Setiawan et al. compared the three classical algorithms through pictures under dark light: SIFT had the highest rotation processing accuracy despite its slower execution speed than the other two algorithms. ORB and SURF had the best comprehensive ability, but SURF was obviously inferior to the other two algorithms in the face of noise [69].

As the first V-SLAM, mono-SLAM [39] used very sparse feature points at the front end and EKF at the back end. Besides, the multistate constraint Kalman filter (MSCKF) was used to update the mean and covariance of the current state of the camera and all landmarks as state quantities. However, this maximum likelihood method has no loop closure detection function, and the longer it works, the larger error becomes. Moreover, mono-SLAM coefficient feature points are easily lost. Subsequently, Klein and Murray. proposed window optimization and covisible graph to realize large-scale single-purpose SLAM: PTAM [40], which was the first V-SLAM to propose the concept of key frames. The front end and back end of PTAM were processed in parallel to speed up the real-time performance, but the tracking target information was easily lost. In order to solve this problem, the ORB algorithm was proposed by ORB-SLAM in 2015, which has become the most used algorithm so far. ORB algorithm used the unique binary string representation of BRIEF, inheriting the advantages of both. Meanwhile, the ORB algorithm not only saves storage space but also greatly reduces the matching time. V-SLAM based on the ORB algorithm even provides short-, medium-, and long-term data association in VI-ORB-SLAM [70] and ORB-SLAM3 [49] and achieves almost zero drift with Atlas system, the frame diagram as shown in Figure 11.

Then, the motion of the camera is estimated according to the feature points and the point pair information of the descriptors after feature matching. Camera motion problems can be divided into three categories: 2D-2D, 3D-2D, and 3D-3D, as shown in Figure 12. Typically, the typical 2D-2D problem occurs when monocular cameras lack additional information, and the motion between the two camera coordinate systems can be solved by the pole-geometric constraints, followed by triangulation to estimate the depth of map points. Meanwhile, methods to solve this problem were also carried out by the perspective-three-point method (P3P) [71] and the iterative closest point method (ICP) [72]. Among them, compared with the least-squares matching algorithm for solving the pose directly, random sample consensus (RANSAC) [73] removes some false matches and noise, making the solution more accurate. Perspective-n-point (PnP) is a method for solving 3D-2D point-pair motion, in addition to UPnP [74], EPnP [75], direct linear transform (DLT) [76], and nonlinear optimization. Generally, the 3D-3D problem will solve the camera pose through the singular value decomposition (SVD) [77] method through feature points after feature matching, but twice-depth information will lead to lower accuracy, so this problem is generally avoided in ordinary cameras.

(a)

(b)

(c)

Figure 12

(a) 2D-2D schematic diagram where and are the centers of the two cameras, is a point in space, and the projection of on the corresponding image plane of and is and , respectively. The intersections of and and the image plane and are called poles, and are called polar lines, and the plane composed of , , and is called the polar plane. (b) 3D-2D schematic diagram, the coordinate of which is in the world coordinate system, and the pixel coordinate of which is projected into the image is . (c) 3D-3D schematic diagram, where is the coordinate in camera coordinate system and is the coordinate in camera coordinate system. The relative pose relation of the coordinate system is described by and .

V-SLAM based on feature-points estimates camera motion according to feature-points after feature matching and optimizes reprojection error, so it is not sensitive to illumination changes. So far, it is the most mature and popular scheme. However, V-SLAM based on feature-points takes a long time in the extraction of key points and descriptors, feature matching, and other aspects. Meanwhile, since V-SLAM based on feature points can only build sparse maps, the traditional feature point extraction methods cannot meet the requirements for some specific scenes.

4.3. The Hybrid Semidirect Method

Forster et al. combined the advantages of the feature-points method and the direct method and proposed a semidirect visual odometer (SVO) applied to UAV aerial photography in 2014 [42].

The SVO uses motion estimated and mapping. Firstly, a rough camera estimate is obtained by using sparse feature blocks between two frames under the assumption criterion of the direct method (gray invariance). Secondly, in depth estimation, the Newton–Gauss method is used to optimize the predicted positions of feature blocks, which not only saves some calculation but also ensures that the gradient of pixels is obvious. Finally, the camera pose was optimized using BA.

Unlike the traditional feature point method, SVO relies only on feature points when selecting key frames in the whole process and does not require the calculation and matching of descriptors in order to save a significant amount of time. Different from the traditional direct method, SVO does not use the information of all pixels in each frame but takes small image blocks to estimate the camera motion pose. This algorithm not only improves the real-time performance under certain accuracy but also uses three optimization methods to ensure the robustness of the results.

Compared to other open-source versions of ORB-SLAM, SVO has short threads and a simple framework that can be maintained on low-end platforms, but it does not perform well in head-up cameras and is not easily repositioned if pose estimation is lost. In 2016, Forster et al. improved SVO to SVO 2.0 [78]. It not only supports perspective, fisheye, and binocular cameras but also adds edge tracking and takes into account IMU rotation prior. In addition, SVO 2.0 allows for faster interframe convergence and provides theoretical guidance for combining V-SLAM with IMU, the flowchart as shown in Figure 13.

On the basis of DSO and ORB-SLAM, Seong Hun Lee and Javier Civera locally used the direct method to quickly and robustly track camera posture on locally accurate, short term, and semidense maps [45]. At a global scale, feature-based approaches are used to optimize key frame postures, perform loop closures, and build reusable, globally consistent, long term, and sparse feature maps. This loose coupling between the direct method and the feature-points method makes up for the shortcomings.

In order to overcome these shortcomings, a semidirect method combining the advantages of the original luminosity information of the selected pixel set in the image is also under development [79], as shown in Figure 14.

4.4. Loop Closure Detection

As an important part of V-SLAM, the loop closure detection plays a key role in establishing a globally consistent map. The core of loop closure detection is to detect the same scene in nonadjacent frames and then adds constraints to eliminate the cumulative error. Generally, there are two methods: (a) V-SLAM based on the visual odometer: this method accords to the distance of the movement to determine whether the loop closure occurs, but it cannot give accurate results when the cumulative error is large, as shown in Figure 15 and (b) V-SLAM based on appearance: this method is based solely on the similarity between the current frame and the historical frame which includes traditional methods and deep learning methods.

4.4.1. Traditional Method

Most of the traditional loop closure detection methods are realized by the BoWbag of words, BoW [80]. The BoW takes the features of the image as words and the whole image as the bag of words. In addition, the dictionary is established by artificial clustering training image data set. This method greatly improves retrieval speed and accuracy by compressing the image information.

In 2006, Nister and Stewenius proposed a tree storage method based on the BoW, which greatly improved the retrieval efficiency [81]. In 2008, Angeli et al. extended the BoW method in image classification to incremental conditions. In addition, they used Bayesian filtering to estimate the loop closure probability in order to solve the real-time loop closure detection under the strong perceptual bias environmental [82]. However, this method just has good pose invariance, but not condition invariance. In the same year, Cummins et al. proposed FAB-MAP [83]. However, the image features of constructed BoW and GIST are manually trained features, and the success rate of loop closure detection is not high under complex illumination. In 2012, Smith et al. proposed the discrete binary space based on BoW and extended it by using the positive index [84], as shown in Figure 16. This method firstly takes binary vocabulary for loop closure detection. It not only finds loop closures quickly but also can detect them in a dynamic environment. Over the last decade, the BoW algorithm has been open source and updates to the fourth generation version, FBoW, which is highly optimized to speed up dictionary creation using AVX, SSE, and MMX instructions.

Moreover, among the traditional feature extraction operators such as SIFT, SURF, and ORB, the BoW method has achieved good results on the problem of loop closure detection. However, there are at least two problems in loop closure detection based on BoW: (a) perception confusion: this method stores the description features out of order and does not take into account the distribution of local features in the scene image. As a result, highly similar scenes can be a mistake for the same scenes, even in different geographical locations, resulting in failed loop closure detection, and (b) manual intervention: manual offline dictionary has limitations, which not only does not make full use of the deep information of the image but also leads to tracking failure in a large range, long time, and complex environment.

4.4.2. Based on Deep Learning

The loop closure detection based on deep learning mainly uses the neural network to extract the features of the image and makes full use of the deep feature information of the image to obtain higher accuracy. The current main frame is shown in Figure 17.

In 2015, Gomez-Ojeda et al. firstly retrained the neural network based on the place recognition database [85]. This method improved the image retrieval accuracy and optimized the loop closure detection’s effect in the appearance change scenes. In the same year, Lowry et al. used the Image Net database to pretrain a Caffe-based AlexNet [86]. This method significantly improved the robustness against partial occlusion in the scene. Chen et al. firstly used a convolutional neural network (CNN) to learn image features [87]. This method improved in matching accuracy and robustness. Gao and Zhang firstly used the auto-encoder to extract the features of the image and detected the similarity between the images through the form of a similarity matrix [88]. This method has achieved good results on the public data set. In 2017, Xia proposed a loop closure detection algorithm based on PCANet [26]. The algorithm used the output of the intermediate CNN layer as the image descriptor and achieved better results in practical performance. In 2019, Khaliq et al. proposed a lightweight method by using the VGG16 network to extract image features which effectively improved the efficiency [89]. In 2021, Zhong and Fang adopted BigBiGAN, which improved recall by 20% and reduced time loss by 14% compared to ORBSLAM2 [90]. In 2022, Gao et al. proposed AirLoop, a method that used unsupervised learning to minimize forgetting when increasing training loop closure detection models and showed that its robustness was greatly improved over ORB-SLAM2 on a data set [91].

4.5. Summary Analysis of Existing Methods

The feature-points method calculates the camera pose and map point position by minimizing the reprojection error. The offset error comes from the camera internal parameters, shutter, and so on. While the direct rule minimizes the photometric error, the offset error comes from the fuzzy noise. Therefore, the direct method can work in a gradient environment as long as the key points have gradients. In addition, according to the research of BMW and Technical University of Munich, the sparse direct method can achieve very fast results, so this method is very suitable for the occasion of limited resources. However, the direct method also has some problems such as insufficient support for the gray invariance hypothesis and nonconvexity. Therefore, the general direct method is applied to the scene with large changes in light and shade, and small motion amplitude.

As people have more and more strict requirements on real time and robustness, the feature-points method is considered as the mainstream method of visual odometer due to its stability, insensitivity to dynamic objects, and illumination. However, on the one hand, the traditional feature-points method has difficulty in matching feature points in weak texture areas, such as sky and white wall. On the other hand, the traditional feature-points method only uses the information of feature points, so the utilization rate of the image is not high.

Compared with the traditional loop closure detection algorithm, the method based on deep learning uses a deep neural network to extract image features, expresses image information more fully, and has stronger robustness to environmental changes such as illumination and season. However, how to choose the appropriate hidden layer to represent image features, how to design the neural network architecture to get away from manual intervention, and how to use task-oriented large data sets to optimize the network parameters for transfer learning are still important issues for future research. In addition, for both traditional and deep-learning-based methods, the amount of data will become larger and larger with the increase of the scene size, so eliminating redundant data and erroneous data is obviously a top priority. Therefore, future loop closure detection should focus on the compressed sensing system, dividing the region, speeding up the spatial search, and eliminating noncritical information.

5. Development Trends and Active Research Areas

Table 3 shows some of the top institutions for academic research on V-SLAM.

The real time, robustness, and accuracy of positioning and mapping have always been the pursuit of researchers. At present, the product is still in the stage of further research, development, and application scenario expansion. Around these three problems, there are many new research directions, such as multifeature V-SLAM, multirobot cooperation V-SLAM, multisensor fusion V-SLAM, semantic V-SLAM, and event-based V-SLAM.

5.1. The Multifeature V-SLAM

Earlier in this paper, we mentioned that point features as the most popular and commonly used features are very fast in storage and matching. However, point features are easily lost in the absence of textures. Thus, more advanced geometric features (such as lines, edges, and faces) can be utilized and integrated into V-SLAM, as shown in Figure 18.

Line feature is composed of multiple points with natural illumination and invariance of the angle of view, which overcomes the shortcoming of the point feature. However, line features are no longer robust in terms of degradation. The degradation of localization is mainly due to the reduction of constraints. The extra constraints will increase the computational burden. Among the many studies with multifeature V-SLAM, He et al. integrated line features into the V-SLAM system by a tightly coupled approach, but it did not consider the parallel constraints between structural lines [92]. In paper [93], a structural constraint is defined, and the mapping accuracy is improved by creating parallel lines, but they do not distinguish between structural and unstructural lines. Xu et al. used different line constraints to improve the accuracy and robustness of pose estimation and mapping in complex environments [94]. In addition, Han et al. used the visual vocabulary bag of words model to extract point and line features to construct a visual vocabulary and used information entropy to combine point-based similarity score and line-based similarity score to further improve the accuracy of similarity assessment of the two images [95]. However, these methods still do not systematically consider structural information such as parallelism, orthogonality, and coplanarity of lines, so they have poor stability in complex environments. Moreover, the pose estimation algorithm based on line feature is not as reliable as the pose estimation algorithm based on feature point in a complex environment.

Edge features provide more information about the environment, including points, lines, and arbitrary curves. At present, the mainstream methods can be divided into two types: one is as an auxiliary feature for minimizing photometric error [96–98]. Even though this method enhances the robustness to scenes with texture loss, it has poor stability in complex environments (such as illumination). The other as the main feature, several methods have been proposed in [99–101], but these methods all have the defects of edge feature extraction failure and edge feature redundancy.

Furthermore, most of the planar features-based methods adopt the Manhattan world [102] assumption that all normal vectors are only distributed on three mutually perpendicular spindles. In an outdoor complex environment, the effect of planar feature enhancement is not obvious [103, 104].

Although more advanced geometric features have rich pixel information, they still need to be theoretically extracted, described, and matched. Facing the complex environment, the current algorithm is not very mature, so how to find the appropriate data structure, efficient preprocessing method, stable and fast matching algorithm, and other problems have become the most important.

5.2. The Multirobot Cooperation V-SLAM

Multirobot collaborative V-SLAM refers to a team of several robots operating in the environment at the same time to construct and locate maps through collaboration. It not only improves the real time, accuracy, and robustness of map construction but also plays an important role in dealing with a complex environment. In it, communication and map fusion are the key problems to be solved.

Communication can be divided into centralized, decentralized, and distributed modes, as shown in Figure 19. Among them, the centralized type is processed by the same control center, so the coordination is good, but the real time, adaptability, and flexibility are poor. This paper [105] uses the common Wi-Fi network, but the communication jam and communication delay between master and slave can greatly affect the practical effect. The decentralized mode is jointly completed by each decentralized controller, but each decentralized controller belongs to the parallel relationship, so it is difficult to determine its structural problems and carry out effective adjustment. Distributed is in between, and even if a local error in the whole network, it will not affect the whole communication process, so it has high real time, dynamic, and robust. However, its architectural design, deployment, and management become difficult and complex. The map fusion problem of multirobot cooperative V-SLAM is the extension of data association in single robot map creation: it is necessary to know the relative pose of one robot in the local map of another robot and maintain real-time communication. Lajoie et al. [106] performed distributed posture map optimization to retrieve trajectory estimates of robots. Bhutta et al. [107] adopt the multiagent method and is used for large-scale mapping. Zhu et al. [108] use covariance intersection (CI) that allows each robot to estimate only its own state and self-covariance and compensates for unknown correlations between robots.

It should be mentioned that most of the long-term drift caused by an accumulated error in multirobot collaborative V-SLAM is compensated by additional sensors, which increases the amount of information to be processed, such as [109]. In addition, under the influence of noise points and limited bandwidth, the real-time performance of robot data sharing has great problems. In addition, there are significant problems with the real-time performance of robot data sharing in the presence of noise points and limited bandwidth.

5.3. The Multisensor Fusion V-SLAM

A single sensor is easy to lose feature objects in complex scenes such as high light intensity, intense motion, and little texture, which leads to the failure of map construction. Multisensor fusion can solve this problem well, as shown in Figure 20.

However, after multisensor fusion, there are the following problems: firstly, how to synchronize sensor data and how to calibrate external parameter relations. Secondly, the large amount of information brought by multiple sensors has a certain redundancy. Most of the current research uses inertia measurement unit (IMU) and camera sensor fusion [110]. IMU can ensure better pose estimation under fast camera motion, and the camera compensates for IMU drift. However, IMU data frequency is very high, so the amount of data calculation is also very large. Recently, the authors of [111] added the optimization of laser plane parameters on the basis of LIC-fusion [112] to improve the accuracy of positioning and drawing, but it is only used in a small indoor area. At present, V-SLAM for multisensor fusion is still in the initial stage, and relatively, unified data fusion theory and effective fusion model have not been established. Moreover, there is no good solution to the problem of fault tolerance and robustness in data fusion system.

5.4. The Semantic V-SLAM

Semantic V-SLAM is a fusion SLAM based on neural network semantic segmentation, target detection, instance segmentation, and other technologies to extract object information tags and traditional V-SLAM, as shown in Figure 21.

(a)

(b)

Generally speaking, there are two approaches to semantic V-SLAM: one is to combine semantic information with localization. With the help of V-SLAM technology, the position constraints between objects can be calculated, and the consistency constraints can be applied to the recognition results of the same object at different angles and at different times. A large number of new data can be generated to provide more optimization conditions, so improving the precision of semantic understanding and saving the cost of manual calibration. Another approach is to combine semantic information with diagram building. Data association is upgraded from the traditional pixel level to the labeled object level with semantic information, which provides high-level maps and improves location accuracy.

The development of semantic SLAM is relatively late, and Alexey Dosovitskiy et al. [114] introduced an CNN. In 2015, Vineet et al. [115] from Stanford firstly realized a system capable of simultaneous graph construction and semantic segmentation, demonstrating the possibility of pushing semantic SLAM into practice. In fact, semantic or object-level SLAM papers have been published in recent years, but almost all of them are limited by the computational complexity and localization accuracy of V-SLAM itself. Moreover, most of these works are carried out in controllable scenarios and need to build data knowledge base to store prior knowledge. The authors of [116] used mask-CNN for semantic segmentation and adopted multiview geometry. Although this approach supplied the recognition of moving objects, as long as there is a dynamic point on the object, the whole object will be considered dynamic. In the paper [117], mask-CNN was also used for the semantic segmentation of images. Based on DS-SLAM, comprehensive utilization of luminosity, reprojection, and depth error were added to allocate robust weights for each key point. As a result, a semantic label was assigned to each pixel, but it still cannot achieve real-time performance in the case of GPU acceleration. In the paper [118], fast plane extraction was added to semantic detection, and then, the graph relationship between the outbound pose and the plane was constructed. However, the parameter estimation accuracy was still poor.

It can be seen that semantic V-SLAM is still in the preliminary stage of research, and most semantic segmentation is performed on dense maps at present. On the sparse map, wrong data association will bring serious errors, and there is no good solution at present.

5.5. The Event-Based V-SLAM

For traditional cameras, the event camera outputs “events” by detecting changes in light intensity, so the output signal flow not only has the advantages of low latency (millisecond level), high pixel bandwidth (kHz level), high dynamic range (140 dB), and ultralow power consumption (20 mW vs. 1.5 W of standard cameras), as shown in Figure 22.

At present, although many algorithms for location and mapping based on event cameras have been proposed, they still have problems. In [119], an investigation has been conducted on event-based cameras in the past decade. For example, the algorithm in the literature [120] requires estimating the image intensity and adjusts the depth, so the calculation cost is very high. The algorithm in the literature [121] relied on traditional cameras for depth estimation, so wasting the low delay and high dynamic range characteristics of event cameras. The algorithm in [122] was not good enough to estimate the pose estimation of event cameras by generating high-resolution dense environment maps, so it took a long time to converge.

The most event-based V-SLAM currently requires more high-quality data sets and collaboration with other sensors. Furthermore, the event sensor also has the disadvantages of low spatial resolution, low signal-to-noise ratio, and high price. From a general research point of view, we need completely new hardware to adapt to the operation characteristics of the event camera. Besides, other problems of V-SLAM are based on event camera, such as recognition, behavior understanding, and cumulative error that also need to be solved. In the future, using deep learning method to deal with event data is a primary research direction.

5.6. Summary Analysis of Research Directions

V-SLAM still has many problems to be solved. For example, it is easier to lose information when driving in the outdoor environment, especially in pure color and dull environment. Moreover, structural light cannot be well calibrated, resulting in poor vision, which lacks the universality of the stereo camera. Secondly, if the front-end data are too sparse, the drift generated by the control object cannot be reversed, dense data will be generated at the front end, and the back-end optimization is heavy, resulting in the mismatch between drawing construction and control speed. Moreover, the cost of overloading data processing, which requires sophisticated cameras and high-performance processors, is difficult to popularize. Finally, the global consistency of map drawing cannot be guaranteed by multiple robots at the lowest cost.

We believe that the future is the era of multitechnology convergence. For example, the authors of [123] combine geometric features and machine-learning methods. The two-layer convolutional long short-term memory (LSTM) module is used to model the pose estimation, and a traditional geometric method is used to simulate key modules in self-supervision. Then, a staged training strategy is proposed, which guarantees certain accuracy even without closed-loop detection. YUN Chang et al. proposed a novel incremental maximum clique outlier rejection protocol through semantic information which is powerful and efficient [124]. The introduction of artificial intelligence technologies (such as deep learning, machine learning, and computer vision technology) to V-SLAM is the future trend.

6. Conclusion

From the SLAM origin and its wide applications, this paper discusses that V-SLAM is still in laboratory research stage which is mainly applied indoor, and its reliable commercial product has not appeared. However, it is worth noting that V-SLAM is of low cost and high semantics. In addition to the above problems mentioned about V-SLAM and its camera feature, V-SLAM will fail to track in complex environments such as illumination, occlusion, dynamic objects, and weak textures. Moreover, when the camera moves too fast, the resulting motion blur also causes V-SLAM to fail to work. More importantly, we should consider not only the accuracy, robustness, and efficiency of the algorithm but also the hardware cost in practice. In future V-SLAM research, what kind of sensor is used to get what kind of data, what kind of decision is made to process these data, and how to achieve high performance in lightweight and low cost will be the primary research directions. According to [24], the current algorithms are in the era of improvement, and we believe that there will appear more and better frameworks in the future, which needs the unremitting efforts of researchers.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Yong Dai obtained the funding and supervised the technical research and direction. Yong Dai and Jiaxin Wu carried out the concepts and the method. Jiaxin Wu designed the experiment, analyzed the data, and wrote the original manuscript. Yong Dai and Duo Wang reviewed and revised the paper.

Acknowledgments

This work was supported by the 2021 High-Level Talents Research Support Program of Shenyang Ligong University (No. 1010147001017), the project funded by the China Postdoctoral Science Foundation, and the Scientific Research Fund of Liaoning Provincial Education Department (grant no.: LJKMZ20220616).

References

D. Wells, N. Beck, A. Kleusberg et al., in Guide to Gps Positioning, Canadian GPS Associates, and University of New Brunswick, Saint John, Canada, 1987.
R. B. Langley, “Rtk gps,” GPS World, vol. 9, no. 9, pp. 70–76, 1998.
View at: Google Scholar
Ian Oppermann, M. Hämäläinen, and J. Iinatti, UWB: Theory and Applications, John Wiley & Sons, Hoboken, NJ, USA, 2004.
R. C. Smith and P. Cheeseman, “On the representation and estimation of spatial uncertainty,” The International Journal of Robotics Research, vol. 5, no. 4, pp. 56–68, 1986.
View at: Publisher Site | Google Scholar
N. Ayache and O. D. Faugeras, “Building, registrating, and fusing noisy visual maps,” The International Journal of Robotics Research, vol. 7, no. 6, pp. 45–65, 1988.
View at: Publisher Site | Google Scholar
L. Crowley, “World modeling and position estimation for a mobile robot using ultrasonic ranging,” ICRA, vol. 89, pp. 674–680, 1989.
View at: Google Scholar
R. Chatila and J. P. Laumond, “Position referencing and consistent world modeling for mobile robots,” in Proceedings of the 1985 IEEE International Conference on Robotics and Automation, pp. 138–145, IEEE, St. Louis, MO, USA, March 1985.
View at: Google Scholar
R. Smith, M. Self, and P. Cheeseman, “Estimating uncertain spatial relationships in robotics,” in Autonomous Robot Vehicles, pp. 167–193, Springer, Berlin, Germany, 1990.
View at: Google Scholar
S. Borthwick and H. Durrant-Whyte, “Dynamic localisation of autonomous guided vehicles,” in Proceedings of the 1994 IEEE International Conference on MFI’94. Multisensor Fusion and Integration for Intelligent Systems, pp. 92–97, IEEE, Las Vegas, NV, USA, October 1994.
View at: Google Scholar
M. Patel, A. Bandopadhyay, and A. Ahmad, “Collaborative mapping of archaeological sites using multiple uavs,” 2021, https://arxiv.org/abs/2105.07644.
View at: Google Scholar
E. Guerra, Y. Bolea, A. Grau, and M. Rodrigo, “Human-robot slam in industrial environments,” in Proceedings of the 2015 IEEE 13th International Conference on Industrial Informatics (INDIN), pp. 390–395, IEEE, Cambridge, UK, July 2015.
View at: Google Scholar
B. Fang, G. Mei, X. Yuan, Le Wang, Z. Wang, and J. Wang, “Visual slam for robot navigation in healthcare facility,” Pattern Recognition, vol. 113, Article ID 107822, 2021.
View at: Publisher Site | Google Scholar
Andrzej Reinke, X. Chen, and C. Stachniss, “Simple but effective redundant odometry for autonomous vehicles,” in Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 9631–9637, IEEE, Xi’an, China, March 2021.
View at: Google Scholar
Ji Zhang and S. Singh, “Loam: lidar odometry and mapping in real-time,” in Robotics: Science and Systems, pp. 1–9, Berkeley, CA, USA, 2014.
View at: Google Scholar
W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-time loop closure in 2d lidar slam,” in Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1271–1278, IEEE, Stockholm, Sweden, May 2016.
View at: Google Scholar
T. Shan, B. Englot, C. Ratti, and D. Rus, “Lvi-sam: tightly-coupled lidar-visual-inertial odometry via smoothing and mapping,” in Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 5692–5698, IEEE, Xi’an, China, June 2021.
View at: Google Scholar
L. Sui and L. Lin, “Design of household cleaning robot based on low-cost 2d lidar slam,” in Proceedings of the 2020 International Symposium on Autonomous Systems (ISAS), pp. 223–227, IEEE, Guangzhou, China, December 2020.
View at: Google Scholar
J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha, “Visual simultaneous localization and mapping: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015.
View at: Publisher Site | Google Scholar
J. Gui, D. Gu, S. Wang, and H. Hu, “A review of visual inertial odometry from filtering and optimisation perspectives,” Advanced Robotics, vol. 29, no. 20, pp. 1289–1301, 2015.
View at: Publisher Site | Google Scholar
T. Taketomi, H. Uchiyama, and S. Ikeda, “Visual slam algorithms: a survey from 2010 to 2016,” IPSJ Transactions on Computer Vision and Applications, vol. 9, no. 1, pp. 16–11, 2017.
View at: Publisher Site | Google Scholar
R. Li, S. Wang, and D. Gu, “Ongoing evolution of visual slam from geometry to deep learning: challenges and opportunities,” Cognitive Computation, vol. 10, no. 6, pp. 875–889, 2018.
View at: Publisher Site | Google Scholar
W. Chen, G. Shang, A. Ji et al., “An overview on visual slam: from tradition to semantic,” Remote Sensing, vol. 14, no. 13, p. 3010, 2022.
View at: Publisher Site | Google Scholar
A. Macario Barros, M. Michel, Y. Moline, G. Corre, and F. Carrel, “A comprehensive survey of visual slam algorithms,” Robotics, vol. 11, no. 1, p. 24, 2022.
View at: Publisher Site | Google Scholar
C. Cadena, L. Carlone, H. Carrillo et al., “Past, present, and future of simultaneous localization and mapping: toward the robust-perception age,” IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
View at: Publisher Site | Google Scholar
G. Younes, D. Asmar, and E. Shammas, “A survey on non-filter-based monocular visual slam systems,” 2016, https://www.researchgate.net/publication/304787224_A_survey_on_non-filter-based_monocular_Visual_SLAM_systems.
View at: Google Scholar
Y. Xia, J. Li, L. Qi, Yu Hui, and J. Dong, “An evaluation of deep learning in loop closure detection for visual slam,” in Proceedings of the 2017 IEEE International Conference on Internet of Things (iThings) and IEEE green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 85–91, IEEE, Exeter, UK, June 2017.
View at: Google Scholar
T. Takleh Omar Takleh, N. Abu Bakar, S. Abdul Rahman, R. Hamzah, and Z. Abd Aziz, “A brief survey on slam methods in autonomous vehicle,” International Journal of Engineering & Technology, vol. 7, no. 4.27, pp. 38–43, 2018.
View at: Publisher Site | Google Scholar
C. Chen, H. Zhu, M. Li, and S. You, “A review of visual-inertial simultaneous localization and mapping from filtering-based and optimization-based perspectives,” Robotics, vol. 7, no. 3, p. 45, 2018.
View at: Publisher Site | Google Scholar
M. R. U. Saputra, A. Markham, and N. Trigoni, “Visual slam and structure from motion in dynamic environments: a survey,” ACM Computing Surveys, vol. 51, no. 2, pp. 1–36, 2018.
View at: Publisher Site | Google Scholar
C. Duan, S. Junginger, J. Huang, K. Jin, and K. Thurow, “Deep learning for visual slam in transportation robotics: a review,” Transportation Safety and Environment, vol. 1, no. 3, pp. 177–184, 2019.
View at: Publisher Site | Google Scholar
A. Parra Bustos, T.-J. Chin, A. Eriksson, and I. Reid, “Visual slam: why bundle adjust?” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), pp. 2385–2391, IEEE, Montreal, Canada, May 2019.
View at: Google Scholar
F. Zeng, C. Wang, and S. S. Ge, “A survey on visual navigation for artificial agents with deep reinforcement learning,” IEEE Access, vol. 8, pp. 135426–135442, 2020.
View at: Publisher Site | Google Scholar
R. Azzam, T. Taha, S. Huang, and Y. Zweiri, “Feature-based visual simultaneous localization and mapping: a survey,” SN Applied Sciences, vol. 2, no. 2, pp. 224–24, 2020.
View at: Publisher Site | Google Scholar
S. Arshad and G. W. Kim, “Role of deep learning in loop closure detection for visual and lidar slam: a survey,” Sensors, vol. 21, no. 4, p. 1243, 2021.
View at: Publisher Site | Google Scholar
K. A. Tsintotas, L. Bampis, and A. Gasteratos, “The revisiting problem in simultaneous localization and mapping: a survey on visual loop closure detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 11, pp. 19929–19953, 2022.
View at: Publisher Site | Google Scholar
T. Ali, H. Bavle, J. Luis Sanchez-Lopez, and H. Voos, “Visual slam: what are the current trends and what to expect?” 2022, https://arxiv.org/abs/2210.10491.
View at: Google Scholar
J. Lomps, L. Artjom, and A. Hadachi, “Evaluation of the robustness of visual slam methods in different environments,” 2020, https://arxiv.org/abs/2009.05427.
View at: Google Scholar
D. Sharafutdinov, M. Griguletskii, P. Kopanev et al., “Comparison of modern open-source visual slam approaches,” 2021, https://arxiv.org/pdf/2108.01654.pdf.
View at: Google Scholar
A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: real-time single camera slam,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
View at: Publisher Site | Google Scholar
G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pp. 225–234, IEEE, Nara, Japan, November 2007.
View at: Google Scholar
J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: large-scale direct monocular slam,” in European Conference on Computer Vision, pp. 834–849, Springer, Berlin, Germany, 2014.
View at: Google Scholar
C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: fast semi-direct monocular visual odometry,” in Proceedings of the 2014 IEEE international conference on robotics and automation (ICRA), pp. 15–22, IEEE, Hong Kong, China, June 2014.
View at: Google Scholar
T. Pire, T. Fischer, G. Castro, P. De-Cristoforis, J. Civera, and J. Jacobo-Berlles, “S-PTAM: stereo parallel tracking and mapping,” Robotics and Autonomous Systems, vol. 93, pp. 27–42, 2017.
View at: Publisher Site | Google Scholar
X. Gao, R. Wang, N. Demmel, and D. C. Ldso, “Direct sparse odometry with loop closure,” in Proceedings of the 2018 IEEE/RSJ International Conference On Intelligent Robots And Systems (IROS), pp. 2198–2204, IEEE, Piscataway, NJ, USA, June 2018.
View at: Google Scholar
S. H. Lee and J. Civera, “Loosely-coupled semi-direct monocular slam,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 399–406, 2019.
View at: Publisher Site | Google Scholar
T. Schneider, M. Dymczyk, M. Fehr et al., “maplab: an open framework for research in visual-inertial mapping and localization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1418–1425, 2018.
View at: Publisher Site | Google Scholar
F. Schenk and F. Fraundorfer, “Reslam: a real-time robust edge-based slam system,” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), pp. 154–160, IEEE, Piscataway, NJ, USA, July 2019.
View at: Google Scholar
P. Schmuck and M. Chli, “Ccm-slam: robust and efficient centralized collaborative monocular simultaneous localization and mapping for robotic teams,” Journal of Field Robotics, vol. 36, no. 4, pp. 763–781, 2019.
View at: Publisher Site | Google Scholar
C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M Montiel, and J. D Tardos, “Orb-slam3: an accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
View at: Publisher Site | Google Scholar
Z. Yu, L. Zhu, and G. Lu, “Vins-motion: tightly-coupled fusion of vins and motion constraint,” in Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 7672–7678, IEEE, Piscataway, NJ, USA, June 2021.
View at: Google Scholar
Z. Min and E. Dunn, “Voldor-slam: for the times when feature-based or direct methods are not good enough,” 2021, https://arxiv.org/abs/2104.06800.
View at: Google Scholar
Yi Zhou, G. Gallego, and S. Shen, “Event-based stereo visual odometry,” IEEE Transactions on Robotics, vol. 37, no. 5, pp. 1433–1450, 2021.
View at: Publisher Site | Google Scholar
T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: bundle adjusted direct rgb-d slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 134–144, Long Beach, CA, USA, June 2019.
View at: Google Scholar
K. Christensen and M. Hebert, “Edge-direct visual odometry,” 2019, https://arxiv.org/abs/1906.04838.
View at: Google Scholar
F. Tang, H. Li, and Y. Wu, “Fmd stereo slam: fusing mvg and direct formulation towards accurate and fast stereo slam,” in Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), pp. 133–139, IEEE, Montreal, Canada, June 2019.
View at: Google Scholar
T. Zhang, H. Zhang, Li Yang, Y. Nakamura, and L. Zhang, “Flowfusion: dynamic dense rgb-d slam based on optical flow,” in Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7322–7328, IEEE, Paris, France, May 2020.
View at: Google Scholar
S. R. Khattak, A Variational Approach to Mapping: An Exploration of Map Representation for SLAM, University of Ontario Institute of Technology (Canada), Oshawa, Canada, 2012.
R. A. Newcombe, S. Izadi, O. Hilliges et al., “Kinectfusion: real-time dense surface mapping and tracking,” in Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136, IEEE, Basel, Switzerland, October 2011.
View at: Google Scholar
R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: dense tracking and mapping in real-time,” in Proceedings of the 2011 International Conference on Computer Vision, pp. 2320–2327, IEEE, Barcelona, Spain, November 2011.
View at: Google Scholar
R. Mur-Artal and J. D. Tardos, “Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
View at: Publisher Site | Google Scholar
J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2018.
View at: Publisher Site | Google Scholar
L. Koestler, N. Yang, N. Zeller, and D. Cremers, “Tandem: tracking and dense mapping in real-time using deep multi-view stereo,” in Proceedings of the Conference on Robot Learning, pp. 34–45, PMLR, Auckland, New Zealand, December 2022.
View at: Google Scholar
T. Lindeberg, “Scale invariant feature transform,” 2012, https://www.sciencedirect.com/topics/computer-science/scale-invariant-feature-transform#:%7E:text=Scale%2DInvariant%20Feature%20Transform%20(SIFT)%E2%80%94SIFT%20is%20an,Keypoints%20Detection%2C%20and%20Feature%20Description.
View at: Google Scholar
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, pp. 886–893, IEEE, San Diego, CA, USA, June 2005.
View at: Google Scholar
B. Herbert, T. Tuytelaars, and L. Van Gool, “Surf: speeded up robust features,” in Proceedings of the European Conference on Computer Vision, pp. 404–417, Springer, Berlin, Germany, July 2006.
View at: Google Scholar
E. Rublee, R. Vincent, K. Kurt, and B. Gary, “Orb: an efficient alternative to sift or surf,” in Proceedings of the 2011 International Conference on Computer Vision, pp. 2564–2571, IEEE, Barcelona, Spain, November 2011.
View at: Google Scholar
M. Calonder, V. Lepetit, C. Strecha, and F. Pascal, “Brief: binary robust independent elementary features,” in Proceedings of the European Conference on Computer Vision, pp. 778–792, Springer, Berlin, Germany, June 2010.
View at: Google Scholar
G. V. Deepak, “Features from accelerated segment test (fast),” in Proceedings of the 10th Workshop on Image Analysis for Multimedia Interactive Services, London, UK, July 2009.
View at: Google Scholar
A. Setiawan, R. A. Yunmar, and H. Tantriawan, “Comparison of speeded-up robust feature (surf) and oriented fast and rotated brief (orb) methods in identifying museum objects using low light intensity images,” in Proceedings of the IOP Conference Series: Earth and Environmental Science, IOP Publishing, Bristol, UK, November, 2020.
View at: Google Scholar
R. Mur-Artal and J. D. Tardos, “Visual-inertial monocular slam with map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796–803, 2017.
View at: Publisher Site | Google Scholar
X. S. Gao, X. R. Hou, J. Tang, and H. F. Cheng, “Complete solution classification for the perspective-three-point problem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 930–943, 2003.
View at: Publisher Site | Google Scholar
P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” Sensor fusion IV: Control Paradigms and Data Structures, vol. 1611, pp. 586–606, 1992.
View at: Google Scholar
M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
View at: Publisher Site | Google Scholar
L. Kneip, H. Li, and Y. Seo, “Upnp: an optimal o (n) solution to the absolute pose problem with universal applicability,” in Proceedings of the European Conference on Computer Vision, pp. 127–142, Springer, Zurich, Switzerland, June 2014.
View at: Google Scholar
V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: an accurate o (n) solution to the pnp problem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.
View at: Publisher Site | Google Scholar
Y. Abdel-Aziz, H. Karara, and M. Hauck, “Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry,” Photogrammetric Engineering & Remote Sensing, vol. 81, no. 2, pp. 103–107, 2015.
View at: Publisher Site | Google Scholar
S. S. Chow, Singular Value Decomposition and its Applications, University of Canterbury Department of Mathematics, Christchurch, New Zealand, 1980.
C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2017.
View at: Publisher Site | Google Scholar
J. J. G. . Rodríguez, J. Lamarca, J. Morlana, D. Juan, and M. M. M. Sd-defslam, “Semi-direct monocular slam for deformable and intracorporeal scenes,” 2020, https://arxiv.org/abs/2010.09409.
View at: Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval, ACM press, New York, NY, USA, 1999.
D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 2161–2168, IEEE, New York, NY, USA, June 2006.
View at: Google Scholar
A. Angeli, D. Filliat, S. Doncieux, and J. A. Meyer, “Fast and incremental method for loop-closure detection using bags of visual words,” IEEE Transactions on Robotics, vol. 24, no. 5, pp. 1027–1037, 2008.
View at: Publisher Site | Google Scholar
M. Cummins and P. Newman, “Fab-map: probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, no. 6, pp. 647–665, 2008.
View at: Publisher Site | Google Scholar
S. L. Smith, M. Schwager, and D. Rus, “Persistent robotic tasks: monitoring and sweeping in changing environments,” IEEE Transactions on Robotics, vol. 28, no. 2, pp. 410–426, 2012.
View at: Publisher Site | Google Scholar
R. Gomez-Ojeda, M. Lopez-Antequera, N. Petkov, and J. Gonzalez-Jimenez, “Training a convolutional neural network for appearance-invariant place recognition,” 2015, https://arxiv.org/abs/1505.07428.
View at: Google Scholar
S. Lowry, N. Sünderhauf, P. Newman et al., “Visual place recognition: a survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016.
View at: Publisher Site | Google Scholar
Z. Chen, O. Lam, J. Adam, and M. Milford, “Convolutional neural network-based place recognition,” 2014, https://arxiv.org/abs/1411.1509.
View at: Google Scholar
X. Gao and T. Zhang, “Unsupervised learning to detect loops using deep neural networks for visual slam system,” Autonomous Robots, vol. 41, no. 1, pp. 1–18, 2017.
View at: Publisher Site | Google Scholar
A. Khaliq, S. Ehsan, Z. Chen, M. Milford, and K. McDonald-Maier, “A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 561–569, 2020.
View at: Publisher Site | Google Scholar
Q. Zhong and X. Fang, “A bigbigan-based loop closure detection algorithm for indoor visual slam,” Journal of Electrical and Computer Engineering, vol. 2021, Article ID 9978022, 10 pages, 2021.
View at: Publisher Site | Google Scholar
D. Gao, C. Wang, and S. Scherer, “Airloop: lifelong loop closure detection,” in Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), pp. 10664–10671, IEEE, Philadelphia, PA, USA, June 2022.
View at: Google Scholar
Y. He, Ji Zhao, Y. Guo, W. He, and K. Yuan, “Pl-vio: tightly-coupled monocular visual–inertial odometry using point and line features,” Sensors, vol. 18, no. 4, p. 1159, 2018.
View at: Publisher Site | Google Scholar
D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “Structvio: visual-inertial odometry with structural regularity of man-made environments,” IEEE Transactions on Robotics, vol. 35, no. 4, pp. 999–1013, 2019.
View at: Publisher Site | Google Scholar
Bo Xu, P. Wang, Y. He, Yu Chen, Y. Chen, and M. Zhou, “Leveraging structural information to improve point line visual-inertial odometry,” 2021, https://arxiv.org/abs/2105.04064.
View at: Google Scholar
J. Han, R. Dong, and J. Kan, “A novel loop closure detection method with the combination of points and lines based on information entropy,” Journal of Field Robotics, vol. 38, no. 3, pp. 386–401, 2021.
View at: Publisher Site | Google Scholar
S. Li and D. Lee, “Fast visual odometry using intensity-assisted iterative closest point,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 992–999, 2016.
View at: Publisher Site | Google Scholar
X. Wang, W. Dong, M. Zhou et al., in Edge Enhanced Direct Visual Odometry, BMVC, Aberdeen, UK, 2016.
J. Xu, H. Cao, D. Li et al., “Edge assisted mobile semantic visual slam,” in Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1828–1837, IEEE, Toronto, Canada, June 2020.
View at: Google Scholar
J. Jose Tarrio and P. Sol, “Realtime edge-based visual odometry for a monocular camera,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 702–710, Santiago, Chile, December 2015.
View at: Google Scholar
C. Kim, P. Kim, S. Lee, and H. J. Kim, “Edge-based robust rgb-d visual odometry using 2-d edge divergence minimization,” in Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–9, IEEE, Madrid, Spain, October 2018.
View at: Google Scholar
A. J. B. Ali, Z. S. Hashemifar, and K. Dantu, “Edge-slam: edge-assisted visual simultaneous localization and mapping,” in Proceedings of the 18th International Conference on Mobile Systems, pp. 325–337, Applications, and Services, New York, NY, USA, June 2020.
View at: Google Scholar
M. Coughlan and L. Yuille, “Manhattan world: compass direction from a single image by bayesian inference,” in Proceedings of the seventh IEEE international conference on computer vision, pp. 941–947, IEEE, Kerkyra, Greece, September 1999.
View at: Google Scholar
O. Saurer, F. Fraundorfer, and M. Pollefeys, “Homography based visual odometry with known vertical direction and weak manhattan world assumption,” in Proceedings of the IEEE/IROS Workshop on Visual Control of Mobile Robots, June 2012.
View at: Google Scholar
S. Yang, Yu Song, M. Kaess, and S. Scherer, “Pop-up slam: semantic monocular plane slam for low-texture environments,” in Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1222–1229, IEEE, Piscataway, NJ, USA, July 2016.
View at: Google Scholar
K. Chakraborty, M. Deegan, P. Kulkarni, C. Searle, and Y. Zhong, Jorb-slam: A Jointly Optimized Multi-Robot Visual Slam, 2022, https://um-mobrob-t12-w19.github.io/docs/report.pdf.
P.-Y. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, and G. Beltrame, “Door-slam: distributed, online, and outlier resilient slam for robotic teams,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1656–1663, 2020.
View at: Publisher Site | Google Scholar
M. U. M. Bhutta, M. Kuse, R. Fan, Y. Liu, and M. Liu, “Loop-box: multiagent direct slam triggered by single loop closure for large-scale mapping,” IEEE Transactions on Cybernetics, vol. 52, no. 6, pp. 5088–5097, 2022.
View at: Publisher Site | Google Scholar
P. Zhu, P. Geneva, W. Ren, and G. Huang, “Distributed visual-inertial cooperative localization,” in Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8714–8721, IEEE, Prague, Czech Republic, June 2021.
View at: Google Scholar
P. Zhu and W. Ren, “Multi-robot joint visual-inertial localization and 3-d moving object tracking,” in Proceedings of the 2021 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 11573–11580, IEEE, Las Vegas, NV, USA, June 2020.
View at: Google Scholar
R. You and H. Hou, Real-time Pose Estimation by Fusing Visual and Inertial Sensors for Autonomous Driving, Chalmers University of Technology, Gothenburg, Sweden, 2020.
X. Zuo, Y. Yang, P. Geneva et al., “Lic-fusion 2.0: lidar-inertial-camera odometry with sliding-window plane-feature tracking,” in Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5112–5119, IEEE, Manhattan, NY, USA, June 2020.
View at: Google Scholar
X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Lic-fusion: lidar-inertial-camera odometry,” in Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5848–5854, IEEE, Piscataway, NJ, USA, June 2019.
View at: Google Scholar
L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
View at: Publisher Site | Google Scholar
Alexey Dosovitskiy, P. Fischer, I. Eddy et al., “Flownet: learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766, Washington, DC. USA, July 2015.
View at: Google Scholar
V. Vineet, O. Miksik, M. Lidegaard et al., “Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction,” in Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 75–82, IEEE, Piscataway, NJ, USA, November 2015.
View at: Google Scholar
B. Bescos, J. M. Facil, J. Civera, and J. Neira, “Dynaslam: tracking, mapping, and inpainting in dynamic scenes,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018.
View at: Publisher Site | Google Scholar
S. Wen, P. Li, Y. Zhao, H. Zhang, F. Sun, and Z. Wang, “Semantic visual slam in dynamic environment,” Autonomous Robots, vol. 45, no. 4, pp. 493–504, 2021.
View at: Publisher Site | Google Scholar
H. Bavle, P. De La Puente, J. P. How, and P. Campoy, “Vps-slam: visual planar semantic slam for aerial robotic systems,” IEEE Access, vol. 8, pp. 60704–60718, 2020.
View at: Publisher Site | Google Scholar
G. Gallego, T. Delbrück, G. Orchard et al., “Event-based vision: a survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 154–180, 2022.
View at: Publisher Site | Google Scholar
H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3d reconstruction and 6-dof tracking with an event camera,” in Proceedings of the European Conference on Computer Vision, pp. 349–364, Springer, Berlin, Germany, June 2016.
View at: Google Scholar
C. Scheerlinck, H. Rebecq, T. Stoffregen, N. Barnes, R. Mahony, and D. Scaramuzza, “Ced: color event camera dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach CA, USA, July 2019.
View at: Google Scholar
Z. Min and E. Dunn, “Voldor+ slam: for the times when feature-based or direct methods are not good enough,” in Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13813–13819, IEEE, Piscataway, NJ, USA, June 2021.
View at: Google Scholar
Y. Zou, Ji Pan, Q.-H. Tran, J.-B. Huang, and M. Chandraker, “Learning monocular visual odometry via self-supervised long-term modeling,” in Proceedings of the European Conference on Computer Vision, pp. 710–727, Springer, Berlin, Germany, July 2020.
View at: Google Scholar
Y. Chang, Y. Tian, J. P. How, and L. Carlone, “Kimera-multi: a system for distributed multi-robot metric-semantic simultaneous localization and mapping,” in Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11210–11218, IEEE, Piscataway, NJ, USA, November 2021.
View at: Google Scholar

Copyright

Copyright © 2023 Yong Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1623

Downloads

1034

Citations