Abstract

Mobile robots are widely used in medicine, agriculture, home furnishing, and industry. Simultaneous localization and mapping (SLAM) is the working basis of mobile robots, so it is extremely necessary and meaningful for making researches on SLAM technology. SLAM technology involves robot mechanism kinematics, logic, mathematics, perceptual detection, and other fields. However, it faces the problem of classifying the technical content, which leads to diverse technical frameworks of SLAM. Among all sorts of SLAM, visual SLAM (V-SLAM) has become the key academic research due to its advantages of low price, easy installation, and simple algorithm model. Firstly, we illustrate the superiority of V-SLAM by comparing it with other localization techniques. Secondly, we sort out some open-source V-SLAM algorithms and compare their real-time performance, robustness, and innovation. Then, we analyze the frameworks, mathematical models, and related basic theoretical knowledge of V-SLAM. Meanwhile, we review the related works from four aspects: visual odometry, back-end optimization, loop closure detection, and mapping. Finally, we prospect the future development trend and make a foundation for researchers to expand works in the future. All in all, this paper classifies each module of V-SLAM in detail and provides better readability to readers. This is undoubtedly the most comprehensive review of V-SLAM recently.

1. Introduction

Tracing back to the modern positioning technology, the earliest widely used in the world is the global positioning system (GPS) [1] proposed by the U.S. department of defense in 1964. Its military accuracy can reach 1-2 meters, and the civilian accuracy is 5–10 meters. Obviously, this kind of precision is not enough for application devices. Then, a high-precision positioning technology is proposed, that is, a high-cost combination of GPS, real-time kinematic (RTK), and inertial measurement unit (IMU) [2], which can achieve an accuracy of 1–3 centimeter. Although the integrated inertial navigation technology basically solves the outdoor problem, it is not available indoor, because an RTK signal cannot be obtained after occlusion. In 2004, an indoor wireless ultrawide band (UWB) [3] communication positioning technology was proposed to solve the indoor positioning problem. However, we know that UWB technology requires modification of the working environment of mobile machines, such as the installation of base stations and launchers, which is not suitable for unfamiliar scenes without modification. In order to solve the above problems, in recent years, an independent, indoor and outdoor compatible, and low-cost integrated solution to solve the localization by real-time mapping without modification of the working environment has been proposed, namely, simultaneous localization and mapping (SLAM) technology. The emergence of SLAM technology brings broad application scenarios and great significance for robot positioning.

Here is a look at the origins and applications of SLAM. It originated in 1986 [4] when the statistical probability was just applied to the field of artificial intelligence. Smith and Cheeseman discussed the problem of continuous graph building at the IEEE Robotics Automation Conference, drew the conclusion of the importance of consistency detection, and found the correlation law of landmarks through the statistical frameworks. Then, it is accompanied by the development of image processing in statistics. On the basis of [57], Smith et al. proposed synchronous positioning and real-time monitoring. They used state vector to store joint pose and coordinate [8]. Through the integration of the above technologies, the abbreviation of SLAM appeared for the first time in this academic forum [9] in 1995. Since then, there has been a relatively systematic framework for SLAM. This conference also opened the prelude of the era of SLAM. In recent years, SLAM technology has been widely concerned and applied, such as augmented reality (AR), Google, and Apple launched ARCore and ARKit, respectively. In the field of mobile robots and aerial vehicles, SLAM can assist these intelligent agents in positioning, path planning, autonomous exploration, obstacle avoidance, and navigation. Similarly, SLAM technology is widely used in underwater operations, industrial robots, medical treatment, unmanned driving, and the other fields [1013], as shown in Figure 1.

According to the perceptual module, the common SLAM system generally has two forms: LIDAR-based (light detection and ranging, LIDAR) SLAM and visual SLAM (V-SLAM). When the LIDAR sensor is used in SLAM, then it is called LIDAR-based SLAM. Similarly, when the camera sensor is used in SLAM, then it is called V-SLAM. In LIDAR-based SLAM, LIDAR can collect point cloud images containing direct geometric relations. Representative methods are as follows: real-time 3D laser SLAM (LOAM) [14], real-time loop closure in 2D LIDAR SLAM (cartographer) [15], tight-coupled LIDAR-visual-inertial SLAM system (LVI-SLAM) [16], and so on. According to the investigation on the application of LIDAR-based SLAM, in the past decade, enterprises have mainly applied it to autonomous driving vehicles and automated guided vehicle (AGV) robots. However, these fields are facing the expensive production that cannot be mass-produced, and the prices of some famous LIDAR manufacturers are shown in Figure 2. Besides, the current bottleneck is that the technology cannot be generalized. Although there are some low-cost LIDAR sensors, these low-cost home LIDAR sensors are only suitable for small-scale detection scenarios, mostly for indoor applications, such as the sweeping robot [17], and not for outdoor applications. With the development of computer vision, V-SLAM is undoubtedly the best choice to provide customers with better service due to its advantages of low cost, simple structure, diversified installation, universal usage indoor and outdoor, and rich semantic information.

Next, we evaluate the review papers on V-SLAM from an objective perspective. In 2015, Fuentes-Pacheco et al. firstly reviewed the main concepts and basic algorithms of the V-SLAM [18]. Although the papers were backward in the current perspective, they were incomparable for that time. In 2016, Gui et al. discussed filter-based and optimization-based V-SLAM [19]. Nevertheless, they only investigated the main concepts of visual-inertial SLAM. In 2017, Taketomi et al. reviewed the V-SLAM technology from 2010 to 2016, but they only focused on the comparison of the previous V-SLAM [20]. Furthermore, they did not carry out a detailed expansion of the V-SLAM problem. In 2018, Li et al. analyzed the opportunities and challenges of V-SLAM. They also predicted the future development prospects of V-SLAM, but they focused too much on deep learning [21]. In 2022, Chen et al. collated a large body of traditional and semantic V-SLAM literature, and they summarized classical frameworks that employ direct/indirect approaches [22]. Despite providing a comprehensive exposition, the classification of methods was limited to the types of features used in feature-based V-SLAM. In another work, Macario Barros et al. provided a timeline of V-SLAM techniques over the years [23]. The visualization was easy to understand, but the rationale was generalized rather than explored in depth.

Table 1 classifies the classic review papers so far. It can be seen that most of the papers focus on a particular topic or trend. Especially, none of the review papers of V-SLAM discuss the loop closure detection in detail. Compared with other reviews, this paper provides a more comprehensive review of the V-SLAM. The main contributions of this paper are summarized as follows:(a)We discuss the basic theoretical knowledge and review the advantages of V-SLAM comprehensively. The recent researches in the field of V-SLAM are classified into the front end, back end, loop closure detection, and mapping.(b)We modularly classify the open-source V-SLAM algorithms by seven indicators and provide the convenient references for readers.(c)We analyze the advantages and disadvantages of the current mainstream V-SLAM by studying different methods in different aspects. Especially, we describe the visual odometry and loop closure detection in detail.(d)We explain the V-SLAM algorithms by tables and flow charts, which lays a foundation for researchers.(e)We divide the V-SLAM algorithms into five categories. Meanwhile, we analyze the existing solutions, summarize the shortcomings, look forward to the future research directions, and provide researchers with extended research ideas.

The remainder of this paper is organized as follows. The research problem of V-SLAM is described in Section 2. In Section 3, the details of the V-SLAM technology are stated. The research status of the V-SLAM is reported in Section 4. Then development trends and active research areas of V-SLAM are discussed in Section 5. Finally, Section 6 concludes the paper.

2. Research Problem

For V-SLAM, a major research issue is how to use cameras to collect, store and update environmental information in order to accurately locate and map in an unknown environment. There are two main research directions to V-SLAM. One is V-SLAM based on filter, which uses all image information for visual fusion. The other is based on the feature extraction and optimization theory, which uses key frames for data association. Both methods are described in detail below.

V-SLAM based on filter deals with data association in a more holistic way that it considers the relationship between the current moment state and all previous states in order to save a lot of feature extraction time. However, it has the following inevitable shortcomings. Firstly, it is based on the assumption of invariant gray level, so it is easy to be affected by external illumination. Secondly, there is no differentiation for single pixels, so it is necessary to calculate the correlation for pure colors. Third, because the image is nonconvex, a good initial value of the image is needed to ensure the correctness of tracking.

V-SLAM based on the feature extraction and optimization theory has become the mainstream due to its directly extracting information and nonlinear optimization, as shown in Table 2. However, it has the following inevitable shortcomings as well. Firstly, V-SLAM based on feature extraction information still spends a lot of time in the calculation of descriptors and feature matching, and many features can not work properly. Second, due to the matrix computation, it has the efficiency optimization problem of extracting characteristic information, loop closure detection, and failure repair. Third, it needs to enhance the robustness.

All in all, the above research issues are the key analysis of this paper, and the classification of the used technology content will be made in the following sections.

3. Formal Description

V-SLAM is generally divided into sensor data collection, front-end visual odometer, back-end optimization, and loop closure detection. The front-end vision odometer is used to estimate the relative motion of the camera by using the information from adjacent pictures and to build local pictures. Back-end optimization is to further optimize the initialization information obtained by the front-end visual odometer according to statistical inference. Loop closure detection is to judge whether the previous scene has been reached by visual detection of pose information at a certain point in time, as shown in Figure 3.

In order to obtain positioning information, we might as well assume , where is the sampling period. The position information of the corresponding time , these discrete points constitute the trajectory of the rigid robot. We can use a general and abstract mathematical model , where is the motion equation, represents a vector, represents the position coordinates of the robot, is the position at the last moment, is the input of the motion sensor, and is the noise. In order to obtain map building information, we might as well assume that the rigid robot sees the landmark point at position in order to generate an observation data . According to the observation data, the rigid robot can independently adjust its position in an unknown environment. Similarly, we can use the general and abstract mathematical model , where is noise and is the observation equation. To sum up, V-SLAM is to estimate internal state variables through noise measurement data for localization and mapping.

The accuracy of the measurements depends greatly on the type of camera sensor. Camera sensors are classified according to the implementation mode: monocular camera, binocular camera, and RGB-D camera, as shown in Figure 4. In addition, camera sensors include event camera, panorama camera, and other special cameras.

We can not know the depth information of this spatial point only according to the imaging model of the monocular camera, so the monocular camera generally relies on its own motion to collect two frames of the fixed time interval for parallax to estimate the depth. However, the binocular camera can estimate the distance between a pixel depth and the camera through the static visual difference between the left and right eye cameras. In both monocular and binocular cameras, parallax itself is difficult to calculate, then RGB-D cameras were invented and could actively measure the depth of each pixel using infrared structured light or time-of-flight principles. However, RGB-D cameras have some disadvantages in cost, power consumption, and so on. In addition, visual cameras have congenital defects caused by internal parameter calibration errors. It takes images that are usually determined by the focal length of its built-in camera and are susceptible to exposure, so the data obtained by the visual camera must be corrected. In addition, camera sensors can be divided into the fish-eye camera and wide-angle camera according to the type of camera. Wide-angle camera has a short focal length and a wide angle of view, so it cannot take panoramas in a tight space. Therefore, it generally shoots large scenes. Fisheye camera has a wide range of visual angles, but the smaller the scale of the object in the picture, the less clear it will be. Therefore, it is generally used in scenes that do not care about details.

The motion of a rigid body robot in 3D space is actually the corresponding rotation and translation operation, as shown in Figure 5. Therefore, we can estimate the trajectory of the rigid body robot and the spatial structure of the environment through the pixel information between adjacent frames.

There is also a problem that cannot be ignored because the transformation relationship is determined by estimating the motion relationship between two images, and the error brought by estimation is inevitable. Therefore, we need to do back-end optimization for the resulting errors. In the last few decades, the filter is almost always used in the back-end optimization, as shown in Figure 6. However, the system is generally linearized and approximates its noise with gaussian distribution in operation. In this way, the highest term is discarded and the covariance matrix of the state needs to be updated in real time, and the time complexity and error are high. The least square method is used to solve the problem, where is the actual value and is the value obtained through the observation model. The spatial position of each camera attitude and each feature is adjusted to meet the real-time performance of the system. Therefore, currently nonlinear optimization represented by bundle adjustment (BA) is the mainstream back-end optimization method. There is another method based on graph optimization, as shown in Figure 7. Optimization of pose nodes and landmark nodes by using nodes as optimization variables and edges as error terms greatly improves optimization efficiency.

Moreover, the error caused by each movement will be transmitted to the next moment, so drift will occur after a long time of movement. To solve this problem, the rigid robot can determine whether to return to the origin based on the similarity between images in order to significantly reduce the cumulative error and building a consistent map. At present, loop closure detection technology is basically the combination of bag model and feature point detection. Although nonlinear optimization or graph optimization technology is used in updating frames, it is more efficient in architecture than the direct filter method.

In some specific applications, we need different forms of maps to store environmental information. Generally speaking, map forms can be divided into point cloud map, occupancy grid map, octree map, and grid map, as shown in Figure 8.

4. Research Status

V-SLAM has made a major breakthrough in the last 10 years. The realization methods of V-SLAM include the direct method, the feature-points method (the indirect method), and the hybrid semidirect method. In this section, we discuss the three methods in detail. Then we introduce the current status of the loop closure detection. Finally. we analyze the problems of the current V-SLAM technology from a modular perspective and propose future solutions.

4.1. The Direct Method

In recent years, the direct estimation method based on the invariance of pixel gray level, which directly utilized the change information of image’s light and shade, had developed rapidly [5355], as shown in Figure 9. The direct method evolved from the optical flow [56] solved the problem of estimating camera pose through the photometric error of two pixels, which had evolved into the least square problem. This method not only overcame the shortcoming of extracting key points and calculated descriptors in feature method, but also solved the problem of missing features. Generally, the direct method can be divided into three categories according to the number of pixels used: the sparse direct method, the dense direct method, and the semidense direct method.

Early V-SLAM based the direct method was not applied to tracking and mapping frameworks, and the most was based on key points of human experience. Later, in the paper of [57], functional variational formula was proposed to construct dense map, but these requirements on GPU were quite stringent. In the paper of [58], depth filter was proposed to construct version dense map, which not only reduced the time complexity but also could run on a single CPU in real time. In 2011, Newcombe et al. proposed DTAM [59], which was the first completely direct approach with stability and accuracy as its virtues. However, this method aligned every depth information of images and maps and requires GPU acceleration, so the real-time performance was greatly reduced. Subsequently, Engel J et al. proposed a direct monocular SLAM (LSD-SLAM) in large-scale environments in 2014. LSD-SLAM replaced V-SLAM which was previously only partially processed. The semidense map generated by LSD-SLAM on common CPU is directly composed of key frames by transformation and could accurately detect scale drift. However, in the aspect of loop closure detection, LSD-SLAM had the same pose optimization as ORB-SLAM2 [60], so it did not get rid of the calculation of feature points.

DSO is also a pure direct method, which is an updated version of Engel et al.’s LSD-SLAM. DSO is not a complete SLAM [61], and it lacks loop closure detection and map reuse. DSO improves real-time performance by projecting each point onto all frames and calculating its residual in each frame. Moreover, DSO improves accuracy by using a sliding window consisting of several key frames as its back end.

TANDEM is a novel monocular camera-only approach to V-SLAM that combines classic direct approaches with the multiview system (MVS) 3D reconstruction [62]. TANDEM is not only rendered by the global truncated signed distance function (TSDF) model to build dense maps but also used view aggregation and multilevel depth to predict the entire key frame window, which greatly improved accuracy, real-time performance, and stability.

Similarly, the direct method is based on the assumption of invariance of pixel gray scale, which is actually susceptible to environmental lighting, camera exposure, and other factors, so the application scene is limited under certain circumstances. None of the above algorithms escape from the common disadvantages of the direct method. In addition, the direct method is based on gradient search. If the time difference between two frames of image acquisition is too large, the movement distance of the picture taken may be too large, so the optimization function falls into the local minimum. From this point of view, the direct method is more suitable to be applied to the environment with large tonal differences and high-speed movement of objects.

4.2. The Feature-Points Method (The Indirect Method)

For the feature-points V-SLAM, its basic principle is to estimate the relative motion of the camera by extracting and comparing the feature points of adjacent images. The steps of the feature point method mainly include feature extraction, feature matching, motion estimation, and local optimization, as shown in Figure 10.

Feature points are composed of key points and descriptors. Initially, David Lowe et al. proposed a scale-invariant feature transform (SIFT) based on the difference of Gaussian (DOG) scale space in 1999 and perfected it in 2004 [63]. This is the first algorithm to solve the problems of rotation, scale scaling, and brightness invariance and apply them to V-SLAM. In this method, the histogram of oriented gradient (HOG) algorithm [64] maintained good invariance through calculation and statistics of the histogram features of the local area of the image, so subtle movements can be ignored without influence. HOG is faster than SIFT in real time, but it is far inferior to SIFT in a complex environment. Subsequently, in 2006, Herbert et al. improved SIFT and proposed a speeded-up robust features (SURF) algorithm [65], which improved the execution efficiency by using the hessian matrix and dimensionality reduction feature descriptor. However, the main direction in this method is too dependent on the gradient direction of the local pixel region, and the subsequent extraction operations are also dependent on the main direction, which can easily lead to the error continuously magnified. In 2011, Rublee et al. proposed oriented FAST and general theory BRIEF(ORB) [66], which increased the variance of binary robust independent elementary features (BRIEF) [67] from the original BRIEF algorithm (rBRIEF) and improved the scale invariance of the original FAST [68]. The rBRIEF algorithm is proposed by using the exhaustive method to select points, which ensures rotation invariance. The ORB-SLAM series makes good use of this feature description. Setiawan et al. compared the three classical algorithms through pictures under dark light: SIFT had the highest rotation processing accuracy despite its slower execution speed than the other two algorithms. ORB and SURF had the best comprehensive ability, but SURF was obviously inferior to the other two algorithms in the face of noise [69].

As the first V-SLAM, mono-SLAM [39] used very sparse feature points at the front end and EKF at the back end. Besides, the multistate constraint Kalman filter (MSCKF) was used to update the mean and covariance of the current state of the camera and all landmarks as state quantities. However, this maximum likelihood method has no loop closure detection function, and the longer it works, the larger error becomes. Moreover, mono-SLAM coefficient feature points are easily lost. Subsequently, Klein and Murray. proposed window optimization and covisible graph to realize large-scale single-purpose SLAM: PTAM [40], which was the first V-SLAM to propose the concept of key frames. The front end and back end of PTAM were processed in parallel to speed up the real-time performance, but the tracking target information was easily lost. In order to solve this problem, the ORB algorithm was proposed by ORB-SLAM in 2015, which has become the most used algorithm so far. ORB algorithm used the unique binary string representation of BRIEF, inheriting the advantages of both. Meanwhile, the ORB algorithm not only saves storage space but also greatly reduces the matching time. V-SLAM based on the ORB algorithm even provides short-, medium-, and long-term data association in VI-ORB-SLAM [70] and ORB-SLAM3 [49] and achieves almost zero drift with Atlas system, the frame diagram as shown in Figure 11.

Then, the motion of the camera is estimated according to the feature points and the point pair information of the descriptors after feature matching. Camera motion problems can be divided into three categories: 2D-2D, 3D-2D, and 3D-3D, as shown in Figure 12. Typically, the typical 2D-2D problem occurs when monocular cameras lack additional information, and the motion between the two camera coordinate systems can be solved by the pole-geometric constraints, followed by triangulation to estimate the depth of map points. Meanwhile, methods to solve this problem were also carried out by the perspective-three-point method (P3P) [71] and the iterative closest point method (ICP) [72]. Among them, compared with the least-squares matching algorithm for solving the pose directly, random sample consensus (RANSAC) [73] removes some false matches and noise, making the solution more accurate. Perspective-n-point (PnP) is a method for solving 3D-2D point-pair motion, in addition to UPnP [74], EPnP [75], direct linear transform (DLT) [76], and nonlinear optimization. Generally, the 3D-3D problem will solve the camera pose through the singular value decomposition (SVD) [77] method through feature points after feature matching, but twice-depth information will lead to lower accuracy, so this problem is generally avoided in ordinary cameras.

V-SLAM based on feature-points estimates camera motion according to feature-points after feature matching and optimizes reprojection error, so it is not sensitive to illumination changes. So far, it is the most mature and popular scheme. However, V-SLAM based on feature-points takes a long time in the extraction of key points and descriptors, feature matching, and other aspects. Meanwhile, since V-SLAM based on feature points can only build sparse maps, the traditional feature point extraction methods cannot meet the requirements for some specific scenes.

4.3. The Hybrid Semidirect Method

Forster et al. combined the advantages of the feature-points method and the direct method and proposed a semidirect visual odometer (SVO) applied to UAV aerial photography in 2014 [42].

The SVO uses motion estimated and mapping. Firstly, a rough camera estimate is obtained by using sparse feature blocks between two frames under the assumption criterion of the direct method (gray invariance). Secondly, in depth estimation, the Newton–Gauss method is used to optimize the predicted positions of feature blocks, which not only saves some calculation but also ensures that the gradient of pixels is obvious. Finally, the camera pose was optimized using BA.

Unlike the traditional feature point method, SVO relies only on feature points when selecting key frames in the whole process and does not require the calculation and matching of descriptors in order to save a significant amount of time. Different from the traditional direct method, SVO does not use the information of all pixels in each frame but takes small image blocks to estimate the camera motion pose. This algorithm not only improves the real-time performance under certain accuracy but also uses three optimization methods to ensure the robustness of the results.

Compared to other open-source versions of ORB-SLAM, SVO has short threads and a simple framework that can be maintained on low-end platforms, but it does not perform well in head-up cameras and is not easily repositioned if pose estimation is lost. In 2016, Forster et al. improved SVO to SVO 2.0 [78]. It not only supports perspective, fisheye, and binocular cameras but also adds edge tracking and takes into account IMU rotation prior. In addition, SVO 2.0 allows for faster interframe convergence and provides theoretical guidance for combining V-SLAM with IMU, the flowchart as shown in Figure 13.

On the basis of DSO and ORB-SLAM, Seong Hun Lee and Javier Civera locally used the direct method to quickly and robustly track camera posture on locally accurate, short term, and semidense maps [45]. At a global scale, feature-based approaches are used to optimize key frame postures, perform loop closures, and build reusable, globally consistent, long term, and sparse feature maps. This loose coupling between the direct method and the feature-points method makes up for the shortcomings.

In order to overcome these shortcomings, a semidirect method combining the advantages of the original luminosity information of the selected pixel set in the image is also under development [79], as shown in Figure 14.

4.4. Loop Closure Detection

As an important part of V-SLAM, the loop closure detection plays a key role in establishing a globally consistent map. The core of loop closure detection is to detect the same scene in nonadjacent frames and then adds constraints to eliminate the cumulative error. Generally, there are two methods: (a) V-SLAM based on the visual odometer: this method accords to the distance of the movement to determine whether the loop closure occurs, but it cannot give accurate results when the cumulative error is large, as shown in Figure 15 and (b) V-SLAM based on appearance: this method is based solely on the similarity between the current frame and the historical frame which includes traditional methods and deep learning methods.

4.4.1. Traditional Method

Most of the traditional loop closure detection methods are realized by the BoWbag of words, BoW [80]. The BoW takes the features of the image as words and the whole image as the bag of words. In addition, the dictionary is established by artificial clustering training image data set. This method greatly improves retrieval speed and accuracy by compressing the image information.

In 2006, Nister and Stewenius proposed a tree storage method based on the BoW, which greatly improved the retrieval efficiency [81]. In 2008, Angeli et al. extended the BoW method in image classification to incremental conditions. In addition, they used Bayesian filtering to estimate the loop closure probability in order to solve the real-time loop closure detection under the strong perceptual bias environmental [82]. However, this method just has good pose invariance, but not condition invariance. In the same year, Cummins et al. proposed FAB-MAP [83]. However, the image features of constructed BoW and GIST are manually trained features, and the success rate of loop closure detection is not high under complex illumination. In 2012, Smith et al. proposed the discrete binary space based on BoW and extended it by using the positive index [84], as shown in Figure 16. This method firstly takes binary vocabulary for loop closure detection. It not only finds loop closures quickly but also can detect them in a dynamic environment. Over the last decade, the BoW algorithm has been open source and updates to the fourth generation version, FBoW, which is highly optimized to speed up dictionary creation using AVX, SSE, and MMX instructions.

Moreover, among the traditional feature extraction operators such as SIFT, SURF, and ORB, the BoW method has achieved good results on the problem of loop closure detection. However, there are at least two problems in loop closure detection based on BoW: (a) perception confusion: this method stores the description features out of order and does not take into account the distribution of local features in the scene image. As a result, highly similar scenes can be a mistake for the same scenes, even in different geographical locations, resulting in failed loop closure detection, and (b) manual intervention: manual offline dictionary has limitations, which not only does not make full use of the deep information of the image but also leads to tracking failure in a large range, long time, and complex environment.

4.4.2. Based on Deep Learning

The loop closure detection based on deep learning mainly uses the neural network to extract the features of the image and makes full use of the deep feature information of the image to obtain higher accuracy. The current main frame is shown in Figure 17.

In 2015, Gomez-Ojeda et al. firstly retrained the neural network based on the place recognition database [85]. This method improved the image retrieval accuracy and optimized the loop closure detection’s effect in the appearance change scenes. In the same year, Lowry et al. used the Image Net database to pretrain a Caffe-based AlexNet [86]. This method significantly improved the robustness against partial occlusion in the scene. Chen et al. firstly used a convolutional neural network (CNN) to learn image features [87]. This method improved in matching accuracy and robustness. Gao and Zhang firstly used the auto-encoder to extract the features of the image and detected the similarity between the images through the form of a similarity matrix [88]. This method has achieved good results on the public data set. In 2017, Xia proposed a loop closure detection algorithm based on PCANet [26]. The algorithm used the output of the intermediate CNN layer as the image descriptor and achieved better results in practical performance. In 2019, Khaliq et al. proposed a lightweight method by using the VGG16 network to extract image features which effectively improved the efficiency [89]. In 2021, Zhong and Fang adopted BigBiGAN, which improved recall by 20% and reduced time loss by 14% compared to ORBSLAM2 [90]. In 2022, Gao et al. proposed AirLoop, a method that used unsupervised learning to minimize forgetting when increasing training loop closure detection models and showed that its robustness was greatly improved over ORB-SLAM2 on a data set [91].

4.5. Summary Analysis of Existing Methods

The feature-points method calculates the camera pose and map point position by minimizing the reprojection error. The offset error comes from the camera internal parameters, shutter, and so on. While the direct rule minimizes the photometric error, the offset error comes from the fuzzy noise. Therefore, the direct method can work in a gradient environment as long as the key points have gradients. In addition, according to the research of BMW and Technical University of Munich, the sparse direct method can achieve very fast results, so this method is very suitable for the occasion of limited resources. However, the direct method also has some problems such as insufficient support for the gray invariance hypothesis and nonconvexity. Therefore, the general direct method is applied to the scene with large changes in light and shade, and small motion amplitude.

As people have more and more strict requirements on real time and robustness, the feature-points method is considered as the mainstream method of visual odometer due to its stability, insensitivity to dynamic objects, and illumination. However, on the one hand, the traditional feature-points method has difficulty in matching feature points in weak texture areas, such as sky and white wall. On the other hand, the traditional feature-points method only uses the information of feature points, so the utilization rate of the image is not high.

Compared with the traditional loop closure detection algorithm, the method based on deep learning uses a deep neural network to extract image features, expresses image information more fully, and has stronger robustness to environmental changes such as illumination and season. However, how to choose the appropriate hidden layer to represent image features, how to design the neural network architecture to get away from manual intervention, and how to use task-oriented large data sets to optimize the network parameters for transfer learning are still important issues for future research. In addition, for both traditional and deep-learning-based methods, the amount of data will become larger and larger with the increase of the scene size, so eliminating redundant data and erroneous data is obviously a top priority. Therefore, future loop closure detection should focus on the compressed sensing system, dividing the region, speeding up the spatial search, and eliminating noncritical information.

Table 3 shows some of the top institutions for academic research on V-SLAM.

The real time, robustness, and accuracy of positioning and mapping have always been the pursuit of researchers. At present, the product is still in the stage of further research, development, and application scenario expansion. Around these three problems, there are many new research directions, such as multifeature V-SLAM, multirobot cooperation V-SLAM, multisensor fusion V-SLAM, semantic V-SLAM, and event-based V-SLAM.

5.1. The Multifeature V-SLAM

Earlier in this paper, we mentioned that point features as the most popular and commonly used features are very fast in storage and matching. However, point features are easily lost in the absence of textures. Thus, more advanced geometric features (such as lines, edges, and faces) can be utilized and integrated into V-SLAM, as shown in Figure 18.

Line feature is composed of multiple points with natural illumination and invariance of the angle of view, which overcomes the shortcoming of the point feature. However, line features are no longer robust in terms of degradation. The degradation of localization is mainly due to the reduction of constraints. The extra constraints will increase the computational burden. Among the many studies with multifeature V-SLAM, He et al. integrated line features into the V-SLAM system by a tightly coupled approach, but it did not consider the parallel constraints between structural lines [92]. In paper [93], a structural constraint is defined, and the mapping accuracy is improved by creating parallel lines, but they do not distinguish between structural and unstructural lines. Xu et al. used different line constraints to improve the accuracy and robustness of pose estimation and mapping in complex environments [94]. In addition, Han et al. used the visual vocabulary bag of words model to extract point and line features to construct a visual vocabulary and used information entropy to combine point-based similarity score and line-based similarity score to further improve the accuracy of similarity assessment of the two images [95]. However, these methods still do not systematically consider structural information such as parallelism, orthogonality, and coplanarity of lines, so they have poor stability in complex environments. Moreover, the pose estimation algorithm based on line feature is not as reliable as the pose estimation algorithm based on feature point in a complex environment.

Edge features provide more information about the environment, including points, lines, and arbitrary curves. At present, the mainstream methods can be divided into two types: one is as an auxiliary feature for minimizing photometric error [9698]. Even though this method enhances the robustness to scenes with texture loss, it has poor stability in complex environments (such as illumination). The other as the main feature, several methods have been proposed in [99101], but these methods all have the defects of edge feature extraction failure and edge feature redundancy.

Furthermore, most of the planar features-based methods adopt the Manhattan world [102] assumption that all normal vectors are only distributed on three mutually perpendicular spindles. In an outdoor complex environment, the effect of planar feature enhancement is not obvious [103, 104].

Although more advanced geometric features have rich pixel information, they still need to be theoretically extracted, described, and matched. Facing the complex environment, the current algorithm is not very mature, so how to find the appropriate data structure, efficient preprocessing method, stable and fast matching algorithm, and other problems have become the most important.

5.2. The Multirobot Cooperation V-SLAM

Multirobot collaborative V-SLAM refers to a team of several robots operating in the environment at the same time to construct and locate maps through collaboration. It not only improves the real time, accuracy, and robustness of map construction but also plays an important role in dealing with a complex environment. In it, communication and map fusion are the key problems to be solved.

Communication can be divided into centralized, decentralized, and distributed modes, as shown in Figure 19. Among them, the centralized type is processed by the same control center, so the coordination is good, but the real time, adaptability, and flexibility are poor. This paper [105] uses the common Wi-Fi network, but the communication jam and communication delay between master and slave can greatly affect the practical effect. The decentralized mode is jointly completed by each decentralized controller, but each decentralized controller belongs to the parallel relationship, so it is difficult to determine its structural problems and carry out effective adjustment. Distributed is in between, and even if a local error in the whole network, it will not affect the whole communication process, so it has high real time, dynamic, and robust. However, its architectural design, deployment, and management become difficult and complex. The map fusion problem of multirobot cooperative V-SLAM is the extension of data association in single robot map creation: it is necessary to know the relative pose of one robot in the local map of another robot and maintain real-time communication. Lajoie et al. [106] performed distributed posture map optimization to retrieve trajectory estimates of robots. Bhutta et al. [107] adopt the multiagent method and is used for large-scale mapping. Zhu et al. [108] use covariance intersection (CI) that allows each robot to estimate only its own state and self-covariance and compensates for unknown correlations between robots.

It should be mentioned that most of the long-term drift caused by an accumulated error in multirobot collaborative V-SLAM is compensated by additional sensors, which increases the amount of information to be processed, such as [109]. In addition, under the influence of noise points and limited bandwidth, the real-time performance of robot data sharing has great problems. In addition, there are significant problems with the real-time performance of robot data sharing in the presence of noise points and limited bandwidth.

5.3. The Multisensor Fusion V-SLAM

A single sensor is easy to lose feature objects in complex scenes such as high light intensity, intense motion, and little texture, which leads to the failure of map construction. Multisensor fusion can solve this problem well, as shown in Figure 20.

However, after multisensor fusion, there are the following problems: firstly, how to synchronize sensor data and how to calibrate external parameter relations. Secondly, the large amount of information brought by multiple sensors has a certain redundancy. Most of the current research uses inertia measurement unit (IMU) and camera sensor fusion [110]. IMU can ensure better pose estimation under fast camera motion, and the camera compensates for IMU drift. However, IMU data frequency is very high, so the amount of data calculation is also very large. Recently, the authors of [111] added the optimization of laser plane parameters on the basis of LIC-fusion [112] to improve the accuracy of positioning and drawing, but it is only used in a small indoor area. At present, V-SLAM for multisensor fusion is still in the initial stage, and relatively, unified data fusion theory and effective fusion model have not been established. Moreover, there is no good solution to the problem of fault tolerance and robustness in data fusion system.

5.4. The Semantic V-SLAM

Semantic V-SLAM is a fusion SLAM based on neural network semantic segmentation, target detection, instance segmentation, and other technologies to extract object information tags and traditional V-SLAM, as shown in Figure 21.

Generally speaking, there are two approaches to semantic V-SLAM: one is to combine semantic information with localization. With the help of V-SLAM technology, the position constraints between objects can be calculated, and the consistency constraints can be applied to the recognition results of the same object at different angles and at different times. A large number of new data can be generated to provide more optimization conditions, so improving the precision of semantic understanding and saving the cost of manual calibration. Another approach is to combine semantic information with diagram building. Data association is upgraded from the traditional pixel level to the labeled object level with semantic information, which provides high-level maps and improves location accuracy.

The development of semantic SLAM is relatively late, and Alexey Dosovitskiy et al. [114] introduced an CNN. In 2015, Vineet et al. [115] from Stanford firstly realized a system capable of simultaneous graph construction and semantic segmentation, demonstrating the possibility of pushing semantic SLAM into practice. In fact, semantic or object-level SLAM papers have been published in recent years, but almost all of them are limited by the computational complexity and localization accuracy of V-SLAM itself. Moreover, most of these works are carried out in controllable scenarios and need to build data knowledge base to store prior knowledge. The authors of [116] used mask-CNN for semantic segmentation and adopted multiview geometry. Although this approach supplied the recognition of moving objects, as long as there is a dynamic point on the object, the whole object will be considered dynamic. In the paper [117], mask-CNN was also used for the semantic segmentation of images. Based on DS-SLAM, comprehensive utilization of luminosity, reprojection, and depth error were added to allocate robust weights for each key point. As a result, a semantic label was assigned to each pixel, but it still cannot achieve real-time performance in the case of GPU acceleration. In the paper [118], fast plane extraction was added to semantic detection, and then, the graph relationship between the outbound pose and the plane was constructed. However, the parameter estimation accuracy was still poor.

It can be seen that semantic V-SLAM is still in the preliminary stage of research, and most semantic segmentation is performed on dense maps at present. On the sparse map, wrong data association will bring serious errors, and there is no good solution at present.

5.5. The Event-Based V-SLAM

For traditional cameras, the event camera outputs “events” by detecting changes in light intensity, so the output signal flow not only has the advantages of low latency (millisecond level), high pixel bandwidth (kHz level), high dynamic range (140 dB), and ultralow power consumption (20 mW vs. 1.5 W of standard cameras), as shown in Figure 22.

At present, although many algorithms for location and mapping based on event cameras have been proposed, they still have problems. In [119], an investigation has been conducted on event-based cameras in the past decade. For example, the algorithm in the literature [120] requires estimating the image intensity and adjusts the depth, so the calculation cost is very high. The algorithm in the literature [121] relied on traditional cameras for depth estimation, so wasting the low delay and high dynamic range characteristics of event cameras. The algorithm in [122] was not good enough to estimate the pose estimation of event cameras by generating high-resolution dense environment maps, so it took a long time to converge.

The most event-based V-SLAM currently requires more high-quality data sets and collaboration with other sensors. Furthermore, the event sensor also has the disadvantages of low spatial resolution, low signal-to-noise ratio, and high price. From a general research point of view, we need completely new hardware to adapt to the operation characteristics of the event camera. Besides, other problems of V-SLAM are based on event camera, such as recognition, behavior understanding, and cumulative error that also need to be solved. In the future, using deep learning method to deal with event data is a primary research direction.

5.6. Summary Analysis of Research Directions

V-SLAM still has many problems to be solved. For example, it is easier to lose information when driving in the outdoor environment, especially in pure color and dull environment. Moreover, structural light cannot be well calibrated, resulting in poor vision, which lacks the universality of the stereo camera. Secondly, if the front-end data are too sparse, the drift generated by the control object cannot be reversed, dense data will be generated at the front end, and the back-end optimization is heavy, resulting in the mismatch between drawing construction and control speed. Moreover, the cost of overloading data processing, which requires sophisticated cameras and high-performance processors, is difficult to popularize. Finally, the global consistency of map drawing cannot be guaranteed by multiple robots at the lowest cost.

We believe that the future is the era of multitechnology convergence. For example, the authors of [123] combine geometric features and machine-learning methods. The two-layer convolutional long short-term memory (LSTM) module is used to model the pose estimation, and a traditional geometric method is used to simulate key modules in self-supervision. Then, a staged training strategy is proposed, which guarantees certain accuracy even without closed-loop detection. YUN Chang et al. proposed a novel incremental maximum clique outlier rejection protocol through semantic information which is powerful and efficient [124]. The introduction of artificial intelligence technologies (such as deep learning, machine learning, and computer vision technology) to V-SLAM is the future trend.

6. Conclusion

From the SLAM origin and its wide applications, this paper discusses that V-SLAM is still in laboratory research stage which is mainly applied indoor, and its reliable commercial product has not appeared. However, it is worth noting that V-SLAM is of low cost and high semantics. In addition to the above problems mentioned about V-SLAM and its camera feature, V-SLAM will fail to track in complex environments such as illumination, occlusion, dynamic objects, and weak textures. Moreover, when the camera moves too fast, the resulting motion blur also causes V-SLAM to fail to work. More importantly, we should consider not only the accuracy, robustness, and efficiency of the algorithm but also the hardware cost in practice. In future V-SLAM research, what kind of sensor is used to get what kind of data, what kind of decision is made to process these data, and how to achieve high performance in lightweight and low cost will be the primary research directions. According to [24], the current algorithms are in the era of improvement, and we believe that there will appear more and better frameworks in the future, which needs the unremitting efforts of researchers.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Yong Dai obtained the funding and supervised the technical research and direction. Yong Dai and Jiaxin Wu carried out the concepts and the method. Jiaxin Wu designed the experiment, analyzed the data, and wrote the original manuscript. Yong Dai and Duo Wang reviewed and revised the paper.

Acknowledgments

This work was supported by the 2021 High-Level Talents Research Support Program of Shenyang Ligong University (No. 1010147001017), the project funded by the China Postdoctoral Science Foundation, and the Scientific Research Fund of Liaoning Provincial Education Department (grant no.: LJKMZ20220616).