Abstract

Traditional visual simultaneous localization and mapping (VSLAM) systems mostly rely on the static-world assumption, which limits their applications in real-world scenarios with dynamic objects. When there are dynamic objects in the scene, the localization accuracy of the system decreases seriously. In this paper, in order to minimize the interference of dynamic objects in vision localization, we propose a real-time and robust dynamic interference removal (DIR) method, which is based on both prior knowledge and geometry information. Our approach employs a novel lightweight CNN network to output semantic labels and extends the semantics based on the correlations of descriptors to generate a segmented mask. We design a geometric consistency check module to remove the dynamic interference, which computes the bundle adjustment to predetermine the static keypoints, and then the semantic weighted epipolar constraint is used to identify the dynamic outliers. The proposed method is integrated into the front end of ORB-SLAM2 to filter out the dynamic keypoints which are associated with the known and unknown dynamic objects. We conduct experiments on the public TUM RGB-D dataset, the qualitative and quantitative results prove that the DIR method can improve the performance of the state-of-the-art VSLAM system in dynamic scenarios.

1. Introduction

A mobile robot is an intelligent system that integrates multiple functions such as environment perception, dynamic anticollision, and motion control. To ensure the various capabilities of a mobile robot, visual simultaneous localization and mapping (VSLAM) are considered fundamental issues. In recent decades, VSLAM systems [17] based on visual sensors have attracted increasing attention and been well studied, with a rather satisfactory performance that can facilitate high-level tasks [810]. Typically, some well-performing VSLAM systems have been developed, such as ORB-SLAM2 [3] and LSD-SLAM [2]. Given a sequence of images, these systems can jointly estimate the camera pose and generate a continuous camera trajectory. However, the vast majority of the VSLAM systems are based on the assumption of static environments, which estimate the pose through a static feature set. As a consequence, they are vulnerable to unexpected changes in surroundings, such as dynamics, especially humans. In these scenarios, dynamic content affects the whole process of VSLAM, which inevitably degrades localization accuracy and reliability.

To address these problems, many algorithms have been adopted to make the existing VSLAM systems dynamic-object-aware [8, 1119]. Algorithms such as random sample consensus (RANSAC) [20, 21] are employed to reject outliers, which can weaken the dynamic interference by optimizing the feature set. However, these algorithms tend to fail when moving objects occupy a major part of the camera field of view. Compared with the method of purely optimizing features, distinguishing the content of a scene as static or dynamic benefits visual localization in dynamic scenarios [8, 1114].

With the development of machine learning, VSLAM systems combined with deep learning methods are well developed [1519]. Advanced convolutional neural network (CNN) architectures such as Mask R-CNN[22], YOLO [23], and SegNet [24] are applied to effectively obtain prior knowledge, which is used to classify the objects of scenes. However, these methods handle the known objects and ignore the unknown dynamic objects, which are labeled as background. Therefore, it is not enough to judge the objects only by prior knowledge. In addition, most of these methods suffer from high computational costs and easily cause information loss. Therefore, robustness and low computation cost are two challenges for such approaches.

In this paper, we propose a real-time and robust dynamic interference removal (DIR) method for dynamic scenarios, which mainly includes a semantic part and a geometric consistency check module. The former is composed of a novel semantic segmentation network and dynamic correlation region, which were introduced to provide a pixel-wise classification and extend the semantics of the local areas which are correlated with dynamic objects. The latter uses bundle adjustment and semantic weighted epipolar constraint to identify and reject the dynamic outliers. The main contributions of the proposed method are summarized as follows:(1)We propose a novel lightweight semantic segmentation network built on MobileNetV2 [25], called De-MNetV2, which is more sensitive to dynamic objects and inconspicuous details. To obtain the dynamic content completely, we define the local areas which are correlated with dynamic pixels as the dynamic correlation region and extend the corresponding semantics of this region.(2)We design an efficient geometric consistency check module, which is based on the bundle adjustment (BA) and the epipolar geometry constraint with semantic weights. The former is computed to predetermine static keypoints for avoiding information loss, and the latter is used to robustly identify the dynamic keypoints on the known and unknown objects.(3)We insert the proposed method into ORB-SLAM2 [3], which is called DIR-SLAM (Dynamic Interference Removal SLAM system). Experiments on the widely used TUM RGB-D benchmark dataset [26] prove convincingly that visual localization accuracy in dynamic environments can be greatly boosted.

The rest of this article is organized as follows: Section 2 summarizes various dynamic SLAM methods and presents the essence of VSLAM problems in dynamic environments. Section 3 describes the theoretical content and verification of the proposed method. Section 4 shows the experimental results and analysis. We draw some conclusions and deliver future work in Section 5.

In dynamic environments, some areas of the image may be taken up by dynamic pixels. As a result, the visual localization accuracy cannot be guaranteed resulting from the fusion of the dynamic content. To address this problem, we give a comprehensive analysis of the existing dynamic VSLAM algorithms in Section 2 and explain the nature of dynamic VSLAM problems in Section 2.2.

2.1. Existing Dynamic VSLAM Algorithms

The direct methods mainly depend on the temporal or spatial coherence of dynamic points, such as the comparison of geometric structures [8, 1114]. Jaimez et al. [11] use the K-means clustering algorithm and reprojection errors to classify geometric clusters as static or dynamic, then the dense dynamic points are removed. Scona et al. [8] employ both sensor information fusion and points of static probability to optimize the robot’s pose. Sun et al. [14] designed a motion removal method to address the problem of RGB-D SLAM in dynamic environments, which can estimate the possible foreground points by dense optical flow computing. According to the 3D information provided by the RGB-D camera, the depth information can be regarded as the only classification criterion. Li and Lee [12] present a static weighting method for handling depth edge points to indicate the likelihood of one point being part of the static environment, which can improve the tracking and mapping performance.

Despite being suitable for dynamic environments, these methods use all pixels in the image for pose estimation; therefore, projection errors caused by interference such as camera noise and illumination changes cannot be properly handled, and thus reliable localization results are often not consistently achieved. In addition, geometric structures can only determine the moving objects and not the moveable objects, such as people who keep things static in their environment. Therefore, it is necessary to introduce prior knowledge to infer the moveable objects.

With the prosperity of deep learning technologies, the feature-based VSLAM combined with deep-learning methods which can provide prior knowledge have developed rapidly to infer the dynamic objects with impressive performance [1518]. Yu et al. [15] employ SegNet to obtain the semantics and check the moving consistency, then optimize the localization by filtering keypoints on humans. Bescos et al. [16] combine multiview geometry models and Mask R-CNN for detecting dynamic objects and use the region growth algorithm to remove all the dynamic points in the mapping process to estimate static maps. Cheng et al. [17] jointly employ YOLO3, Faster R-CNN, and SSD detection models as the prior knowledge generation module, and then a Bayesian framework is applied to determine and discard dynamic regions.

2.2. VSLAM Problems in Dynamic Environments

In feature-based VSLAM systems, the interference caused by dynamic objects is multifaceted and mainly reflected in keypoints, descriptors, and geometric structures. Dynamic keypoints on the moving objects, lead to inaccurate landmarks of tracking. Meanwhile, because the patch-based descriptors are constructed by sampling neighboring points of an area [27], thus the descriptors will contain dynamic content when dynamic keypoints exist in the area, which is not conducive to feature matching and pose estimation. Finally, the dynamic keypoints destroy the consistency and cause conflicts in geometric structures, which directly reduce the accuracy of visual localization.

The essence of VSLAM problems in dynamic environments is observations; hence, we filter out unreliable observations to remove dynamic interference. As we discussed in Section 2.1, since the objects with high-dynamic probability easily cause pose estimation errors and trajectory tracking failures, we contend it is essential to use the deep learning networks to prior infer the moveable objects. However, the prediction results of the network are often inaccurate, hence the geometric structures cannot be ignored, which express the consistency of points. Figure 1 shows that dynamic keypoints can destroy the epipolar geometry constraints.

Compared with the methods mentioned in Section 2.1, our proposed method falls into the feature-based VSLAM combined with deep-learning methods and we describe the detailed characteristics in Section 3.

3. DIR-SLAM

3.1. Method Overview

ORB-SLAM2 [3] is the most used solution for visual localization and has shown excellent performance in most practical situations. However, in dynamic environments, it suffers a lot. Therefore, we propose a dynamic interference removal method (DIR), which is named DIR-SLAM. The flowchart of DIR-SLAM is shown in Figure 2.

Figure 3 illustrates the details of the DIR method. It contains the following four sections: (1) semantic part; (2) dynamic correlation region; (2) feature matching; (3) geometric consistency check module. In the semantic part, we design a lightweight semantic segmentation network to output the semantic labels, which are arranged according to the likelihood of movement from 0 (background) to 20 (person). Then, the semantic content is extended according to the dynamic correlation region and generates a segmented mask. The dynamic correlation region is defined in Section 3.3. We use the segmented mask to provide semantic weights. For feature matching, the keypoints are tracked between the current frame and the previous frame by optical flow [28] to generate initial feature matches. In the geometric consistency check module, the BA is computed first to reserve static keypoints which are consistent with the previous camera pose and then the epipolar geometry constraint with semantic weights is calculated to identify and reject the dynamic outliers.

3.2. Semantic Segmentation

Consecutive frames captured by a moving camera are inevitably blurred or appeared ghosting, which requires higher scene parsing ability. In addition, frequently changing details of dynamic objects lead to strong interference and inconspicuous pixels in frames. Therefore, it is worthwhile to capture details. To address these problems, we propose a developed MobileNetV2 [25] network, i.e., De-MNetV2. The network structure of De-MNetV2 is shown in Figure 4.

Figure 4 shows the pyramid pooling module (PPM) of PSPNet [29] is connected with the backbone, namely, the PSP header, which gathers global context information and provides a complete understanding of the scene. Considering that the low-level layers of the network are rich in spatial details [31], we insert two skip connection branches to fuse the low-level features for increasing details, which benefits high-level features. The branches first extract the low-level features through dilation convolution, and then we use the fully connected layer for the dimensions consistently. Finally, the details from branches and the global context information provided by the PSP header are superposed and sent to the decoder to predict the semantic labels.

The PSP Header can both develop adaptability and scene parsing in dynamic environments to reduce mismatches and confusion categories. Meanwhile, the details can benefit the network by enhancing the classification performance. We hold the opinion that these improvements are useful for classifications in dynamic environments.

3.3. Dynamic Correlation Region

In feature-based VSLAM methods, the camera pose is estimated by matching the descriptors of keypoints, such as ORB [32], SIFT [33], and FREAK [34]. Here, we take the ORB algorithm as an example to illustrate the correlation between dynamic points and neighbors. Next, we define the areas affected by dynamic objects as the dynamic correlation region.

The ORB algorithm uses Rotation-Aware BRIEF (rBRIEF) descriptor. To generate the rBRIEF descriptor, a pixel-wise circular sampling patch centered at the keypoint is first rotated according to the orientation of the keypoint to guarantee rotation invariance. The circular patch centered at the keypoint is shown in Figure 5.

In Figure 5, assuming is static, the orientation angle of is calculated by [35]where are defined as follows [35]:

It can be seen that is closely relevant to the intensity of the patch. Hence, when is moving, centroid becomes to and drifts to . The orientation deviation angle can be expressed as follows:

In these cases, the rBRIEF descriptor will contain dynamic content, which can affect feature matching. To validate the influence, we simulate the orientation calculation process of rBRIEF descriptor. Pairs of frames are captured from different conditions to represent the various dynamic environments. We show the comparative results in Figure 6. First, we gray the image and filter it with the Gaussian kernel . The orientation is computed by (2), and the angular deviation between image pairs of each pixel is calculated by equation (3). For more intuitively, value between is scaled to the red color channel of and represented by a mask. The redder the color, the more severe the angular deviation, which shows the stronger the impact of dynamic objects.

Simulation results indicate that dynamic objects can inevitably affect static points. Especially in the circular neighboring areas with the radius . We notice that dynamic semantics should occupythese areas, which will better play the role of prior knowledge. Hence, we define these areas as the dynamic correlation region.

According to (1) and (2), we employ the morphological dilation algorithm to extend the semantics of dynamic objects for covering the dynamic correlation region. The dilation kernel can be explained by

Then, the segmented mask of the frame is updated. Without loss of generality, if another feature extractor is adopted, parameter just needs to be adjusted based on the patch size.

3.4. Geometric Consistency Check Module

We design an adaptive geometric consistency check module between two consecutive frames, which can be applied in scenarios with known and unknown dynamic objects to robustly remove the dynamic interference. First, we compute BA [36] to estimate the static keypoints, which are consistent with the previous camera pose. Then, the epipolar geometry constraint with semantic weights is calculated to identify and reject dynamic outliers.

As shown in Figure 7, we assume that the keypoint in the current frame is matched with in the previous frame , which means the camera observes . The coordinates of the matched pair can be expressed as and . Since the pose of the previous frame is known, according to the reprojection model, we can respect the observed point to the world coordinate frame and compute the corresponding 3D coordinates . Thus, the reprojection error between and is calculated as follows:where is the projection function of the current frame and is the pose of the previous frame. We set a small value to (1.0), thus the keypoints that satisfy equation (5) are considered static, which are directly reserved as inliers. For other matched pairs, the epipolar line of the keypoint is calculated as follows:where denotes the fundamental matrix and A, B, and C denote real vectors. Then, the distance from the to the epipolar line is denoted by the following equation:

Because of the sensor errors and dynamic interference, keypoints deviate from the epipolar lines. The larger the distance , the more likely the keypoint is to be dynamic. Here, we employ the prior semantic information to weigh the distance , where the label values are arranged according to the likelihood of movement from low (0) to high (20). Therefore, the label of higher motion probability covers lower ones after the semantic extension in Section 3.3. We assign the label values and corresponding semantic weights as shown in Figure 8.

As shown in Figure 8, the semantic weight increases with the likelihood of the object moving. We employ the segmented mask to provide the semantic weight of. The final distance function is calculated as follows:

is used to identify and reject dynamic outliers. When is larger than a certain threshold (1.0), the keypoint is considered a dynamic point and is rejected.

We extract the results of the single-step of geometric consistency check module as shown in Figure 9. In Figure 9(a) the green points are predetermined inliers by BA. These keypoints are recognized as static and have priority reserved for avoiding information loss. In Figure 9(b), the distance of semantic weighted epipoplar geometry is computed, and the filtered dynamic keypoints are shown as red points. Blue points represent the remaining static keypoints, which are employed to track the pose.

4. Experiments

To prove the effectiveness of the proposed method in this paper, we conduct comparative experiments on the methods and evaluate the results quantitatively and qualitatively. In this section, we experimentally evaluate the effectiveness of the proposed method from the following two parts: (1) De-MNetV2 network; (2) DIR-SLAM.

4.1. De-MNetV2 Network

The De-MNetV2 network is trained on PASCAL VOC 2012 dataset [37]. The model can detect 20 classes that contain common dynamic objects, e.g., people, cats, and dogs, which suffices to meet the testing requirements of the TUM dataset [26]. If the environment is complex, the model should be trained on the COCO dataset [38] to classify more categories. The implementation is training on the public platform Keras with a GTX 2080Ti, Intel E5-2678 v3 CPU, and 64GB RAM. For data augmentation, we randomly scale (from 0.5 to 1.5) and left-right flip the input images. The images are cropped to and grouped with batch size 6. We set the initial learning rate to 0.0001, which gradually decreases to 0 by following the “poly” strategy and [39]. The network is trained with Adam, of which the weight decay is set to 0.00001. The pixel-wise dice loss [40] is used as the loss function.

We conduct the experiments with the evaluation metric mean Intersection Over Union (mIOU) and mean Pixel Accuracy (mPA). Table 1 gives the comparison of semantic segmentation on the PASCAL VOC 2012 validation set.

Our network achieves mIOU of 75.75% and mPA of 84.06%. Compared with the original MobileNetV2, the mIOU of MobileNet + PSPNet is reduced by 2.31%. Because the multiscale pyramid pooling module abstracts the high-level features, which enhance scene parsing but reduce classification capabilities in lightweight networks. After the insertion of skip connection branches, the mIOU of De-MNetV2 is 2.14% higher than that of MobileNet + PSPNet, which indicates that the fusion of details from low-level layers can refine the high-level features and improve overall performance. Compared to MobileNet + DeepLabV3, which performs the best semantic segmentation performance in the original paper [25], our network is competitive.

Figure 10 lists the comparison of scene parsing ability and details between De-MNetV2 and MobileNet ∗ DeepLabV3. These results suggest that De-MNetV2 is more sensitive to dynamic objects, and the misclassifications and discontinuous labels are fewer. We suggest that the De-MNetV2 is more suitable for our requirements.

From Tables 2 and 3, we can see that DIR-SLAM gets competitive ATE and RPE values in most of the sequences, which shows that our method achieves excellent performance. The results illustrate that the proposed DIR method has an exponential growth rate compared to the original ORB-SLAM2 in high-dynamic scenarios, which is effective and excellent.

Similar to the DS-SLAM, we segment the scene and use the epipolar constraints to determine and reject the dynamic outliers. The reasons our method outperforms DS-SLAM are because we extend the semantics in the dynamic correlation region to obtain the prior information more completely, and we use the semantic weights to make the movement of dynamic targets more obvious. Besides, we do not cluster the keypoints by pixels or geometric structures. All the keypoints are identified robustly and independently, the experimental results present the feasibility. Compared with the DynaSLAM, the experimental results show that our performance is very close. In the semantic part, DynaSLAM extends semantics similarly, but without going into further theoretical analysis. DynaSLAM relies on Mask R-CNN and multi-view geometry to improve semantics, which brings in expensive computational costs. Our method is implemented by a lightweight De-MNetV2 network and semantic extension of the dynamic correlation region, which are fast and efficient.

We evaluate the accuracy of our system with different configurations, and the RMSE of ATE is shown in Table 4. We test four different configurations of DIR-SLAM.(1): MobileNet ∗ DeepLabV3 is used for semantic segmentation(2): the semantics do not extend to cover the dynamic correlation region(3): the semantic weights are not used(4): BA of the geometric consistency check module is not computed

Table 4 shows that the results of DIR-SLAM are better than others. In view of Table 2, we notice that the experimental results of our method are not ideal in low-dynamic sequences. Due to robustness, ORB-SLAM2 is enough to overcome the dynamic interference caused by slight movements. For f3/s/xyz sequence, the ATE of ORB-SLAM2 is 0.0097 meter and better than our method. However, Table 4 shows that the DIR-SLAM configured with Semantic outperforms others in this sequence, in which the semantic weights are removed. Hence, we consider that prior knowledge can lead to information loss. We use the BA to verify the consistency between keypoints and the pose of the previous frame, which predetermines the static keypoints and reduces information loss. In terms of experimental results of DIR-SLAM, the gaps between DIR-SLAM and ORB-SLAM2 in f3/s sequences are small.

Figure 11 shows the comparative results of trajectories in sequences. We use three types of lines to express the trajectory. It can be seen that the trajectory of DIR-SLAM is closer to the ground truth, which indicates that the localization results of DIR-SLAM are more precise. Although the experimental results have a certain degree of uncertainty, they still basically follow regularity. Based on the results of qualitative analysis, the DIR-SLAM is more suitable for high-dynamic scenarios with larger camera movements.

4.2. DIR-SLAM

We have evaluated DIR-SLAM in the public TUM RGB-D dataset [26] and compared it with other state-of-the-art VSLAM systems [15, 16]. The runtime analysis is presented to show the efficiency of our method. Furthermore, we demonstrate the performance of our method with a Kinect V1 in a real environment.

The descriptions of sequences for evaluation are as follows: The f3 sequences are dynamic object sequences, which contain four types of camera motions. (1) half (half sphere): the camera has been moved on a small half sphere of approximately one-meter diameter. (2) rpy: the camera has been rotated along the principal axes (roll-pitch-yaw) at the same position. (3) static: the camera has been kept in place manually. (4) xyz: the camera has manually been moved along three directions (xyz) while keeping the same orientation. Specifically, f3/s means f3_sitting sequences, which depict low-dynamic scenarios, and f3/w means f3_walking sequences, which depict high-dynamic scenarios. The suffix represents validation sequences with undisclosed ground truth. Each sequence contains both RGB and depth images recorded at the full frame rate of size. We performed all the experiments in a notebook with 2.6 GHz Intel i7-9750H, 16 GB RAM, NVIDIA GTX1660Ti, and Ubuntu16.04.

The quantitative evaluation indicators of the comparison, respectively, employ Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). Our method is compared with other methods in terms of Root-Mean-Square Error (RMSE) and Standard Deviation (SD).

We chose the most advanced dynamic VSLAM methods DS-SLAM [15], DynaSLAM [16], and the original ORB-SLAM2 [3], for performance comparisons. All the methods are based on ORB-SLAM2, and the comparison results are shown in Tables 2 and 3.

Our system is a real-time semantic SLAM system. In order to show the efficiency, we compared the average computation time of major processing modules between DIR-SLAM and ORB-SLAM2. To find a relation of the computational cost and the amount of dynamic points, we choose the f3/w/static sequence and f3/w/xyz sequence for comparison. As mentioned in Section 4.2, the motion in f3/w/xyz is more complicated because the camera is always moving. The results are shown in Table 5.

The modules in Table 5 correspond to Figure 3. Specifically, the semantic part runs in parallel with the system as a separate GPU thread. In the geometric consistency check module, f3/w/xyz takes more time than f3/w/static. Because the dynamic points in the f3/w/xyz sequence are more diffused, the fundamental matrix takes more iteration time. In pose estimation, the DIR-SLAM is faster than the original ORB-SLAM2, due to dynamic outliers are rejected, the convergence rate of pose optimization is accelerating. Finally, tracking is the main thread to process every single frame, our method costs less than 100 ms and as fast as a human brain [41].

4.3. Robustness Test in a Real Environment

We integrate DIR-SLAM with ROS and conduct experiments in a real environment to demonstrate the robustness and real-time performance. Frames are captured by a Kinect V1 camera with . The duration is about 2 minutes. Experimental results of DIR-SLAM during the real environment test are shown in Figure 12. The red points are dynamic keypoints, which are identified by the proposed method, and the blue points are static keypoints.

In a real environment, a person holding a book is sitting in front of the camera, and the camera is holding static. Note that the person is labeled, but the book is not. In Figure 12(a), when the person is moving meanwhile the book stays static, the dynamic keypoints are basically distributed in the person. In Figure 12(b), the book is moved but the person is not, so our method can identify the dynamic keypoints distributed in the book. We record the complete experimental test video: https://wo712268.lofter.com/post/1d4e8522_2b40b2d53.

5. Conclusion

In this article, we propose a real-time semantic DIR-SLAM to address the problems of visual localization in dynamic environments. As we depicted before, we reject all the dynamic keypoints on account of prior knowledge and geometry constraints. We use ORB-SLAM2 [3] as the system framework and perform experiments on the public TUM RGB-D dataset [26]. From the results, we notice that our system can outperform high-dynamic scenarios more than low-dynamic scenarios, and robustly take effect on unknown environments.

However, the errors still exist. The method cannot well deal with the trajectory deviation caused by the camera moving. To handle the pose estimation errors caused by the rapid and large changes of views, we intend to attempt an affine-invariant feature extractor that is more adaptable to the movements of the camera. Besides, we aim to join the semantic mapping and background repairing methods to DIR-SLAM to realize real-time dense mapping.

Data Availability

The author has used third-party data. More information about these data can be obtained from the reference Yu, C., Liu, Z., Liu, X.J., Xie, F., Yang, Y., Wei, Q., and Fei, Q., 2018. DS-SLAM: A semantic visual slam towards dynamic environments, in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROSs). https://doi.org/10.1109/IROS.2018.8593691. Bescos, B., Fácil, J.M., Civera, J., Neira, and J., 2018. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters. https://ieeexplore.ieee.org/document/8421015.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing (Grant no. 2020CP05), Shaanxi Natural Science Basic Research Project (Grant nos. 2022JQ-711 and 2022JM-348), Xi’an Science and Technology Bureau Science and Technology Innovation Leading Project (Grant nos. 21XJZZ0022 and 21XJZZ0020), and Key R & D Plan of Shaanxi Province (Grant no. 2020ZDLGY06-01), National Natural Science Foundation of China (Grant no. 61873200).